[00:28:55] (03CR) 10Legoktm: "As discussed on IRC, I agree that buildpack images should be in a separate heirarchy. I like your suggestion of buster-1 style names. I wr" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) (owner: 10Legoktm) [00:37:57] (03PS5) 10Legoktm: Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) [00:46:58] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10wiki_willy) Hi @LSobanski - John is still out with the broken hand and Chris was out sick the entire week, so we'll sync up next week to where things currently are with this one. Thanks,... [00:48:56] (03PS6) 10Legoktm: Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) [00:53:53] (03PS2) 10Razzi: stats: Add envoy on port 8443 alongside nginx [puppet] - 10https://gerrit.wikimedia.org/r/634667 (https://phabricator.wikimedia.org/T240439) [01:02:22] (03PS1) 10Razzi: stats: temporarily switch analytics sites to port 8443 [puppet] - 10https://gerrit.wikimedia.org/r/634669 (https://phabricator.wikimedia.org/T240439) [01:24:20] PROBLEM - SSH on analytics1049.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:04] RECOVERY - SSH on analytics1049.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:04:12] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:04:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:04:42] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:18:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:19:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:22:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:59:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:02:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:18:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:05:55] (03PS1) 10Elukey: Decommission analytics1053 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/634673 (https://phabricator.wikimedia.org/T255140) [06:06:50] (03CR) 10Elukey: [C: 03+2] Decommission analytics1053 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/634673 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [06:11:06] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:43] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:33:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:43:31] (03PS2) 10ArielGlenn: dumps::web::html Add landing page/readme for pageview-complete dumps [puppet] - 10https://gerrit.wikimedia.org/r/634650 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [06:57:11] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201017T0700) [07:03:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Volans) p:05Medium→03High a:05Dzahn→03RobH The IPs were allocated manually outside of Netbox and as such they could be allocated to a different host by Net... [08:30:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:48:37] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 69 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:54:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:07:23] PROBLEM - Apache HTTP on testvm1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:08:57] PROBLEM - mediawiki-installation DSH group on testvm1001 is CRITICAL: Host testvm1001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:57:32] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-cs mailinglist for Czech Arbitration Committee - https://phabricator.wikimedia.org/T265472 (10Urbanecm) 05Resolved→03Open Thank you. However, no email came to the address. Can you verify everything is all right? If it is, could you please send the pass... [12:53:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:22] !log [urbanecm@mwmaint2001 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Fæ . # T264529 [13:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:30] T264529: Large file upload request for 0399CHRO - https://phabricator.wikimedia.org/T264529 [14:15:09] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6548054, @Krinkle wrote: > @kostajh Can you confirm whether something does or does not need to change... [14:58:30] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10patilise) It is 3 months since the extension was disabled. When will any news be announced? [15:25:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Dzahn) @RobH Please do not assign service puppet roles on new hosts. That will almost never work. Just add new hosts in site with the "insetup" role and hand over... [15:27:55] (03CR) 10Dzahn: "please don't add service puppet roles to new hosts before handing over to service owner. just use the "insetup" role during OS install and" [puppet] - 10https://gerrit.wikimedia.org/r/634598 (https://phabricator.wikimedia.org/T265653) (owner: 10RobH) [17:05:14] (03PS3) 10Reedy: Stop installing timidity and freepats on appservers [puppet] - 10https://gerrit.wikimedia.org/r/445604 [17:14:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:35:32] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-cs mailinglist for Czech Arbitration Committee - https://phabricator.wikimedia.org/T265472 (10JMeybohm) 05Open→03Resolved I did reset the password and send it to both of the supplied mail addresses. [17:51:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:49] (03PS1) 10Andrew Bogott: bootstrap-vz: add some python3 packages to the stretch manifest [puppet] - 10https://gerrit.wikimedia.org/r/634741 [18:43:12] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: add some python3 packages to the stretch manifest [puppet] - 10https://gerrit.wikimedia.org/r/634741 (owner: 10Andrew Bogott) [21:11:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:16:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:36:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:38:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:41:17] (03PS1) 10Andrew Bogott: wmcs admin scripts: added wmcs-imageusage [puppet] - 10https://gerrit.wikimedia.org/r/634754 (https://phabricator.wikimedia.org/T263461) [21:44:55] (03CR) 10Andrew Bogott: [C: 03+2] wmcs admin scripts: added wmcs-imageusage [puppet] - 10https://gerrit.wikimedia.org/r/634754 (https://phabricator.wikimedia.org/T263461) (owner: 10Andrew Bogott) [21:56:28] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 56.17 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:59:54] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 86.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:12:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:14:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:54:58] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 70 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:06:20] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 52 probes of 567 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:25:44] PROBLEM - snapshot of s2 in eqiad on alert1001 is CRITICAL: snapshot for s2 at eqiad taken more than 3 days ago: Most recent backup 2020-10-14 23:02:40 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:45:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:47:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets