[00:06:19] PROBLEM - Host cloudcephosd1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:12:16] RECOVERY - Host cloudcephosd1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [00:32:06] 04Critical Alert for device cr2-eqord.wikimedia.org - CDR bills over 98% used [00:32:10] 04Critical Alert for device cr3-ulsfo.wikimedia.org - CDR bills over 98% used [00:33:06] 04Critical Alert for device cr2-codfw.wikimedia.org - CDR bills over 98% used [00:33:10] 04Critical Alert for device cr1-eqiad.wikimedia.org - CDR bills over 98% used [00:45:51] (03CR) 10Dzahn: [C: 03+1] "@Ema I tried that as well and could confirm what you reported. Then also tried with the client from the "uwsc" package. But as Mukunda the" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [00:48:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:31] (03PS3) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [01:22:38] (03PS4) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [01:23:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [01:26:16] (03PS5) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [01:27:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [01:30:39] (03PS6) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [01:31:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [01:33:11] (03PS7) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [01:34:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [01:56:29] (03PS8) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [02:02:00] (03PS9) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [02:05:58] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:38] PROBLEM - Check the last execution of package_builder_Clean_up_build_directory on deneb is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:26:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:28:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:53:56] (03PS1) 10Andrew Bogott: wmcs/ceph/backy: add a bunch of keys needed for the ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/618875 (https://phabricator.wikimedia.org/T259192) [02:54:32] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config [puppet] - 10https://gerrit.wikimedia.org/r/618875 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [03:07:36] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt1006 is OK: OK: synced at Fri 2020-08-07 03:07:35 UTC. https://wikitech.wikimedia.org/wiki/NTP [03:16:56] RECOVERY - Check systemd state on cloudvirt1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:09] RECOVERY - configured eth on cloudvirt1006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:19:53] RECOVERY - dhclient process on cloudvirt1006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [03:19:57] RECOVERY - HP RAID on cloudvirt1006 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:21:17] RECOVERY - Check systemd state on cloudvirt1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:15] RECOVERY - HP RAID on cloudvirt1004 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:30:04] (03PS1) 10Andrew Bogott: wmcs/ceph: split out rbd client profiles [puppet] - 10https://gerrit.wikimedia.org/r/618876 (https://phabricator.wikimedia.org/T259192) [03:31:11] RECOVERY - Disk space on cloudvirt1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1006&var-datasource=eqiad+prometheus/ops [03:39:55] (03PS1) 10Andrew Bogott: Added fake profile::wmcs::backy2::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/618877 [03:40:06] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added fake profile::wmcs::backy2::db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/618877 (owner: 10Andrew Bogott) [03:43:12] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph: split out rbd client profiles [puppet] - 10https://gerrit.wikimedia.org/r/618876 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [03:48:42] (03PS1) 10Andrew Bogott: wmcs/ceph/backy: remove reference to 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/618878 (https://phabricator.wikimedia.org/T259192) [03:49:47] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: remove reference to 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/618878 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [03:52:17] (03PS1) 10Andrew Bogott: wmcs/ceph/backup: remove another nova-specific ref [puppet] - 10https://gerrit.wikimedia.org/r/618879 (https://phabricator.wikimedia.org/T259192) [03:53:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backup: remove another nova-specific ref [puppet] - 10https://gerrit.wikimedia.org/r/618879 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [03:58:35] RECOVERY - puppet last run on cloudvirt1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:59:31] RECOVERY - puppet last run on cloudvirt1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:21:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: analytics1041.eqiad.wmnet, webperf1001.eqiad.wmnet, deneb.codfw.wmnet, webperf2001.codfw.wmnet, wdqs1009.eqiad.wmnet, testreduce1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:23:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:59] (03PS1) 10Elukey: druid: fix settings for sending metrics to the local exporter [puppet] - 10https://gerrit.wikimedia.org/r/618920 [06:34:22] (03CR) 10Elukey: [C: 03+2] druid: fix settings for sending metrics to the local exporter [puppet] - 10https://gerrit.wikimedia.org/r/618920 (owner: 10Elukey) [06:34:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092 for upgrade', diff saved to https://phabricator.wikimedia.org/P12191 and previous config saved to /var/cache/conftool/dbconfig/20200807-063431-marostegui.json [06:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:55] no alter tables on Friday mornings [06:35:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:58] (03PS1) 10Marostegui: db1092: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618921 (https://phabricator.wikimedia.org/T250666) [06:36:42] (03CR) 10Marostegui: [C: 03+2] db1092: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/618921 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [06:46:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200807T0700) [07:09:19] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:37] (03CR) 10JMeybohm: [C: 03+2] changeprop: Fix repository URL in requirements, bump to 0.9.52 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618790 (owner: 10JMeybohm) [07:10:48] (03Merged) 10jenkins-bot: changeprop: Fix repository URL in requirements, bump to 0.9.52 [deployment-charts] - 10https://gerrit.wikimedia.org/r/618790 (owner: 10JMeybohm) [07:13:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:24:45] (03PS1) 10Marostegui: db1092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618923 [07:24:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:25:21] (03CR) 10Marostegui: [C: 03+2] db1092: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/618923 (owner: 10Marostegui) [07:39:02] (03PS1) 10Volans: dns: zone generation improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/618926 [07:46:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P12192 and previous config saved to /var/cache/conftool/dbconfig/20200807-074658-marostegui.json [07:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:28] !log prometheus codfw lvextend --resize --size +30G /dev/mapper/vg--ssd-prometheus--k8s [07:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:32] !log prometheus codfw lvextend --resize --size +60G /dev/mapper/vg--hdd-prometheus--global [07:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:55:56] (03CR) 10Hashar: [C: 03+1] "Thanks. Just a question, shouldn't we use the full fingerprint instead of just the last few bits?" [puppet] - 10https://gerrit.wikimedia.org/r/618771 (https://phabricator.wikimedia.org/T259116) (owner: 10Alexandros Kosiaris) [07:57:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:03:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:04:21] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: update mediawiki errors query to count beyond the 10k limit [puppet] - 10https://gerrit.wikimedia.org/r/618870 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [08:04:55] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add default count all query [puppet] - 10https://gerrit.wikimedia.org/r/618869 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [08:07:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P12193 and previous config saved to /var/cache/conftool/dbconfig/20200807-080719-marostegui.json [08:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:46] 10Operations, 10Wikimedia-Logstash, 10observability, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10fgiunchedi) Indeed, we should disable telemetry for kibana-next, I'll followup with a patch [08:11:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:12:31] (03PS1) 10Jcrespo: mariadb: Add proof of concept of memory alert [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) [08:13:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add proof of concept of memory alert [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [08:14:56] (03PS1) 10Filippo Giunchedi: kibana: disable telemetry and newsfeed [puppet] - 10https://gerrit.wikimedia.org/r/618948 (https://phabricator.wikimedia.org/T259794) [08:16:36] volunteers welcome for ^ if you feel inclined [08:17:37] (03PS2) 10Jcrespo: mariadb: Add proof of concept of memory alert [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) [08:19:30] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/24375/db2102.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [08:24:40] (03CR) 10Jcrespo: "Normal output:" [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [08:31:33] (03PS3) 10Filippo Giunchedi: Add Debian packaging [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) [08:32:26] (03CR) 10Marostegui: [C: 03+1] "Thanks for working on this" [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [08:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P12194 and previous config saved to /var/cache/conftool/dbconfig/20200807-084747-marostegui.json [08:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:15] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10akosiaris) p:05Triage→03Medium [08:54:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] aptrepo: Update jenkins gpg release key [puppet] - 10https://gerrit.wikimedia.org/r/618771 (https://phabricator.wikimedia.org/T259116) (owner: 10Alexandros Kosiaris) [09:01:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:06:56] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [09:07:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:08:45] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add proof of concept of memory alert [puppet] - 10https://gerrit.wikimedia.org/r/618947 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [09:15:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1092', diff saved to https://phabricator.wikimedia.org/P12195 and previous config saved to /var/cache/conftool/dbconfig/20200807-091527-marostegui.json [09:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:14] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (... [09:25:26] (03PS8) 10Ema: ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) [09:27:43] (03CR) 10Ema: [C: 03+2] ATS: add function profile::trafficserver_caching_rules [puppet] - 10https://gerrit.wikimedia.org/r/618537 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [09:31:06] (03PS1) 10Volans: cameras: remove old stale records [dns] - 10https://gerrit.wikimedia.org/r/618952 (https://phabricator.wikimedia.org/T207965) [09:35:38] 10Operations, 10Domains, 10Traffic: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10akosiaris) p:05Triage→03Medium [09:35:48] 10Operations, 10observability: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10akosiaris) p:05Triage→03Medium [09:36:30] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10akosiaris) p:05Triage→03Medium [09:36:43] 10Operations, 10DNS, 10Traffic: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10akosiaris) p:05Triage→03Medium [09:36:59] (03PS1) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) [09:38:45] (03PS1) 10DCausse: [wdqs] cleanup useless logger [puppet] - 10https://gerrit.wikimedia.org/r/618954 [09:43:17] (03CR) 10Volans: "I've tried to both ping and check the arp table / ipv6 neighbors on cr1-eqiad and are all no-show." [dns] - 10https://gerrit.wikimedia.org/r/618952 (https://phabricator.wikimedia.org/T207965) (owner: 10Volans) [09:48:17] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Update Jenkins gpg release key in reprepro - https://phabricator.wikimedia.org/T259116 (10hashar) 05Open→03Resolved a:03akosiaris thank you [09:48:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:49:35] * volans looking [09:50:03] 503 from googleapi... [10:00:06] akosiaris: can I trouble you for a quick review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/618948 ? [10:00:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] kibana: disable telemetry and newsfeed [puppet] - 10https://gerrit.wikimedia.org/r/618948 (https://phabricator.wikimedia.org/T259794) (owner: 10Filippo Giunchedi) [10:01:06] godog: +1ed [10:02:48] !log reboot deneb via ganeti2021 (hostname config pointing to recdns for some reason) [10:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:51] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:16] akosiaris: \o/ thank you [10:04:36] (03CR) 10Filippo Giunchedi: [C: 03+2] kibana: disable telemetry and newsfeed [puppet] - 10https://gerrit.wikimedia.org/r/618948 (https://phabricator.wikimedia.org/T259794) (owner: 10Filippo Giunchedi) [10:05:50] (03CR) 10Alexandros Kosiaris: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/618771 (https://phabricator.wikimedia.org/T259116) (owner: 10Alexandros Kosiaris) [10:07:05] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Update Jenkins gpg release key in reprepro - https://phabricator.wikimedia.org/T259116 (10akosiaris) >>! In T259116#6365994, @akosiaris wrote: >> I could not find where we store that key in puppet :-\ > > That's cause we don't store it. We ju... [10:07:59] RECOVERY - Check the last execution of package_builder_Clean_up_build_directory on deneb is OK: OK: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:09:00] (03PS1) 10Elukey: deneb: fix docker engine package version [puppet] - 10https://gerrit.wikimedia.org/r/618955 [10:09:06] kormat: --^ [10:09:09] (if you have a min) [10:09:47] (03CR) 10Kormat: [C: 03+1] deneb: fix docker engine package version [puppet] - 10https://gerrit.wikimedia.org/r/618955 (owner: 10Elukey) [10:09:54] (03CR) 10Elukey: [C: 03+2] deneb: fix docker engine package version [puppet] - 10https://gerrit.wikimedia.org/r/618955 (owner: 10Elukey) [10:13:22] 10Operations: apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10akosiaris) [10:13:24] (03CR) 10Ayounsi: [C: 03+1] "You linked the correct task. DNS should be cleaned up and re-added if we deploy the cams." [dns] - 10https://gerrit.wikimedia.org/r/618952 (https://phabricator.wikimedia.org/T207965) (owner: 10Volans) [10:13:30] 10Operations: apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10akosiaris) p:05Triage→03High [10:15:43] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:03] 10Operations, 10Toolhub, 10Wikimedia-Mailing-lists, 10User-bd808: Create toolhub-dev@lists.wikimedia.org - https://phabricator.wikimedia.org/T259830 (10akosiaris) 05Open→03Resolved p:05Triage→03Medium a:03akosiaris List created, the user should have receiver the password. [10:23:07] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:24:27] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:25:30] 10Operations, 10Citoid, 10serviceops, 10Patch-For-Review: citoid /api LVS check reports HTTP 200 instead of HTTP 404 - https://phabricator.wikimedia.org/T259469 (10akosiaris) 05Open→03Resolved a:03akosiaris Thanks @mvolz. Problem resolved. [10:27:23] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [10:27:24] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash1025.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:29] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash1025.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:27:42] (03CR) 10Hashar: [C: 03+1] "Bah there is no CI configured for that repository. So that requires a Verified+2 and manual submit ;)" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/404222 (https://phabricator.wikimedia.org/T184882) (owner: 10Thcipriani) [10:32:49] the pybal ipvs diff is me, looking [10:33:21] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:33:23] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:37:07] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10Privacy: Kibana next sending telemetry to elastic.co - https://phabricator.wikimedia.org/T259794 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @Rxy telemetry is disabled now and indeed I can't see the requests anymore, ple... [10:37:10] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) [10:41:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:43:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:47:56] (03CR) 10JMeybohm: [C: 04-1] helmfile: strawman refactoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [10:58:05] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [11:05:27] (03PS1) 10Hashar: Dummy change to test CI [software/transferpy] - 10https://gerrit.wikimedia.org/r/618959 (https://phabricator.wikimedia.org/T253736) [11:07:46] (03Abandoned) 10Hashar: Dummy change to test CI [software/transferpy] - 10https://gerrit.wikimedia.org/r/618959 (https://phabricator.wikimedia.org/T253736) (owner: 10Hashar) [11:11:20] (03CR) 10Jcrespo: "We could change it to buster, but I am not sure what is the policy for wmf packages." [software/transferpy] - 10https://gerrit.wikimedia.org/r/618959 (https://phabricator.wikimedia.org/T253736) (owner: 10Hashar) [11:11:58] (03PS1) 10Ema: ATS: align req_handling/mapping_rules on traffic-cache-atstext [puppet] - 10https://gerrit.wikimedia.org/r/618960 (https://phabricator.wikimedia.org/T259692) [11:15:42] (03CR) 10Ema: [C: 03+2] ATS: align req_handling/mapping_rules on traffic-cache-atstext [puppet] - 10https://gerrit.wikimedia.org/r/618960 (https://phabricator.wikimedia.org/T259692) (owner: 10Ema) [11:16:35] (03CR) 10Privacybatm: "> Patch Set 1:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/618959 (https://phabricator.wikimedia.org/T253736) (owner: 10Hashar) [11:18:29] (03CR) 10Jcrespo: "Moritz: is there a best practice for wmf packages? What should we call the distro? Is "buster" ok or should we call them wikimedia and ign" [software/transferpy] - 10https://gerrit.wikimedia.org/r/618959 (https://phabricator.wikimedia.org/T253736) (owner: 10Hashar) [11:21:30] (03PS6) 10Ema: ATS: add new backend for phabricator aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:23:29] (03CR) 10Ema: [C: 03+2] ATS: add new backend for phabricator aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:26:45] (03PS5) 10Ema: cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 [11:45:50] (03CR) 10Hashar: "The distribution in debian/changelog MUST be a valid one, else pbuilder fails since it can not recognize the distribution and we don't kno" [software/transferpy] - 10https://gerrit.wikimedia.org/r/618959 (https://phabricator.wikimedia.org/T253736) (owner: 10Hashar) [11:48:38] (03PS1) 10Hnowlan: api-gateway: hack to support wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/618963 (https://phabricator.wikimedia.org/T246265) [11:52:53] (03PS6) 10Ema: cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 [11:54:14] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618745 (owner: 10Ema) [12:00:53] (03CR) 10Ema: [C: 03+2] cache: add type Profile::Cache::Sites [puppet] - 10https://gerrit.wikimedia.org/r/618745 (owner: 10Ema) [12:25:25] 10Operations, 10Traffic: Generate ATS cache.config from software-agnostic data structures - https://phabricator.wikimedia.org/T259692 (10ema) 05Open→03Resolved Done, `profile::trafficserver::backend::caching_rules` is now gone. `cache.config` is generated by parsing `req_handling` and `alternate_domains`.... [12:28:23] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [12:28:40] (03PS1) 10Hnowlan: api-gateway: enable TLS when talking to appservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) [12:34:34] (03PS2) 10Filippo Giunchedi: Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) [12:39:02] (03PS1) 10ZPapierski: Bump the weight of near match for search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618974 (https://phabricator.wikimedia.org/T257922) [12:42:54] (03PS1) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [12:43:16] (03PS8) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [12:44:13] (03CR) 10jerkins-bot: [V: 04-1] cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [12:45:51] (03CR) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [12:52:35] (03PS2) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [13:00:24] (03CR) 10JMeybohm: [C: 04-1] api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [13:07:52] (03CR) 10DCausse: [C: 03+1] Bump the weight of near match for search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618974 (https://phabricator.wikimedia.org/T257922) (owner: 10ZPapierski) [13:27:30] !log bounce pybal on lvs1016 and then lvs1015 to reset state, logstash1025 reported down but actually up [13:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:54] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [13:28:57] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:53] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:33:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:33:37] (03PS1) 10Andrew Bogott: wmcs/ceph/backup: fix some copy/paste errors [puppet] - 10https://gerrit.wikimedia.org/r/618980 [13:35:06] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for David Rochford (Drochford) - https://phabricator.wikimedia.org/T259713 (10akosiaris) 05Open→03Resolved a:03akosiaris I 've reached out to @drochford to double check, I 've just added the user in the wmf ldap group. @drochford, you are... [13:35:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:37:09] (03PS1) 10Alexandros Kosiaris: admin: Add drochford to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/618984 (https://phabricator.wikimedia.org/T259713) [13:39:06] (03PS2) 10Andrew Bogott: wmcs/ceph/backup: fix some copy/paste errors, move db host [puppet] - 10https://gerrit.wikimedia.org/r/618980 [13:39:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add drochford to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/618984 (https://phabricator.wikimedia.org/T259713) (owner: 10Alexandros Kosiaris) [13:41:05] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backup: fix some copy/paste errors, move db host [puppet] - 10https://gerrit.wikimedia.org/r/618980 (owner: 10Andrew Bogott) [13:41:53] (03PS1) 10Marostegui: wikireplicas_dns: Depool dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/618986 (https://phabricator.wikimedia.org/T255408) [13:42:22] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for David Rochford (Drochford) - https://phabricator.wikimedia.org/T259713 (10Liefx) a:05akosiaris→03Liefx Hi guys, I'm so confused Liefx is my account. I was requesting LDAP access for LiefX [13:47:39] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for David Rochford (Drochford) - https://phabricator.wikimedia.org/T259713 (10akosiaris) a:05Liefx→03None @Liefx Please open a dedicated task for your request please. Make sure to explain what you request and why. [13:51:17] (03PS2) 10DCausse: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) [13:51:51] (03PS3) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [13:52:05] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [13:53:54] (03CR) 10Kormat: [C: 03+1] wikireplicas_dns: Depool dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/618986 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [13:56:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Please don't add things in stdlib. wmflib is the place to add this. We try to keep stdlib pristine so that upgrades between the versions o" [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [13:57:33] (03PS1) 10Ema: cache: move all VCL files to the same directory [puppet] - 10https://gerrit.wikimedia.org/r/618989 (https://phabricator.wikimedia.org/T241239) [13:58:13] (03PS2) 10Ema: cache: move all VCL files to the same directory [puppet] - 10https://gerrit.wikimedia.org/r/618989 (https://phabricator.wikimedia.org/T241239) [13:58:15] akosiaris: thanks, I had this feeling that was the wrong place but got confirmation by a fellow colleague too [13:58:27] volans: yw [13:58:57] should just be a matter of mv right? [13:59:15] volans: if you ever need someone to tell you "you're doing it wrong", my door is always open! ;) [13:59:39] don't forget to knock! [13:59:54] haha [14:00:03] ema: why loosing the opportunity to have kormat telling me I did it wrong because I didn't knocked? [14:01:11] (03PS1) 10Ottomata: Don't allow an env set CONDA_USER_ENV to override $1 in anaconda-activate-stacked-env [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/618990 [14:02:32] volans: that was a highly sophisticated quote from https://it.wikipedia.org/wiki/I_due_superpiedi_quasi_piatti [14:02:49] ahahah [14:02:51] 10+ --^ [14:02:54] sorry, I missed it [14:03:09] akosiaris: what would be the right path for the spec test? [14:03:28] /modules/wmflib/spec doesn't seem to have the hierarchy I was expecting :) [14:03:31] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (... [14:03:43] volans: that would be my suggestion [14:04:02] yeah, but wher einside that [14:04:29] I'll put in functions [14:04:42] it's a bit of a mix but seems ok [14:05:35] (03PS3) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:05:37] (03PS3) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:05:39] (03PS3) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:06:02] (03CR) 10Volans: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:06:22] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:07:19] what did I do wrong... [14:07:25] kormat: [14:07:28] please it's your turn [14:07:51] 2 offenses detected [14:10:52] let's retry [14:11:01] (03PS4) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:11:03] (03PS4) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:11:05] (03PS4) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:11:43] (03PS2) 10Hnowlan: api-gateway: enable TLS when talking to appservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) [14:12:32] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:14:31] (03CR) 10Hnowlan: api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [14:18:00] (03CR) 10Kormat: [C: 03+2] Split utilities into separate packages [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618513 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [14:20:45] (03PS5) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:20:47] (03PS5) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:20:49] (03PS5) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:22:21] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:27:53] (03PS1) 10Andrew Bogott: wmcs/ceph/backup: remove quotes that were messing with ferm [puppet] - 10https://gerrit.wikimedia.org/r/618992 [14:29:27] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backup: remove quotes that were messing with ferm [puppet] - 10https://gerrit.wikimedia.org/r/618992 (owner: 10Andrew Bogott) [14:29:44] (03PS1) 10RLazarus: Fix the name of Newfoundland and Labrador (comment-only change) [dns] - 10https://gerrit.wikimedia.org/r/618993 [14:30:34] (03CR) 10CDanis: [C: 03+1] Fix the name of Newfoundland and Labrador (comment-only change) [dns] - 10https://gerrit.wikimedia.org/r/618993 (owner: 10RLazarus) [14:30:58] (03CR) 10RLazarus: [C: 03+2] Fix the name of Newfoundland and Labrador (comment-only change) [dns] - 10https://gerrit.wikimedia.org/r/618993 (owner: 10RLazarus) [14:33:53] rzl: thank you for your service [14:34:02] o7 [14:34:41] I wanted to do the maple leaf emoji but oh well :P [14:34:51] 🇨🇦 🇨🇦 🇨🇦 🇨🇦 🇨🇦 [14:35:07] haha [14:35:30] (03CR) 10Ppchelko: [C: 04-1] api-gateway: hack to support wikifeeds (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618963 (https://phabricator.wikimedia.org/T246265) (owner: 10Hnowlan) [14:35:46] 🍁 [14:37:08] (03PS6) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:37:11] (03PS6) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:37:13] (03PS6) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:37:55] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:39:28] (03PS7) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:39:29] 7 is the charm right? :D [14:39:30] (03PS7) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:39:32] (03PS7) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:40:19] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:40:25] (03PS1) 10Andrew Bogott: Added ipv6 addresses for cloudvirt1004 and cloudvir1006 [dns] - 10https://gerrit.wikimedia.org/r/618995 (https://phabricator.wikimedia.org/T259192) [14:40:54] (03CR) 10jerkins-bot: [V: 04-1] Added ipv6 addresses for cloudvirt1004 and cloudvir1006 [dns] - 10https://gerrit.wikimedia.org/r/618995 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [14:44:47] (03PS8) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:44:49] (03PS8) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:44:51] (03PS8) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:44:59] (03PS2) 10Andrew Bogott: Added ipv6 addresses for cloudvirt1004 and cloudvir1006 [dns] - 10https://gerrit.wikimedia.org/r/618995 (https://phabricator.wikimedia.org/T259192) [14:45:33] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 (owner: 10Volans) [14:46:03] (03CR) 10jerkins-bot: [V: 04-1] Added ipv6 addresses for cloudvirt1004 and cloudvir1006 [dns] - 10https://gerrit.wikimedia.org/r/618995 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [14:47:26] (03PS3) 10Andrew Bogott: Added ipv6 addresses for cloudvirt1004 and cloudvir1006 [dns] - 10https://gerrit.wikimedia.org/r/618995 (https://phabricator.wikimedia.org/T259192) [14:47:57] (03PS9) 10Volans: wmflib: add netmask_to_cidr parser function [puppet] - 10https://gerrit.wikimedia.org/r/618765 [14:47:59] (03PS9) 10Volans: interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 [14:48:01] (03PS9) 10Volans: cassandra::instance: use real netmask for IP alias [puppet] - 10https://gerrit.wikimedia.org/r/618767 [14:48:34] (03CR) 10Andrew Bogott: [C: 03+2] Added ipv6 addresses for cloudvirt1004 and cloudvir1006 [dns] - 10https://gerrit.wikimedia.org/r/618995 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [14:58:32] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins: Update Jenkins gpg release key in reprepro - https://phabricator.wikimedia.org/T259116 (10hashar) [15:01:21] !log import DNS names for network devices in Netbox - T258729 [15:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:24] T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1 - https://phabricator.wikimedia.org/T258729 [15:02:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:05:30] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10hashar) [15:06:53] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10hashar) I filed a dupe of this task. The admin interface states the notification server is not reachable. At https... [15:11:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:19:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:24:39] (03PS2) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [15:28:33] elukey: I will add the debian packaging CI job for debs/hue ;) [15:29:34] hashar: I am still trying to make it work, sadly it is kinda "special" since it needs to pull stuff from pypi/npm when building :( [15:29:42] can't make it work yet on deneb [15:29:43] WHAT? ;D [15:30:07] yes basically upstream has a makefile that packs all in a "build" directory [15:30:12] so I guess it needs network access which we can enable by setting PBUILDER_USENETWORK [15:30:14] and then the daemon runs all from there [15:30:57] didnt know about it, I am trying to pass https_proxy=etc.. but I have failed up to now [15:31:26] pretty sure that pbuilder (or something similar) disables network by default [15:31:30] using some black magic in linux [15:31:42] yep yep [15:32:22] or you can vendor the nodejs/python modules inside the tree bah :\ [15:32:49] elukey: here is the magic for CI https://gerrit.wikimedia.org/r/c/integration/config/+/618999 [15:33:00] hashar: I wanted to do it manually on deneb before releasing the package, but the deps are ~500MB in size (!!) [15:33:11] and I need several packages to make it work (kerberos, sasl, etc..) [15:33:18] so the sandbox was nice for this reason [15:33:24] but it is hacky I know [15:33:52] an alternative is to not use a debian package but stuff everything in a git repo and push it with scap [15:34:04] and possibly have the large dependencies stored via git-fat [15:36:02] (03CR) 10Hashar: "recheck https://gerrit.wikimedia.org/r/c/integration/config/+/618999" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [15:40:17] (03CR) 10Hashar: "The CI job fails since the repository is empty. That can be fixed by adding a dummy commit which adds the .gitreview file and rebasing thi" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [15:40:47] elukey: the job is deployed but CI requires at least one commit in the repo before being able to process (known bug). A commit that adds .gitreview is probably sufficient ;) [15:41:01] elukey: anyway I am off, poke me again about it next week. I will be happy to help [15:41:27] hashar: will check thanks a lot! [15:41:41] 10Operations, 10Security-Team, 10SecTeam Discussion, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236 (101) Здравствуйте! Я новенький, имею желание помогать и развивать Wikimedia. Что для этого нужно сделать? [15:45:24] 10Operations, 10Security-Team, 10SecTeam Discussion, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236 (10sbassett) 05Open→03Resolved a:03Reedy I think this can be resolved for now, since I believe @reedy set this up for the #security-team. [16:03:09] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10wkandek) Will, is the unexpected and unknown state due to the Cassandra database state. I remember this being discussed a couple of days ago. [16:07:34] brennen: hey, i'm here now if you want to talk about https://phabricator.wikimedia.org/T259855 [16:09:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:09:55] MatmaRex: hey, thanks - so i defer to your judgment here - if individual reverts can happen quickly, that's probably preferable to having the train on .2 over the weekend [16:10:35] brennen: i think so, i'll have it for review in a minute [16:13:10] brennen: https://gerrit.wikimedia.org/r/619005 [16:18:53] Hey hey. [16:18:57] I'm on deck. [16:19:08] MatmaRex: thanks - my home internet connection has chosen a very inopportune time to crap out, soliciting assistance [16:19:16] thanks James_F [16:20:21] i was talking to edsanders about it too (hey, didn't notice you on IRC) [16:20:30] * James_F nods. [16:20:58] MatmaRex: Want to wait to confirm on Beta Cluster, or should we JFDI? [16:21:58] it won't hurt. i tested locally and it works fine [16:25:19] Is there a way to ensure that the new version of these changes (whenever they arrive) have test coverage by CI? [16:28:27] dancy: i think the only way would be with browser tests – the problem is somewhere on the boundary between DiscussionTools and Parsoid (i suspect we passed it wrong parameters somehow) [16:28:52] so you'd need to have a whole wiki with Parsoid and RESTBase all set up to reproduce the problem [16:28:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:29:39] It might even be related to the WMF-specific Parsoid config, which is a mess. [16:31:19] MatmaRex: Hopefully we can achieve that some day. [16:32:50] i'm not sure what the current state of browser tests, i think we use them somewhere but they also seem to regularly get rewritten from scratch and i don't really know how one is supposed to write them these days [16:34:32] James_F: brennen: are you backporting? [16:34:37] Yes. [16:34:52] (03PS1) 10Jforrester: Revert new reply API [extensions/DiscussionTools] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618822 (https://phabricator.wikimedia.org/T259855) [16:34:53] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:57] (03CR) 10Jforrester: [C: 03+2] Revert new reply API [extensions/DiscussionTools] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618822 (https://phabricator.wikimedia.org/T259855) (owner: 10Jforrester) [16:36:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:38:25] (03Merged) 10jenkins-bot: Revert new reply API [extensions/DiscussionTools] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/618822 (https://phabricator.wikimedia.org/T259855) (owner: 10Jforrester) [16:40:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:41:50] Deploying now. [16:42:53] !log jforrester@deploy1001 Synchronized php-1.36.0-wmf.3/extensions/DiscussionTools/: T259855 Revert new reply API (duration: 01m 06s) [16:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:56] T259855: DiscussionTools touched unrelated parts of the page - https://phabricator.wikimedia.org/T259855 [16:43:09] (03PS1) 10Andrew Bogott: wmcs/backy2/ceph: add admin keyring so backy can access things [puppet] - 10https://gerrit.wikimedia.org/r/619011 (https://phabricator.wikimedia.org/T259192) [16:43:11] (03PS1) 10Andrew Bogott: wmcs/ceph/backy: add python3 packages [puppet] - 10https://gerrit.wikimedia.org/r/619012 [16:43:39] MatmaRex: Seem fixed to you? [16:45:11] hmm [16:46:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:46:30] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: add python3 packages [puppet] - 10https://gerrit.wikimedia.org/r/619012 (owner: 10Andrew Bogott) [16:46:36] James_F: i tried replying on https://ko.wikipedia.org/wiki/위키백과:사랑방_(기술)/2020년_7월 and i got an error (which i stupidly didn't copy) about linterrors being undefined. but i tried again and it seems fine now [16:46:53] Might have been a cached code glitch? [16:46:53] i'm not sure if that's a real problem or something about the deployment [16:46:57] Yeah. :-( [16:46:57] yeah [16:47:03] Gah, PHP, you suck so much. [16:47:15] Let's provisionally hope that it's fixed. [16:47:21] (03PS2) 10Andrew Bogott: wmcs/ceph/backy: add python3 packages for rbd and rados [puppet] - 10https://gerrit.wikimedia.org/r/619012 [16:47:23] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/backy2/ceph: add admin keyring so backy can access things [puppet] - 10https://gerrit.wikimedia.org/r/619011 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [16:47:27] for the record i never even tried reproducing the original issue, i was just going to see if replying still works now [16:47:37] * James_F nods. [16:47:38] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] wmcs/ceph/backy: add python3 packages for rbd and rados [puppet] - 10https://gerrit.wikimedia.org/r/619012 (owner: 10Andrew Bogott) [16:48:03] MatmaRex: I reproduced it locally [16:48:04] "TypeError: Cannot read property 'linterrors' of undefined", i got it again [16:48:25] ugh, is there a dependent commit in VE? [16:49:01] Argh, more reverts needed? [16:49:03] edsanders: that TypeError doesn't happen every time, i'm not sure if it's a real bug or some cached code [16:49:33] OK i think i see it. it happens when opening in source mode [16:49:52] it's working locally for me [16:49:56] Happy to deploy a small quick fix if you have it. [16:49:59] in source mode [16:51:36] is this deployed or do I need to use WMDebug? [16:51:51] edsanders: Deployed. [16:52:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:52:23] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) I think L3 doc is signed now so we can proceeed? [16:52:26] (03CR) 10Dzahn: "oh, you merged it. thank you 😊" [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [16:53:16] meanwhile, while i can't get kibana to open whatever, there's a high count of "i/u/User:3429 CAS update failed on user_touched. The version of the user to be saved is older than the current version." in logspam-watch [16:53:38] Source mode is loading fine for me here: https://hu.wikipedia.org/wiki/Szerkeszt%C5%91vita:ESanders_(WMF) [16:53:47] edsanders: James_F: i don't see it any more. shrug [16:54:02] * James_F crosses fingers. [16:54:37] i'll drop a note about this on the task, in case someone else also encounters it, but i think it was some deployment issue with incompatible versions running together [16:57:02] (03PS2) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) [16:57:03] (03CR) 10jerkins-bot: [V: 04-1] Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [16:57:05] (03PS3) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) [16:57:40] (03CR) 10jerkins-bot: [V: 04-1] Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [16:58:38] (03PS4) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) [17:01:14] James_F: brennen: thanks, and sorry about this! [17:01:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:09:20] (03PS1) 10Volans: mgmt codfw: migrated Papaul's IP to Netbox [dns] - 10https://gerrit.wikimedia.org/r/619015 (https://phabricator.wikimedia.org/T233183) [17:09:54] (03CR) 10Volans: "I've added the IP in Netbox here:" [dns] - 10https://gerrit.wikimedia.org/r/619015 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:12:04] (03PS3) 10Hnowlan: api-gateway: enable TLS when talking to appservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) [17:13:21] (03PS5) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) [17:15:02] (03CR) 10Urbanecm: "I tested quickly at mwdebug1001 what happens when default => null is used. It seems it doesn't set it to null, but makes IS.php processing" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [17:15:19] (03Abandoned) 10Urbanecm: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618091 (https://phabricator.wikimedia.org/T259004) (owner: 10Urbanecm) [17:15:24] (03PS3) 10Urbanecm: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) [17:26:30] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10RobH) [17:26:32] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10RobH) [17:26:57] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10RobH) [17:27:03] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10RobH) [17:27:49] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10RobH) > 10:23 < robh> : so dc opsen > 10:23 < robh> : we have a number of 'in place upgrades' this fiscal year > 10:23 < robh> : should we add a column for that or... [17:27:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10RobH) > 10:23 < robh> : so dc opsen > 10:23 < robh> : we have a number of 'in place upgrades' this fiscal year > 10:23 < robh> : should we add a column for that or... [17:31:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:42:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:44:36] (03PS1) 10Bstorm: paws: monitor the new URLs instead of the deprecated ones [puppet] - 10https://gerrit.wikimedia.org/r/619019 (https://phabricator.wikimedia.org/T211096) [18:03:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:15:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:27:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:34:40] (03CR) 10Volans: "Latest compiler results still looks good: https://puppet-compiler.wmflabs.org/compiler1001/24381/" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [18:34:45] (03PS2) 10Ppchelko: Support wikifeeds in api-gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/618963 (https://phabricator.wikimedia.org/T246265) (owner: 10Hnowlan) [18:35:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:48:34] (03CR) 10Ppchelko: api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [18:52:15] (03CR) 10Eevans: [C: 03+1] "As I mentioned to @Volans elsewhere, I'm not sure I have the context necessary to appreciate why this would be necessary, but also can't t" [puppet] - 10https://gerrit.wikimedia.org/r/618767 (owner: 10Volans) [18:54:29] (03CR) 10Ppchelko: api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [19:03:12] (03CR) 10Ppchelko: api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [19:03:59] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:05:48] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:17:44] (03CR) 10Ppchelko: [C: 04-1] "I still believe we gotta use the health_check filter..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [19:18:19] (03CR) 10Ppchelko: [C: 04-1] "- name: envoy.health_check" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [19:22:21] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [19:45:41] (03PS1) 10Herron: logstash: increase 'hdd' hosts heap from 24G to 26G [puppet] - 10https://gerrit.wikimedia.org/r/619032 (https://phabricator.wikimedia.org/T259219) [20:16:18] (03CR) 10Cwhite: [C: 03+2] prometheus: add default count all query [puppet] - 10https://gerrit.wikimedia.org/r/618869 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:17:54] (03CR) 10Cwhite: [C: 03+2] profile: update mediawiki errors query to count beyond the 10k limit [puppet] - 10https://gerrit.wikimedia.org/r/618870 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:20:02] 10Operations, 10ops-eqiad, 10netops: new cloudflare xconnect to cr1-eqiad - https://phabricator.wikimedia.org/T259923 (10RobH) p:05Triage→03Medium [20:38:25] 10Operations, 10ops-eqiad: relforge1001's mgmt IP not reachable - https://phabricator.wikimedia.org/T259777 (10wiki_willy) a:03Cmjohnson [20:39:12] (03PS1) 10Ottomata: Revert "Bump refine job refinery version to 0.0.132 to fix $schema field bug" [puppet] - 10https://gerrit.wikimedia.org/r/618825 (https://phabricator.wikimedia.org/T259924) [20:40:46] (03PS2) 10Ottomata: Revert "Bump refine job refinery version to 0.0.132 to fix $schema field bug" [puppet] - 10https://gerrit.wikimedia.org/r/618825 (https://phabricator.wikimedia.org/T259924) [20:40:51] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Revert "Bump refine job refinery version to 0.0.132 to fix $schema field bug" [puppet] - 10https://gerrit.wikimedia.org/r/618825 (https://phabricator.wikimedia.org/T259924) (owner: 10Ottomata) [20:44:33] (03PS1) 10Dzahn: ATS: set caching to 'websockets' for Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/619036 (https://phabricator.wikimedia.org/T238593) [21:21:53] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Clarakosi) [21:23:57] (03PS10) 10Dave Pifke: [WIP] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [21:53:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:55:03] (03PS11) 10Dave Pifke: Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [21:59:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:59:56] (03CR) 10Bstorm: [C: 03+2] paws: monitor the new URLs instead of the deprecated ones [puppet] - 10https://gerrit.wikimedia.org/r/619019 (https://phabricator.wikimedia.org/T211096) (owner: 10Bstorm) [22:02:39] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Tgr) This also prevents me from making a card donation (via the donation link in the sidebar menu, but I imagine click... [22:02:43] (03CR) 10Dave Pifke: "This was tested in beta, and now works as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [22:04:40] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Tgr) Related: {T122097} [22:12:56] (03CR) 10Dave Pifke: "Puppet compiler output (for a couple of random storage servers), showing this is mostly a no-op unless enabled: https://puppet-compiler.wm" [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [22:26:17] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Platonides) It could go both ways. If as an Hungarian with only Hungarian credit card, and temporarily visiting the US...