[00:39:14] 10Operations, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10RobH) p:05Triage→03Medium [00:48:36] (03CR) 10Cwhite: "@Filippo, I'm not sure what direction to take the debianization part of this. Do you happen to know of a good way to handle the multiple-" [debs/prometheus-icinga-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/626001 (owner: 10Cwhite) [00:58:48] 10Operations, 10DC-Ops: documented procedure for replacing disks in software RAID servers - https://phabricator.wikimedia.org/T220842 (10Papaul) For me it goes back to the service owner. Also i think all the SW raid now has the bootloader installed all both disks if in case the server has 2 disks but not sure. [01:09:37] (03PS1) 10Krinkle: labs: Remove old wgWMECitationUsage* settings for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626019 (https://phabricator.wikimedia.org/T213969) [02:12:44] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) One outstanding question is what to do about the restrictions bitfield. In production, firejail will be disabled an... [02:27:51] (03CR) 10Krinkle: [C: 03+2] labs: Remove old wgWMECitationUsage* settings for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626019 (https://phabricator.wikimedia.org/T213969) (owner: 10Krinkle) [02:29:05] (03Merged) 10jenkins-bot: labs: Remove old wgWMECitationUsage* settings for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626019 (https://phabricator.wikimedia.org/T213969) (owner: 10Krinkle) [02:47:10] RECOVERY - SSH on wtp1047.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:47:29] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [02:54:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:56:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:08:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:12:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:06:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:10:02] RECOVERY - SSH on wdqs1005.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:12:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:16] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:04:16] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:05:38] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:38] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:10:16] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 15.94 ms [05:10:16] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 19.15 ms [05:11:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:21:02] !log push new pfw policies - T262297 [06:21:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:01] (03PS1) 10Elukey: profile::mjolnir::kafka_msearch_daemon: clean up old code [puppet] - 10https://gerrit.wikimedia.org/r/626055 (https://phabricator.wikimedia.org/T260305) [06:22:27] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10ayounsi) [06:24:13] (03CR) 10Elukey: [C: 03+2] profile::mjolnir::kafka_msearch_daemon: clean up old code [puppet] - 10https://gerrit.wikimedia.org/r/626055 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [06:32:30] PROBLEM - Check systemd state on db1113 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:35:35] (03PS1) 10Elukey: role::elasticsearch::relforge: add missing hiera config for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/626056 (https://phabricator.wikimedia.org/T260305) [06:36:27] db1113 has a ferm issue, but doesn't seem recent [06:37:06] I think it was just an old downtime expiring [06:37:12] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::relforge: add missing hiera config for mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/626056 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [06:38:12] RECOVERY - Check systemd state on db1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:57] (03PS7) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) [06:52:02] (03PS1) 10Elukey: profile::mjolnir::kafka_msearch_daemon_instance: fix enable value [puppet] - 10https://gerrit.wikimedia.org/r/626058 (https://phabricator.wikimedia.org/T260305) [06:54:22] (03CR) 10ZPapierski: [C: 03+1] profile::mjolnir::kafka_msearch_daemon_instance: fix enable value [puppet] - 10https://gerrit.wikimedia.org/r/626058 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [06:54:37] 10Operations, 10Analytics-Clusters: Segfault for systemd-sysusers.service on stat1007 - https://phabricator.wikimedia.org/T256098 (10MoritzMuehlenhoff) This was now fixed in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=20338#c5 And there's now also a bug in Debian to backport it to Buster: https://b... [06:56:17] (03CR) 10Elukey: [C: 03+2] profile::mjolnir::kafka_msearch_daemon_instance: fix enable value [puppet] - 10https://gerrit.wikimedia.org/r/626058 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [06:56:21] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25001/" [puppet] - 10https://gerrit.wikimedia.org/r/626058 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [06:58:50] (03CR) 10Muehlenhoff: [C: 03+2] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [07:00:12] 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) p:05Triage→03Medium [07:01:00] (03PS1) 10Elukey: role::elasticsearch::cirrus: add missing mjolnir hiera value [puppet] - 10https://gerrit.wikimedia.org/r/626060 (https://phabricator.wikimedia.org/T260305) [07:05:02] (03CR) 10ZPapierski: [C: 03+1] role::elasticsearch::cirrus: add missing mjolnir hiera value [puppet] - 10https://gerrit.wikimedia.org/r/626060 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [07:07:00] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::cirrus: add missing mjolnir hiera value [puppet] - 10https://gerrit.wikimedia.org/r/626060 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [07:12:56] PROBLEM - SSH on wdqs1005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:18:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:21:47] <_joe_> effie: is that you? ^^ [07:25:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=mjolnir site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:25:38] !log restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage [07:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={mjolnir,swagger_check_citoid_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:34:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:34:46] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.005604 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:36:32] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [07:36:54] marostegui: ^^might those alerts be T262240?^^ [07:37:22] (03PS3) 10JMeybohm: admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869 [07:37:54] Urbanecm: Yes, most likely, I was following the spike [07:37:59] Urbanecm: and it is happening at the moment [07:38:10] I'm ready to turn off DPL now then [07:38:19] Urbanecm: +1 from my side [07:38:32] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:40:02] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (70) node(s) change every puppet run: elastic2060.codfw.wmnet, elastic2050.codfw.wmnet, elastic2044.codfw.wmnet, elastic2033.codfw.wmnet, elastic2051.codfw.wmnet, elastic1037.eqiad.wmnet, elastic1063.eqiad.wmnet, elastic2036.codfw.wmnet, elastic1038.eqiad.wmnet, elastic2042.codfw.wmnet, elastic1064.eqiad.wmn [07:40:02] odfw.wmnet, elastic2030.codfw.wmnet, elastic2049.codfw.wmnet, elastic1059.eqiad.wmnet, elastic2052.codfw.wmnet, elastic2048.codfw.wmnet, elastic2057.codfw.wmnet, elastic1067.eqiad.wmnet, elastic2046.codfw.wmnet, elastic2041.codfw.wmnet, elastic1043.eqiad.wmnet, elastic1036.eqiad.wmnet, elastic2039.codfw.wmnet, elastic1039.eqiad.wmnet, elastic1062.eqiad.wmnet, elastic2040.codfw.wmnet, webperf1002.eqiad.wmnet, elastic2026.codfw.wmn [07:40:02] qiad.wmnet, elastic1033.eqiad.wmnet, elastic2054.codfw.wmnet, elastic1045.eqiad.wmnet, elastic2025.codfw.wmnet, elastic2031.codfw.wmnet, elastic1060.eqiad.wmnet, elastic2056.codfw.wmnet, elastic1065.eqiad.wmnet, elastic2032 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:40:18] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [07:40:31] syncing that [07:41:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable DynamicPageList on ruwikinews (T262240) (duration: 01m 22s) [07:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:56] wait the elastic puppet run problem is weird [07:42:26] marostegui: done [07:42:35] (03PS1) 10Urbanecm: Disable DynamicPageList on ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626062 [07:43:24] (03CR) 10Urbanecm: [C: 03+2] "already emergency-deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626062 (owner: 10Urbanecm) [07:44:06] (03Merged) 10jenkins-bot: Disable DynamicPageList on ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626062 (owner: 10Urbanecm) [07:44:29] Urbanecm: thanks so much [07:44:39] no problem [07:45:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:57:36] (03PS2) 10Giuseppe Lavagetto: mobileapps: make template for the restbase uri configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625936 (https://phabricator.wikimedia.org/T255876) [07:57:38] (03PS2) 10Giuseppe Lavagetto: mobileapps: use the service proxy for all calls in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625937 (https://phabricator.wikimedia.org/T255876) [08:01:22] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.03 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:02:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: make template for the restbase uri configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625936 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [08:04:31] (03Merged) 10jenkins-bot: mobileapps: make template for the restbase uri configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625936 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [08:04:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: use the service proxy for all calls in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625937 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [08:06:28] (03Merged) 10jenkins-bot: mobileapps: use the service proxy for all calls in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625937 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [08:06:50] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10Joe) [08:08:13] (03CR) 10Filippo Giunchedi: [C: 03+1] nagios-nrpe-server systemd unit: use /run for PID files [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [08:13:42] RECOVERY - SSH on wdqs1005.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:14:36] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [08:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:13] (03PS1) 10Giuseppe Lavagetto: mobileapps: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626102 (https://phabricator.wikimedia.org/T255876) [08:29:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626102 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [08:29:54] (03PS1) 10Filippo Giunchedi: aptrepo: add new signing key for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/626103 [08:30:32] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: add new signing key for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/626103 (owner: 10Filippo Giunchedi) [08:30:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:30:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:39] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12535 and previous config saved to /var/cache/conftool/dbconfig/20200909-083038-kormat.json [08:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:57] (03CR) 10Muehlenhoff: [C: 03+1] "Key is listed in https://downloads.apache.org/cassandra/KEYS, so LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626103 (owner: 10Filippo Giunchedi) [08:31:09] (03Merged) 10jenkins-bot: mobileapps: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626102 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [08:32:28] (03PS2) 10Filippo Giunchedi: aptrepo: add new signing key for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/626103 [08:32:46] (03CR) 10Hashar: [C: 04-1] "From a discussion with Moritz, we can have the newer git uploaded to the 'main' component which ensure consistency everywhere and saves us" [puppet] - 10https://gerrit.wikimedia.org/r/625847 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [08:33:26] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: add new signing key for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/626103 (owner: 10Filippo Giunchedi) [08:34:42] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [08:34:42] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [08:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:17] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12536 and previous config saved to /var/cache/conftool/dbconfig/20200909-083616-kormat.json [08:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:19] (03PS1) 10ZPapierski: Bump msearch daemon parallelism [puppet] - 10https://gerrit.wikimedia.org/r/626105 (https://phabricator.wikimedia.org/T260305) [08:40:28] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [08:40:28] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [08:40:29] (03PS1) 10Filippo Giunchedi: aptrepo: use elastic 7.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/626106 [08:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:31] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: use elastic 7.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/626106 (owner: 10Filippo Giunchedi) [08:42:40] (03PS2) 10Filippo Giunchedi: aptrepo: use elastic 7.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/626106 [08:44:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:44:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:44:34] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12537 and previous config saved to /var/cache/conftool/dbconfig/20200909-084433-kormat.json [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:00] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10hashar) The instance only has 2GB RAM. Maybe the instance flavor can just be changed to get more RAM and then restarted, else we would ne... [08:45:44] hashar: you can configure puppetdb to use less ram btw [08:46:05] hashar: e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/625881 [08:47:10] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:48:23] hieradata/cloud.yaml:profile::puppetdb::database::shared_buffers: '7680MB' [08:48:23] hieradata/cloud/eqiad1/devtools/common.yaml:profile::puppetdb::database::shared_buffers: 768MB [08:48:25] kormat: nice ;) [08:51:48] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12538 and previous config saved to /var/cache/conftool/dbconfig/20200909-085147-kormat.json [08:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:17] (03PS1) 10Hashar: deployment-prep: reduce puppetdb memory usage [puppet] - 10https://gerrit.wikimedia.org/r/626109 (https://phabricator.wikimedia.org/T248041) [08:52:24] kormat: ^ ;] [08:52:44] PROBLEM - SSH on wtp1047.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:53:06] !log upgrade kibana to 7.9.1 on the logstash7 cluster [08:53:08] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [08:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:21] (03CR) 10Kormat: [C: 03+1] deployment-prep: reduce puppetdb memory usage [puppet] - 10https://gerrit.wikimedia.org/r/626109 (https://phabricator.wikimedia.org/T248041) (owner: 10Hashar) [08:53:24] <_joe_> !log restarting restbase on rb2009 (depooled) [08:53:26] kormat: thank you so much for the great hint ;] [08:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:53] hashar: no problem :) it's something i ran into myself, and just hacked around locally. but godog came up with the Proper fix [08:54:17] (03CR) 10Hashar: [C: 04-1] "I have cherry picked it on the deployment puppetmaster and ran puppet on deployment-puppetdb03 but that does not change anything :-\\\\" [puppet] - 10https://gerrit.wikimedia.org/r/626109 (https://phabricator.wikimedia.org/T248041) (owner: 10Hashar) [08:56:51] (03CR) 10Kormat: [C: 03+1] "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/626109 (https://phabricator.wikimedia.org/T248041) (owner: 10Hashar) [08:58:06] (03Abandoned) 10Hashar: deployment-prep: reduce puppetdb memory usage [puppet] - 10https://gerrit.wikimedia.org/r/626109 (https://phabricator.wikimedia.org/T248041) (owner: 10Hashar) [08:58:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:58:20] kormat: turns out it is already set at 600MB via Horizon ;D [08:58:39] hashar: oh, and you're still having issues? [08:58:40] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash1023.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:59:17] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10hashar) The postgresql tunings are: ` name=/etc/postgresql/11/main/tuning.conf maintenance_work_mem = 1GB checkpoin... [08:59:47] kormat: yeah somehow but I am willing to let it go :] [09:00:00] hashar: oh. i should have looked closer. if it's the puppetdb process itself, then look at `profile::puppetdb::jvm_opts: '-Xmx256m'` [09:01:22] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash1023.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:02:00] kormat: ah yeah it has -Xmx4G [09:02:30] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash1023.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:03:07] hah that's me ^ fixing [09:03:52] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:04:03] kormat: did that, thanks for the pointer :] [09:04:07] hashar: fwiw i have a puppetdb running for an env with ~8 hosts. it's a 'small' VPS with 2G of ram, and with those two hiera settings it hasn't OOM'd [09:04:21] hashar: sorry for assuming it was postgres, but glad i could help :) [09:04:22] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:04:27] (03PS2) 10Giuseppe Lavagetto: cxserver: enable the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625935 (https://phabricator.wikimedia.org/T255879) [09:04:29] (03PS1) 10Giuseppe Lavagetto: cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) [09:04:35] 10Operations, 10Discovery, 10Discovery-Search (Current work): Increase vcores and ram on search-loader VMs - https://phabricator.wikimedia.org/T262385 (10elukey) [09:05:01] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10hashar) The java process is now running with `-Xmx256m`, was `-Xmx4G`, that should help [09:07:16] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:07:40] (03CR) 10ZPapierski: [C: 04-1] "This should be merged after https://phabricator.wikimedia.org/T262385 is done." [puppet] - 10https://gerrit.wikimedia.org/r/626105 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [09:08:22] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:08:29] (03PS1) 10JMeybohm: pybal: Move from conf1006 to conf1005 as config_host in esams [puppet] - 10https://gerrit.wikimedia.org/r/626111 (https://phabricator.wikimedia.org/T196487) [09:09:07] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [09:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:54] (03PS1) 10Elukey: aptrepo: add rock-dkms in the list of packages for the rocm33 component [puppet] - 10https://gerrit.wikimedia.org/r/626112 (https://phabricator.wikimedia.org/T260442) [09:10:36] (03PS1) 10JMeybohm: Temporarily remove conf1006 from client SRV records [dns] - 10https://gerrit.wikimedia.org/r/626113 (https://phabricator.wikimedia.org/T196487) [09:11:19] !log installing qemu security updates on Buster [09:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:23] (03CR) 10Klausman: [C: 03+1] aptrepo: add rock-dkms in the list of packages for the rocm33 component [puppet] - 10https://gerrit.wikimedia.org/r/626112 (https://phabricator.wikimedia.org/T260442) (owner: 10Elukey) [09:15:44] (03PS6) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [09:15:58] (03PS7) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [09:16:06] (03PS4) 10Jbond: role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) [09:16:23] (03PS5) 10Jbond: role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) [09:16:52] (03CR) 10Jbond: [C: 03+2] java: add define to update the java trust store (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625623 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:16:55] (03CR) 10jerkins-bot: [V: 04-1] role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:16:57] (03CR) 10jerkins-bot: [V: 04-1] profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:16:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Let's wait for the day before the shutdown" [puppet] - 10https://gerrit.wikimedia.org/r/626111 (https://phabricator.wikimedia.org/T196487) (owner: 10JMeybohm) [09:17:01] (03CR) 10jerkins-bot: [V: 04-1] role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:17:24] (03PS7) 10Jbond: profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) [09:17:39] (03PS4) 10Jbond: profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) [09:19:19] (03PS8) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [09:19:27] (03PS9) 10Jbond: role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) [09:19:36] (03CR) 10Jbond: [C: 03+2] profile::java: add param to toggle puppet ca trust [puppet] - 10https://gerrit.wikimedia.org/r/625624 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:19:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Let's wait until the day before the downtime" [dns] - 10https://gerrit.wikimedia.org/r/626113 (https://phabricator.wikimedia.org/T196487) (owner: 10JMeybohm) [09:20:26] (03CR) 10Jbond: [C: 03+2] role:idp_test: add the puppet CA to the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625625 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:21:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudceph: Add cpufreq tools to set cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/625947 (owner: 10Bstorm) [09:21:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This should probably be enabled for cloudvirts too :-P" [puppet] - 10https://gerrit.wikimedia.org/r/625947 (owner: 10Bstorm) [09:23:03] (03PS6) 10Jbond: role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) [09:24:18] (03PS1) 10Jcrespo: trasnsferpy: Add ability to override transferpy defaults for wmf [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) [09:24:20] (03CR) 10Jbond: [C: 03+2] role:idp_test: add remove puppet CA from the java truststore [puppet] - 10https://gerrit.wikimedia.org/r/625630 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:25:20] (03CR) 10Jcrespo: "We can setup the defaults you want, up to you, this is only an example." [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:25:48] (03PS5) 10Jbond: profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) [09:26:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:26:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:26:22] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12539 and previous config saved to /var/cache/conftool/dbconfig/20200909-092621-kormat.json [09:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:42] (03CR) 10Jbond: "Testing on O:idp_test worked fine, This Change is now ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:27:22] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [09:28:54] (03PS1) 10Jbond: idp_test: remove puppet ca from trustore [puppet] - 10https://gerrit.wikimedia.org/r/626117 [09:28:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'm assuming to be merged once we have fully phased out the metrics from alerts/dashboards/etc" [puppet] - 10https://gerrit.wikimedia.org/r/625975 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [09:29:15] (03PS2) 10Jcrespo: trasnsferpy: Add ability to override transferpy defaults for wmf [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) [09:29:29] (03CR) 10Jbond: [C: 03+2] idp_test: remove puppet ca from trustore [puppet] - 10https://gerrit.wikimedia.org/r/626117 (owner: 10Jbond) [09:31:22] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:31:47] (03PS3) 10Jcrespo: transferpy: Add ability to override transferpy defaults for wmf [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) [09:32:09] (03CR) 10Kormat: transferpy: Add ability to override transferpy defaults for wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:33:02] (03CR) 10Jcrespo: transferpy: Add ability to override transferpy defaults for wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:33:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Maybe I'd consider if the percentage is redundant, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009) (owner: 10Effie Mouzeli) [09:33:33] (03PS4) 10Jcrespo: transferpy: Add ability to override transferpy defaults for wmf [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) [09:33:54] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12540 and previous config saved to /var/cache/conftool/dbconfig/20200909-093353-kormat.json [09:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:07] (03CR) 10Jcrespo: transferpy: Add ability to override transferpy defaults for wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:34:41] (03Abandoned) 10Hashar: base: upgrade git on stretch to 2.20 [puppet] - 10https://gerrit.wikimedia.org/r/625847 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:35:11] (03CR) 10Jcrespo: "Any suggestions on the best defaults (of course, this is just defaults, they can be always overriden on command line)?" [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:35:14] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [09:36:03] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:38:21] (03PS2) 10Hashar: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) [09:38:49] (03PS1) 10Hnowlan: restbase: return restbase2009 to the host list [puppet] - 10https://gerrit.wikimedia.org/r/626119 [09:39:28] (03CR) 10Jcrespo: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [09:39:30] (03CR) 10jerkins-bot: [V: 04-1] git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:39:58] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar) @MoritzMuehlenhoff I have upgraded on deploymen... [09:40:10] (03PS3) 10Hashar: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) [09:43:26] (03PS2) 10Hashar: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) [09:51:01] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [debs/prometheus-icinga-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/626001 (owner: 10Cwhite) [09:52:17] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:52:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:20] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12541 and previous config saved to /var/cache/conftool/dbconfig/20200909-095219-kormat.json [09:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:34] (03PS7) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [09:55:19] (03CR) 10Hnowlan: [C: 03+2] api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [09:56:30] (03Merged) 10jenkins-bot: api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [09:57:39] (03CR) 10Jbond: [C: 03+2] sslcert::x509_to_pkcs12: add define for creating p12 files [puppet] - 10https://gerrit.wikimedia.org/r/623361 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:57:44] (03CR) 10Jbond: [C: 03+2] base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:58:00] (03PS3) 10Jbond: sslcert::x509_to_pkcs12: add define for creating p12 files [puppet] - 10https://gerrit.wikimedia.org/r/623361 (https://phabricator.wikimedia.org/T253957) [09:58:09] (03PS5) 10Jbond: base::puppet: add ability to create p12 puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/623362 (https://phabricator.wikimedia.org/T253957) [10:00:10] (03PS4) 10Jbond: puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) [10:00:49] (03CR) 10jerkins-bot: [V: 04-1] puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [10:01:18] (03PS5) 10Jbond: puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) [10:01:25] (03PS6) 10Jbond: puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) [10:01:58] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12542 and previous config saved to /var/cache/conftool/dbconfig/20200909-100157-kormat.json [10:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:43] (03CR) 10Jbond: [C: 03+2] puppet ssl p12: enable generation of puppet p12 cert on test cluster [puppet] - 10https://gerrit.wikimedia.org/r/623363 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [10:05:09] (03PS1) 10Filippo Giunchedi: alertmanager: display one row per severity in the Karma UI [puppet] - 10https://gerrit.wikimedia.org/r/626122 [10:09:49] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:54] (03CR) 10Elukey: "> Patch Set 5:" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [10:10:06] (03PS1) 10Jbond: base: puppet use hostcert not publickey [puppet] - 10https://gerrit.wikimedia.org/r/626123 [10:11:14] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: display one row per severity in the Karma UI [puppet] - 10https://gerrit.wikimedia.org/r/626122 (owner: 10Filippo Giunchedi) [10:11:22] (03CR) 10Jbond: [C: 03+2] base: puppet use hostcert not publickey [puppet] - 10https://gerrit.wikimedia.org/r/626123 (owner: 10Jbond) [10:11:31] !log Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442) [10:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:37] T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features - https://phabricator.wikimedia.org/T260442 [10:11:49] jbond42: merging your change too [10:11:56] yes please, thx [10:20:16] PROBLEM - Check systemd state on db1081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:13] marostegui: ^ did a downtime just expire? [10:21:53] huh. the failing service is ferm [10:22:12] kormat: could be that it expired too yeah [10:22:14] let me check [10:22:15] what's the error? [10:22:30] `DNS query for 'prometheus1003.eqiad.wmnet' failed: query timed out` [10:23:14] kormat: yeah, db1081 isn't under maintenance at the moment, so it is not downtimed anymore [10:23:17] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:32] marostegui: alright. i'll reload ferm, and see [10:23:55] that worked v0v [10:23:59] <_joe_> kormat: that's a leftover from yesterday's outage [10:24:04] RECOVERY - Check systemd state on db1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:11] kormat: thank you! [10:24:28] _joe_: huh, ok [10:24:43] I did one of those tomorrow [10:24:46] ah. failed since 20h. it takes _that_ long for icinga to care? [10:24:49] I think service was reloaded [10:24:55] for all hosts not downtimed [10:24:59] kormat: probably was downtimed [10:25:02] hence we didn't notice [10:25:04] and we miss a few that were under maintenance [10:25:10] volans: ah ok [10:25:16] I did one early in the morning that had a similar pattern [10:25:21] jynus: gotcha [10:25:34] (03PS1) 10Jbond: java::cacert: -cacerts is not supported in java 8 [puppet] - 10https://gerrit.wikimedia.org/r/626125 (https://phabricator.wikimedia.org/T253957) [10:25:35] * elukey blames kormat without reading the context [10:25:59] elukey: blaming me for not reading the context is also a high-percentage game [10:26:46] (03CR) 10Elukey: [C: 03+1] "Thanks for the patience :D" [puppet] - 10https://gerrit.wikimedia.org/r/626125 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [10:27:09] * elukey sends wikilove to kormat [10:27:21] I just run a cumin on all eqiad to list failed units [10:27:23] (03CR) 10Jbond: [C: 03+2] java::cacert: -cacerts is not supported in java 8 [puppet] - 10https://gerrit.wikimedia.org/r/626125 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [10:27:27] ferm seems to be failed on 8 of them [10:27:28] (8) an-test-master1001.eqiad.wmnet,an-worker1116.eqiad.wmnet,db[1075,1101,1116].eqiad.wmnet,labstore1007.wikimedia.org,logstash[1025,1030].eqiad.wmnet [10:27:40] I'll do a restart [10:28:00] volans: cheers [10:28:35] !log restarting ferm on failed hosts: an-test-master1001.eqiad.wmnet,an-worker1116.eqiad.wmnet,db[1075,1101,1116].eqiad.wmnet,labstore1007.wikimedia.org,logstash[1025,1030].eqiad.wmnet leftover from yesterday network issue [10:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:36] (03PS1) 10DCausse: [wdqs] use an Integer instead of String for jmx_exporter port [puppet] - 10https://gerrit.wikimedia.org/r/626129 [10:46:33] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:18] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:47:21] wait, did that downtime already expire?! [10:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:00] klausman: I think so yes it was 30 mins [10:48:17] Added another hour [10:48:18] but not a big deal, SAL is up to date so that alarm is fine [10:49:09] (03CR) 10Effie Mouzeli: [C: 03+2] php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009) (owner: 10Effie Mouzeli) [10:54:54] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [10:57:27] (03PS1) 10Marostegui: db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626131 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1100). [11:00:04] hnowlan: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:45] (03CR) 10Marostegui: [C: 03+2] db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626131 (owner: 10Marostegui) [11:01:21] 10Operations: Improve process to add/update keys for pwstore repo - https://phabricator.wikimedia.org/T262393 (10MoritzMuehlenhoff) [11:02:27] hnowlan: I'd be happy to deploy this. [11:02:27] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Marostegui) Is there anything pending here that might require power changes? [11:02:29] (03CR) 10Effie Mouzeli: [C: 03+2] push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:02:45] awight: great! [11:03:04] (03CR) 10MarcoAurelio: [C: 04-1] "Little issue." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [11:03:12] hnowlan: I'm just checking to see how far along the new wiki is, if there is a db etc. [11:03:33] awight: patch has a tyop (sic) I've just -1ed [11:03:37] (03Merged) 10jenkins-bot: push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:03:42] hauskatze: okay, thanks! [11:03:53] !log Stop MySQL on s2 eqiad master to prepare for the PDU maintenance (this will generate lag on s2 on labsdb) T261453 [11:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:00] T261453: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 [11:04:09] (03PS8) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [11:04:21] (03PS1) 10Giuseppe Lavagetto: wikifeeds: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) [11:04:50] (03CR) 10Hnowlan: api-portal: required extended configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [11:05:23] lgtm now [11:05:54] the new dosceditor user group would probably require a WikimediaMessages entry for i18n compat but that can be done later [11:06:10] (03CR) 10Awight: [C: 03+2] "Config window deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [11:06:23] visiting api.wikimedia before it gets private [11:06:57] (03Merged) 10jenkins-bot: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [11:07:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:07:39] hnowlan: Should be live on mwdebug1001 [11:07:47] awight: cool, looking [11:07:54] awight: 1001? [11:08:04] ain't we using 2001 these days? [11:08:08] server switch etc [11:08:15] hnowlan: hehe I'm behind the times. [11:08:23] I'll pull to 2001 now, thanks for the heads-up [11:09:26] hnowlan: on 2001 [11:09:39] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) @MSantos please let us know when you are ready to go on production, so we can perform the fina... [11:10:07] Hmm the read protection isn't behaving like I would expect. [11:10:17] ^ [11:10:29] but I'm in global groups that grants me 'read' everywhere so... [11:10:32] iirc [11:11:28] the skin hasn't changed either [11:11:30] Yeah, it's not behaving like I'd expect. There should also be a skin change [11:11:48] I'm not logged in, so should have gotten locked out. [11:12:20] I can confirm that mwdebug2001 has the updated config files. [11:12:46] No docseditor group at https://api.wikimedia.org/wiki/Special:ListGroupRights either [11:13:14] Does this wiki... know that it's apiportalwiki? [11:13:15] I'm assuming some dblist issue here if I had to guess [11:13:48] Urbanecm: 112 [11:14:32] hauskatze: Are you calling the emergency? [11:14:41] If so, how may I help? [11:14:41] !log jiji@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [11:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:49] Urbanecm: yup [11:14:55] Urbanecm: so this is weird [11:15:06] config change not getting applied? [11:15:19] some composer build dblist issue maybe? [11:15:20] That's weird - scap pull should clear caches [11:15:29] !log added Tobias Klausmann to pwstore [11:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:36] (or scap sync-file, idk if you're debugging on mwdebug or prod) [11:16:10] Urbanecm: This was a `scap pull` to mwdebug2001 [11:16:26] Okay. And what does (not) happen? What is the change? [11:16:51] Urbanecm: This https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/624750/8 should have made the wiki private, among other changes. [11:17:19] I'm trying to understand how groupOverrides works, reading the config code... [11:18:10] dumping $wgGroupPermissions from eval.php gives the expected result... [11:18:22] That's... weird [11:18:30] Urbanecm: just to confirm, that api wiki was supposed to be replicate to labs, right? https://phabricator.wikimedia.org/T246946 I don't see anything there that says it doesn't need to be [11:18:33] And special:usergrouprighte? [11:18:47] marostegui: I think so [11:19:39] Urbanecm: ok, thanks! [11:19:48] *special:usergrouprights awight [11:19:49] I ran mw.config.get('wgDBname') in the browser to be absolutely sure, and it looks right. [11:19:51] marostegui: the labs config was removed? [11:20:12] awight: and yet the wiki is public? [11:20:18] yup [11:20:24] hauskatze: no, just checking it is intended to be on labs [11:20:26] and no new 'docseditor' right either [11:20:30] Urbanecm: yes the Special:ListGroupRights is incorrect, it doesn't reflect any of the changes. [11:20:56] Hmm [11:21:40] I've never seen this before, but maybe the browser extension is failing to switch me to the debug server. [11:21:52] What if you re-run scap pull? [11:21:52] I can try the full deployment. [11:22:18] OOH [11:22:21] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: run nginx-ingress on ingress dedicated nodes [puppet] - 10https://gerrit.wikimedia.org/r/626133 (https://phabricator.wikimedia.org/T250172) [11:22:25] Sorry everyone. I'm deploying from eqiad. [11:22:50] wait, that still doesn't explain it because I see updated config files on the debug server. [11:22:51] lol [11:23:13] nevertheless I'll try deployment.codfw.wmnet [11:23:26] awight Urbanecm - I think we still use eqiad for deplos? [11:23:38] Seems to be the same machine. [11:24:06] I did a deplo yesterday and I think it was deploy1002 with 1 being eqiad, just wondering [11:24:39] This was the line that raised my hackles: [11:24:40] 11:24:28 Copying from deployment.codfw.wmnet to mwdebug2001.codfw.wmnet [11:24:42] Yes, mwdebug should be codfw, eqiad host is used for deployment [11:25:03] awight: that's normal, deployment.codfw.wmnet points to eqiad [11:25:06] +1 [11:25:55] (03PS1) 10Jbond: base::puppet: move the export_p12 parameter to base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/626134 [11:26:21] oh no, I think I've just realised what the issue is! The endpoint that is serving api.wikimedia.org isn't respecting the mwdebug headers, it just serves requests to the regular appservers [11:26:44] if I curl the staging instance of the api server I can see the correct skin in use [11:26:55] hnowlan: Thanks for finding that :-) So, I'll just go to full deployment then? [11:27:16] let me just test one or two things [11:27:20] ack [11:28:13] okay, so nothing is broken then :) [11:28:14] yep, user pages are disabled [11:28:53] Urbanecm: Nice to get your input, and sorry to drag you into a rare deployment "off" :-) [11:29:02] sorry for the distraction/confusion :) [11:29:08] happy to help :) [11:29:15] hnowlan: is there a way how to make it respect mwdebug anway? [11:29:23] (for further deployments) [11:30:13] hnowlan: Any last checks to make or are you okay with me deploying? [11:30:39] awight: I think it looks all good from my end [11:30:50] Urbanecm: definitely, I'll implement that asap [11:30:53] thanks! [11:33:20] Well it makes sense. Testing not working because it was not respecting the headers [11:33:29] We could've been here all day re-scapping around [11:34:12] (03CR) 10Jbond: [C: 03+2] base::puppet: move the export_p12 parameter to base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/626134 (owner: 10Jbond) [11:34:39] !log awight@deploy1001 Synchronized wmf-config: Config: [[gerrit:624750|api-portal: required extended configuration (T261425)]] (duration: 01m 08s) [11:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:46] T261425: Configure API Portal wiki - https://phabricator.wikimedia.org/T261425 [11:35:26] Yup, it's now restricted [11:35:36] Looks good [11:37:25] !log EU Bacon complete [11:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:22] awight: thanks a lot! [11:42:45] (03CR) 10JMeybohm: [C: 04-1] citoid: add TLS LVS endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [11:44:23] (03CR) 10JMeybohm: [C: 04-1] citoid: promote https lvs to production status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [11:44:54] (03PS1) 10Jbond: base::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) [11:46:02] (03CR) 10jerkins-bot: [V: 04-1] base::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [11:47:55] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:47:56] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:08] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [11:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:22] (03PS1) 10Mvolz: Update citoid to 2020-09-08-122926-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) [11:54:53] 10Operations, 10Citoid, 10Prod-Kubernetes, 10serviceops, and 2 others: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10Mvolz) [11:54:54] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:54:54] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:19] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:56] (03CR) 10JMeybohm: [C: 03+1] cxserver: enable the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625935 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [12:09:12] (03CR) 10JMeybohm: [C: 03+1] cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [12:09:41] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [12:09:45] 10Operations, 10Citoid, 10Prod-Kubernetes, 10serviceops, and 2 others: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10Mvolz) I'm planning to deploy tomorrow, so I was wondering if I can have clarification on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622585... [12:11:38] !log installing zeromq security updates on Buster [12:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [12:23:45] (03PS1) 10Filippo Giunchedi: alertmanager: enable/disable irc service as needed [puppet] - 10https://gerrit.wikimedia.org/r/626140 (https://phabricator.wikimedia.org/T258948) [12:31:07] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:31:07] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:31:10] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12544 and previous config saved to /var/cache/conftool/dbconfig/20200909-123109-kormat.json [12:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:54] (03CR) 10Giuseppe Lavagetto: citoid: add TLS LVS endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [12:32:58] (03CR) 10Giuseppe Lavagetto: citoid: promote https lvs to production status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [12:34:41] PROBLEM - Apache HTTP on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:45] PROBLEM - Apache HTTP on wtp2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:45] PROBLEM - PHP7 rendering on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:45] PROBLEM - PHP7 rendering on wtp2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:47] PROBLEM - Apache HTTP on wtp2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:49] PROBLEM - Apache HTTP on wtp2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:51] PROBLEM - PHP7 rendering on wtp2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:51] PROBLEM - PHP7 rendering on wtp2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:51] PROBLEM - Apache HTTP on wtp2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:53] (03PS1) 10Hnowlan: api-gateway: use x-client-ip instead of x-forwarded-for IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/626146 (https://phabricator.wikimedia.org/T246276) [12:34:59] PROBLEM - PHP7 rendering on wtp2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:01] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:03] PROBLEM - PHP7 rendering on wtp2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:05] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:35:05] PROBLEM - Apache HTTP on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:07] PROBLEM - PHP7 rendering on wtp2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:09] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:09] PROBLEM - PHP7 rendering on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:11] PROBLEM - Apache HTTP on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:11] PROBLEM - PHP7 rendering on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:15] PROBLEM - Apache HTTP on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: enable the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625935 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [12:35:17] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp2006.codfw.wmnet, wtp2007.codfw.wmnet, wtp2012.codfw.wmnet, wtp2019.codfw.wmnet, wtp2009.codfw.wmnet, wtp2016.codfw.wmnet, wtp2008.codfw.wmnet, wtp2004.codfw.wmnet, wtp2013.codfw.wmnet, wtp2017.codfw.wmnet, wtp2015.codfw.wmnet, wtp2020.codfw.wmnet, wtp2003.codfw.wmnet, wtp2014.codfw.wmnet are marked down but poole [12:35:17] h.wikimedia.org/wiki/PyBal [12:35:17] PROBLEM - PHP7 rendering on wtp2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:17] PROBLEM - Apache HTTP on wtp2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:17] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:17] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:18] PROBLEM - PHP7 rendering on wtp2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:19] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:19] PROBLEM - PHP7 rendering on wtp2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:25] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:25] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:27] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:35:27] PROBLEM - PHP7 rendering on wtp2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:29] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:29] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:35:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:35] PROBLEM - PHP7 rendering on wtp2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:35] PROBLEM - PHP7 rendering on wtp2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:37] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp2004.codfw.wmnet, wtp2018.codfw.wmnet, wtp2012.codfw.wmnet, wtp2003.codfw.wmnet, wtp2013.codfw.wmnet, wtp2008.codfw.wmnet, wtp2002.codfw.wmnet, wtp2010.codfw.wmnet, wtp2006.codfw.wmnet, wtp2015.codfw.wmnet, wtp2001.codfw.wmnet, wtp2020.codfw.wmnet, wtp2007.codfw.wmnet, wtp2019.codfw.wmnet, wtp2009.codfw.wmnet, wtp [12:35:37] wtp2017.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:35:37] PROBLEM - Apache HTTP on wtp2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:39] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:39] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:35:39] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:39] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:39] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:35:43] PROBLEM - Apache HTTP on wtp2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:45] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:45] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:45] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:35:45] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:45] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:45] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:47] <_joe_> wat [12:35:47] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:51] PROBLEM - Apache HTTP on wtp2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:51] PROBLEM - PHP7 rendering on wtp2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:53] (03PS1) 10Muehlenhoff: Update netbox Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/626147 [12:35:57] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:57] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:59] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:01] PROBLEM - Apache HTTP on wtp2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:05] PROBLEM - Apache HTTP on wtp2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:05] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:15] PROBLEM - PHP7 rendering on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:36:16] <_joe_> is this just parsoid? [12:36:17] PROBLEM - Apache HTTP on wtp2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:17] PROBLEM - Apache HTTP on wtp2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:23] PROBLEM - Apache HTTP on wtp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:23] PROBLEM - PHP7 rendering on wtp2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:36:23] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:23] <_joe_> looks like it [12:36:27] PROBLEM - Apache HTTP on wtp2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:33] PROBLEM - PHP7 rendering on wtp2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:36:34] what on earth [12:36:35] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12545 and previous config saved to /var/cache/conftool/dbconfig/20200909-123634-kormat.json [12:36:37] PROBLEM - Apache HTTP on wtp2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:37] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:39] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:54] PROBLEM - LVS parsoid-php codfw port 443/tcp - Parsoid/PHP wikitext parser for VisualEditor -eqiad- IPv4 #page on parsoid.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:36:55] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:55] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:55] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:01] akosiaris: tell me it has nothing to do with the PDUs again please [12:37:03] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:07] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:07] <_joe_> those servers are severely overloaded [12:37:07] !log beginning scheduled PDU maintenance racks D5 and D6 in eqiad [12:37:08] uh [12:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:11] D6: Interactive deployment shell aka iscap - https://phabricator.wikimedia.org/D6 [12:37:12] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [12:37:14] (03Merged) 10jenkins-bot: cxserver: enable the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625935 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [12:37:14] marostegui: I think it's overload [12:37:34] paged [12:37:43] * jayme here [12:37:45] same [12:37:47] * jbond42 here [12:37:50] yo [12:37:54] <_joe_> so it's mobileapps calling parsoid [12:37:54] akosiaris: gotcha [12:38:01] <_joe_> all zhwiki urls [12:38:14] <_joe_> can someone look at the 5xx on logdstash? [12:38:47] most in the form of https://zh.wikipedia.org/api/rest_v1/page/summary/ [12:39:00] <godog> two pages actually afaics [12:39:01] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=17&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200&from=1599653714704&to=1599655146405 [12:39:47] <rzl> here, good morning [12:40:00] <bblack> hi :) [12:41:47] <wikibugs> (03CR) 10Volans: "Not sure if we should have 2 separate ones as the first is production and the other kinda testing. I'll leave it to Cas an you to decide, " [puppet] - 10https://gerrit.wikimedia.org/r/626147 (owner: 10Muehlenhoff) [12:42:25] <sobanski> Should we pause the PDU maintenance for the time being? [12:42:28] <_joe_> so this is more or less parsoid being overloaded by a specific caller [12:42:41] <_joe_> sobanski: yes please :) [12:43:21] <sobanski> cmjohnson1: Can you hold off on any changes until this incident is resolved? [12:44:55] <cmjohnson1> standing by sobanski [12:46:03] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [12:46:06] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:10] <_joe_> I'll raise the mobileapps / rb timeout ^^ [12:46:18] <_joe_> so maybe we can complete those pages [12:47:09] <icinga-wm> RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:47:15] <_joe_> !log restarting php-fpm on wtp2003 [12:47:19] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:31] <akosiaris> there we go, nice [12:47:40] <_joe_> akosiaris: don't trust it [12:47:55] <akosiaris> _joe_: I am guessing the recovery would be the result of the ban, not the timeout change? [12:48:12] <_joe_> akosiaris: possibly, but I don't see a change on the backends in terms of load [12:48:27] <akosiaris> probably retries from restbase ? [12:49:00] <_joe_> dunno [12:49:13] <_joe_> the thing I see is we're still getting the same volume of requests on the backend [12:49:30] <_joe_> can someone check the failing url on the varnishes? [12:52:51] <icinga-wm> PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:55:49] <icinga-wm> PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [12:56:21] <icinga-wm> PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is [12:56:21] <icinga-wm> t structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/meta [12:56:21] <icinga-wm> rieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:58:23] <wikibugs> (03PS6) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [13:00:04] <jouncebot> longma and liw: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1300). [13:00:05] <icinga-wm> RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:01:20] <wikibugs> (03PS2) 10Hnowlan: api-gateway: use x-client-ip instead of x-forwarded-for IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/626146 (https://phabricator.wikimedia.org/T246276) [13:01:42] <icinga-wm> PROBLEM - LVS restbase-https codfw port 7443/tcp - RESTBase- restbase.svc.eqiad.wmnet - HTTPS IPv4 #page on restbase.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:02:37] <_joe_> please no mw train now [13:02:42] <_joe_> longma / liw [13:03:10] <liw> _joe_, ack, not doing anything [13:03:21] <icinga-wm> RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:03:23] <liw> _joe_, and longma is probably asleep [13:03:29] <wikibugs> (03CR) 10Muehlenhoff: "Looks good, some comments inline" (033 comments) [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [13:05:18] <wikibugs> (03PS1) 10RLazarus: varnish: temporary ban https://zh\.wikipedia\.org/api/rest_v1/* to stop overload [puppet] - 10https://gerrit.wikimedia.org/r/626149 [13:05:22] <icinga-wm> RECOVERY - LVS restbase-https codfw port 7443/tcp - RESTBase- restbase.svc.eqiad.wmnet - HTTPS IPv4 #page on restbase.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 16683 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:05:53] <icinga-wm> PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response [13:05:53] <icinga-wm> domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:06:03] <wikibugs> (03PS2) 10RLazarus: varnish: temporary ban https://zh.wikipedia.org/api/rest_v1/* to stop overload [puppet] - 10https://gerrit.wikimedia.org/r/626149 [13:06:13] <icinga-wm> PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected sta [13:06:13] <icinga-wm> g: 200): /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve [13:06:13] <icinga-wm> bile-sections returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:09:59] <wikibugs> (03CR) 10BBlack: [C: 03+1] varnish: temporary ban https://zh.wikipedia.org/api/rest_v1/* to stop overload [puppet] - 10https://gerrit.wikimedia.org/r/626149 (owner: 10RLazarus) [13:10:34] <wikibugs> (03CR) 10RLazarus: [C: 03+2] varnish: temporary ban https://zh.wikipedia.org/api/rest_v1/* to stop overload [puppet] - 10https://gerrit.wikimedia.org/r/626149 (owner: 10RLazarus) [13:11:51] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:11:55] <icinga-wm> RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:17:37] <icinga-wm> PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:18:39] <icinga-wm> PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [13:20:29] <icinga-wm> RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [13:20:43] <icinga-wm> PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.ulsfo.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [13:23:23] <icinga-wm> PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:24:18] <wikibugs> (03PS1) 10Klausman: Update prometheus-amd-rocm-stats.py to handle new driver readings [puppet] - 10https://gerrit.wikimedia.org/r/626150 (https://phabricator.wikimedia.org/T262404) [13:24:49] <icinga-wm> RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:25:05] <icinga-wm> RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:27:28] <wikibugs> (03CR) 10Elukey: [C: 03+1] "Really nice!" [puppet] - 10https://gerrit.wikimedia.org/r/626150 (https://phabricator.wikimedia.org/T262404) (owner: 10Klausman) [13:27:58] <wikibugs> (03CR) 10Kormat: [C: 03+1] Update prometheus-amd-rocm-stats.py to handle new driver readings [puppet] - 10https://gerrit.wikimedia.org/r/626150 (https://phabricator.wikimedia.org/T262404) (owner: 10Klausman) [13:28:07] <icinga-wm> RECOVERY - Apache HTTP on wtp2001 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 6.892 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:28:37] <icinga-wm> RECOVERY - PHP7 rendering on wtp2001 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:28:37] <icinga-wm> RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:28:47] <icinga-wm> RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:28:49] <icinga-wm> RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:28:59] <icinga-wm> RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:29:11] <wikibugs> (03CR) 10Klausman: [C: 03+2] Update prometheus-amd-rocm-stats.py to handle new driver readings [puppet] - 10https://gerrit.wikimedia.org/r/626150 (https://phabricator.wikimedia.org/T262404) (owner: 10Klausman) [13:29:29] <icinga-wm> RECOVERY - Apache HTTP on wtp2020 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:29:35] <icinga-wm> RECOVERY - Apache HTTP on wtp2011 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.393 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:29:41] <icinga-wm> RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:29:51] <icinga-wm> RECOVERY - Apache HTTP on wtp2012 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:29:53] <icinga-wm> RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:29:53] <icinga-wm> RECOVERY - Apache HTTP on wtp2003 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 5.217 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:29:55] <icinga-wm> RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:29:55] <icinga-wm> RECOVERY - PHP7 rendering on wtp2018 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 3.989 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:29:57] <icinga-wm> RECOVERY - PHP7 rendering on wtp2010 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 6.806 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:29:57] <icinga-wm> RECOVERY - Apache HTTP on wtp2018 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 2.896 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:29:59] <icinga-wm> RECOVERY - PHP7 rendering on wtp2011 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:29:59] <icinga-wm> RECOVERY - Apache HTTP on wtp2013 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:30:04] <icinga-wm> RECOVERY - LVS parsoid-php codfw port 443/tcp - Parsoid/PHP wikitext parser for VisualEditor -eqiad- IPv4 #page on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 17172 bytes in 1.228 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:30:04] <icinga-wm> RECOVERY - PHP7 rendering on wtp2014 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 4.440 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:07] <icinga-wm> RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:07] <icinga-wm> RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:09] <icinga-wm> RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:09] <icinga-wm> RECOVERY - PHP7 rendering on wtp2020 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:11] <icinga-wm> RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:11] <icinga-wm> RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:30:13] <icinga-wm> RECOVERY - PHP7 rendering on wtp2002 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 6.906 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:13] <icinga-wm> RECOVERY - PHP7 rendering on wtp2012 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:15] <icinga-wm> RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:15] <icinga-wm> RECOVERY - Apache HTTP on wtp2010 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 2.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:30:17] <icinga-wm> RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:21] <icinga-wm> RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:21] <icinga-wm> RECOVERY - PHP7 rendering on wtp2009 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 7.726 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:21] <icinga-wm> RECOVERY - Apache HTTP on wtp2015 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.598 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:30:23] <icinga-wm> RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:23] <icinga-wm> RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:23] <icinga-wm> RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:23] <icinga-wm> RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:23] <icinga-wm> RECOVERY - PHP7 rendering on wtp2003 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:25] <icinga-wm> RECOVERY - Apache HTTP on wtp2002 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 4.061 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:30:27] <icinga-wm> RECOVERY - PHP7 rendering on wtp2019 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:27] <icinga-wm> RECOVERY - PHP7 rendering on wtp2017 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:29] <icinga-wm> RECOVERY - PHP7 rendering on wtp2008 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 9.646 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:39] <icinga-wm> RECOVERY - PHP7 rendering on wtp2013 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:39] <icinga-wm> RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:39] <icinga-wm> RECOVERY - PHP7 rendering on wtp2006 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 1.737 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:30:41] <icinga-wm> RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:41] <icinga-wm> RECOVERY - Apache HTTP on wtp2019 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:30:43] <icinga-wm> RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:47] <icinga-wm> RECOVERY - Apache HTTP on wtp2016 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 1.227 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:30:49] <icinga-wm> RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:49] <icinga-wm> RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:49] <icinga-wm> RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:49] <icinga-wm> RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:30:53] <icinga-wm> RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:30:53] <icinga-wm> RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:53] <icinga-wm> RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:55] <icinga-wm> RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:30:55] <icinga-wm> RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:55] <icinga-wm> RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:30:59] <icinga-wm> RECOVERY - Apache HTTP on wtp2014 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:31:01] <icinga-wm> RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:01] <icinga-wm> RECOVERY - PHP7 rendering on wtp2015 is OK: HTTP OK: HTTP/1.1 302 Found - 646 bytes in 1.877 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:31:01] <icinga-wm> RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:31:03] <icinga-wm> RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:03] <icinga-wm> RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:03] <icinga-wm> RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:03] <icinga-wm> RECOVERY - Apache HTTP on wtp2008 is OK: HTTP OK: HTTP/1.1 302 Found - 632 bytes in 1.763 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:31:11] <icinga-wm> RECOVERY - Apache HTTP on wtp2017 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:31:13] <icinga-wm> RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:31:19] <icinga-wm> RECOVERY - PHP7 rendering on wtp2004 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:31:23] <icinga-wm> RECOVERY - Apache HTTP on wtp2009 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:31:25] <icinga-wm> RECOVERY - Apache HTTP on wtp2007 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:31:25] <icinga-wm> RECOVERY - PHP7 rendering on wtp2016 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:31:37] <icinga-wm> RECOVERY - PHP7 rendering on wtp2007 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:31:39] <icinga-wm> RECOVERY - Apache HTTP on wtp2006 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:31:41] <wikibugs> (03CR) 10Ppchelko: [C: 03+2] api-gateway: use x-client-ip instead of x-forwarded-for IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/626146 (https://phabricator.wikimedia.org/T246276) (owner: 10Hnowlan) [13:32:05] <icinga-wm> RECOVERY - Apache HTTP on wtp2004 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:32:59] <wikibugs> (03Merged) 10jenkins-bot: api-gateway: use x-client-ip instead of x-forwarded-for IP [deployment-charts] - 10https://gerrit.wikimedia.org/r/626146 (https://phabricator.wikimedia.org/T246276) (owner: 10Hnowlan) [13:38:49] <wikibugs> (03PS1) 10BBlack: varnish: downgrade zhwikirb ban to ratelimit at 5/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 [13:41:04] <wikibugs> (03CR) 10Jbond: [C: 03+1] varnish: downgrade zhwikirb ban to ratelimit at 5/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 (owner: 10BBlack) [13:43:16] <wikibugs> (03PS2) 10BBlack: varnish: downgrade zhwikirb ban to ratelimit at 1/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 [13:46:03] <wikibugs> (03PS3) 10BBlack: varnish: downgrade zhwikirb ban to ratelimit at 1/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 [13:46:49] <wikibugs> (03CR) 10Jbond: [C: 03+1] varnish: downgrade zhwikirb ban to ratelimit at 1/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 (owner: 10BBlack) [13:48:08] <wikibugs> (03PS4) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087) [13:50:37] <wikibugs> (03CR) 10Vgutierrez: [C: 03+1] varnish: downgrade zhwikirb ban to ratelimit at 1/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 (owner: 10BBlack) [13:51:39] <wikibugs> (03CR) 10BBlack: [C: 03+2] varnish: downgrade zhwikirb ban to ratelimit at 1/s [puppet] - 10https://gerrit.wikimedia.org/r/626153 (owner: 10BBlack) [13:52:02] <wikibugs> (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087) (owner: 10Ottomata) [13:53:13] <icinga-wm> PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:54:02] <bblack> !log deployed https://gerrit.wikimedia.org/r/626153 [13:54:06] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:26] <wikibugs> (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626140 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:57:01] <icinga-wm> RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:57:13] <marostegui> !log Restart mysql on db1115 T231769 [13:57:18] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:19] <stashbot> T231769: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 [14:00:48] <logmsgbot> !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:00:52] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:33] <logmsgbot> !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:02:33] <logmsgbot> !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:02:36] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:39] <icinga-wm> PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:02:40] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:47] <icinga-wm> PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [14:04:11] <icinga-wm> PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:30] <moritzm> not sure about the dbmonitor alert, tendril working fine for me [14:04:31] <icinga-wm> RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 91725 bytes in 1.234 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [14:04:35] <icinga-wm> PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:39] <icinga-wm> PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:09] <icinga-wm> PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:23] <icinga-wm> PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus2003 is CRITICAL: CRITICAL: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:07:10] <marostegui> ^ that is expected because of db1115's restart [14:08:21] <icinga-wm> RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:23] <icinga-wm> RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:53] <icinga-wm> RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:49] <icinga-wm> RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:06] <wikibugs> (03CR) 10Ppchelko: [C: 03+1] restbase: return restbase2009 to the host list [puppet] - 10https://gerrit.wikimedia.org/r/626119 (owner: 10Hnowlan) [14:12:29] <wikibugs> 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [14:12:57] <wikibugs> 10Operations, 10fundraising-tech-ops, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10Jgreen) [14:12:59] <wikibugs> 10Operations, 10fundraising-tech-ops, 10netops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10Jgreen) 05Open→03Resolved p:05Triage→03Medium a:03Jgreen Config change is deployed to puppet and appears to be working fro... [14:13:24] <wikibugs> (03PS1) 10Giuseppe Lavagetto: envoy: add a new endpoint for services calling restbase [puppet] - 10https://gerrit.wikimedia.org/r/626158 [14:13:31] <icinga-wm> RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:13:45] <wikibugs> (03PS1) 10Giuseppe Lavagetto: mobileapps: use a non-retry, long-lasting restbase endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/626159 [14:13:52] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] envoy: add a new endpoint for services calling restbase [puppet] - 10https://gerrit.wikimedia.org/r/626158 (owner: 10Giuseppe Lavagetto) [14:14:59] <icinga-wm> RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus2003 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:22:02] <logmsgbot> !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:22:07] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:20] <logmsgbot> !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:29:23] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:42] <wikibugs> (03CR) 10Muehlenhoff: [C: 03+1] "If it works fine on idp-test, +1 on enabling it by default." [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [14:33:50] <wikibugs> (03CR) 10Hnowlan: [C: 03+2] restbase: return restbase2009 to the host list [puppet] - 10https://gerrit.wikimedia.org/r/626119 (owner: 10Hnowlan) [14:36:38] <wikibugs> (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [14:39:39] <wikibugs> (03PS2) 10Jbond: ase::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) [14:42:55] <wikibugs> (03CR) 10Hnowlan: "Removing +2 for the time being until ready to deploy" [puppet] - 10https://gerrit.wikimedia.org/r/626119 (owner: 10Hnowlan) [14:43:55] <wikibugs> (03CR) 10Jcrespo: "I am going to deploy as is, manuel or anyone can later update the options to whatever is preferred as default. But at least is would be pu" [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [14:44:00] <wikibugs> (03CR) 10Jcrespo: [C: 03+2] transferpy: Add ability to override transferpy defaults for wmf [puppet] - 10https://gerrit.wikimedia.org/r/626115 (https://phabricator.wikimedia.org/T257601) (owner: 10Jcrespo) [14:44:51] <wikibugs> 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089 (10Gehel) [14:45:18] <wikibugs> 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089 (10Gehel) 05Open→03Resolved a:03Gehel [14:47:07] <wikibugs> (03CR) 10Muehlenhoff: "This script is only called manually, we don't need to care about Py2 compat. We can simply drop the backwards compat layer and change the " [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:50:48] <wikibugs> (03PS1) 10Dbarratt: Enable $wgAllowCrossOrigin on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626164 (https://phabricator.wikimedia.org/T262425) [14:53:12] <wikibugs> 10Operations, 10Discovery-Search, 10Wikimedia-Logstash, 10observability, 10Epic: [Epic] Migrate log transport to kafka for Search Platform applications - https://phabricator.wikimedia.org/T224911 (10Gehel) 05Open→03Declined [14:53:46] <wikibugs> (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: enable/disable irc service as needed [puppet] - 10https://gerrit.wikimedia.org/r/626140 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:54:26] <wikibugs> (03PS1) 10BBlack: varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 [14:54:40] <wikibugs> (03PS1) 10Ottomata: eventgate - Set default cors only if not provided in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/626166 (https://phabricator.wikimedia.org/T262087) [14:56:05] <wikibugs> (03CR) 10Ottomata: [C: 03+2] eventgate - Set default cors only if not provided in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/626166 (https://phabricator.wikimedia.org/T262087) (owner: 10Ottomata) [14:58:39] <wikibugs> (03PS2) 10BBlack: varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 [14:58:41] <icinga-wm> PROBLEM - Check size of conntrack table on prometheus1003 is CRITICAL: connect to address 10.64.0.123 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:58:51] <icinga-wm> PROBLEM - Check systemd state on prometheus1003 is CRITICAL: connect to address 10.64.0.123 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:35] <herron> [40966560.571128] Out of memory: Kill process 21099 (prometheus) score 357 or sacrifice child [15:00:47] <icinga-wm> RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:37] <icinga-wm> PROBLEM - Thanos sidecar cannot connect to Prometheus on icinga1001 is CRITICAL: cluster=prometheus instance=prometheus1003 job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [15:01:51] <icinga-wm> PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [15:02:33] <icinga-wm> RECOVERY - Check size of conntrack table on prometheus1003 is OK: OK: nf_conntrack is 3 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [15:03:33] <icinga-wm> RECOVERY - Thanos sidecar cannot connect to Prometheus on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [15:03:39] <icinga-wm> PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:03:45] <wikibugs> (03PS3) 10BBlack: varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 [15:04:19] <wikibugs> (03PS1) 10Ottomata: eventgate - move cors setting to earlier in conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/626169 [15:05:39] <icinga-wm> RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [15:05:49] <wikibugs> (03CR) 10Jbond: [C: 03+1] varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 (owner: 10BBlack) [15:05:59] <wikibugs> (03PS4) 10BBlack: varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 [15:06:43] <wikibugs> (03CR) 10Jbond: [C: 03+1] varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 (owner: 10BBlack) [15:06:54] <godog> herron: ooof, I guess a big query [15:07:04] <wikibugs> (03CR) 10BBlack: [C: 03+2] varnish: zhwikirb limiter only for known variants [puppet] - 10https://gerrit.wikimedia.org/r/626165 (owner: 10BBlack) [15:07:17] <herron> yeah :/ [15:07:18] <wikibugs> (03CR) 10Ottomata: [C: 03+2] eventgate - move cors setting to earlier in conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/626169 (owner: 10Ottomata) [15:07:32] <wikibugs> (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - move cors setting to earlier in conf file [deployment-charts] - 10https://gerrit.wikimedia.org/r/626169 (owner: 10Ottomata) [15:07:38] <herron> systemd restarted it, and I bounced the ops thanos sidecar which was logging errors [15:10:25] <wikibugs> (03PS11) 10Herron: kibana: move kibana.yml settings to parameters [puppet] - 10https://gerrit.wikimedia.org/r/622651 [15:11:28] <herron> !log prometheus1003: systemctl restart thanos-sidecar@ops.service [15:11:32] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:37] <herron> jftr [15:12:09] <wikibugs> 10Operations, 10Traffic: Cache Accept-language optimisation - https://phabricator.wikimedia.org/T262428 (10jbond) p:05Triage→03Medium [15:12:36] <wikibugs> (03PS1) 10Jcrespo: remote_backup: Instead of using a preassigned port, autoselect one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/626172 (https://phabricator.wikimedia.org/T138562) [15:13:44] <wikibugs> (03PS1) 10BBlack: varnish: Minor bugfix for prev commit 33803341 [puppet] - 10https://gerrit.wikimedia.org/r/626173 [15:13:53] <wikibugs> (03PS2) 10Krinkle: mediawiki: update alerts on logstash logs [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [15:14:32] <wikibugs> (03PS2) 10BBlack: varnish: Minor bugfix for prev commit 33803341 [puppet] - 10https://gerrit.wikimedia.org/r/626173 [15:14:45] <wikibugs> (03CR) 10BBlack: [V: 03+2 C: 03+2] varnish: Minor bugfix for prev commit 33803341 [puppet] - 10https://gerrit.wikimedia.org/r/626173 (owner: 10BBlack) [15:15:06] <logmsgbot> !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:15:06] <logmsgbot> !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:15:10] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:14] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:52] <wikibugs> (03PS2) 10Jcrespo: remote_backup: Instead of using a preassigned port, autoselect one [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/626172 (https://phabricator.wikimedia.org/T138562) [15:18:59] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability, 10serviceops: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10jijiki) p:05Triage→03Medium [15:19:18] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability, 10serviceops: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10jijiki) [15:20:17] <logmsgbot> !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:20:17] <logmsgbot> !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:20:21] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:30] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:03] <wikibugs> (03CR) 10Jcrespo: "This needs some testing." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/626172 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:23:16] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability, 10serviceops: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10JMeybohm) [15:24:03] <bd808> jouncebot: next [15:24:04] <jouncebot> In 0 hour(s) and 35 minute(s): Striker (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1600) [15:26:25] <icinga-wm> RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:26:42] <logmsgbot> !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [15:26:43] <logmsgbot> !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [15:26:46] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:50] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:56] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:31:12] <wikibugs> 10Operations, 10Analytics, 10Patch-For-Review: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10Ottomata) Ok! I think we are good to go! We'll need to add a wgEventStreams stream config entry and then redeploy (or just restart) eventgate-logging-ex... [15:35:59] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:36:13] <wikibugs> (03PS3) 10Jbond: base::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) [15:40:42] <wikibugs> (03CR) 10Alexandros Kosiaris: [C: 04-1] "I like the idea. Ideally we should have no such services as this" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626158 (owner: 10Giuseppe Lavagetto) [15:41:53] <wikibugs> (03CR) 10Cwhite: mediawiki: update alerts on logstash logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [15:46:04] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [15:48:15] <wikibugs> (03PS4) 10Jbond: base::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) [15:48:18] <wikibugs> (03PS1) 10Jbond: spec: update spec files to use shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/626176 [15:48:56] <logmsgbot> !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:49:00] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:26] <wikibugs> (03PS1) 10Vgutierrez: varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) [15:52:37] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [15:53:34] <mdholloway> _joe_: o/ it looks like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626102 (mobileapps: use the service proxy everywhere) caused an incorrect configuration to be used, causing user-visible breakage. ok to revert? [15:53:38] <wikibugs> (03CR) 10Jbond: [C: 03+2] spec: update spec files to use shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/626176 (owner: 10Jbond) [15:53:44] <mdholloway> ^ mateusbs17 [15:54:02] <mdholloway> (by the way, i was out sick yesterday, sorry i missed your message on the staging version of that change) [15:54:37] <wikibugs> (03PS5) 10Jbond: base::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) [15:54:53] <logmsgbot> !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:56] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:26] <wikibugs> 10Operations, 10Puppet, 10Patch-For-Review, 10User-crusnov, 10User-jbond: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10akosiaris) Let me add a number of use cases for this: * Kubernetes nodes currently have manually set in hiera their datacenter and rack row setup... [15:56:01] <wikibugs> 10Operations, 10DNS, 10Traffic: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10CKoerner_WMF) Hello friends. Is there anything I need to do to help move this along? [15:56:45] <_joe_> mdholloway: uhh sorry, please go on [15:56:51] <_joe_> mdholloway: do you have a task? [15:57:01] <_joe_> we had an outage on mobileapps earlier [15:57:19] <mdholloway> _joe_: not yet, we just got pinged on slack about it [15:57:38] <_joe_> so if this is still ongoing, yes revert [15:57:45] <_joe_> we can debug the problem together [15:58:00] <akosiaris> mdholloway: cebwiki? [15:58:05] <_joe_> frankly I checked the swagger spec and it was all healthy, and the metrics were ok [15:58:11] <_joe_> then alex noticed that in the logs yes [15:58:12] <bearND> PCS run into CSP issues. example: https://en.wikipedia.org/api/rest_v1/page/mobile-html/Earth [15:58:31] <akosiaris> I was noticing something in logstash during today's restbase/parsoid outage [15:58:43] <_joe_> bearND: what's the issue there? [15:58:51] <_joe_> sorry it's not obvious :) [15:59:26] <bearND> _joe_: I'm not sure what is wrong in the CSP header yet, but it doesn't load any of our JS and CSS. [15:59:31] <wikibugs> (03CR) 10Herron: [C: 03+2] kibana: move kibana.yml settings to parameters [puppet] - 10https://gerrit.wikimedia.org/r/622651 (owner: 10Herron) [15:59:34] <_joe_> oh I see [15:59:58] <_joe_> that's strange, heh, for now let's rollbck [16:00:04] <jouncebot> bd808: Your horoscope predicts another unfortunate Striker deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1600). [16:00:06] <akosiaris> _joe_: Content Security Policy: The page’s settings blocked the loading of a resource at http://localhost:6011/meta.wikimedia.org/v1/data/css/mobile/base (“style-src”). [16:00:10] <bearND> _joe_: It has localhost:6011 in there. `Refused to load the stylesheet 'http://localhost:6011/meta.wikimedia.org/v1/data/css/mobile/base' because it violates the following Content Security Policy directive: "style-src app://meta.wikimedia.org https://meta.wikimedia.org app://*.wikipedia.org https://*.wikipedia.org 'self' 'unsafe-inline'". Note that 'style-src-elem' was not explicitly set, so 'style-src' is used as a fallback.` [16:00:10] <_joe_> and maybe add a check on that [16:00:15] <akosiaris> it's referencing localhost:6011 [16:00:48] <bearND> It's supposed to use the external origins instead [16:00:53] <wikibugs> (03PS2) 10Vgutierrez: varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) [16:01:03] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [16:01:12] <_joe_> bearND: ouch, I see, so the problem is one of those templates is also used to output urls [16:01:17] <_joe_> ok, gotcha [16:01:22] <mdholloway> the issue is that `mobile_html_rest_api_base_uri` and `mobile_html_rest_api_base_uri_template` are expected to use public domains. they're injected into [16:01:24] <mdholloway> yes, exactly [16:01:45] <_joe_> ok, maybe we can revert just that? [16:01:53] <_joe_> but ok either way [16:01:54] <mdholloway> *injected into page html for use on the client [16:02:05] <_joe_> mdholloway: let me do this :) [16:02:12] <mdholloway> k [16:02:37] <akosiaris> _joe_: should we use staging instead to debug this further? Unless you are clear on what needs fixing [16:02:50] <akosiaris> and just revert production that is* [16:02:57] <wikibugs> (03PS1) 10MSantos: Revert "mobileapps: use the service proxy everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626037 [16:03:10] <logmsgbot> !log bd808@deploy1001 Started deploy [striker/deploy@e120c6c]: Deploying r20200909 tag (T262323, T144111) [16:03:11] <_joe_> akosiaris: pretty clear [16:03:16] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:17] <stashbot> T144111: Allow self-service creation of Maniphest projects for Tools - https://phabricator.wikimedia.org/T144111 [16:03:18] <stashbot> T262323: "is webservice" checkbox is required on new tools - https://phabricator.wikimedia.org/T262323 [16:03:27] <akosiaris> _joe_: ok, fine by me [16:03:34] <wikibugs> (03CR) 10Mholloway: [C: 04-1] "Let's hold for now pending outcome of discussion in #wikimedia-operations" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626037 (owner: 10MSantos) [16:03:45] <wikibugs> (03PS1) 10Giuseppe Lavagetto: mobileapps: don't override uri templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/626178 [16:03:49] <_joe_> mdholloway: ^^ [16:04:00] <_joe_> I think this should restore the defaults that should be ok in production [16:04:30] <logmsgbot> !log bd808@deploy1001 Finished deploy [striker/deploy@e120c6c]: Deploying r20200909 tag (T262323, T144111) (duration: 01m 21s) [16:04:37] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:44] <_joe_> I'm going to fast-deploy it [16:05:05] <_joe_> I'm thinking we might have a caching problem [16:05:11] <_joe_> in restbase, specifically [16:05:15] <logmsgbot> !log bd808@deploy1001 Started deploy [striker/deploy@e120c6c]: Deploying r20200909 tag (T262323, T144111) [take 2] [16:05:21] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:25] <logmsgbot> !log bd808@deploy1001 Finished deploy [striker/deploy@e120c6c]: Deploying r20200909 tag (T262323, T144111) [take 2] (duration: 00m 11s) [16:05:32] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:39] <_joe_> mdholloway: I'm going to be bold and merge my change [16:05:42] <wikibugs> (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: don't override uri templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/626178 (owner: 10Giuseppe Lavagetto) [16:05:47] <wikibugs> (03PS2) 10MSantos: Revert "mobileapps: use the service proxy everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626037 [16:05:54] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] Revert "mobileapps: use the service proxy everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626037 (owner: 10MSantos) [16:05:56] <mdholloway> _joe_: yes, lgtm [16:06:27] <bd808> !log scap3 of Striker to labweb1001 failing. Will investigate. [16:06:30] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:33] <logmsgbot> !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [16:06:33] <logmsgbot> !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [16:06:37] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:40] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:04] <_joe_> running eqiad first (non active) to be sure I didn't screw up something else [16:07:16] <mateusbs17> _joe_ mdholloway related task is here btw https://phabricator.wikimedia.org/T262437 [16:10:00] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [16:10:00] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [16:10:03] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:07] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:19] <logmsgbot> !log bd808@deploy1001 Started deploy [striker/deploy@e120c6c]: Deploying r20200909 tag (T262323, T144111) [take 3] [16:10:25] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:26] <stashbot> T144111: Allow self-service creation of Maniphest projects for Tools - https://phabricator.wikimedia.org/T144111 [16:10:27] <stashbot> T262323: "is webservice" checkbox is required on new tools - https://phabricator.wikimedia.org/T262323 [16:10:30] <logmsgbot> !log bd808@deploy1001 Finished deploy [striker/deploy@e120c6c]: Deploying r20200909 tag (T262323, T144111) [take 3] (duration: 00m 11s) [16:10:36] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:48] <bd808> third time was the magic :) [16:11:33] <bd808> the failures were caused by bad file permissions on pyc bytecode in an old release that was ready for cleanup [16:12:47] <akosiaris> _joe_: bearND: I can see CSS and images now for https://en.wikipedia.org/api/rest_v1/page/mobile-html/Earth [16:12:55] <akosiaris> I did not even need to do cache-busting tricks [16:12:58] <_joe_> yep, me too [16:12:59] <wikibugs> (03CR) 10Jdlrobson: "Can we deploy this now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623392 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [16:13:03] <_joe_> akosiaris: are you logged in? [16:13:15] <akosiaris> both [16:13:21] <_joe_> uhm [16:13:22] <akosiaris> both anonymous and logged in [16:13:46] <_joe_> yeah... [16:13:54] <_joe_> mdholloway: can you/others confirm? [16:14:07] <_joe_> it seems this type of urls is not cached by rb? [16:14:15] <bearND> It's working for me as well. Thank you [16:14:27] <_joe_> heh sorry :/ [16:14:29] <mdholloway> Yes, https://en.wikipedia.org/api/rest_v1/page/mobile-html/Earth looks good to me. [16:14:40] <mdholloway> I too am surprised about cache cleanup not being needed... [16:14:56] <_joe_> mdholloway: so the cache for those pages seems to be short-lived on the edge [16:15:21] <_joe_> https://en.wikipedia.org/api/rest_v1/page/mobile-html/Italy is still showing up wrong fwiw [16:15:21] <marostegui> !log Stop mysql on db2125 for on-site maintenance T260670 [16:15:26] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:28] <stashbot> T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [16:15:49] <_joe_> uh not anymore [16:16:06] <_joe_> ok, I would think we wait for more user reports before clearing any cache? [16:16:43] <mdholloway> sounds good to me. [16:16:44] <akosiaris> +1 [16:17:13] <mdholloway> btw, I just did action=purge on the Italy page and now that's fixed [16:17:14] <_joe_> although somehow from firefox I can't click on the links [16:17:24] <_joe_> it was before [16:17:35] <_joe_> well we hit different caches [16:17:45] <_joe_> anyways, the caching is short for these pages AIUI [16:17:47] <mdholloway> ah [16:17:57] <wikibugs> 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10Papaul) p:05Triage→03Medium [16:18:23] <_joe_> so for me on an EU cache was already ok when I said 18:15:49 <_joe_> uh not anymore [16:18:31] <mdholloway> bearND: do you know what could be going on with the links? i think that might be expected but i can't recall why [16:18:36] <mdholloway> ah, i see [16:19:09] <_joe_> https://en.wikipedia.org/api/rest_v1/page/mobile-html/Barack_Obama is still cached wrong for me though [16:19:24] <_joe_> so yes, it was your action=purge that did the trick [16:19:33] <mdholloway> wrong here too [16:19:37] <icinga-wm> RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [16:20:11] <_joe_> mdholloway: I don't think it's varnish that needs purging though [16:20:19] <_joe_> I think it's restbase :/ [16:20:25] <bearND> mdholloway: The head element in the pages has/had the wrong links, like `<link rel="stylesheet" href="http://localhost:6011/meta.wikimedia.org/v1/data/css/mobile/base">` [16:20:47] <_joe_> uh did someone purge obama's page too? [16:21:01] <mdholloway> i haven't [16:21:07] <akosiaris> _joe_: me. but not purge. Just Ctrl+Shift+F5 [16:21:14] <_joe_> ok, *now* I'm confused [16:21:30] <akosiaris> the links indeed don't work though [16:22:20] <_joe_> I don't think that is a bug akosiaris [16:22:59] <_joe_> we send out cache-control: s-maxage=1209600, max-age=0, must-revalidate so browser caches are *not* an issue I think [16:25:16] <_joe_> ok, https://en.wikipedia.org/api/rest_v1/page/mobile-html/Politics is consistently stale for me [16:26:05] <mdholloway> bearND: regarding the links, clicking them does nothing even on the fixed version of the page, can you confirm whether that is expected? [16:26:26] <_joe_> let me try to ban it from the caches only [16:26:33] <mateusbs17> mdholloway: I believe it will trigger an event in the console.log and do nothing [16:27:07] <mdholloway> mateusbs17: Got it, thanks. [16:27:32] <mateusbs17> the links do the redirection on broken pages because pagelib js is never loaded [16:30:03] <joewalshwmf> the cache will likely be broken for pages that were edited while the config change was active but haven't been edited (or purged) since, correct? [16:30:45] <_joe_> yes [16:31:16] <_joe_> I'm still trying to understand if restbase is caching these too [16:31:43] <bearND> Pchelolo is probably the best person to ask [16:31:44] <joewalshwmf> yes, AFAIK they are cached in restbase [16:32:11] * Pchelolo is reading what is needed from /me [16:32:42] <wikibugs> (03CR) 10Filippo Giunchedi: [C: 03+2] Add Alertmanager client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625660 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:32:49] <wikibugs> (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add Alertmanager client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625660 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:33:01] <_joe_> Pchelolo: we might need to ban all pages under mobile-html cached today before 16:10Z [16:33:05] <_joe_> from restbase [16:33:11] <wikibugs> (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] Add Icinga AM client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:33:16] <wikibugs> (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+1] Add Icinga AM client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:33:19] <wikibugs> (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add Icinga AM client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:33:30] <Pchelolo> _joe_: ban === remove from storage? [16:33:36] <_joe_> invalidate [16:34:00] <_joe_> Pchelolo: for a single url, sending a PURGE request will work? [16:34:07] <_joe_> a purge to restbase I mean [16:34:25] <bearND> Pchelolo: any page that has `<link>` elements in the `<head>` with `href="http://localhost:6011/meta.wikimedia.org/v1/data/css/mobile/pcs"` [16:34:34] <Pchelolo> no, you send a request with 'Cache-Control: no-cache' from inside wmf cluster [16:34:40] <_joe_> ok [16:34:53] <bearND> but probably faster to do it by timestamp [16:34:54] <_joe_> Pchelolo: that will overwrite the cache in restbase? [16:35:02] <_joe_> bearND: yeah that was my point [16:35:08] <Pchelolo> if the render actually changes - yes [16:36:00] <_joe_> ok so [16:36:13] <_joe_> purging from restbase is what is needed [16:36:38] <_joe_> as soon as I sent that request to restbase, https://en.wikipedia.org/api/rest_v1/page/mobile-html/Politics fixed itself [16:37:17] <akosiaris> interestingly, that article was last edited in Aug 27th [16:37:29] <wikibugs> 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [16:37:57] <Pchelolo> akosiaris: template updates could cause it to change at some random point in time [16:38:02] <akosiaris> true [16:38:19] <_joe_> Pchelolo: so yes, anything for mobile-html generated between 08:40 and 16:10 [16:38:33] <akosiaris> and it seems to have a ton of templates [16:38:55] <_joe_> akosiaris: what article? Politics? [16:39:01] <akosiaris> yup [16:39:07] <Pchelolo> _joe_: that would be a very interesting task :) [16:39:20] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) I upgrade the the Network firmware on the host [16:39:27] <_joe_> Pchelolo: we don't have a way to invalidate the cache in restbase? [16:39:53] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you Papaul. I have upgraded mariadb and will start repooling the host tomorrow [16:39:57] <_joe_> Pchelolo: as an alternative, we could make restbase invalidate any content from page-htlm that includes a link to localhost [16:40:06] <Pchelolo> _joe_: we can drop everything, but we can't drop 'anything between TS1 and TS2' [16:40:20] <Pchelolo> what we can do is read all kafka messages and issue the requests [16:40:28] <_joe_> mdholloway / mateusbs17 / bearND sorry again [16:40:45] <Pchelolo> gimme a sec [16:41:03] <_joe_> for the future, how can I ensure I can ping you all better? [16:42:32] <bearND> Here's a task for this: https://phabricator.wikimedia.org/T262437 [16:44:59] <_joe_> Pchelolo: I was thinking we might be able to create a cassandra query to find the pages cached in that interval [16:45:17] <Pchelolo> timestamp is not a part of the key [16:45:29] <_joe_> oh so not possible to query against? [16:45:37] <mdholloway> _joe_: it happens! :) for me, phabricator pings are probably best. i actually did see your comment to me on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/625619 when i logged on this morning, but by then it was too late, unfortunately. [16:45:38] <Pchelolo> urandom will know better [16:45:40] <_joe_> else we could read every wentry, yes [16:45:49] <_joe_> mdholloway: :/ [16:46:39] <_joe_> urandom: say we want to remove/invalidate all the restbase cache entries for mobile-html in an interval of time [16:46:43] <mdholloway> not sure about the best way to ping the entire team. [16:46:46] <_joe_> is there a way to do that in cassandra? [16:47:12] <_joe_> mdholloway: I tried to search for you in slack but I didn't find a team room [16:47:19] <mdholloway> probably an email to product-infrastructure@ is best in a case like this [16:47:24] <_joe_> anyways, this is for laters [16:47:26] <_joe_> ack, noted [16:47:28] <mdholloway> indeed [16:47:56] <mdholloway> team room is private :/ [16:48:07] <bearND> _joe_: for me IRC pings tend to be faster. I wonder if we should update https://wikitech.wikimedia.org/wiki/Mobileapps_(service) to info about that. [16:48:53] <mdholloway> probably https://www.mediawiki.org/wiki/Wikimedia_Product/Wikimedia_Product_Infrastructure_team as well [16:49:00] <urandom> _joe_: looking [16:49:22] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [16:49:24] <hauskatze> hnowlan: hi, still having permissions issues in apiportalwiki? [16:49:29] <_joe_> anyways totally my bad, that's the second to last service I needed to convert, the last being wikifeeds, and I jumped the gun [16:50:13] <Pchelolo> urandom: tldr, is there an efficient way of querying cass by writetime, it not being a part of partition/clustering key? I think no [16:50:29] <urandom> Pchelolo: yeah, you are right. No. [16:50:37] <_joe_> if we don't have a way to do this smartly, we can just decide to nuke it all? [16:50:37] <Pchelolo> we can write a script to get all the kafka messages for that time period, but kafkacat installed on our machines it too old, doesn't support timestamp offsets yet [16:50:41] <urandom> I mean, it can be done, but it's a full table scan [16:50:59] <_joe_> Pchelolo: how many req/s does the mobile-html endpoing gets on restbase? [16:51:21] <Pchelolo> looking [16:51:41] <akosiaris> https://phabricator.wikimedia.org/T262437 [16:51:44] <akosiaris> sorry [16:51:45] <_joe_> I'm thinking of temporarily repooling eqiad and trying to delete those things wiki-by-wiki [16:51:50] <akosiaris> https://grafana.wikimedia.org/d/000000577/restbase-external-overview?viewPanel=12&orgId=1 [16:52:05] <_joe_> akosiaris: I am not sure how to read those data [16:52:06] <ottomata> Pchelolo: i have a compiled version of kafkacat on stat1004 :p [16:52:14] <ottomata> i think its all statically linked, so you can copy it around [16:52:24] <akosiaris> _joe_: the answer is 40 [16:52:30] <akosiaris> it would be awesome if it was 42 [16:52:30] <hnowlan> hauskatze: sorta - we've solved the issue by adding users to the docseditor group, but the assumption was that the change in wgAddGroups would add bureaucrats to the docseditor group automatically [16:52:32] <Pchelolo> _joe_: https://grafana.wikimedia.org/d/000000577/restbase-external-overview?viewPanel=13&orgId=1 - ~50/s external reqs [16:52:41] <ottomata> stat007* [16:53:08] <_joe_> I see "avg: 80 req/s [16:53:32] <hauskatze> hnowlan: hmm, wgAddGroups/wgRemoveGroups just lets you configure the permissions you can grant/remove via Special:UserRights [16:53:40] <_joe_> whereas mobileapps gets around 1k req/s [16:53:54] <akosiaris> _joe_: It is indeed 42!!! GET: page_mobile-html_-title 33 53 42! [16:53:55] <akosiaris> :P [16:54:11] <_joe_> we can even risk purging it all from cassandra then [16:54:21] <_joe_> akosiaris: what do you think? it's a marginal amount of traffic [16:54:31] <ottomata> 16:54:21 [@stat1007:/home/otto] $ /home/otto/kafkacat -Q -b kafka-jumbo1006.eqiad.wmnet -t codfw.mediawiki.revision-create:0:1599636733000 [16:54:31] <ottomata> codfw.mediawiki.revision-create [0] offset 56282941 [16:55:03] <hauskatze> hnowlan: instead of bureaucrat, the 'sysop' group has 'docseditor' permissions; maybe that's the issue [16:55:13] <akosiaris> _joe_: a 1k rps on mobileapps for mobile-html ? [16:55:16] <akosiaris> I don't think so [16:55:21] <_joe_> no I mean overall [16:55:22] <akosiaris> https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?viewPanel=12&orgId=1 [16:55:29] <_joe_> mobile-html is less than 1 req/s [16:55:30] <akosiaris> a, yeah total is about that [16:56:05] <akosiaris> ok, but what does "all from cassandra" mean ? [16:56:18] <Pchelolo> if we bump the content-type version on that in MCS, and bump the required version in RESTBase it will rerender them on demand [16:56:18] <akosiaris> what's the WHERE something=something I mean? [16:56:21] <_joe_> all the cache keys for mobile-html [16:56:35] <_joe_> Pchelolo: that seems like a good idea [16:56:42] <Pchelolo> 2 tiny config changes, 2 deployments [16:56:55] <_joe_> and easy to revert if the traffic is too much [16:57:01] <_joe_> +1 [16:57:13] <akosiaris> so let it organically fix itself? fine by me [16:57:56] <Pchelolo> ok. lemme make the changes. [16:57:58] <_joe_> yeah let's go that way [16:58:29] <Pchelolo> hnowlan: I've jumped the gun merging addition of rb2009 to scap targets didn't I? it can't be deployed yet? [16:58:35] <bearND> FWIW, on that dashboard i see domain_v1_page_mobile-html_--title_--revision-_--tid at around 150 req/s. [16:58:58] <_joe_> bearND: yeah, not sure what that's linked to [16:59:10] <_joe_> but good news, we have more capacity for mobileapps [16:59:17] <_joe_> a whole datacenter that's depooled right now :P [16:59:27] <hnowlan> hauskatze: ohhh I see [16:59:35] <hnowlan> Pchelolo: it can be yes, it's depooled [16:59:48] <Pchelolo> ok. then it's ok. thank you hnowlan [17:00:00] <hauskatze> hnowlan: are you talking about that elsewhere, maybe in -cpt? [17:00:12] <hauskatze> to avoid fragmentation of talks, etc. [17:01:02] <bearND> _joe_: I did notice plenty of timeouts today but hopefully that was just due to the same issue and is fixed by this revert. https://phabricator.wikimedia.org/T262432 [17:01:08] <_joe_> while Pchelolo writes the patches, can anyone find a page that still renders incorectly? [17:01:10] <hnowlan> hauskatze: there's some conversation in T261425 but the rest has been in dm. Wouldn't hurt to redirect conversation though [17:01:11] <stashbot> T261425: Configure API Portal wiki - https://phabricator.wikimedia.org/T261425 [17:01:18] <_joe_> bearND: no that was due to an outage :) [17:01:30] <hauskatze> hnowlan: k, I come from that Task :) [17:02:29] <hnowlan> oh ofc, heh [17:02:49] <wikibugs> (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [17:03:20] <hnowlan> hauskatze: I think you've clarified enough for us to figure out next steps though, especially around wgAddGroups. Thanks a lot! [17:03:39] <hauskatze> hnowlan: ok, happy to help [17:04:17] <mdholloway> _joe_: ah, right, is there a task about the mobileapps outage you mentioned? [17:04:26] <mdholloway> i'm curious to know what happened there [17:04:46] <wikibugs> 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [17:04:47] <_joe_> we can't discuss the details right now [17:05:20] <Pchelolo> bearND: mdholloway I don't quite know what I'm doing in MCS, please have a look https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/626185 [17:06:11] <Pchelolo> would you mind deploying it if it looks ok to you? [17:08:00] <mdholloway> Pchelolo: sure, can that happen now or do we need to wait for anything? [17:08:18] <Pchelolo> this one can go any time [17:08:19] <mdholloway> jouncebot: next [17:08:19] <jouncebot> In 0 hour(s) and 51 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1800) [17:08:19] <jouncebot> In 0 hour(s) and 51 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1800) [17:08:23] <Pchelolo> restbase one would depend on it [17:08:35] <wikibugs> 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) Update: Due to the accident/injury at the data center today, @Jclark-ctr will try and complete the upgrade of these 2x PDUs tomorrow (on Thur, Sept 10... [17:08:37] <_joe_> mdholloway: just deploy if you need - this is a production issue [17:09:10] <mdholloway> cool. will deploy momentarily. [17:10:56] <wikibugs> (03PS1) 10Filippo Giunchedi: base: add remote syslog queues [puppet] - 10https://gerrit.wikimedia.org/r/626186 (https://phabricator.wikimedia.org/T226703) [17:12:26] <Pchelolo> mdholloway: please ping me when done, I'll do https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/626187 [17:13:01] <wikibugs> (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/25011/" [puppet] - 10https://gerrit.wikimedia.org/r/626186 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [17:15:37] <wikibugs> 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10MNovotny_WMF) I approve this request (Mike Raish's supervisor) [17:16:48] <wikibugs> 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10bd808) >>! In T248041#6445873, @hashar wrote: > The java process is now running with `-Xmx256m`, was `-Xmx4G`, that... [17:18:12] <wikibugs> (03PS1) 10Mholloway: Update mobileapps to 2020-09-09-171242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626189 (https://phabricator.wikimedia.org/T262437) [17:18:50] <wikibugs> (03CR) 10Ppchelko: [C: 03+1] Update mobileapps to 2020-09-09-171242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626189 (https://phabricator.wikimedia.org/T262437) (owner: 10Mholloway) [17:18:51] <logmsgbot> !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:18:55] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:17] <wikibugs> (03CR) 10Bstorm: [C: 03+1] "I wonder if there's a way I could get kubernetes yaml validation squeezed into puppet CI one of these days. I might propose something if I" [puppet] - 10https://gerrit.wikimedia.org/r/626133 (https://phabricator.wikimedia.org/T250172) (owner: 10Arturo Borrero Gonzalez) [17:19:24] <wikibugs> (03PS1) 10Cicalese: Rename docseditor right to edit-docs. Allow bureaucrats to read. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626190 (https://phabricator.wikimedia.org/T261425) [17:19:40] <wikibugs> (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-09-09-171242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626189 (https://phabricator.wikimedia.org/T262437) (owner: 10Mholloway) [17:20:37] <icinga-wm> PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:53] <wikibugs> (03Merged) 10jenkins-bot: Update mobileapps to 2020-09-09-171242-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626189 (https://phabricator.wikimedia.org/T262437) (owner: 10Mholloway) [17:21:37] <_joe_> mdholloway: want me to deploy? [17:21:58] <mdholloway> _joe_: yes, please! [17:22:22] <logmsgbot> !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:22:26] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:57] <wikibugs> (03CR) 10Bstorm: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:23:23] <_joe_> I'll go with codfw next, once I confirm the spec still works [17:24:07] <_joe_> spec is correct, proceeding with codfw so Pchelolo can deploy sooner [17:24:55] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:24:55] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [17:24:59] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:03] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:17] <icinga-wm> RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:56] <wikibugs> (03CR) 10Bstorm: wikireplicas: create multiinstance roles and profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:28:10] <Pchelolo> _joe_: it's better to deploy RB after both DCs are deployed for mobileapps. it can produce some logspam and unnesessary rerenders if deployed out of order [17:28:28] <logmsgbot> !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:28:31] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:51] <_joe_> Pchelolo: only codfw is pooled [17:28:58] <_joe_> but still, let's do the full deployment [17:29:17] <logmsgbot> !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [17:29:17] <logmsgbot> !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [17:29:20] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:24] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:54] <Pchelolo> question: now that mcs is contacted via envoy, how do I do just a random curl request to it? [17:33:09] <_joe_> port 4102 [17:33:12] <_joe_> and https [17:33:24] <_joe_> service-checker-swagger -t 40 $(dig +short mobileapps.svc.eqiad.wmnet) https://mobileapps.discovery.wmnet:4102 is what I'm running right now [17:33:43] <_joe_> deployment finished [17:34:19] <wikibugs> (03CR) 10BBlack: "Of course, I forgot to rebase/pull first and wrote the wrong hash in the title. The commit this was a fixup for is actually 775916a7 aka " [puppet] - 10https://gerrit.wikimedia.org/r/626173 (owner: 10BBlack) [17:35:17] <_joe_> Pchelolo: you can proceed with restbase [17:35:25] <Pchelolo> doing [17:35:56] <logmsgbot> !log ppchelko@deploy1001 Started deploy [restbase/deploy@dc3b955]: Require mobile-html 1.2.2 T262437 [17:36:01] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:02] <stashbot> T262437: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 [17:37:03] <Pchelolo> I really hope my plan will actually work :) [17:37:13] <_joe_> Pchelolo: let's hope [17:37:21] <_joe_> lmk when it's done [17:37:36] <_joe_> actually, do you know a host where this is already deployed? [17:38:00] <Pchelolo> one sec, it's deploying to restbase2010.codfw.wmnet [17:38:10] <Pchelolo> ok, done deploying to restbase2010.codfw.wmnet [17:38:46] <_joe_> Pchelolo: works! [17:38:55] <Pchelolo> pampampam :) [17:39:15] <_joe_> https://en.wikipedia.org/api/rest_v1/page/mobile-html/United_States was bad, I requested it to restbase2010 and tada! [17:39:42] <Pchelolo> in the beautiful wmf infrastructure of the future we will no need to deploy both MCS and restbase for a thing like this [17:39:54] <_joe_> yep :P [17:40:17] <akosiaris> 🙌 [17:41:20] <_joe_> yes confirmed works on all pages I had [17:41:29] <_joe_> the caching at the edge is very short-lived [17:41:38] <_joe_> oh also rb sends a purge [17:41:56] <logmsgbot> !log ppchelko@deploy1001 Finished deploy [restbase/deploy@dc3b955]: Require mobile-html 1.2.2 T262437 (duration: 06m 00s) [17:42:01] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:02] <stashbot> T262437: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 [17:42:06] <_joe_> ok [17:42:10] <Pchelolo> hnowlan: oh.. adding 2009 actually did break it [17:42:14] <_joe_> if you had test pages, please test them [17:42:57] <logmsgbot> !log ppchelko@deploy1001 Started deploy [restbase/deploy@b90472d]: Require mobile-html 1.2.2 T262437, take 2 [17:43:01] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:03] <_joe_> Pchelolo: oh? [17:44:24] <Pchelolo> _joe_: it's ok. we re-added restbase2009 when it wasn't yet ready to be readded [17:44:42] <_joe_> ok I just resolved the bug [17:44:49] <_joe_> that would've been sad [17:44:57] <Pchelolo> so deploy failed. I re-removed it [17:45:09] <Pchelolo> all good. restbase deploy take awhile [17:45:17] <_joe_> great [17:45:27] <_joe_> I am browsing the wikis like crazy on the ios app [17:45:35] <_joe_> and I don't find "bad" pages anymore [17:46:10] <mdholloway> i'm requesting from /page/mobile-html for various pages and everything looks good [17:46:45] <_joe_> thanks all for your patience and availability [17:47:35] <mdholloway> thanks all for helping to get it resolved! [17:47:43] <icinga-wm> PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:47:46] <mdholloway> i've started a convo about how best to contact the team [17:48:29] <bearND> I've updated the wikitech page: https://wikitech.wikimedia.org/wiki/Mobileapps_(service). [17:48:36] <bearND> I guess this warrants an incident report. We should probably also look into how we can use staging better. AFAIK nobody noticed an issue when the config was deployed to staging only. Frankly, I don't even know how to access the service in staging. [17:48:57] <icinga-wm> PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:48:57] <wikibugs> 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [17:49:22] <wikibugs> 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10RobH) [17:49:25] <_joe_> bearND: if we had anything in the spec testing for the css addresses, we would've caught this when deploying to staging [17:49:58] <_joe_> I ran service-checker-swagger -t 40 $(dig +short kubestage1001.eqiad.wmnet) https://mobileapps.discovery.wmnet:4102 from deploy1001, which tests the openapi spec [17:50:13] <_joe_> and yes, this warrants an incident report [17:50:33] <_joe_> I think I should work on it, but I'd do it tomorrow (it's pretty late here) [17:50:44] <toni_> hi all, still seeing a couple of issues - Mulan_2020 is not showing CSS on my end (but maybe it just takes a while to propagate), but more importantly I noticed an article where the mobile-html response looks fine, but the mobile-html-offline-resources still shows localhost:6011 items [17:51:01] <Pchelolo> toni_: the deploy has not yet been completely finished [17:51:22] <Pchelolo> oh... offline-resources was affected too??? [17:51:28] <_joe_> yeah sorry toni_ I saw the message saying the deployment was done [17:51:29] <bearND> _joe_: Good idea about a test for the CSS and JS in the mobile-html output. [17:51:37] <_joe_> Pchelolo: seems so [17:51:42] <_joe_> :/ [17:51:46] <Pchelolo> it's not stored in restbase, so here it's varnish.. [17:52:03] <toni_> https://en.wikipedia.org/api/rest_v1/page/mobile-html/Melanocortin, https://en.wikipedia.org/api/rest_v1/page/mobile-html-offline-resources/Melanocortin [17:52:04] <_joe_> can I have a url? [17:52:31] <_joe_> toni_: uh ok [17:52:33] <_joe_> thanks [17:52:35] <logmsgbot> !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b90472d]: Require mobile-html 1.2.2 T262437, take 2 (duration: 09m 38s) [17:52:40] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:41] <stashbot> T262437: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 [17:52:42] <logmsgbot> !log ppchelko@deploy1001 Started deploy [restbase/deploy@b90472d]: Require mobile-html 1.2.2 T262437, feed timeout [17:52:47] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:53] <_joe_> Pchelolo: we need another deploy, don't we? [17:53:08] <Pchelolo> this one is cause the feed check timeouted... [17:53:10] <bearND> https://en.wikipedia.org/api/rest_v1/page/mobile-html-offline-resources/Politics also has `"https://{{ domain }}/api/rest_v1/data/i18n/pcs"`, which looks like an unrelated bug. [17:53:31] <bearND> or additional bug, I should say [17:53:51] <Pchelolo> as for offline resources... it's cached only by Varnish. I am not sure how RESTBase can help here... [17:54:06] <wikibugs> (03PS9) 10Bstorm: wikireplicas: create multiinstance roles and profiles [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) [17:54:28] <_joe_> bearND: what can have caused that though? [17:54:46] <Pchelolo> we can't use the same method cause restbase will never be hit if the wrong page is in Varnish... [17:55:01] <wikibugs> (03CR) 10Andrew Bogott: [C: 03+2] "I tested this on cloudcephosd1015 and it seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/625947 (owner: 10Bstorm) [17:55:02] <Pchelolo> and it has s-maxage=1209600, max-age=86400 [17:55:03] <bearND> hmm, the cache-control on that is a day: `s-maxage=1209600, max-age=86400` [17:56:54] <_joe_> yes [17:57:17] <_joe_> we need to purge those urls from the varnishes, yes [17:58:14] <bearND> _joe_ i suspect the `mobile_html_local_rest_api_base_uri_template` variable is causing this. [17:58:54] <_joe_> bearND: ok, let me revert *that* change [17:59:05] <_joe_> before we move on with the rest [17:59:29] <logmsgbot> !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b90472d]: Require mobile-html 1.2.2 T262437, feed timeout (duration: 06m 47s) [17:59:33] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:35] <stashbot> T262437: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 [17:59:49] <wikibugs> (03PS1) 10Giuseppe Lavagetto: Revert "mobileapps: make template for the restbase uri configurable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626039 [17:59:50] <Pchelolo> gosh, that feed timeout on deploy problem is hitting us again... [17:59:56] <logmsgbot> !log ppchelko@deploy1001 Started deploy [restbase/deploy@b90472d]: Require mobile-html 1.2.2 T262437, feed timeout [18:00:01] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] <jouncebot> longma and liw: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1800). [18:00:04] <jouncebot> RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1800). [18:00:04] <jouncebot> hip and davidwbarratt: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:10] <bearND> _joe_: That's the weird thing. I thought you already did in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626178/1/helmfile.d/services/mobileapps/values.yaml [18:00:14] <davidwbarratt> here! [18:00:24] <_joe_> bearND: yes the problem is elsehwere [18:00:32] <hip> here [18:00:47] <Pchelolo> unrelated question: did the mw train not go today? I see group1 is still on wmf.6 [18:00:47] <_joe_> bearND: I'm fixing this [18:00:51] <RoanKattouw> I would normally deploy but I'm still in line at the grocery store [18:01:13] <longma> the train goes at 12PM pst (in an hour) Pchelolo [18:01:32] <bearND> _joe_: oh, it's the missing `local` in that variable. It should be called mobile_html_local_rest_api_base_uri_template. [18:01:42] <_joe_> yes [18:01:51] <_joe_> sigh I really screwed up today, sorry [18:01:56] <wikibugs> (03CR) 10Mholloway: wikifeeds: use the service proxy in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [18:02:51] <logmsgbot> !log ppchelko@deploy1001 Finished deploy [restbase/deploy@b90472d]: Require mobile-html 1.2.2 T262437, feed timeout (duration: 02m 55s) [18:02:55] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:21] <wikibugs> (03PS2) 10Giuseppe Lavagetto: Revert "mobileapps: make template for the restbase uri configurable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626039 [18:04:30] <_joe_> bearND: now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626039 should DTRT [18:05:19] <bearND> _joe_: not sure about dropping the protocol from mobile_html_rest_api_base_uri [18:05:24] <davidwbarratt> maybe Urbanecm could do the config deploy? [18:05:30] <_joe_> bearND: it's how it was before [18:05:44] <_joe_> that's one of my changes from this morning [18:05:48] <bearND> ah, ok. Then +1 :) [18:05:55] <davidwbarratt> or Niharika ? [18:05:58] <wikibugs> (03CR) 10BearND: [C: 03+1] Revert "mobileapps: make template for the restbase uri configurable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626039 (owner: 10Giuseppe Lavagetto) [18:06:07] <wikibugs> (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "mobileapps: make template for the restbase uri configurable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626039 (owner: 10Giuseppe Lavagetto) [18:06:13] <_joe_> bearND: thanks :) [18:06:35] <Pchelolo> btw, did you see a nifty trick from change-prop and horrific {{ templating? Look at how changeprop chart does it with ` symbols [18:06:52] <Urbanecm> hip: are you around too? [18:06:55] <hip> yep [18:07:03] <Urbanecm> okay, thanks [18:07:29] <Urbanecm> hip: are you able to test that change from a mwdebug host? [18:07:33] <hip> yep [18:07:47] <wikibugs> (03Merged) 10jenkins-bot: Revert "mobileapps: make template for the restbase uri configurable" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626039 (owner: 10Giuseppe Lavagetto) [18:08:01] <wikibugs> (03PS1) 10Urbanecm: Revert "Revert "Enable MediaWiki client errors on commonswiki and metawiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626040 [18:08:07] <wikibugs> (03PS2) 10Urbanecm: Revert "Revert "Enable MediaWiki client errors on commonswiki and metawiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626040 [18:08:12] <wikibugs> (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Enable MediaWiki client errors on commonswiki and metawiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626040 (owner: 10Urbanecm) [18:08:39] <Urbanecm> hip: cool! I'll ping you once it's there :-) [18:09:06] <wikibugs> (03Merged) 10jenkins-bot: Revert "Revert "Enable MediaWiki client errors on commonswiki and metawiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626040 (owner: 10Urbanecm) [18:09:07] <hip> Urbanecm: thanks, I'll be here :) [18:09:27] <Urbanecm> hip: I pulled that onto mwdebug2001 - let me know how it works! [18:09:30] <_joe_> https://phabricator.wikimedia.org/P12548 this is the change in config [18:09:40] <_joe_> bearND, mdholloway ^^ [18:10:07] <_joe_> {{ domain }} vs {{domain}} [18:10:09] <_joe_> :D [18:10:18] <_joe_> looking at the code now it's obvious [18:10:33] <hip> Urbanecm: working [18:10:34] <logmsgbot> !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:10:38] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:39] <Urbanecm> thanks hip ! [18:10:44] <Urbanecm> I'm syncing that [18:11:03] <wikibugs> (03PS2) 10Urbanecm: Enable $wgAllowCrossOrigin on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626164 (https://phabricator.wikimedia.org/T262425) (owner: 10Dbarratt) [18:11:09] <wikibugs> (03CR) 10Urbanecm: [C: 03+2] Enable $wgAllowCrossOrigin on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626164 (https://phabricator.wikimedia.org/T262425) (owner: 10Dbarratt) [18:11:10] <hip> cheers [18:11:50] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [18:11:52] <wikibugs> (03Merged) 10jenkins-bot: Enable $wgAllowCrossOrigin on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626164 (https://phabricator.wikimedia.org/T262425) (owner: 10Dbarratt) [18:11:58] <_joe_> toni_: do you still see pages that are wrong? I'm working on the offline resources issue [18:12:14] <_joe_> but for the mobile-page part we should be out of the woods [18:12:54] <logmsgbot> !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 85e36ae12e7467a559e3d52c58cc3a71ffd09ded: Enable MediaWiki client errors on commonswiki and metawiki (T255585) (duration: 01m 06s) [18:12:59] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:00] <stashbot> T255585: Extend client-side error logging coverage - https://phabricator.wikimedia.org/T255585 [18:13:08] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [18:13:22] <Urbanecm> hip: should be deployed! [18:13:32] <davidwbarratt> oops sorr y I'm back [18:13:42] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:13:42] <Urbanecm> davidwbarratt: no problem, you're just in time :) [18:13:42] <logmsgbot> !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [18:13:45] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:49] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:51] <Urbanecm> davidwbarratt: your patch is available at mwdebug2001 [18:13:57] <davidwbarratt> awesome! let me test [18:13:58] <Urbanecm> could you test and lmk? [18:14:01] <Urbanecm> thanls! [18:15:23] <toni_> yep, still seeing localhosts in the offline resources calls [18:15:33] <davidwbarratt> Urbanecm it's perfect! [18:15:43] <Urbanecm> davidwbarratt: thanks! Going to deploy :) [18:15:47] <davidwbarratt> sweet! [18:15:56] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member ge-3/0/9 { ... } + member ge-3/0/10; [edit interfaces interface-r... [18:15:58] <logmsgbot> !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [18:16:02] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:13] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [18:16:47] <_joe_> bearND: can you confirm https://en.wikipedia.org/api/rest_v1/page/mobile-html-offline-resources/Melanocortin?test is correct? [18:17:17] <logmsgbot> !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: b226330c1b3bd3dae113e375e2afb4d6af774cde: Enable $wgAllowCrossOrigin on all wikis (T262425) (duration: 01m 04s) [18:17:22] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:23] <stashbot> T262425: Enable $wgAllowCrossOrigin on all Wikimedia wikis - https://phabricator.wikimedia.org/T262425 [18:17:32] <Urbanecm> davidwbarratt: should be live! [18:17:40] <Urbanecm> anything else? :) [18:17:47] <logmsgbot> !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [18:17:47] <logmsgbot> !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:17:50] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:53] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:02] <joewalshwmf> _joe_: i'm still seeing a bad page response at https://en.wikipedia.org/api/rest_v1/page/mobile-html/Hua_Mulan [18:18:41] <_joe_> joewalshwmf: uh I don't [18:18:51] <davidwbarratt> Urbanecm awesome! thanks! [18:18:58] <_joe_> I'm trying with the IOS app [18:18:58] <Urbanecm> cool! [18:19:04] <Urbanecm> !log Morning B&C window done [18:19:05] <_joe_> your browser might have it cached locall? [18:19:08] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:43] <_joe_> !log banning urls ^/api/rest_v1/page/mobile-html-offline-resources/ from varnish caches [18:19:44] <joewalshwmf> I cleared my local cache earlier and it persisted [18:19:46] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:48] <joewalshwmf> but now it looks like it's fine [18:19:56] <_joe_> that's.. strange [18:20:01] <mdholloway> _joe_: joewalshwmf: i just purged it with Special:Purge [18:20:09] <mdholloway> it was bad when i requested it too [18:21:07] <_joe_> Pchelolo: ^^ [18:21:11] <_joe_> this is quite strange [18:21:58] <joewalshwmf> Here's another one that still has the bug: https://en.wikipedia.org/api/rest_v1/page/mobile-html/Tenet_(film) [18:22:37] <_joe_> ok please don't purge it [18:22:44] <mdholloway> ok [18:22:50] <joewalshwmf> and: https://en.wikipedia.org/api/rest_v1/page/mobile-html/Mulan_(2020_film) [18:22:54] <_joe_> I see it correctly [18:22:56] <mdholloway> were ^/api/rest_v1/page/mobile-html/ URLs ever banned from Varnish? [18:23:03] <joewalshwmf> maybe it's an issue with parenthesis? [18:23:06] <_joe_> no mdholloway [18:23:21] <_joe_> joewalshwmf: ok the second one I see badly too [18:23:27] <mdholloway> same here [18:23:51] <Pchelolo> curl -i 'https://en.wikipedia.org/api/rest_v1/page/mobile-html/Mulan_(2020_film)' | grep 'age: ' [18:23:55] <Pchelolo> 7718 [18:24:03] <mdholloway> https://en.wikipedia.org/api/rest_v1/page/mobile-html/Tenet_(film) is good here now (I didn't purge), but https://en.wikipedia.org/api/rest_v1/page/mobile-html/Mulan_(2020_film) is bad [18:24:04] <Pchelolo> they're not proactively purged [18:24:22] <Pchelolo> when varnish expires, restbase is reached, it's rerendered [18:24:40] <Pchelolo> but if it's still in varnish, restbase cn't do anything [18:25:07] <_joe_> cache-control: s-maxage=1209600, max-age=0, must-revalidate [18:25:10] <_joe_> heh [18:25:17] <Pchelolo> curl -i 'https://en.wikipedia.org/api/rest_v1/page/mobile-html/Mulan_(2020_film)' | grep 'content-type' -> Mobile-HTML/1.2.1" [18:25:29] <_joe_> yes [18:25:45] <_joe_> why some pages are so persistently cached, while most aren't though [18:25:52] <wikibugs> (03PS1) 10Ppchelko: Mobileapps: Use backtick to simplify templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/626196 [18:26:24] <_joe_> ok I need to take a break, I still didn't have dinner [18:26:47] <wikibugs> (03CR) 10Ppchelko: "isn't backpack syntax pretty?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626196 (owner: 10Ppchelko) [18:27:33] <wikibugs> (03PS2) 10Cicalese: Rename docseditor right to edit-docs. Allow bureaucrats to read. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626190 (https://phabricator.wikimedia.org/T261425) [18:27:37] <_joe_> Pchelolo: mdholloway I'll be back as soon as I am done [18:27:46] <Pchelolo> bon apetit _joe_ [18:28:13] <mdholloway> bon app [18:31:10] <wikibugs> (03CR) 10Mholloway: [C: 03+1] "Ooh, lovely!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626196 (owner: 10Ppchelko) [18:32:21] <wikibugs> (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/626197 (owner: 10CRusnov) [18:34:26] <wikibugs> (03PS2) 10Ppchelko: Enable OAuthRateLimiter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) [18:35:57] <wikibugs> (03CR) 10BearND: [C: 03+1] "Looks simpler. I don't have the local setup to test this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/626196 (owner: 10Ppchelko) [18:38:53] <_joe_> I'm back but I'm struggling to clean the caches for those urls, the instructions on wikitech are clearly outdated [18:39:16] <wikibugs> (03PS1) 10Papaul: DNS: Add production DNS for db2141 [dns] - 10https://gerrit.wikimedia.org/r/626198 [18:39:22] <bearND> _joe: yes, https://en.wikipedia.org/api/rest_v1/page/mobile-html-offline-resources/Melanocortin?test looks correct to me. [18:39:24] <_joe_> so I will ask for advice to people in the traffic team, and hopefully they can fix this [18:39:33] <_joe_> yes so now it's just caching [18:39:46] <_joe_> for the pages, it expires quite fast AFAICT [18:40:15] <_joe_> for the offline resources, it can take up to 1 day, but I hope we can find a way around it [18:40:48] <wikibugs> (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for db2141 [dns] - 10https://gerrit.wikimedia.org/r/626198 (owner: 10Papaul) [18:41:19] <bearND> The mobile-html-offline-resources affects the native apps ability to download all needed related resources when saving a page for offline use. So, the impact is not felt right away. [18:41:34] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [18:41:34] <bearND> or at least as frequently [18:41:38] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) 05Resolved→03... [18:43:33] <wikibugs> (03PS25) 10CRusnov: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [18:43:58] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [18:45:15] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10bearND) Is there a se... [18:45:38] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) Sadly, we still... [18:47:11] <_joe_> bearND, mdholloway rzl is taking over from me (I'll be around but not in an 100% active role) [18:47:23] <rzl> 👋 I'm catching up on backlog now [18:47:51] <bearND> thank you! [18:47:59] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) To clarify furth... [18:48:31] <rzl> AIUI, everything is resolved except for banning /api/rest_v1/page/mobile-html/* and /api/rest_v1/page/mobile-html-offline-resources/* (for all domains) from ATS, is that correct? [18:50:13] <rzl> I can see https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Forcing_a_cache_miss_(similar_to_ban) but I'm not excited about trying it for the first time without an expert nearby :) I'll try to chase down someone from traffic [18:51:13] <rzl> (it does look like e.ma just updated that section so I'm reasonably optimistic it's up-to-date) [18:51:27] <mdholloway> I think that's correct. My hypothesis on the apparently short cache life of some mobile-html responses is that there are a lot of different possible sources of purge reqeusts that could result in the existing content getting purged well before the current max-age (1209600) elapses [18:51:27] <_joe_> rzl: it will be needed to ban it from ats first, and varnish afterwards [18:51:45] <rzl> oh, because we cleaned varnish and then re-poisoned it from ats? [18:51:47] <_joe_> mdholloway: that's s-max-age [18:51:52] <_joe_> current max-age is 0 [18:51:59] <mdholloway> ah, sorry, meant s-maxage [18:52:28] <_joe_> mdholloway: anyways, we want to purge those too, we just don't know how [18:52:40] <mdholloway> ah, i see. [18:53:12] <wikibugs> (03CR) 10Cwhite: [V: 03+2 C: 03+2] parse_service_problem doesn't need instance-local data move parse_service problem to global function and have am import it clean up imports [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 (owner: 10Cwhite) [18:55:30] <wikibugs> (03PS1) 10Papaul: DHCP: Add MAC address for db2141 [puppet] - 10https://gerrit.wikimedia.org/r/626202 (https://phabricator.wikimedia.org/T260819) [19:00:04] <jouncebot> longma and liw: Time to snap out of that daydream and deploy Mediawiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T1900). [19:00:33] <longma> Are we still holding the train? [19:01:17] <_joe_> longma: no, outage is long over [19:01:37] <longma> okay, thanks for confirming [19:03:16] <wikibugs> (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for db2141 [puppet] - 10https://gerrit.wikimedia.org/r/626202 (https://phabricator.wikimedia.org/T260819) (owner: 10Papaul) [19:08:02] <wikibugs> (03PS1) 10Jeena Huneidi: group1 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626205 [19:08:04] <wikibugs> (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626205 (owner: 10Jeena Huneidi) [19:08:47] <wikibugs> (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626205 (owner: 10Jeena Huneidi) [19:10:21] <logmsgbot> !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.8 [19:10:30] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:25] <logmsgbot> !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.8 (duration: 01m 03s) [19:11:28] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:32] <wikibugs> (03PS1) 10Papaul: Add db2124 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/626206 (https://phabricator.wikimedia.org/T260819) [19:13:11] <longma> I'm seeing a lot of fatals on grafana [19:14:15] <wikibugs> (03PS2) 10Papaul: Add db2141 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/626206 (https://phabricator.wikimedia.org/T260819) [19:14:55] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:15:07] <wikibugs> (03CR) 10Papaul: [C: 03+2] Add db2141 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/626206 (https://phabricator.wikimedia.org/T260819) (owner: 10Papaul) [19:15:30] <longma> looks like they are dropping off though [19:16:49] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:24:27] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2141.codfw.wmnet... [19:25:22] <wikibugs> (03PS1) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [19:26:34] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [19:30:18] <wikibugs> (03CR) 10Giuseppe Lavagetto: [C: 04-1] "it's ts.client_request.get_uri() , not get_url() AFAICT" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [19:30:50] <wikibugs> (03PS2) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [19:31:57] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) ` │ │ reuse-parts: Recipe device matching failed │... [19:39:25] <wikibugs> (03PS3) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [19:40:28] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [19:41:53] <wikibugs> (03PS4) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [19:42:20] <Pchelolo> longma: was the train finished? can I deploy a few config changes? [19:42:45] <wikibugs> (03PS1) 10Andrew Bogott: wmcs-k8s-node-upgrade.py: minor usage edit [puppet] - 10https://gerrit.wikimedia.org/r/626213 (https://phabricator.wikimedia.org/T260614) [19:42:46] <longma> Pchelolo: yes, go ahead [19:42:47] <wikibugs> (03PS1) 10Andrew Bogott: wmcs-package-build.py: update default hosts to use .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626214 (https://phabricator.wikimedia.org/T260614) [19:42:49] <wikibugs> (03PS1) 10Andrew Bogott: Hiera: replace some commented refs to .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/626215 (https://phabricator.wikimedia.org/T260614) [19:42:51] <wikibugs> (03PS1) 10Andrew Bogott: hiera_lookup.rb: update usage statement to not reference .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/626216 [19:42:52] <Pchelolo> cool, thank you [19:42:53] <wikibugs> (03PS1) 10Andrew Bogott: designate.conf: update comments [puppet] - 10https://gerrit.wikimedia.org/r/626217 (https://phabricator.wikimedia.org/T260614) [19:42:55] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [19:43:04] <Pchelolo> CindyCicaleseWMF: shall we? [19:43:17] <CindyCicaleseWMF> We shall! [19:44:22] <Pchelolo> ok. I mostly summoned you to sit here and wait for now :) [19:44:29] <wikibugs> (03CR) 10Andrew Bogott: [C: 03+2] hiera_lookup.rb: update usage statement to not reference .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/626216 (owner: 10Andrew Bogott) [19:44:45] <wikibugs> (03PS1) 10Ppchelko: Rename docseditor right to edit-docs [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626044 (https://phabricator.wikimedia.org/T261425) [19:44:47] <wikibugs> (03CR) 10Andrew Bogott: [C: 03+2] designate.conf: update comments [puppet] - 10https://gerrit.wikimedia.org/r/626217 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:45:04] <CindyCicaleseWMF> I can do that ;-) [19:45:08] <wikibugs> (03CR) 10Andrew Bogott: [C: 03+2] wmcs-package-build.py: update default hosts to use .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626214 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:45:28] <wikibugs> (03CR) 10Andrew Bogott: [C: 03+2] Hiera: replace some commented refs to .eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/626215 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:45:49] <wikibugs> (03CR) 10Andrew Bogott: [C: 03+2] wmcs-k8s-node-upgrade.py: minor usage edit [puppet] - 10https://gerrit.wikimedia.org/r/626213 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:46:27] <wikibugs> (03PS5) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [19:47:41] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [19:48:52] <wikibugs> (03CR) 10Ppchelko: [C: 03+2] Rename docseditor right to edit-docs [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626044 (https://phabricator.wikimedia.org/T261425) (owner: 10Ppchelko) [19:49:31] <wikibugs> (03CR) 10Ppchelko: [C: 03+2] Rename docseditor right to edit-docs. Allow bureaucrats to read. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626190 (https://phabricator.wikimedia.org/T261425) (owner: 10Cicalese) [19:50:18] <wikibugs> 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10Andrew) [19:50:24] <wikibugs> (03Merged) 10jenkins-bot: Rename docseditor right to edit-docs. Allow bureaucrats to read. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626190 (https://phabricator.wikimedia.org/T261425) (owner: 10Cicalese) [19:50:41] <Pchelolo> CindyCicaleseWMF: so the desired behavoir is that having group 'docseditor' I will be able to read it? [19:51:09] <wikibugs> 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) @Marostegui problem with partman recipe. Can you please check. Thanks [19:51:19] <CindyCicaleseWMF> you should now be able to read with sysop or bureaucrat or docseditor [19:51:30] <Pchelolo> hm, I can read it right now.. [19:51:37] <Pchelolo> with docseditor [19:52:01] <CindyCicaleseWMF> that's good [19:52:15] <Pchelolo> ok, anyway, never mind [19:52:16] <wikibugs> (03Merged) 10jenkins-bot: Rename docseditor right to edit-docs [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626044 (https://phabricator.wikimedia.org/T261425) (owner: 10Ppchelko) [19:54:35] <Pchelolo> is someone in the process of deploying MobileFrontend? [19:55:02] <Pchelolo> why was it left dirty on the deployment server? [19:56:55] <CindyCicaleseWMF> beta api portal appears to be working as designed now [19:57:02] <wikibugs> (03PS6) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [19:58:10] <Pchelolo> CindyCicaleseWMF: ok, both changes are on mwdebug2001 [19:58:19] <wikibugs> (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [19:58:50] <Pchelolo> I can see various pages being a docseditor [19:58:57] <Pchelolo> I think it's ok [20:00:04] <jouncebot> chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T2000). [20:01:42] <CindyCicaleseWMF> Great - looks good to me, too [20:01:57] <logmsgbot> !log ppchelko@deploy1001 Synchronized php-1.36.0-wmf.8/skins/WikimediaApiPortal: Backport gerrit:626044, T261425 (duration: 01m 12s) [20:02:02] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:03] <stashbot> T261425: Configure API Portal wiki - https://phabricator.wikimedia.org/T261425 [20:02:25] <icinga-wm> RECOVERY - SSH on wtp1047.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:03:33] <Pchelolo> ok CindyCicaleseWMF, it should all be where it is supposed to be. please test it [20:03:39] <wikibugs> (03CR) 10Volans: [C: 03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/626197 (owner: 10CRusnov) [20:03:43] <logmsgbot> !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:626190 T261425 (duration: 01m 03s) [20:03:47] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:16] <wikibugs> (03PS3) 10Ppchelko: Enable OAuthRateLimiter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) [20:04:19] <wikibugs> (03CR) 10Ppchelko: [C: 03+2] Enable OAuthRateLimiter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [20:05:05] <wikibugs> (03Merged) 10jenkins-bot: Enable OAuthRateLimiter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [20:05:25] <CindyCicaleseWMF> Pchelolo: Looks good. Thanks for the deploy! [20:05:33] <Pchelolo> sure thing [20:05:44] <wikibugs> (03PS7) 10RLazarus: trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) [20:07:17] <wikibugs> (03PS1) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 [20:07:19] <wikibugs> (03PS1) 10Andrew Bogott: wikireplica_dns.yaml: add .eqiad1.wikimedia.cloud cnames [puppet] - 10https://gerrit.wikimedia.org/r/626221 (https://phabricator.wikimedia.org/T260614) [20:08:22] <wikibugs> (03CR) 10BBlack: [C: 03+1] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [20:08:52] <wikibugs> (03CR) 10RLazarus: [C: 03+2] trafficserver: Cache-ban pages with localhost links from page content service [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [20:09:49] <wikibugs> (03Abandoned) 10Andrew Bogott: wikireplica_dns.yaml: add .eqiad1.wikimedia.cloud cnames [puppet] - 10https://gerrit.wikimedia.org/r/626221 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [20:10:03] <wikibugs> (03CR) 10RLazarus: [C: 03+2] trafficserver: Cache-ban pages with localhost links from page content service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626210 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [20:13:29] <logmsgbot> !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:625914 (duration: 01m 03s) [20:13:33] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:50] <wikibugs> (03CR) 10CRusnov: [C: 03+2] netbox/configuration.py: Bump pagination default to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/626197 (owner: 10CRusnov) [20:24:41] <rzl> bearND, mdholloway: took a while for me to learn how, but the ATS cache ban is rolling out now :) I'll follow that up with another Varnish ban and then we should be all set [20:26:05] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Wikimedia-Logstash, and 3 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10Mholloway) It looks like mobileapps and proton (chromium-render) have essentially the same `logging` config (see http... [20:26:41] <mdholloway> rzl: excellent, thank you! [20:26:50] <bearND> Thank you! [20:32:48] <wikibugs> (03PS1) 10Ottomata: eventstreams - bump to image version 2020-09-09-201733-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626223 (https://phabricator.wikimedia.org/T261556) [20:34:59] <wikibugs> (03PS1) 10Brennen Bearnes: logspam-watch: display seconds and refresh each cycle [puppet] - 10https://gerrit.wikimedia.org/r/626224 (https://phabricator.wikimedia.org/T260826) [20:35:40] <wikibugs> 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10wiki_willy) a:03Jclark-ctr [20:35:57] <wikibugs> 10Operations, 10ops-eqiad, 10serviceops: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10wiki_willy) This one looks like it's under warranty, just installed a year ago [20:36:15] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:41:51] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:52:38] <rzl> bearND, mdholloway: done -- there should be no buggy pages still in the cache, can you try a few urls to verify? [20:52:59] <mdholloway> rzl: sure, looking... [21:00:53] <bearND> toni_ joewalshwmf ^^ [21:05:08] <wikibugs> (03CR) 10Hashar: "+Jbond cause he is apparently familiar with the ferm module which this change is based upon." [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [21:19:23] <wikibugs> (03CR) 10Ahmon Dancy: [C: 03+1] logspam-watch: display seconds and refresh each cycle [puppet] - 10https://gerrit.wikimedia.org/r/626224 (https://phabricator.wikimedia.org/T260826) (owner: 10Brennen Bearnes) [21:20:33] <toni_> I am still seeing localhosts in https://en.wikipedia.org/api/rest_v1/page/mobile-html-offline-resources/Melanocortin, lmk if you see the same [21:21:31] <rzl> interesting, I'm not -- what's in the Server response header? [21:21:39] <mdholloway> toni_: i had a bad copy of that page in browser cache, but refreshing fixed it [21:21:50] <rzl> and, check-- haha, that :) ^ [21:22:47] <toni_> ah, yes, refresh worked. and now my device is no longer failing too. thanks! I think we're good [21:22:48] <mdholloway> sorry, just realized i'd forgotten to follow up. i didn't find any new responses with bad content :) [21:23:15] <rzl> perfect, thanks for checking [21:23:26] <rzl> I'll close the task, let me know if anything resurfaces but at this point it shouldn't :) [21:23:35] <mdholloway> rzl: thanks for taking care of that! [21:24:56] <wikibugs> (03CR) 10Jeena Huneidi: [C: 03+2] "Thanks hashar!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 (owner: 10Hashar) [21:26:15] <wikibugs> (03Merged) 10jenkins-bot: Add CI entry point to run tox [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 (owner: 10Hashar) [21:29:03] <wikibugs> 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10RLazarus) 05Open→0... [21:44:46] <wikibugs> (03CR) 10Cwhite: modules/service/files/logstash_checker.py: Fix Python3 PEP8 errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:01:08] <wikibugs> (03PS1) 10Cicalese: Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 [22:04:16] <wikibugs> 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10MBeat33) Advancement stakeholders are talking about email address processes currently, fyi, so this task remains relevant. The issue with the misrouting of the jimmy@ email replies has been solved by for... [22:13:34] <wikibugs> (03PS2) 10Catrope: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) [22:20:38] <wikibugs> (03PS1) 10Nskaggs: Convert maintain-meta_p.py to python3 [puppet] - 10https://gerrit.wikimedia.org/r/626235 (https://phabricator.wikimedia.org/T218426) [22:22:53] <wikibugs> (03PS1) 10Ebernhardson: InterleavedResultSet should implement SearchMetricsProvider [extensions/CirrusSearch] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626045 [22:24:23] <icinga-wm> PROBLEM - kubelet operational latencies on kubernetes2011 is CRITICAL: instance=kubernetes2011.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:26:17] <icinga-wm> RECOVERY - kubelet operational latencies on kubernetes2011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:29:03] <wikibugs> (03CR) 10Bstorm: "Maybe the black formatting changes should be on another patch. That would be easier to review separately since that patch should be comple" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626235 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [22:29:22] <wikibugs> (03PS2) 10Cicalese: Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) [22:33:12] <wikibugs> (03CR) 10Bstorm: "I found something more useful to comment on! It may or may not be a problem yet, though. I haven't tried it. It just jumps out as a likely" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626235 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [22:41:24] <wikibugs> (03CR) 10Bstorm: Convert maintain-meta_p.py to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626235 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [22:51:00] <wikibugs> (03CR) 10Bstorm: "I just checked the database for meta_p and found that it uses utf8 character set, not utf8mb4. IIRC, that means encoding is a deathtrap he" [puppet] - 10https://gerrit.wikimedia.org/r/626235 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [22:52:48] <wikibugs> 10Operations, 10observability: Logstash-next fails to load properly. - https://phabricator.wikimedia.org/T262492 (10colewhite) [23:00:04] <jouncebot> RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200909T2300). [23:00:04] <jouncebot> ebernhardson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:42] <ebernhardson> i'll ship it [23:05:04] <wikibugs> (03CR) 10Ebernhardson: [C: 03+2] "evening backport window" [extensions/CirrusSearch] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626045 (owner: 10Ebernhardson) [23:16:09] <DannyS712> Urbanecm I just +2'ed the fix for T262463 - should it be backported? [23:16:09] <stashbot> T262463: Call to a member function preSaveTransform() on boolean - https://phabricator.wikimedia.org/T262463 [23:26:27] <wikibugs> (03Merged) 10jenkins-bot: InterleavedResultSet should implement SearchMetricsProvider [extensions/CirrusSearch] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626045 (owner: 10Ebernhardson) [23:37:54] <logmsgbot> !log ebernhardson@deploy1001 Synchronized php-1.36.0-wmf.8/extensions/CirrusSearch/includes/Search/InterleavedResultSet.php: Repair passing interleaved search metrics from backend to frontend (duration: 01m 04s) [23:37:59] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:39] <ebernhardson> backport window complete [23:51:00] <logmsgbot> !log dpifke@deploy1001 Started deploy [performance/arc-lamp@55fccc6]: Deploying https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/622915 [23:51:03] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:05] <logmsgbot> !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@55fccc6]: Deploying https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/622915 (duration: 00m 05s) [23:51:09] <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log