[00:34:13] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32231016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:03] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 198632 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:31:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:46:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:00:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:02:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:46:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:01:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:25:47] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:31:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:53:28] (03CR) 10Ayounsi: [C: 03+1] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615237 (owner: 10Muehlenhoff) [05:00:23] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) Thanks @Jclark-ctr - I am going to depool this hots so it is ready for when you arrive to the DC. [05:08:53] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Marostegui) This being a backups source doesn't require depooling, but we need to check with @jcrespo when this host can be powered off. [05:11:45] (03PS1) 10Marostegui: dbproxy1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615324 (https://phabricator.wikimedia.org/T255408) [05:12:08] 10Operations, 10netops: ripe-atlas-eqiad IPv6 unreachable - https://phabricator.wikimedia.org/T258018 (10ayounsi) According to https://radar.qrator.net portmap is open to the world but I was not able to reproduce. [05:12:28] (03CR) 10Marostegui: [C: 03+2] dbproxy1012: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615324 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:13:35] 10Operations, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10ayounsi) 05Resolved→03Open p:05Triage→03High a:05ayounsi→03None This has been alerting since a few days ago. It might be worth following up with the vendor instead of rebooting the console servers... [05:16:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126', diff saved to https://phabricator.wikimedia.org/P12002 and previous config saved to /var/cache/conftool/dbconfig/20200722-051607-marostegui.json [05:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:27:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:30:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:40:13] (03CR) 10Ayounsi: [C: 03+1] Update analytics-in(4|6) filters [homer/public] - 10https://gerrit.wikimedia.org/r/614702 (owner: 10Elukey) [05:41:16] (03CR) 10Ayounsi: Add term idp to analytics-in4/6 filters (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/615160 (owner: 10Elukey) [05:50:58] 10Operations, 10ops-eqiad: Interface errors on asw2-d-eqiad:xe-7/0/0 (ms-be1037) - https://phabricator.wikimedia.org/T257541 (10ayounsi) 05Open→03Resolved Indeed, looks all good now! [05:53:04] (03PS1) 10Marostegui: mariadb: Move db1084 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/615331 (https://phabricator.wikimedia.org/T253217) [05:54:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Joe) [05:54:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1084 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/615331 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [06:04:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 to clone db1084', diff saved to https://phabricator.wikimedia.org/P12003 and previous config saved to /var/cache/conftool/dbconfig/20200722-060432-marostegui.json [06:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:39] (03PS5) 10Giuseppe Lavagetto: Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [06:08:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [06:09:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add nskaggs key and grant access to WMCS related groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [06:09:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:05] 10Operations, 10SRE-Access-Requests: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Joe) 05Open→03Resolved Access should now be granted. If something doesn't work, feel free to reopen this task! [06:11:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:27] !log Stop MySQL on db1107 [06:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:34] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) >>! In T236327#6324507, @Jclark-ctr wrote: > @elukey can you let me know your availability for scheduling this project? Any time that you are... [06:38:51] (03CR) 10Elukey: [C: 03+2] Update analytics-in(4|6) filters [homer/public] - 10https://gerrit.wikimedia.org/r/614702 (owner: 10Elukey) [06:39:17] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10Joe) >>! In T258364#6323872, @CGlenn wrote: > No worries @Joe ! No worries. I wasn't sure either. > > Should I put in a new ticket to add wikimediafoundation.org as a prope... [06:47:17] !log update analytics-in4/6 filters on cr1/cr2 eqiad (ref https://gerrit.wikimedia.org/r/c/operations/homer/public/+/614702) [06:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615237 (owner: 10Muehlenhoff) [07:16:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615237 (owner: 10Muehlenhoff) [07:18:03] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: install statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [07:18:03] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10fgiunchedi) 05Open→03Resolved All good, thanks @Jclark-ctr ` Cache Board Present: True Cache Status: OK Cache Ratio: 10% Read / 90% Write ` [07:26:53] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:32:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 45 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:35:37] (03PS1) 10Muehlenhoff: Remove access for petarpetkovic, h78na [puppet] - 10https://gerrit.wikimedia.org/r/615405 [07:36:05] (03CR) 10JMeybohm: [C: 03+2] Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [07:37:09] (03CR) 10JMeybohm: [C: 03+2] _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [07:37:11] (03Merged) 10jenkins-bot: Check if images are debian based before generating report [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [07:39:29] (03CR) 10JMeybohm: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 (owner: 10JMeybohm) [07:40:33] (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 (owner: 10JMeybohm) [07:40:46] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) @aaron @krinkle thoughts? :) [07:40:52] !log stop db1145 for hw maintenance T258249 [07:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:58] T258249: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 [07:43:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for petarpetkovic, h78na [puppet] - 10https://gerrit.wikimedia.org/r/615405 (owner: 10Muehlenhoff) [07:44:36] (03PS1) 10Marostegui: instances.yaml: Add db1084 [puppet] - 10https://gerrit.wikimedia.org/r/615406 (https://phabricator.wikimedia.org/T253217) [07:45:27] (03CR) 10Muehlenhoff: Switch Turnilo to CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [07:45:32] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) Backups were taken from db1145 today and the host put down. Please ping here when maintenance is complete. [07:45:32] I plan to update cxserver. Anything else on deploy1001? Or should I go ahead? [07:46:20] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1084 [puppet] - 10https://gerrit.wikimedia.org/r/615406 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [07:49:50] !log import docker-report 0.0.6-1 to buster-wikimedia [07:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1084 to s1, depooled T253217', diff saved to https://phabricator.wikimedia.org/P12005 and previous config saved to /var/cache/conftool/dbconfig/20200722-075040-marostegui.json [07:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:45] T253217: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 [07:53:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Increase es1020 to 50% pooled in es4 T257284', diff saved to https://phabricator.wikimedia.org/P12006 and previous config saved to /var/cache/conftool/dbconfig/20200722-075312-kormat.json [07:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:18] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [07:57:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 and db1107', diff saved to https://phabricator.wikimedia.org/P12007 and previous config saved to /var/cache/conftool/dbconfig/20200722-075749-marostegui.json [07:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:09] OK. I'll update cxserver after 20 minutes. [08:01:29] (03CR) 10Muehlenhoff: [C: 03+2] Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [08:01:47] (03PS1) 10Marostegui: db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615409 (https://phabricator.wikimedia.org/T253217) [08:02:30] (03CR) 10Marostegui: [C: 03+2] db1084: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615409 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [08:02:37] (03PS1) 10Filippo Giunchedi: webrequest: add rsync server to migrate data [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) [08:03:03] moritzm: you can merge my change anytime, yours seems more sensible [08:03:20] will do in a few sec [08:03:49] no rush [08:04:12] looking for a volunteer to sanity check / +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/615410 should be straightforward [08:05:09] !log updated docker-report to 0.0.6-1 on deneb [08:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:38] godog: i'll take a look [08:05:54] kormat: thank you sir, appreciate it [08:06:29] (03CR) 10Kormat: [C: 03+1] webrequest: add rsync server to migrate data [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:06:53] (03PS14) 10Gehel: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:07:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:07:19] (03CR) 10Elukey: webrequest: add rsync server to migrate data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:07:47] (03CR) 10Gehel: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1001/24042/" [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:09:51] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:57] thats me [08:10:09] ah ok I am building on it and I thought I broke it :P [08:12:19] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. JMeybohm Ive introduced a new bug in docker-report, fixing asap - The acknowledgement expires at: 2020-07-26 08:10:37. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:46] !log Turnilo switched to CAS [08:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 and db1107', diff saved to https://phabricator.wikimedia.org/P12008 and previous config saved to /var/cache/conftool/dbconfig/20200722-081330-marostegui.json [08:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:51] jayme: ack, no prob. Out of curiosity, are you by any chance working on the garbage collection of obsolete focker images from debmonitor? [08:13:51] (03CR) 10Gehel: [C: 03+1] "As expected, NOOP for production servers: https://puppet-compiler.wmflabs.org/compiler1002/24045/" [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [08:14:54] volans: No. I tried to fix docker-report to not try to generate reports for non debian images [08:14:57] !log kormat@cumin1001 dbctl commit (dc=all): 'Increase es1020 to 75% pooled in es4, reduce es1021 to weight 25 T257284', diff saved to https://phabricator.wikimedia.org/P12009 and previous config saved to /var/cache/conftool/dbconfig/20200722-081457-kormat.json [08:15:02] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=scb.* [08:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:04] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [08:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:19] (03CR) 10Gehel: [C: 03+2] "NOOP for production hosts: https://puppet-compiler.wmflabs.org/compiler1003/24047/" [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:15:21] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [08:15:25] (03CR) 10Filippo Giunchedi: webrequest: add rsync server to migrate data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:15:29] (03PS7) 10Gehel: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:16:01] !log increase codfw mobileapps kubernetes traffic to 96% T218733. Take #2. Let's see if I can reproduce the weird increases in p99 latencies and figure out their cause [08:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:06] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [08:16:08] jayme: ah ok, got it (nerdsniping attempt failed) [08:16:23] (03PS10) 10Gehel: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [08:16:24] eheh, nice try anyways :-) [08:16:42] (03PS2) 10Filippo Giunchedi: webrequest: add rsync server to migrate data [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) [08:16:45] we should plan to do it though, they are growing and growing [08:17:01] (03CR) 10Gehel: [C: 03+2] add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [08:17:14] second attempt? :P [08:18:29] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks all!" [puppet] - 10https://gerrit.wikimedia.org/r/615410 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [08:19:56] !log volans@cumin1001 START - Cookbook sre.dns.netbox [08:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:05] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-07-20-200559-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615150 (https://phabricator.wikimedia.org/T257674) (owner: 10KartikMistry) [08:20:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12010 and previous config saved to /var/cache/conftool/dbconfig/20200722-082023-marostegui.json [08:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:00] (03Merged) 10jenkins-bot: Update cxserver to 2020-07-20-200559-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615150 (https://phabricator.wikimedia.org/T257674) (owner: 10KartikMistry) [08:21:19] (03PS1) 10JMeybohm: Manually pull the image before creating a container [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) [08:21:21] (03PS1) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615412 [08:22:17] (03PS2) 10Gehel: airflow: Include refinery python dependencies [puppet] - 10https://gerrit.wikimedia.org/r/615240 (owner: 10Ebernhardson) [08:22:19] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [08:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 and db1107', diff saved to https://phabricator.wikimedia.org/P12012 and previous config saved to /var/cache/conftool/dbconfig/20200722-082309-marostegui.json [08:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:02] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would add a wrapper around pulling / creating the image, and return false if either fails. I'd prefer to have a false positive than repo" (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [08:25:27] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [08:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:16] (03CR) 10Gehel: [C: 03+2] airflow: Include refinery python dependencies [puppet] - 10https://gerrit.wikimedia.org/r/615240 (owner: 10Ebernhardson) [08:28:01] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [08:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:19] !log Updated cxserver to 2020-07-20-200559-production (T257674) [08:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:24] T257674: Create Moroccan Arabic Wikipedia - https://phabricator.wikimedia.org/T257674 [08:31:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12013 and previous config saved to /var/cache/conftool/dbconfig/20200722-083140-marostegui.json [08:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:59] (03PS1) 10Filippo Giunchedi: rsync: fix quickdatacopy sync script [puppet] - 10https://gerrit.wikimedia.org/r/615413 (https://phabricator.wikimedia.org/T254480) [08:32:10] (03CR) 10JMeybohm: "> Patch Set 1: Code-Review-1" (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [08:32:41] (03PS6) 10Gehel: [wdqs] drop updater mode config [puppet] - 10https://gerrit.wikimedia.org/r/602353 (owner: 10DCausse) [08:33:17] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/24049/centrallog1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615413 (https://phabricator.wikimedia.org/T254480) (owner: 10Filippo Giunchedi) [08:33:26] (03PS2) 10JMeybohm: Manually pull the image before creating a container [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) [08:33:29] (03PS2) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615412 [08:34:04] (03CR) 10Gehel: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24048/" [puppet] - 10https://gerrit.wikimedia.org/r/602353 (owner: 10DCausse) [08:35:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1084 and db1107', diff saved to https://phabricator.wikimedia.org/P12014 and previous config saved to /var/cache/conftool/dbconfig/20200722-083535-marostegui.json [08:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:03] 10Operations, 10Beta-Cluster-Infrastructure: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z - https://phabricator.wikimedia.org/T258451 (10jbond) @Krinkle I have manuly resolved the conflicts and things look good to me, is someone able to confirm that t... [08:37:58] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [08:38:32] (03PS25) 10Gehel: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [08:38:36] (03PS1) 10Muehlenhoff: xhgui: Switch to extra_pkgs [puppet] - 10https://gerrit.wikimedia.org/r/615414 [08:39:20] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/615413 (https://phabricator.wikimedia.org/T254480) (owner: 10Filippo Giunchedi) [08:39:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12015 and previous config saved to /var/cache/conftool/dbconfig/20200722-083926-marostegui.json [08:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:39] 10Operations, 10DBA, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat) [08:39:41] (03PS26) 10Gehel: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [08:40:19] (03CR) 10Filippo Giunchedi: "LGTM overall! I think though 'statsv' as an instance name is too specific, maybe sth like 'xt' for "external" as we might want to pull oth" [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [08:40:24] (03CR) 10Gehel: [C: 03+1] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/24050/" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [08:41:23] (03PS3) 10Gehel: [wdqs] bump vocabulary and inline URI handler version [puppet] - 10https://gerrit.wikimedia.org/r/605536 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse) [08:41:59] !log kormat@cumin1001 dbctl commit (dc=all): 'Increase es1020 to 100% pooled in es4, reduce es1021 to weight 0 T257284', diff saved to https://phabricator.wikimedia.org/P12016 and previous config saved to /var/cache/conftool/dbconfig/20200722-084159-kormat.json [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:05] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [08:42:45] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) (owner: 10Filippo Giunchedi) [08:43:28] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add logs retention [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) (owner: 10Filippo Giunchedi) [08:43:59] (03PS3) 10JMeybohm: Manually pull the image before creating a container [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) [08:44:01] (03PS3) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615412 [08:44:03] (03CR) 10Gehel: [C: 03+2] [wdqs] bump vocabulary and inline URI handler version [puppet] - 10https://gerrit.wikimedia.org/r/605536 (https://phabricator.wikimedia.org/T255399) (owner: 10DCausse) [08:44:06] (03CR) 10Filippo Giunchedi: [C: 03+2] rsync: fix quickdatacopy sync script [puppet] - 10https://gerrit.wikimedia.org/r/615413 (https://phabricator.wikimedia.org/T254480) (owner: 10Filippo Giunchedi) [08:45:10] godog: can I merge your puppet change with mine? [08:45:19] gehel: yes please! thank you [08:45:34] done [08:46:04] (03PS2) 10Vgutierrez: ATS: Add missing PIDFile for non-default instances [puppet] - 10https://gerrit.wikimedia.org/r/567009 [08:46:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1126', diff saved to https://phabricator.wikimedia.org/P12017 and previous config saved to /var/cache/conftool/dbconfig/20200722-084613-marostegui.json [08:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:40] (03PS3) 10Vgutierrez: ATS: Add missing PIDFile for non-default instances [puppet] - 10https://gerrit.wikimedia.org/r/567009 [08:48:15] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:48:19] 10Operations, 10DBA, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [08:48:23] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Set up replication for zarcillo - https://phabricator.wikimedia.org/T257816 (10Kormat) 05Open→03Resolved Monitoring is not properly in place, but going to track that in T258566. [08:48:56] !log restarting blazegraph on wdqs1010 (testing new vocab) [08:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:11] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:50:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/567009 (owner: 10Vgutierrez) [08:51:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Manually pull the image before creating a container [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [08:53:02] (03PS3) 10Filippo Giunchedi: icinga: add logs retention [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) [08:53:29] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: add logs retention [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) (owner: 10Filippo Giunchedi) [08:53:30] (03CR) 10Jbond: [C: 03+2] labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 (owner: 10Jbond) [08:54:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:55:03] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 44 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:57:48] (03CR) 10JMeybohm: [C: 03+2] Manually pull the image before creating a container [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [08:57:59] (03PS1) 10Jbond: Revert "labs - hiera: migrate to hiera version5" [puppet] - 10https://gerrit.wikimedia.org/r/615205 [08:58:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "labs - hiera: migrate to hiera version5" [puppet] - 10https://gerrit.wikimedia.org/r/615205 (owner: 10Jbond) [08:59:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "labs - hiera: migrate to hiera version5" [puppet] - 10https://gerrit.wikimedia.org/r/615205 (owner: 10Jbond) [08:59:15] (03Merged) 10jenkins-bot: Manually pull the image before creating a container [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615411 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [08:59:53] (03CR) 10Muehlenhoff: [C: 03+2] Configure yarn/testcluster for LDAP auth only [puppet] - 10https://gerrit.wikimedia.org/r/615225 (owner: 10Muehlenhoff) [09:00:05] (03PS1) 10Jbond: labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/615426 [09:00:55] (03CR) 10jerkins-bot: [V: 04-1] labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/615426 (owner: 10Jbond) [09:01:17] (03PS2) 10Jbond: labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/615426 [09:02:09] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:02:58] (03PS4) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615412 [09:04:37] (03CR) 10JMeybohm: [C: 03+2] New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615412 (owner: 10JMeybohm) [09:05:01] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:05:40] (03Merged) 10jenkins-bot: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/615412 (owner: 10JMeybohm) [09:06:37] !log restarting blazegraph on all wdqs nodes - new vocabulary [09:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:50] !log import docker-report 0.0.7-1 to buster-wikimedia [09:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:32] !log updated docker-report to 0.0.7-1 on deneb [09:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:51] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:11:17] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [09:13:03] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:13:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:16:47] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) Hi @Jclark-ctr, if the replacement can be done with no downtime, go for it. If downtime is required let me know when you'll be doing the replacement and I'll take the sytem down. Just for refe... [09:20:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:37] (03PS1) 10Alexandros Kosiaris: mobileapps: Bump memory limit by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615416 (https://phabricator.wikimedia.org/T218733) [09:21:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Bump memory limit by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615416 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [09:21:48] (03PS1) 10Filippo Giunchedi: wikimedia.org: lower librenms TTL in preparation for failover [dns] - 10https://gerrit.wikimedia.org/r/615417 (https://phabricator.wikimedia.org/T247967) [09:22:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:22:25] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [09:22:40] (03Merged) 10jenkins-bot: mobileapps: Bump memory limit by 25% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615416 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [09:22:48] (03CR) 10Jbond: [C: 03+2] labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/615426 (owner: 10Jbond) [09:23:28] (03PS2) 10Filippo Giunchedi: wikimedia.org: lower librenms/smokeping TTL in preparation for failover [dns] - 10https://gerrit.wikimedia.org/r/615417 (https://phabricator.wikimedia.org/T247967) [09:24:11] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [09:25:25] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [09:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:48] !log bump memory limits for mobileapps by 25% T218733 [09:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:53] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:25:56] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/615159 (owner: 10Jbond) [09:26:26] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [09:26:48] 10Operations, 10serviceops, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) 05Open→03Resolved ` Jul 22 09:15:57 deneb docker-report-releng[3273]: INFO[docker-report] Building debmonitor report for docker-registry.wikimedia.org/rel... [09:27:15] !log akosiaris@deploy2001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [09:27:15] !log akosiaris@deploy2001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, the same was done for the idp CNAME and it was very helpful." [dns] - 10https://gerrit.wikimedia.org/r/615417 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [09:28:41] (03CR) 10JMeybohm: [C: 03+2] blubberoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615245 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:29:35] (03Merged) 10jenkins-bot: blubberoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615245 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:29:43] 10Operations, 10DBA, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Marostegui) > Do we need to cover the case where db1115 is the active tendril node, but db2093 is the active zarcillo one? If so i'm not sure we can easily use mariadb::monitor_readonly... [09:33:15] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [09:34:03] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [09:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] 10Operations, 10serviceops, 10Patch-For-Review: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10JMeybohm) K8s sidecar(s) have one more: ` [2020-07-22 09:34:36.435][1][warning][misc] [source/common/protobuf/utility.cc:198] Using deprecated option 'envoy.api.v2... [09:40:30] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=mobileapps,name=scb.* [09:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:45] !log increase codfw mobileapps kubernetes traffic to 100% T218733 [09:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:53] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:40:56] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [09:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:06] Woo. [09:41:35] after 2 setbacks, this time I think we got it [09:41:49] * James_F crosses his fingers. [09:43:14] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:43:15] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [09:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks Moritz!" [dns] - 10https://gerrit.wikimedia.org/r/615417 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [09:43:42] (03CR) 10Jbond: [C: 03+1] "LGTM but note netmon2001 is not in a fit stat to fail over currently." [dns] - 10https://gerrit.wikimedia.org/r/615417 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [09:43:46] (03PS3) 10Filippo Giunchedi: wikimedia.org: lower librenms/smokeping TTL in preparation for failover [dns] - 10https://gerrit.wikimedia.org/r/615417 (https://phabricator.wikimedia.org/T247967) [09:43:52] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [09:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:43] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:45:48] dammit [09:45:53] spoke too soon [09:46:00] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=mobileapps,name=scb.* [09:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:31] !log codfw mobileapps kubernetes traffic back to 96% T218733 again. scb pooled again. [09:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:49] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:46:49] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:47:14] it amazes me that that last 4% managed to cause an issue. [09:48:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_mobileapps_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:40] (03CR) 10JMeybohm: [C: 03+2] citoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615246 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:50:05] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [09:50:21] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:50:43] (03Merged) 10jenkins-bot: citoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615246 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [09:51:24] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [09:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [09:52:58] !log centrallog1001 lvextend /srv by 130G [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:11] (03PS4) 10Vgutierrez: ATS: Add missing PIDFile for non-default instances [puppet] - 10https://gerrit.wikimedia.org/r/567009 [09:54:20] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [09:54:53] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:55:19] !log akosiaris@deploy2001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [09:55:19] !log akosiaris@deploy2001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [09:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:37] !log bump memory in codfw mobileapps another 20% T218733 [09:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:42] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:57:00] (03PS2) 10Effie Mouzeli: Add k8s dummy tokens for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/613101 (https://phabricator.wikimedia.org/T256973) [09:57:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:58:47] !log Deploy MCR schema change on s4 codfw master (lag will appear on codfw) - T238966 [09:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [09:59:32] (03PS2) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [10:00:28] (03CR) 10Effie Mouzeli: [C: 03+2] Add k8s dummy tokens for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/613101 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:01:20] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:01:22] (03PS1) 10Filippo Giunchedi: profile: sync weblog data to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/615420 (https://phabricator.wikimedia.org/T247968) [10:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:51] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] Add k8s dummy tokens for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/613101 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:02:48] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [10:03:47] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: sync weblog data to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/615420 (https://phabricator.wikimedia.org/T247968) (owner: 10Filippo Giunchedi) [10:04:30] effie: I puppet-merged your labs/private change too btw [10:04:54] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:22] (03PS3) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [10:07:48] godog: tx! [10:08:01] I was just looking which poor soul would do so [10:08:36] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=mobileapps,name=scb.* [10:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:58] (03PS2) 10Effie Mouzeli: Kubernetes: Create token stanzas for push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) [10:11:39] 10Operations, 10DBA, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat) p:05Triage→03Medium a:03Kormat [10:12:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_mobileapps_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:12:43] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=mobileapps,name=scb.* [10:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:23] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:15:37] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:15:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:17:57] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:18:34] (03PS2) 10JMeybohm: cxserver: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615247 (https://phabricator.wikimedia.org/T256843) [10:19:05] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24051/deploy1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [10:19:11] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:19:26] !log akosiaris@deploy2001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [10:19:26] !log akosiaris@deploy2001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [10:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:41] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:19:48] (03CR) 10JMeybohm: [C: 03+2] cxserver: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615247 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [10:20:25] (03PS15) 10Effie Mouzeli: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [10:20:47] (03CR) 10Effie Mouzeli: charts for push-notification service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [10:20:48] (03Merged) 10jenkins-bot: cxserver: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615247 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [10:20:49] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:21:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:22:46] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:24:52] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:24:54] !log upload prometheus-swagger-exporter_0.3-1+deb10u1 to apt1001 buster repo [10:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:33] (03PS1) 10Muehlenhoff: profile::idp::client::httpd: Default priority to 50 [puppet] - 10https://gerrit.wikimedia.org/r/615422 [10:37:58] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Joe) I will just acknowledge the alert given you are tracking the situation. Btw if I remember correctly, those log files are used to regenerate svgs via a script... [10:39:37] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:56] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:40:20] again? [10:40:35] akosiaris: related to what was discussed few minutes ago? [10:41:07] pybal on lvs2009 seems to be happy [10:41:49] weird ... [10:43:16] of course the icinga check and pybal proxyfetch test hits different urls (both against api.php but different params) [10:43:30] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24704 bytes in 0.487 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:43:59] api queries on codfw used to take a long time, but not sure if just because of coldness [10:44:11] a few seconds ago, not anymore [10:44:37] this is what's being checked by icinga: en.wikipedia.org!/w/api.php?action=query&meta=siteinfo [10:45:30] pybal just checks /w/api.php [10:45:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:46:12] I don't have performance metrics because all good metrics are of actual user requests, not codfw [10:46:15] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-30m&to=now doesn't point to anything. I mean sure we have a drop of 10% in rps and latency increases, but we are talking about 40rps, so inconsequential [10:46:59] could it be the monitoring ? [10:47:14] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24052/" [puppet] - 10https://gerrit.wikimedia.org/r/615422 (owner: 10Muehlenhoff) [10:48:48] (03PS1) 10Volans: GC: add time-based GC for Image objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 [10:49:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:50:04] akosiaris: you thinking icinga or pybal? [10:50:45] nothing wrong with pybal [10:50:59] (03PS2) 10Jbond: profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) [10:51:06] <_joe_> so let's see what's wrong [10:51:47] <_joe_> the problem is that apparently the query we make to apis is taking longer than 10 seconds [10:51:54] (03PS1) 10Ema: ATS: force cache revalidation on a few selected wikis [puppet] - 10https://gerrit.wikimedia.org/r/615446 (https://phabricator.wikimedia.org/T256750) [10:52:06] _joe_: that is consistent with my anecdotal test [10:52:11] (03CR) 10jerkins-bot: [V: 04-1] profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [10:52:35] (03PS3) 10Jbond: profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) [10:52:56] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 disagrees [10:53:14] <_joe_> it can be that one specific appserver responded slowly around that time [10:53:20] I was about to say [10:53:34] <_joe_> but that's strange given in theory this goes round-robin [10:53:37] and we don't notice it because lack of traffic [10:53:44] (03CR) 10jerkins-bot: [V: 04-1] profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [10:53:51] and only rarely gets hit by health check? [10:53:58] (03PS4) 10Jbond: profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) [10:54:31] maybe we can do a manual health check on all codfw api app servers and check latency? [10:54:40] (03Abandoned) 10Jbond: profile::thanos::frontend: only support SSO on thanos [puppet] - 10https://gerrit.wikimedia.org/r/615216 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [10:54:47] the same one pybal does [10:55:26] <_joe_> not the one that pybal does [10:55:30] <_joe_> the one that icinga does [10:55:34] (03PS4) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [10:55:40] <_joe_> but I'll rather look at logs [10:57:15] (03CR) 10Vgutierrez: [C: 03+1] ATS: force cache revalidation on a few selected wikis [puppet] - 10https://gerrit.wikimedia.org/r/615446 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [10:57:28] (03PS1) 10Muehlenhoff: Remove obsolete references to Yarn on hadoop::ui role [puppet] - 10https://gerrit.wikimedia.org/r/615447 (https://phabricator.wikimedia.org/T258152) [10:58:35] <_joe_> can someone look at the icinga event log for the first occurence of the error? [10:58:40] <_joe_> I need the timestamp [10:59:28] 10:39 on IRC [10:59:39] I will check icinga [11:00:00] I think he wants the first error, icinga requires 3 before paging [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T1100). [11:00:04] jan_drewniak and jan_drewniak: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] \o/ [11:00:15] 10:36:11 [11:00:17] _joe_: should we delay the window? [11:00:28] <_joe_> Urbanecm: no reason to [11:00:32] unless I am mistaken to get soft [11:00:50] _joe_: okay, thanks, just seeing the conversation above and wasn't sure if it affects deployments [11:00:51] heh, apparently jouncebot doesn’t dedupe the ircnicks [11:01:15] jan_drewniak: hey, around? :) [11:01:18] o/ [11:01:22] <_joe_> we can move that discussion to #sre [11:01:24] Lucas_WMDE: you know where to file bugs :D [11:01:36] jan_drewniak: as discussed yesterday, wanna try yourself? [11:02:12] yes, also ema needs a heads-up before this deploy, there's some caching stuff that needs to happen after [11:02:29] jan_drewniak: hi, yeah I'm around :) [11:02:58] jan_drewniak: okay, cool! https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers are the docs, if you have any questions, ask :) [11:05:47] (03PS10) 10Jdrewniak: Enable instrumentation for wikis in the desktop improvements testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [11:08:08] (03CR) 10Jdrewniak: [C: 03+2] Enable instrumentation for wikis in the desktop improvements testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [11:08:33] (03PS3) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [11:08:50] (03PS2) 10JMeybohm: eventgate-analytics-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615248 (https://phabricator.wikimedia.org/T256843) [11:09:00] (03Merged) 10jenkins-bot: Enable instrumentation for wikis in the desktop improvements testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [11:10:13] (03PS2) 10Muehlenhoff: rename: rename source package to wmf-laptop [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/615224 (owner: 10Jbond) [11:10:49] (03PS2) 10Ema: ATS: force cache revalidation on a few selected wikis [puppet] - 10https://gerrit.wikimedia.org/r/615446 (https://phabricator.wikimedia.org/T256750) [11:11:02] (03CR) 10JMeybohm: [C: 03+2] eventgate-analytics-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615248 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [11:11:06] Urbanecm: hey so for a patch were I have a lot different folders to sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/614888 do I run sync-file on the whole [11:11:06] wmf-config folder? or run it multiple times for each directory? [11:11:17] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "Thanks, merging!" (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/615224 (owner: 10Jbond) [11:11:45] jan_drewniak: you first need to sync the .dblist file, and then IS.php. There is no need to sync the .yaml files IIRC, they aren't loaded by production [11:12:00] (03Merged) 10jenkins-bot: eventgate-analytics-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615248 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [11:13:11] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:13:11] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:36] Urbanecm: so first `scap sync-file wmf-config/config/dblists/desktop-improvements.dblist ` and then... [11:14:13] jan_drewniak: no, the path is wrong. It's from /srv/mediawiki-stagging, so scap sync-file dblists/desktop-improvements.dblist [11:14:31] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 55 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:15:17] Urbanecm: right, sorry, and after that file is synced, then I sync the MWConfigCacheGenerator.php file? [11:16:06] yup [11:16:43] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [11:18:37] !log jdrewniak@deploy1001 Synchronized dblists/desktop-improvements.dblist: Config: [[gerrit:614888|Enable instrumentation for wikis in the desktop improvements testing group (T254228)]] (duration: 01m 18s) [11:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:43] T254228: Deploy new version of vector skin to all wikis as a user preference - https://phabricator.wikimedia.org/T254228 [11:20:03] !log jdrewniak@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:614888|Enable instrumentation for wikis in the desktop improvements testing group (T254228)]] (duration: 01m 05s) [11:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:21] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:20:21] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [11:22:07] Urbanecm: `Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.06, After: 2.00, Threshold: 1.00)` Is that something I should worry about? [11:22:24] jan_drewniak: there should be a logstash link to check [11:22:34] it may be a background noise, it may be something related to what you synced [11:22:43] aha https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b ok I'll check = [11:24:13] Doesn't look related to any of the wikis involved in the sync [11:24:43] yeah, looks like known errors [11:24:45] let's continue :) [11:25:39] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [11:25:44] Urbanecm: so for individual files like wmf-config/config/euwiki.yaml can I sync the entire wmf-config/config folder? or do I do those individually too? [11:26:04] jan_drewniak: you can sync just the folder [11:27:45] ok, I'll do that and then the InitialiseSettings.php file [11:28:08] yup [11:28:20] !log jdrewniak@deploy1001 Synchronized wmf-config/config: Config: [[gerrit:614888|Enable instrumentation for wikis in the desktop improvements testing group (T254228)]] (duration: 01m 05s) [11:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:25] T254228: Deploy new version of vector skin to all wikis as a user preference - https://phabricator.wikimedia.org/T254228 [11:30:10] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:30:10] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:25] !log jdrewniak@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:614888|Enable instrumentation for wikis in the desktop improvements testing group (T254228)]] (duration: 01m 04s) [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:55] Urbanecm: phew, ok all files synced! was that a deploy? [11:31:11] yeah, why shouldn't it be? :) [11:31:18] congrats :) [11:32:18] woohoo! Just wanted to make sure that was it :) The next one will be pretty simple I guess, just one file. [11:32:53] cool! [11:33:21] (03PS8) 10Jdrewniak: Enable desktop improvements by default for testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [11:34:38] (03CR) 10Jdrewniak: [C: 03+2] Enable desktop improvements by default for testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [11:35:29] (03Merged) 10jenkins-bot: Enable desktop improvements by default for testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [11:37:01] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [11:37:59] Urbanecm: one more thing, I'm getting an error mwdebug1002 `IOError: [Errno 2] Invalid branch directory: u'/srv/mediawiki-staging/php-1.35.0-wmf.41'` [11:38:26] jan_drewniak: what did you do to get that error? [11:38:43] just `jdrewniak@mwdebug1002:~$ scap sync`? [11:39:04] ah, to test changes at mwdebug, the command is `scap pull` [11:39:24] doy! thanks [11:39:46] np [11:42:00] !log jdrewniak@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:614889|Enable desktop improvements by default for testing group (round 1) (T254227)]] (duration: 01m 05s) [11:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:06] T254227: Switch test wikis to new version of vector by default - https://phabricator.wikimedia.org/T254227 [11:43:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615422 (owner: 10Muehlenhoff) [11:45:07] jan_drewniak: I guess this is done too? 🙂 [11:45:35] Urbanecm: yes! thank you so much for your help today! [11:45:41] happy to help! [11:45:41] (03PS1) 10Marostegui: dbproxy1019: Reduce labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/615456 [11:45:56] in that case, I'll deploy an update of interwiki cache :) [11:46:07] (03CR) 10Ema: [C: 03+2] ATS: force cache revalidation on a few selected wikis [puppet] - 10https://gerrit.wikimedia.org/r/615446 (https://phabricator.wikimedia.org/T256750) (owner: 10Ema) [11:46:14] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Reduce labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/615456 (owner: 10Marostegui) [11:46:16] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615457 [11:46:18] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615457 (owner: 10Urbanecm) [11:46:37] ema: ok to merge your change? [11:47:01] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615457 (owner: 10Urbanecm) [11:47:12] marostegui: yes please [11:47:16] ok, merging! [11:47:37] ema: merged [11:48:22] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 15s) [11:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:23] !log A:cp-text force puppet run to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/615446 T256750 [11:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:28] T256750: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 [11:50:36] (03CR) 10JMeybohm: [C: 04-1] ratelimit: add new docker image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [11:50:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:52:24] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=mobileapps,name=scb.* [11:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:34] !log EU B&C window done [11:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:15] (03PS1) 10Muehlenhoff: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459 [11:54:01] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [11:54:01] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [11:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:25] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=mobileapps,name=scb.* [11:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:08] (03CR) 10Addshore: [C: 03+1] [sdoc] fix entity source base URIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) (owner: 10DCausse) [11:55:26] <_joe_> the error spike is from parsoid [11:55:47] ok, looks like memory solved, it's CPU now that was an issue. interestingly adding another 4% traffic to mobileapps@kubernetes makes the CPU usage skyrocket [11:56:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:56:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] Create namespaces/calico rules for push-notifications (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/613097 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [11:56:52] !log A:cp-text varnish ban euwiki T256750 [11:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:58] T256750: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 [11:59:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/data/css/mobile/pagelib (Get CSS bundle from wikimedia-page-library) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/de [11:59:15] (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [11:59:25] those are expected ^ ignore them [11:59:29] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_mobileapps_cluster_codfw,swagger_check_restbase_ulsfo} site={codfw,eqiad,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:59:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:01:05] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24053/" [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [12:01:44] !log A:cp-text varnish ban frwiktionary T256750 [12:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:03:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:04:28] (03PS1) 10Muehlenhoff: Turnilo: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/615461 [12:04:43] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:05:37] !log A:cp-text varnish ban ptwikiversity T256750 [12:05:41] (03PS2) 10JMeybohm: eventgate-analytics: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615249 (https://phabricator.wikimedia.org/T256843) [12:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:42] T256750: CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 [12:05:43] (03PS1) 10Muehlenhoff: Remove now obsolete Kibana CAS config [puppet] - 10https://gerrit.wikimedia.org/r/615462 [12:05:51] (03PS2) 10JMeybohm: eventgate-logging-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615250 (https://phabricator.wikimedia.org/T256843) [12:05:58] (03PS2) 10JMeybohm: eventgate-main: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615251 (https://phabricator.wikimedia.org/T256843) [12:11:57] (03PS2) 10Effie Mouzeli: Create namespaces rules for push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/613097 (https://phabricator.wikimedia.org/T256973) [12:14:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:14:15] (03CR) 10Effie Mouzeli: [C: 03+2] Kubernetes: Create token stanzas for push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/613104 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:15:31] (03CR) 10JMeybohm: [C: 03+2] eventgate-analytics: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615249 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [12:16:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:16] (03Merged) 10jenkins-bot: eventgate-analytics: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615249 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [12:16:34] 10Operations, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment - https://phabricator.wikimedia.org/T256750 (10ema) euwiki, frewiktionary, and ptwikiversity done today. All good. [12:17:26] (03CR) 10Effie Mouzeli: Create namespaces rules for push-notifications (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/613097 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:17:30] !log akosiaris@cumin1001 conftool action : set/weight=0; selector: dc=codfw,service=mobileapps,name=scb.* [12:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:36] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:51] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:17] (03PS1) 10Urbanecm: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615464 [12:22:19] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615464 (owner: 10Urbanecm) [12:22:21] (03CR) 10Effie Mouzeli: [C: 03+2] Create namespaces rules for push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/613097 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [12:23:00] (03Merged) 10jenkins-bot: Update interwiki cache for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615464 (owner: 10Urbanecm) [12:23:21] (03PS3) 10Effie Mouzeli: Create namespaces rules for push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/613097 (https://phabricator.wikimedia.org/T256973) [12:25:43] (03CR) 10Ema: [C: 03+1] ATS: Add missing PIDFile for non-default instances [puppet] - 10https://gerrit.wikimedia.org/r/567009 (owner: 10Vgutierrez) [12:26:28] (03PS9) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) [12:28:29] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=mobileapps,name=scb.* [12:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:47] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:37] (03PS1) 10Kormat: mariadb: Use defined type in profile::mariadb::mysql_role [puppet] - 10https://gerrit.wikimedia.org/r/615465 [12:36:39] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=scb.* [12:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:39] (03CR) 10Filippo Giunchedi: "Haven't checked PCC but LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:41:11] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [12:42:57] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [12:43:03] (03CR) 10Muehlenhoff: profile::thanos::frontend: Add SSO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:46:42] (03PS5) 10Jbond: profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) [12:46:55] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:47:11] (03Abandoned) 10Jbond: profile::thanos::frontend: enable SSO on thanos-fe2003 [puppet] - 10https://gerrit.wikimedia.org/r/615214 (owner: 10Jbond) [12:47:26] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10CDanis) Do you think you'd have time to attempt this in the next few days? The eqiad anchor being down has some impact on our monitoring, and we have a RIPE engineer waiting o... [12:47:31] (03Abandoned) 10Jbond: profile::thanos::frontend: enable sso for all thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/615215 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [12:47:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:49:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ayounsi) There has been some confusions and some informal IRC discussions about how best t... [12:49:27] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:52:06] mobileapps is going to complain a bit again [12:52:18] but I think I am getting to the source of all of this [12:52:36] _joe_: is restbase talking to mobileapps via envoy, right? [12:53:15] <_joe_> yes [12:53:41] <_joe_> via envoy on restbase [12:54:18] ok, that explains the exorbitant amount of requests I see on scb2* instead of kubernetes [12:54:34] and why everything breaks down when I remove them [12:54:55] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:54:57] turns out, I thought I had moved to kubernetes ~97%, but I had not [12:55:01] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:55:05] cause envoy persistent connection! [12:55:21] despite their very low weight, scb hosts end up gathering the majority of connections [12:55:23] ha [12:55:54] I wonder how much I have really moved over [12:55:54] <_joe_> akosiaris: oh right, rebalancing isn't working? [12:56:06] <_joe_> you should try to just depool them one by one [12:56:10] nope, that functionality is dead in the water now [12:56:16] <_joe_> akosiaris: is it codfw? [12:56:21] yeah, I just did that and that's how I figured it out [12:56:22] <_joe_> what functionality is? [12:56:37] I am not witnessing the slow accumulation of active connections on scb hosts [12:56:39] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:56:57] the reason of course is that the kubernetes installation can't handle the load, but that's the actually fixable part [12:57:58] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10Cmjohnson) @CDanis I am going to need an adapter to connect to the scs. See the attached image {F31944423} [12:58:00] <_joe_> akosiaris: I would add more pods rather than adding resources to the single pods tbh [12:58:15] _joe_: yeah that's what I am about to do [12:58:31] I am trying to calculate how many [12:58:52] PROBLEM - SSH on analytics1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:58:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:58:53] at least now I got some numbers [12:58:53] (03PS1) 10Filippo Giunchedi: smokeping: don't sync data between hosts [puppet] - 10https://gerrit.wikimedia.org/r/615473 (https://phabricator.wikimedia.org/T258491) [12:58:53] (03PS1) 10Filippo Giunchedi: librenms: add passive server for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T258491) [12:58:53] (03PS1) 10Filippo Giunchedi: install_server: reinstall netmon2001 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/615475 (https://phabricator.wikimedia.org/T258491) [13:00:04] longma and liw: Your horoscope predicts another unfortunate Mediawiki train - American+European Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T1300). [13:00:11] (03PS1) 10Kormat: mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) [13:00:17] RECOVERY - SSH on analytics1077 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:00:20] <_joe_> akosiaris: I run watch -n 5 ipvsadm -Lt mobileapps.svc.codfw.wmnet:8888 to monitor connections btw [13:00:35] _joe_: that makes 2 of us [13:01:01] <_joe_> akosiaris: can I try to depool one scb host? [13:01:23] _joe_: but you can also witness it here https://w.wiki/Xak [13:01:26] <_joe_> I don't get why connections to k8s are not persisted though [13:01:52] _joe_: cause of this https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=94&fullscreen&orgId=1&from=now-15m&to=now&refresh=1m [13:02:10] <_joe_> ok so we need more pods? [13:02:14] essentially they are closed because they fail [13:02:15] <_joe_> or more cpu? [13:02:24] <_joe_> how may pods do you have running? [13:02:28] and after a while they end up on scbs where they are persisted [13:02:33] very few it seems (12?) [13:02:50] I am helping effie with push-notifications and I 'll get to calculate how many we need [13:03:04] 12 was clearly very very optimistic [13:04:30] (03CR) 10Kormat: "Adding arturo for the WMCS files." [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [13:05:06] (03CR) 10Vgutierrez: [C: 03+2] ATS: Add missing PIDFile for non-default instances [puppet] - 10https://gerrit.wikimedia.org/r/567009 (owner: 10Vgutierrez) [13:05:57] PROBLEM - SSH on an-worker1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:07:24] (03CR) 10Elukey: [C: 03+1] Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [13:07:39] RECOVERY - SSH on an-worker1095 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:08:11] (03PS2) 10Filippo Giunchedi: smokeping: don't sync data between hosts [puppet] - 10https://gerrit.wikimedia.org/r/615473 (https://phabricator.wikimedia.org/T247967) [13:08:13] (03PS2) 10Filippo Giunchedi: librenms: add passive server for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) [13:08:15] (03PS2) 10Filippo Giunchedi: install_server: reinstall netmon2001 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/615475 (https://phabricator.wikimedia.org/T247967) [13:08:17] (03CR) 10Elukey: Modernise Apache config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff) [13:08:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:09:01] (03CR) 10Elukey: [C: 03+1] Remove obsolete references to Yarn on hadoop::ui role [puppet] - 10https://gerrit.wikimedia.org/r/615447 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [13:12:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:12:32] (03CR) 10JMeybohm: chartmuseum: Add systemd timer to package and push charts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:13:44] (03PS1) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:13:51] (03CR) 10JMeybohm: [C: 03+2] eventgate-logging-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615250 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:14:08] (03CR) 10jerkins-bot: [V: 04-1] thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [13:14:55] (03Merged) 10jenkins-bot: eventgate-logging-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615250 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:15:36] (03PS2) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:16:01] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:16:01] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:30] (03PS1) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:17:54] (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1002/24054/" [puppet] - 10https://gerrit.wikimedia.org/r/615465 (owner: 10Kormat) [13:18:00] (03PS1) 10Vgutierrez: ATS: Move from /var/run/trafficserver to /run/trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/615480 [13:18:03] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:18:03] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:18:06] (03PS6) 10Jbond: profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/615480 (owner: 10Vgutierrez) [13:18:29] (03CR) 10jerkins-bot: [V: 04-1] ATS: Move from /var/run/trafficserver to /run/trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/615480 (owner: 10Vgutierrez) [13:18:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [13:19:18] (03PS2) 10Vgutierrez: ATS: Move from /var/run/trafficserver to /run/trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/615480 [13:19:56] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:19:56] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:39] (03CR) 10Jbond: [C: 03+2] profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [13:20:50] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/615480 (owner: 10Vgutierrez) [13:22:02] (03CR) 10JMeybohm: [C: 03+2] eventgate-main: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615251 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:22:38] (03PS3) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:22:58] (03Merged) 10jenkins-bot: eventgate-main: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615251 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [13:25:35] (03PS4) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [13:25:38] (03PS4) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:26:15] (03PS1) 10Alexandros Kosiaris: mobileapps: Bump 20x the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/615484 (https://phabricator.wikimedia.org/T218733) [13:26:56] (03PS5) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:27:24] (03PS2) 10Kormat: mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) [13:27:40] _joe_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/615484 [13:27:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615462 (owner: 10Muehlenhoff) [13:28:22] _joe_: reasoning at https://w.wiki/Xaq [13:30:38] (03CR) 10Kormat: "PCC run: https://puppet-compiler.wmflabs.org/compiler1002/24062/" [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [13:30:59] (03CR) 10Jbond: librenms: add passive server for rsync server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:31:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This is going to be fun" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615484 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [13:31:53] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:32:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615473 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:32:26] (03CR) 10JMeybohm: [C: 04-1] GC: add time-based GC for Image objects (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [13:32:28] (03Merged) 10jenkins-bot: mobileapps: Bump 20x the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/615484 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [13:32:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615475 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:33:58] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:33:58] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [13:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] (03PS2) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:17] (03CR) 10Hnowlan: ratelimit: add new docker image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [13:34:29] wow, that a lot akosiaris :-o [13:34:42] 10Operations, 10ops-codfw: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10jcrespo) [13:35:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [13:36:36] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:36:36] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:12] (03PS5) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [13:37:42] (03PS3) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:39:02] (03CR) 10Volans: "REply inline, CR will follow" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [13:39:05] (03PS4) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:42:00] (03PS1) 10Volans: templates: add support for private templates [software/homer] - 10https://gerrit.wikimedia.org/r/615488 [13:42:03] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: reinstall netmon2001 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/615475 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [13:42:08] (03PS3) 10Filippo Giunchedi: install_server: reinstall netmon2001 with Buster [puppet] - 10https://gerrit.wikimedia.org/r/615475 (https://phabricator.wikimedia.org/T247967) [13:43:56] (03PS6) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:44:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615465 (owner: 10Kormat) [13:45:49] (03PS7) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:46:21] (03CR) 10Kormat: [C: 03+2] mariadb: Use defined type in profile::mariadb::mysql_role [puppet] - 10https://gerrit.wikimedia.org/r/615465 (owner: 10Kormat) [13:46:28] jbond42: thanks :) [13:47:23] (03CR) 10JMeybohm: [C: 04-1] ratelimit: add new docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [13:47:29] (03CR) 10Volans: "Re-correcting myself 😊" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [13:49:00] (03PS1) 10Alexandros Kosiaris: mobileapps: Bump quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/615494 (https://phabricator.wikimedia.org/T218733) [13:49:58] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:49:58] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Bump quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/615494 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [13:50:40] (03PS3) 10Kormat: mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) [13:51:23] (03Merged) 10jenkins-bot: mobileapps: Bump quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/615494 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris) [13:51:29] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:51:58] (03PS5) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [13:54:11] (03PS1) 10Jbond: add dummy thanos key [labs/private] - 10https://gerrit.wikimedia.org/r/615495 [13:54:16] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:54:17] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [13:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] add dummy thanos key [labs/private] - 10https://gerrit.wikimedia.org/r/615495 (owner: 10Jbond) [13:55:04] (03PS8) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [13:55:57] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 1:" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [13:56:35] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 46 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:57:07] (03CR) 10Ema: [C: 03+1] ATS: Move from /var/run/trafficserver to /run/trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/615480 (owner: 10Vgutierrez) [13:58:01] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:58:06] (03PS9) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [14:01:16] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:01:43] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:02:05] (03CR) 10Ayounsi: [C: 03+1] "There is no PII information in LibreNMS so it's fine to not encrypt it if it impacts performance." [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:04:33] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:04:39] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [14:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:45] akosiaris: are the kubelet latency alerts of any concern, or are they just a result of all the new pods shuffling around? [14:05:31] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:05:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:05:49] cdanis: result of all those new pods [14:06:18] 👍 [14:06:43] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:04] (03CR) 10Vgutierrez: [C: 03+2] ATS: Move from /var/run/trafficserver to /run/trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/615480 (owner: 10Vgutierrez) [14:10:31] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:12:49] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:12:53] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:12:53] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:42] (03PS2) 10JMeybohm: eventstreams: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615252 (https://phabricator.wikimedia.org/T256843) [14:13:51] (03PS2) 10JMeybohm: mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) [14:13:58] (03PS2) 10JMeybohm: proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843) [14:14:02] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10RobH) >>! In T258221#6326025, @Cmjohnson wrote: > @CDanis I am going to need an adapter to connect to the scs. See the attached image {F31944423} Chris, This is the db9 to s... [14:16:25] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:18:55] PROBLEM - Host analytics1075 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:18] elukey: FYI ^^^ [14:20:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:07] RECOVERY - Host analytics1075 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [14:21:17] it's up, but very high loadavg [14:21:20] 14:21:01 up 410 days, 21:23, 1 user, load average: 403.47, 304.74, 155.73 [14:21:21] PROBLEM - Host db1145.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:21:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [14:21:40] some heavy hadoop stuff? [14:22:10] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Jclark-ctr) @jcrespo maintenance is completed [14:22:19] (03PS1) 10Giuseppe Lavagetto: helmfile: strawman refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) [14:23:15] (03CR) 10Cwhite: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [14:23:28] also some OOM killing on analytics1075 [14:24:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:25:55] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete Kibana CAS config [puppet] - 10https://gerrit.wikimedia.org/r/615462 (owner: 10Muehlenhoff) [14:26:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:26:33] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:26:59] RECOVERY - Host db1145.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [14:28:03] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:29:43] (03PS2) 10Volans: templates: add support for private templates [software/homer] - 10https://gerrit.wikimedia.org/r/615488 [14:31:29] (03PS1) 10Jcrespo: Revert "db1145: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/615430 (https://phabricator.wikimedia.org/T258249) [14:31:42] (03PS2) 10Jcrespo: Revert "db1145: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/615430 (https://phabricator.wikimedia.org/T258249) [14:32:05] PROBLEM - Host an-worker1094 is DOWN: PING CRITICAL - Packet loss = 100% [14:32:15] (03CR) 10Jcrespo: [C: 03+2] Revert "db1145: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/615430 (https://phabricator.wikimedia.org/T258249) (owner: 10Jcrespo) [14:32:43] (03CR) 10JMeybohm: [C: 03+2] eventstreams: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615252 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [14:32:52] moritzm: 20ddc23078 ok to merge? [14:33:01] (03PS1) 10Jbond: thanos: add descovery addresses for thanos.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/615500 [14:33:38] (03PS2) 10Jbond: thanos: add LVS/discovery records for thanos.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/615500 [14:33:43] RECOVERY - Host an-worker1094 is UP: PING WARNING - Packet loss = 60%, RTA = 0.25 ms [14:33:54] (03PS3) 10Jbond: thanos: add LVS/discovery records for thanos.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/615500 [14:34:00] (03Merged) 10jenkins-bot: eventstreams: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615252 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [14:34:07] jbond42: moritz may be temporarily unavailable, do you know if 20ddc23078 is safe to merge (related to CAS)? [14:34:07] (03PS1) 10Cmjohnson: Revert "Adding production dns for cloudcephosd1004-1015" [dns] - 10https://gerrit.wikimedia.org/r/615431 [14:34:17] (03PS2) 10Cmjohnson: Revert "Adding production dns for cloudcephosd1004-1015" [dns] - 10https://gerrit.wikimedia.org/r/615431 [14:35:01] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10Jclark-ctr) @hnowlan Replaced failed drive Failed drive ICN BTHC62300066800NGN [14:35:02] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:35:02] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:37] jynus: sorry, yes [14:35:39] (03CR) 10Cmjohnson: [C: 03+2] Revert "Adding production dns for cloudcephosd1004-1015" [dns] - 10https://gerrit.wikimedia.org/r/615431 (owner: 10Cmjohnson) [14:35:40] jynus: yes should be fine [14:35:46] ok, I was reviewing the change [14:35:53] it seemed trivial [14:35:56] but better ask [14:36:03] yeah, it's just cleanup [14:36:21] thanks for checking [14:36:42] the one day I won't check it will be the day there is an ongoing emergency and I shouldn't have merged :-P [14:37:42] (03PS1) 10Mholloway: Use new naming convention for deployment-docker-mobileapps01 [puppet] - 10https://gerrit.wikimedia.org/r/615502 (https://phabricator.wikimedia.org/T256794) [14:39:29] (03CR) 10Filippo Giunchedi: "Idea LGTM! Will need to check what's envoy behavior with multiple servers and global cert + server-specific cert" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:40:15] (03CR) 10Ayounsi: [C: 03+1] "tested!" [software/homer] - 10https://gerrit.wikimedia.org/r/615488 (owner: 10Volans) [14:40:17] 10Operations, 10Beta-Cluster-Infrastructure: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z - https://phabricator.wikimedia.org/T258451 (10Mholloway) 05Open→03Resolved a:03Mholloway Yes, this is working well again, thank you @jbond. [14:40:30] 10Operations, 10Beta-Cluster-Infrastructure: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z - https://phabricator.wikimedia.org/T258451 (10Mholloway) a:05Mholloway→03jbond [14:40:58] (03CR) 10Volans: [C: 03+2] templates: add support for private templates [software/homer] - 10https://gerrit.wikimedia.org/r/615488 (owner: 10Volans) [14:41:03] (03PS16) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [14:41:20] (03CR) 10jerkins-bot: [V: 04-1] charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [14:42:07] (03Merged) 10jenkins-bot: templates: add support for private templates [software/homer] - 10https://gerrit.wikimedia.org/r/615488 (owner: 10Volans) [14:42:29] (03PS1) 10Jbond: thanos: add new lvs thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615504 (https://phabricator.wikimedia.org/T151009) [14:42:56] (03PS4) 10Jbond: thanos: add LVS/discovery records for thanos.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/615500 (https://phabricator.wikimedia.org/T151009) [14:43:20] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:43:20] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:43:21] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=mobileapps,name=scb.* [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] (03CR) 10Jcrespo: "Subtle issue." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [14:45:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/615500 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:46:30] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10jcrespo) 05Open→03Resolved Everything looking good. Thanks, @Jclark-ctr ! [14:46:36] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete references to Yarn on hadoop::ui role [puppet] - 10https://gerrit.wikimedia.org/r/615447 (https://phabricator.wikimedia.org/T258152) (owner: 10Muehlenhoff) [14:46:45] (03PS10) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [14:47:00] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:47:11] (03PS2) 10Jbond: thanos: add new lvs thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615504 (https://phabricator.wikimedia.org/T151009) [14:47:26] (03PS6) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [14:47:33] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [14:48:26] (03PS1) 10Marostegui: check_mariadb.py: Quick fix [puppet] - 10https://gerrit.wikimedia.org/r/615506 [14:48:39] (03CR) 10Marostegui: check_mariadb.py: Add check for the event_scheduler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/604379 (https://phabricator.wikimedia.org/T254738) (owner: 10Marostegui) [14:48:43] 10Operations, 10Beta-Cluster-Infrastructure: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z - https://phabricator.wikimedia.org/T258451 (10jbond) 05Resolved→03Open Great thanks, resolving [14:48:52] 10Operations, 10Beta-Cluster-Infrastructure: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z - https://phabricator.wikimedia.org/T258451 (10jbond) 05Open→03Resolved [14:49:04] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move yarn.wikimedia.org to a separate Buster VM - https://phabricator.wikimedia.org/T258152 (10MoritzMuehlenhoff) 05Open→03Resolved Yarn us now running on a separate Ganeti VM using Buster (an-tool1008.eqiad.wmnet) [14:49:07] (03CR) 10Filippo Giunchedi: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [14:49:37] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) 05Open→03Stalled [14:49:38] 10Operations, 10Traffic, 10Patch-For-Review: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [14:49:58] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:50:37] (03CR) 10Jcrespo: [C: 03+1] check_mariadb.py: Quick fix [puppet] - 10https://gerrit.wikimedia.org/r/615506 (owner: 10Marostegui) [14:50:40] PROBLEM - thanos.wikimedia.org requires authentication on thanos-fe1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 404 Not Found https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:50:49] (03PS2) 10Muehlenhoff: profile::idp::client::httpd: Default priority to 50 [puppet] - 10https://gerrit.wikimedia.org/r/615422 [14:50:51] (03CR) 10Marostegui: [C: 03+2] check_mariadb.py: Quick fix [puppet] - 10https://gerrit.wikimedia.org/r/615506 (owner: 10Marostegui) [14:51:50] (03CR) 10Filippo Giunchedi: "I think for all intended purposes we can map/proxy thanos.wikimedia.org to thanos-query.discovery.wmnet internally, without another servic" [dns] - 10https://gerrit.wikimedia.org/r/615500 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:51:55] (03PS1) 10Hnowlan: Add vendor modules for v1.5.0 [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/615507 [14:52:41] !log add accept-data and remove bogus v6 IP from ulsfo sandbox vlan [14:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:23] (03CR) 10Filippo Giunchedi: "Yeah I don't think we need this, I think we can map/proxy thanos.wikimedia.org to thanos-query.discovery.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/615504 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:55:17] (03Abandoned) 10Hnowlan: Add vendor modules for v1.5.0 [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/615507 (owner: 10Hnowlan) [14:56:21] (03PS1) 10Hnowlan: Add vendor modules for v1.5.0 [software/envoyproxy/ratelimiter] (v1.5.0-vendor) - 10https://gerrit.wikimedia.org/r/615508 (https://phabricator.wikimedia.org/T254907) [14:56:49] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] Add vendor modules for v1.5.0 [software/envoyproxy/ratelimiter] (v1.5.0-vendor) - 10https://gerrit.wikimedia.org/r/615508 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [14:57:17] (03CR) 10Filippo Giunchedi: "Envoy config looks good to my untrained eye, modulo inline comment I think this is good to go" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:57:24] 10Operations, 10Traffic: Backend naming in VCL needs to use fqdn+port - https://phabricator.wikimedia.org/T138546 (10ema) 05Open→03Declined Varnish backends are gone since T227432, this is now unnecessary. [14:58:17] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10wiki_willy) a:03Papaul [14:59:58] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10wiki_willy) @jcrespo - just a heads up, we won't have anyone onsite in the next couple weeks, so this may need to wait until August or we could utilize remote hands beforehand. Thanks, Willy [15:02:11] 10Operations, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10ema) 05Open→03Resolved >>! In T162035#5415777, @gerritbot wrote: > Change 53033... [15:02:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, even though I don't fully understand these classes. Adding Brooke." [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [15:02:37] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10jcrespo) p:05Triage→03Low No urgency then, this has been ongoing for a few days and I checked and there is no planned maintenance and no user impact, but it would be nice to have it done by Sept... [15:03:01] 10Operations, 10Traffic, 10Patch-For-Review: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768 (10ema) 05Open→03Resolved a:03ema Both cache and upload now have limited transient: ` hieradata/role/common/cache/text.yaml:profile::cache::varnish::frontend::transient_... [15:03:05] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 59 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:05:08] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:06:45] (03CR) 10Muehlenhoff: [C: 03+2] profile::idp::client::httpd: Default priority to 50 [puppet] - 10https://gerrit.wikimedia.org/r/615422 (owner: 10Muehlenhoff) [15:09:09] (03PS1) 10Muehlenhoff: Also remove priority for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/615509 [15:09:19] (03PS11) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) [15:09:33] (03PS5) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [15:12:19] (03CR) 10Hnowlan: ratelimit: add new docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [15:12:21] (03PS1) 10Jbond: thanos: orrect file path [labs/private] - 10https://gerrit.wikimedia.org/r/615510 [15:12:34] (03CR) 10Jbond: [V: 03+2 C: 03+2] thanos: orrect file path [labs/private] - 10https://gerrit.wikimedia.org/r/615510 (owner: 10Jbond) [15:13:51] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24072/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [15:13:53] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 46 probes of 563 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:14:11] (03Abandoned) 10Jbond: thanos: add new lvs thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615504 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [15:14:19] (03Abandoned) 10Jbond: thanos: add LVS/discovery records for thanos.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/615500 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [15:16:31] _joe_: interestingly, mobileapps active connections now linger at the 600 mark whereas previously they were at the 1000 mark [15:16:47] but now the pods can take it so it's fine. [15:16:54] <_joe_> how many? [15:17:00] I went a bit overboard [15:17:04] 240 :P [15:17:08] A bit [15:17:10] <_joe_> yeah that seemed ludicrous [15:17:22] <_joe_> can we scale that back to like 30? [15:17:36] 30 might not cut it, but 100 might [15:17:52] <_joe_> how big is a pod? [15:17:55] but I 'd rather wait out and see how the day goes [15:17:58] <_joe_> that seems still too much [15:18:36] limit is at CPU: 1.3 and memory at 600 [15:18:54] are a bit lower. 1.1 and 450 [15:19:02] requests are a bit lower. 1.1 and 450 [15:19:15] <_joe_> and ok, that's about 2 workers [15:19:35] <_joe_> looking at what happens on scb [15:19:50] where it is keeping 25% of the cpu constantly in use [15:19:59] <_joe_> yes [15:20:04] mobileapps is quite possibly the biggest service we got cpu wise [15:21:23] <_joe_> so ok, it uses overall 24 cpus on scb [15:21:26] <_joe_> give or take [15:22:14] <_joe_> I would expect you to need 30-40 pods, not more [15:22:26] (03CR) 10Mholloway: [C: 04-1] "I think this needs to be updated to use the deployment-prep.eqiad1.wikimedia.cloud name rather than the legacy .eqiad.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/612406 (https://phabricator.wikimedia.org/T256795) (owner: 10MSantos) [15:23:29] interestingly now ~1000 connections/s are going to k8s per LVS stats [15:23:46] so at least it can serve everything just fine. [15:24:22] I 'll leave it be at 240 for today and fiddle with it tomorrow. Probably start decreasing in 50% steps (120, 60, 30) [15:25:23] _joe_: total CPU btw per https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=28&fullscreen&orgId=1&from=now-1h&to=now&refresh=1m is more than 30 [15:25:29] I doubt 30-40 will cut it [15:25:44] <_joe_> akosiaris: that's more cpu than what we used outside of k8s [15:26:24] _joe_: yeah, I don't particularly mind tbh [15:26:45] there is also damn 512c999 in play [15:26:49] <_joe_> yes [15:26:52] <_joe_> on that note [15:26:59] <_joe_> we need to upgrade to buster :) [15:27:01] jumping to 4.19 might not be feasible btw [15:27:08] but rather straight to buster [15:27:11] <_joe_> yep [15:27:37] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [15:27:38] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [15:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:00] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 144.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [15:31:09] !log updated stretch installer image to Stretch 9.13 release T258407 [15:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:15] T258407: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 [15:31:39] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [15:34:40] (03PS1) 10Bartosz Dziewoński: OOUI: Backport I3d88853fdf9915d2b08063c80ecaf7d92828a5df [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615433 (https://phabricator.wikimedia.org/T258256) [15:35:48] (03PS1) 10Bartosz Dziewoński: OOUI: Backport I3d88853fdf9915d2b08063c80ecaf7d92828a5df [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615434 (https://phabricator.wikimedia.org/T258256) [15:36:06] (03PS5) 10Cwhite: profile: add prometheus instance for statsv metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [15:36:26] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:37:17] (03CR) 10jerkins-bot: [V: 04-1] profile: add prometheus instance for statsv metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [15:37:29] (03PS6) 10Cwhite: profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [15:37:58] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:38:41] (03CR) 10jerkins-bot: [V: 04-1] profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [15:40:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 51 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:45:52] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:47:52] (03PS2) 10Giuseppe Lavagetto: helmfile: strawman refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) [15:53:06] (03CR) 10Giuseppe Lavagetto: GC: add time-based GC for Image objects (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [15:53:08] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:54:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:55:22] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:56:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [15:56:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:59:48] (03CR) 10Volans: "replies inline" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [16:02:22] (03PS7) 10Cwhite: profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [16:03:22] (03PS1) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [16:07:14] (03CR) 10Giuseppe Lavagetto: Add discovery and disabled LVS components for API gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:07:51] (03CR) 10Giuseppe Lavagetto: GC: add time-based GC for Image objects (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans) [16:08:21] (03PS1) 10Cmjohnson: Adding cloudcephosd servers to private vlan [dns] - 10https://gerrit.wikimedia.org/r/615513 (https://phabricator.wikimedia.org/T251619) [16:08:46] (03CR) 10jerkins-bot: [V: 04-1] Adding cloudcephosd servers to private vlan [dns] - 10https://gerrit.wikimedia.org/r/615513 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [16:10:52] (03PS2) 10Cmjohnson: Adding cloudcephosd servers to private vlan [dns] - 10https://gerrit.wikimedia.org/r/615513 (https://phabricator.wikimedia.org/T251619) [16:11:04] (03CR) 10Filippo Giunchedi: [C: 03+2] smokeping: don't sync data between hosts [puppet] - 10https://gerrit.wikimedia.org/r/615473 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi) [16:11:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:25] (03PS7) 10ArielGlenn: dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) [16:14:00] (03CR) 10ArielGlenn: [C: 03+2] dumps rsync refactor, better opts and flags handling [puppet] - 10https://gerrit.wikimedia.org/r/614755 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [16:14:29] (03PS6) 10Jbond: use dnsmasq: add configuration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [16:14:52] (03CR) 10Cmjohnson: [C: 03+2] Adding cloudcephosd servers to private vlan [dns] - 10https://gerrit.wikimedia.org/r/615513 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [16:15:23] (03PS1) 10Dzahn: ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) [16:17:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:17:51] (03PS2) 10Dzahn: ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) [16:19:03] (03CR) 10Paladox: [C: 03+1] ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn) [16:20:21] (03CR) 10Dzahn: [C: 03+2] Remove line saying ldaplist will be removed 30 August 2016 [puppet] - 10https://gerrit.wikimedia.org/r/613360 (owner: 10Reedy) [16:22:36] (03CR) 10Andrew Bogott: [C: 03+1] "I'm not sure why deployment-docker-mobileapps01 doesn't have .eqiad.wmflabs dns entries; I just double-checked and new VMs are getting tho" [puppet] - 10https://gerrit.wikimedia.org/r/615502 (https://phabricator.wikimedia.org/T256794) (owner: 10Mholloway) [16:26:04] (03PS4) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) [16:26:14] (03CR) 10Filippo Giunchedi: "LGTM! See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [16:28:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Andrew) @ayounsi, the POC hosts are currently hosting a small amount of user workload. Wi... [16:28:39] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:28:58] <_joe_> again! [16:28:59] why does it say paper jam when there is no paper jam [16:29:10] computer says no [16:29:11] form feed error [16:29:19] pc load letter?? [16:29:22] heh [16:29:26] rzl: yes that was what i was trying to think of [16:29:32] <_joe_> ok,seriously, wth? [16:29:34] in a meeting but shout if I can help [16:29:53] same time as yesterday? [16:29:54] <_joe_> can someone ack the alert? [16:30:00] ack [16:30:03] I/F is in our weekly, but if help is needed we can end early -- let me know [16:30:21] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24704 bytes in 0.506 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:30:35] not even any log messages this time [16:30:37] recoered already :-/ [16:30:39] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-30m&to=now no sign of trouble ofc [16:31:03] i sent the "ack number" back to victorops [16:31:05] <_joe_> no 5xx [16:31:12] I would like to downtime it for one day until we discover the reason for the alert, today or tomorrow, as I think it shouldn't affect user traffic? [16:31:50] I can create a ticket meanwhile [16:32:05] <_joe_> jynus: please do [16:32:11] but I won't downtime it without some +1s [16:32:39] (03PS8) 10Cwhite: profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [16:32:44] I think short term it causes more issues than help alerting on it, clearly it is flappting [16:32:45] +1 as long as it is just the codfw api jynus [16:32:52] yes, only that check indeed [16:33:02] <_joe_> so the interesting part is [16:33:09] <_joe_> https doesn't page [16:33:12] I will do it only for 24 hours [16:33:14] <_joe_> and checks the same uri [16:33:24] (03CR) 10Cwhite: profile: add prometheus instance for external metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [16:33:28] and if someone can dig today from the current timezone [16:33:31] <_joe_> this is mindblowing tbh [16:33:41] or otherwise was can dig more in the morning [16:33:56] I will create the ticket with the previous discussion [16:34:18] (03CR) 10BryanDavis: "The missing deployment-docker-mobileapps01.deployment-prep.eqiad.wmflabs DNS entry is a bug. Cloud VPS *should* be creating both . I have a vague suspect this has something to do with the move to force the use of https [16:34:45] <_joe_> but then why not eqiad [16:34:55] that 80/tcp is the first thing I thought [16:35:12] but I assumed later it was the canonical place [16:35:17] icinga logs show it happens more often but the other times it did not get past "soft alert" state so no pages [16:36:42] <_joe_> we should probably set a curl running from icinga1001 with a timeout and log whenever it takes more than 10 seconds [16:36:51] <_joe_> in a loop, I mean [16:37:25] <_joe_> but yes, if no one does, I'll look tomorrow [16:37:36] the alert is for both codfw and eqiad combined? [16:37:51] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) [16:37:54] "api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet " [16:38:18] I created a template at LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 to downtime the icinga alert first [16:38:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [16:38:25] at https://phabricator.wikimedia.org/T258614 [16:38:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [16:38:56] I've been broad with tags (svops, traffic, obs) [16:39:16] will now fill in the past pages [16:41:49] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codf... [16:41:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Bstorm) @ayounsi I agree about the choice of public-b, I think. If we don't need to cross... [16:47:03] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) [16:47:14] so the check on the "eqiad" host checks only eqiad, but the "codfw" check checks both codfw and eqiad [16:47:34] I've put more info on the ticket [16:48:11] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10CDanis) First occurrence was June 17th, 15:10 UTC: `Jun 17 15:10:38 icinga1001 icinga: SERVICE ALERT: api.s... [16:48:23] mutante, cdanis there was some conversation starting at 11:01 on #wikimedia-sre [16:48:31] (03PS1) 10Hnowlan: kubernetes: add namespace for api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/615521 (https://phabricator.wikimedia.org/T254906) [16:48:37] I think that is all that we know so far [16:49:08] mutante: I believe the occurrence of 'eqiad' in the codfw one was just a templating error [16:49:19] ok. i am trying to find the _actual_ check command [16:50:47] (03PS2) 10Volans: GC: add time-based GC for Image objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 [16:51:24] host is not even in puppet_hosts though .. ehm... [16:52:47] ah, it's one of those in /etc/nagios insted of /etc/icinga .. right [16:53:14] I believe it is something like the 'eqiad' comes from $::site as expanded on the 'active' icinga server [16:57:03] yea, so it's just the service description. the check_command is the same. [16:57:32] it's also the ping/HOST check on that host https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet [16:58:35] yah [16:58:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Bstorm) NOTE: For any spectators, we are setting up a meeting to make sure we are all sync... [16:58:52] so really "just" network connectivity from icinga1001 to codfw, never eqiad [17:00:01] it isn't that, though; it started seeing soft-fails on icinga2001 on the same day [17:00:11] (03CR) 10Cwhite: [V: 03+2 C: 03+2] debianization [debs/grafana-loki] (debian/sid) - 10https://gerrit.wikimedia.org/r/610864 (owner: 10Cwhite) [17:00:18] same time too [17:00:21] 15:10 UTC [17:01:36] it happens more often, like once per hour at random times [17:01:48] just most of the times it's short enough to never get to HARD [17:01:54] I was just looking at the first occurrence of a soft alert [17:01:56] it didn't used to happen [17:02:06] it started happening around here https://sal.toolforge.org/log/D1fUwnIBj_Bg1xd30Riv [17:06:57] (03PS6) 10Ryan Kemper: [wdqs] overrides default blazegraph ns [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [17:16:13] (03PS1) 10Ssingh: dnsrecursor: add a parameter to set the use-incoming-edns-subnet option [puppet] - 10https://gerrit.wikimedia.org/r/615526 [17:16:48] (03PS1) 10Cmjohnson: Revert "Adding cloudcephosd servers to private vlan" [dns] - 10https://gerrit.wikimedia.org/r/615436 [17:17:53] (03CR) 10Cmjohnson: [C: 03+2] Revert "Adding cloudcephosd servers to private vlan" [dns] - 10https://gerrit.wikimedia.org/r/615436 (owner: 10Cmjohnson) [17:22:33] (03CR) 10Ssingh: "Similar to I2dfc2a7b11499946892217bb980ea83b77804162, I tested this on dnsbox2001 and cloudservices1003:" [puppet] - 10https://gerrit.wikimedia.org/r/615526 (owner: 10Ssingh) [17:23:50] ^ I wonder what's an easy way to refer to another change in gerrit in the comment formatting instead of this long change-ID string [17:24:23] cdanis: yea, you pointed at the right thing indeed. mw2335-mw2339 are a mix-up. they are API appservers in confctl and regular appservers in site.pp. so that's my bad. i will fix it after getting a coffee [17:24:59] sukhe: you don't have to use the full string, the beginning of it will do [17:25:29] ah OK great, I will use the first 5 chars as in git commits :) [17:26:17] (03PS1) 10Gergő Tisza: Localisation updates from https://translatewiki.net. [extensions/OAuth] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615437 [17:26:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2335.codfw.wmnet [17:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:25] mutante: thanks! [17:29:25] (03PS1) 10Ssingh: wikidough: enable support for EDNS Client Subnet [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) [17:30:05] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) a:03Dzahn mw2335 - mw2339 are configured as API appservers in confctl but they are regular appserv... [17:31:44] cdanis: thanks as well, that SAL entry was the right call [17:31:51] 🍻 [17:32:15] * volans re-iterate the need of raw-socket checkers to avoid these situations [17:34:37] volans: something to keep in mind for L4LBv2, for sure [17:35:14] something we could have right now already as an external check ;) [17:45:16] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:08] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [17:51:44] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: add a parameter to set the use-incoming-edns-subnet option [puppet] - 10https://gerrit.wikimedia.org/r/615526 (owner: 10Ssingh) [17:54:30] (03PS1) 10Dzahn: conftool-data: move mw2335-mw2339 to regular appservers [puppet] - 10https://gerrit.wikimedia.org/r/615537 (https://phabricator.wikimedia.org/T258614) [17:58:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2336.codfw.wmnet [17:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:33] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2337.codfw.wmnet [17:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2338.codfw.wmnet [17:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] longma and liw: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T1800). Please do the needful. [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T1800). [18:00:04] MatmaRex and tgr: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:45] hi [18:01:04] \o/ [18:01:37] I can deploy today! [18:02:38] (03CR) 10Urbanecm: [C: 03+2] "B&C" [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615433 (https://phabricator.wikimedia.org/T258256) (owner: 10Bartosz Dziewoński) [18:02:40] (03CR) 10Urbanecm: [C: 03+2] "B&C" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615434 (https://phabricator.wikimedia.org/T258256) (owner: 10Bartosz Dziewoński) [18:02:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:03:39] jouncebot: 1, it's not wikipedia and [[WP:LIGHTBULB]]s [18:03:57] i'm away from my desk, i'll be able to test in five minutes [18:04:08] no problem, I'm waiting for CI [18:05:25] (03CR) 10Urbanecm: [C: 03+2] "B&C, removal of spam messages" [extensions/OAuth] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615437 (owner: 10Gergő Tisza) [18:06:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:06:29] tgr: I've +2'ed your backport, seems simple enough, I'll do that [18:08:07] (i'm here now) [18:09:15] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: enable delete API for docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/610191 (owner: 10BryanDavis) [18:11:15] MatmaRex: ack [18:11:16] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:23] still waiting on CI [18:12:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:12:58] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10faidon) 05Resolved→03Open We still seem to have remnants of PIM-RP: ` faidon@re0.cr2-codfw> show configuration | display set | match 208.80.153.194 set interfaces lo0 unit... [18:15:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:16:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2339.codfw.wmnet [18:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:26] the selenium jobs sure have been taking foreer recently [18:17:30] forever* [18:18:08] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] overrides default blazegraph ns [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [18:21:06] Hello. [18:21:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:22:15] (03CR) 10Ryan Kemper: [C: 03+1] "Taking off my +2 while figuring out if the hieradata is supposed to be set to `wdq` or if that's a typo instead" [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [18:22:17] (03CR) 10Ssingh: [C: 03+2] dnsrecursor: add a parameter to set the use-incoming-edns-subnet option [puppet] - 10https://gerrit.wikimedia.org/r/615526 (owner: 10Ssingh) [18:23:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:17] Hello, i can provision a wikimedia puppet with Vagrant??? [18:24:07] SpainDist: just MediaWiki itself. puppet is unrelated [18:24:20] errrr [18:24:27] (03Merged) 10jenkins-bot: OOUI: Backport I3d88853fdf9915d2b08063c80ecaf7d92828a5df [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615433 (https://phabricator.wikimedia.org/T258256) (owner: 10Bartosz Dziewoński) [18:24:33] (03Merged) 10jenkins-bot: OOUI: Backport I3d88853fdf9915d2b08063c80ecaf7d92828a5df [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615434 (https://phabricator.wikimedia.org/T258256) (owner: 10Bartosz Dziewoński) [18:24:37] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/OAuth] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615437 (owner: 10Gergő Tisza) [18:25:10] mutante: fyi, this user you are talking to is a globally banned user [18:25:37] mutante And i request a new URL Shortner, and the domain is wm.gd [18:26:10] robh or _joe_: could you please kickban SpainDist? [18:26:30] Wiki13 No [18:26:35] Wiki13 And i request a new URL Shortner, and the domain is wm.gd [18:27:21] MatmaRex: pulled onto mwdebug1001 [18:27:46] thanks, looking [18:28:24] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw233[5-9].codfw.wmnet [18:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:19] Wiki13: done [18:29:25] ty [18:29:27] Urbanecm: seems good [18:29:33] (03CR) 10Dzahn: [C: 03+2] conftool-data: move mw2335-mw2339 to regular appservers [puppet] - 10https://gerrit.wikimedia.org/r/615537 (https://phabricator.wikimedia.org/T258614) (owner: 10Dzahn) [18:29:39] np [18:30:01] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/24076/" [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:30:05] Thanks, will sync [18:30:13] him being here would only distract and waste peoples' time, hence the request :) [18:30:44] yeah someone joining and then requesting the same thing over and over by pinging everyone they see in the channel is not good irc behavior. [18:31:42] also, hes a WMF banned user, another reason to block him here imo [18:32:31] that certainly helps [18:33:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2335.codfw.wmnet [18:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:16] !log urbanecm@deploy1001 Started scap: 9529cf8d2570bbf6dd1e919c966f5954e39dbd67: b66ec9143bd96cbf3a20b70f6aa3f2d6d7963bb5: OOUI backport; 93755a6a92923ae390e3a04b19421c8562568d2a: i18n changes for OAuth, removal of spam messages [18:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:35] MatmaRex: this will take a while, running a full scap to do the i18n changes too [18:33:57] Urbanecm: thanks! (and sorry - had to go afk, lost track of time) [18:34:10] np [18:34:24] sure [18:34:27] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/24075/" [puppet] - 10https://gerrit.wikimedia.org/r/611373 (owner: 10DCausse) [18:36:20] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw233[6-9].codfw.wmnet [18:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw233[6-9].codfw.wmnet [18:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:24] (03PS2) 10Herron: prometheus: introduce role::prometheus::pop [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) [18:39:28] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw233[5-9].codfw.wmnet [18:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:55] (03CR) 10CDanis: [C: 03+1] conftool-data: move mw2335-mw2339 to regular appservers [puppet] - 10https://gerrit.wikimedia.org/r/615537 (https://phabricator.wikimedia.org/T258614) (owner: 10Dzahn) [18:40:16] (03CR) 10Herron: "> We _might_ need to tweak profile::base::domain_search to match prometheus role, but maybe that's not needed in pops." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [18:42:58] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:57] (03PS2) 10MSantos: update proton beta instance for restbase [puppet] - 10https://gerrit.wikimedia.org/r/612406 (https://phabricator.wikimedia.org/T256795) [18:47:16] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [18:48:44] (03CR) 10Dzahn: [C: 03+2] update proton beta instance for restbase [puppet] - 10https://gerrit.wikimedia.org/r/612406 (https://phabricator.wikimedia.org/T256795) (owner: 10MSantos) [18:48:51] (03PS3) 10Dzahn: update proton beta instance for restbase [puppet] - 10https://gerrit.wikimedia.org/r/612406 (https://phabricator.wikimedia.org/T256795) (owner: 10MSantos) [18:51:11] (03PS2) 10Dzahn: Use new naming convention for deployment-docker-mobileapps01 [puppet] - 10https://gerrit.wikimedia.org/r/615502 (https://phabricator.wikimedia.org/T256794) (owner: 10Mholloway) [18:51:21] (03CR) 10Dzahn: [C: 03+2] Use new naming convention for deployment-docker-mobileapps01 [puppet] - 10https://gerrit.wikimedia.org/r/615502 (https://phabricator.wikimedia.org/T256794) (owner: 10Mholloway) [18:54:42] finally check canaries stage [18:59:46] (03CR) 10Paladox: "We didn't really get around this, scap pulled on the phab host and it happened to use the http url when pulling, not ssh it seems." [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [19:00:04] longma and liw: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T1900). [19:02:37] please hold the train for a bit [19:02:44] scap sync I started earlier didnt finish [19:04:48] (03PS7) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [19:06:44] Urbanecm: alrighty [19:09:30] (03PS8) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [19:10:36] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:44] !log mw2335 - mw2339 - scap pull [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:42] !log urbanecm@deploy1001 Finished scap: 9529cf8d2570bbf6dd1e919c966f5954e39dbd67: b66ec9143bd96cbf3a20b70f6aa3f2d6d7963bb5: OOUI backport; 93755a6a92923ae390e3a04b19421c8562568d2a: i18n changes for OAuth, removal of spam messages (duration: 42m 26s) [19:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:52] longma: I'm done [19:15:56] MatmaRex: tgr it should be synced! [19:16:28] Thanks [19:16:44] thank Urbanecm [19:16:51] Urbanecm: scap is fast now [19:19:41] Is it? :) [19:21:38] (03PS1) 10Jeena Huneidi: group1 wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615549 [19:21:40] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615549 (owner: 10Jeena Huneidi) [19:22:23] (03PS8) 10Paladox: Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [19:22:25] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615549 (owner: 10Jeena Huneidi) [19:22:44] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [19:23:38] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [19:25:10] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.1 [19:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:12] (03PS9) 10Paladox: Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [19:26:14] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.1 (duration: 01m 03s) [19:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:10] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:31:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24078/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [19:32:00] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:44:04] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:59] (03CR) 10Dzahn: "systemd-sysuser should replace both user{} and group{} at the same time. I think repeating it with "m" instead of "u" also lets us do the " [puppet] - 10https://gerrit.wikimedia.org/r/607646 (owner: 10Legoktm) [19:45:54] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [19:47:31] cdanis: it stopped since the fix while it kept happening every hour before. calling it tentatively resolved. https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codfw+port+80%2Ftcp+-+MediaWiki+API+cluster-+api.svc.eqiad.wmnet+IPv4+%23page [19:47:40] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [19:47:41] mutante: lgtm :) [19:47:48] cool [19:47:55] mutante: also I am writing a quick Python script we can run as an NRPE to find this in the future [19:48:37] nice, thank you [19:49:07] i went through other new(ish) appservers in codfw but found no more [19:50:56] `check_http` doesn't support it, but, it's not too hard to make `requests` in python3 have a specific source address [19:51:52] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) So this happened whenever the check ended up talking to one of the servers in that 2335 - 2339 range.... [19:52:39] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) 05Open→03Resolved [19:53:10] 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10Dzahn) [19:53:11] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [20:00:04] halfak and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T2000). [20:00:42] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24080/" [puppet] - 10https://gerrit.wikimedia.org/r/615414 (owner: 10Muehlenhoff) [20:05:18] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/24079/" [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [20:07:55] 10Operations, 10Traffic, 10observability, 10serviceops: monitoring for mismatched LVS realserver addresses/configurations - https://phabricator.wikimedia.org/T258648 (10CDanis) [20:08:01] (03CR) 10QChris: [C: 03+1] ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn) [20:09:50] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:38] (03CR) 10Dzahn: [C: 04-1] "The class dnsrecursor does not have "allow_incoming_ecs". It only has "allow_edns_whitelist" which is also what the commit message mention" [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:14:07] (03PS3) 10Dzahn: ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) [20:14:56] (03PS1) 10Catrope: GrowthExperiments (labs): Use correct syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615556 [20:16:44] (03CR) 10Ssingh: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:17:34] (03PS3) 10Herron: prometheus: introduce role::prometheus::pop [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) [20:17:57] (03PS2) 10Dzahn: wikidough: enable support for EDNS Client Subnet [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:19:15] (03CR) 10Ottomata: "Ok, this actually works! Once again, I slightly changed approaches. I couldn't get debhelper + dpkg-buildpackage to build a working anac" [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [20:21:58] (03CR) 10Herron: [C: 03+2] prometheus: introduce role::prometheus::pop [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [20:22:17] (03CR) 10Catrope: [C: 03+2] "Well this is embarassing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615556 (owner: 10Catrope) [20:22:43] (03CR) 10Dzahn: [C: 03+1] "Ah, yes, i did not see that because it needed a rebase on top of the other changes. I see it now. Looks good to me, compiler looks ok as w" [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:23:00] (03Merged) 10jenkins-bot: GrowthExperiments (labs): Use correct syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615556 (owner: 10Catrope) [20:23:35] (03PS4) 10Dzahn: ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) [20:23:40] (03CR) 10Dzahn: [C: 03+2] ATS: remove gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/615514 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn) [20:25:15] (03PS1) 10Dzahn: remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) [20:31:24] (03CR) 10Ssingh: [C: 03+2] wikidough: enable support for EDNS Client Subnet [puppet] - 10https://gerrit.wikimedia.org/r/615531 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [20:33:45] (03Abandoned) 10QChris: gerrit: Use `gerrit-replica.wikimedia.org` as canonical host for gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/608214 (https://phabricator.wikimedia.org/T256567) (owner: 10QChris) [20:38:00] (03CR) 10QChris: [C: 04-1] "Meanwhile gerrit.wmfusercontent.org has been removed (Ie9124d5527db640201da7693099d598d226a51a5)." [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [20:39:31] (03PS1) 10Ssingh: dnsdist: update template for dnsdist.conf (improves 7f583962) [puppet] - 10https://gerrit.wikimedia.org/r/615564 [20:41:04] (03CR) 10Ssingh: "No Puppet code change; updated configuration template: https://puppet-compiler.wmflabs.org/compiler1001/24083/malmok.wikimedia.org/index.h" [puppet] - 10https://gerrit.wikimedia.org/r/615564 (owner: 10Ssingh) [20:41:38] (03CR) 10Ssingh: [C: 03+2] dnsdist: update template for dnsdist.conf (improves 7f583962) [puppet] - 10https://gerrit.wikimedia.org/r/615564 (owner: 10Ssingh) [20:43:06] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:27] (03PS1) 10Ssingh: dnsdist: update module comments (no code change) [puppet] - 10https://gerrit.wikimedia.org/r/615568 [20:57:31] (03CR) 10Ssingh: "PCC confirms no change (as expected): https://puppet-compiler.wmflabs.org/compiler1001/24084/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615568 (owner: 10Ssingh) [20:57:47] (03CR) 10Ssingh: [C: 03+2] dnsdist: update module comments (no code change) [puppet] - 10https://gerrit.wikimedia.org/r/615568 (owner: 10Ssingh) [20:59:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:01:58] (03PS1) 10Dzahn: ores: add envoy-proxy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/615569 (https://phabricator.wikimedia.org/T210411) [21:02:17] (03CR) 10jerkins-bot: [V: 04-1] ores: add envoy-proxy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/615569 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:03:52] (03PS2) 10Dzahn: ores: add envoy-proxy for TLS termination behind ATS [puppet] - 10https://gerrit.wikimedia.org/r/615569 (https://phabricator.wikimedia.org/T210411) [21:10:32] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:24:30] (03PS1) 10RLazarus: Add a TestCase field for POST form data. [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 [21:25:43] (03CR) 10jerkins-bot: [V: 04-1] Add a TestCase field for POST form data. [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 (owner: 10RLazarus) [21:26:32] (03PS1) 10Catrope: GrowthExperiments (beta): Send help panel to mentors on cswiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615571 (https://phabricator.wikimedia.org/T250235) [21:26:57] (03CR) 10Catrope: [C: 03+2] GrowthExperiments (beta): Send help panel to mentors on cswiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615571 (https://phabricator.wikimedia.org/T250235) (owner: 10Catrope) [21:27:43] (03Merged) 10jenkins-bot: GrowthExperiments (beta): Send help panel to mentors on cswiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615571 (https://phabricator.wikimedia.org/T250235) (owner: 10Catrope) [21:39:09] (03PS1) 10Ppchelko: Limit concurrency for processMediaModeration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/615572 (https://phabricator.wikimedia.org/T258653) [21:44:25] (03PS2) 10RLazarus: Add a TestCase field for POST form data. [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 [21:45:06] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:26] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Ferdi2005) @Dzahn it.Wikinews is really poorly indicizated, an article "Apple passa ad ARM e annuncia altre novità" is at the sixth page of Google. Is th... [21:58:12] (03CR) 10BryanDavis: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/610191 (owner: 10BryanDavis) [22:02:29] (03CR) 10BryanDavis: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/610191 (owner: 10BryanDavis) [22:05:43] 10Operations, 10SRE-Access-Requests: Adding Italian Wikinews to Google Search Console to add it to Google News - https://phabricator.wikimedia.org/T253988 (10Dzahn) @Ferdi2005 This is blocked by T254437. That is outside of my control now. I would ask you to please ping (the assignee) there. [22:07:23] !log remove downtime on api.svc.codfw.wmnet T258614 [22:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:29] T258614: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 [22:11:04] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:51] (03PS1) 10BryanDavis: docker::registry: Allow param config to override defaults [puppet] - 10https://gerrit.wikimedia.org/r/615581 [22:12:16] (03CR) 10BryanDavis: "Follow up at I77446efe52101d873ad1037d4d071875df632f2e" [puppet] - 10https://gerrit.wikimedia.org/r/610191 (owner: 10BryanDavis) [22:14:25] (03PS1) 10Ebernhardson: airflow: Allow scap deploy user to set variables [puppet] - 10https://gerrit.wikimedia.org/r/615582 [22:15:39] (03CR) 10jerkins-bot: [V: 04-1] airflow: Allow scap deploy user to set variables [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [22:18:09] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) Per comment on ticket above this access is not needed, closing [22:18:24] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) 05Stalled→03Declined [22:20:45] (03PS2) 10Ebernhardson: airflow: Allow scap deploy user to set variables [puppet] - 10https://gerrit.wikimedia.org/r/615582 [22:24:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:59] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1002/24085/" [puppet] - 10https://gerrit.wikimedia.org/r/615581 (owner: 10BryanDavis) [22:31:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:33:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:37:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:48:00] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:51:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200722T2300). Please do the needful. [23:03:43] (03PS1) 10Legoktm: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615439 (https://phabricator.wikimedia.org/T258664) [23:09:08] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:10:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:10:48] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:12:04] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:13:38] I'll use the backport window to deploy my reverts for T258664, but it's going to take a little while for it to pass jenkins [23:13:39] T258664: 25% latency regression July 2nd due to InstallerExtensionSelector service running in production - https://phabricator.wikimedia.org/T258664 [23:15:58] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [23:17:51] (03PS2) 10Legoktm: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615439 (https://phabricator.wikimedia.org/T258664) [23:17:53] (03PS1) 10Legoktm: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615440 (https://phabricator.wikimedia.org/T258664) [23:19:40] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [23:40:10] (03CR) 10Legoktm: [C: 03+2] Revert "Add a new type of database to the installer from extension" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615439 (https://phabricator.wikimedia.org/T258664) (owner: 10Legoktm) [23:40:16] (03CR) 10Legoktm: [C: 03+2] Revert "Add a new type of database to the installer from extension" [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615440 (https://phabricator.wikimedia.org/T258664) (owner: 10Legoktm) [23:41:36] not sure jenkins will finish in the span of the window :| [23:45:28] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state