[00:00:25] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:43] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:21] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:29] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:51] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:35] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46932008 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:34:31] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 23696 and 55 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:20] (03PS1) 10Krinkle: SkinMustache::generateHTML: Call headElement after getTemplateData [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619092 (https://phabricator.wikimedia.org/T259872) [01:12:35] (03PS2) 10Krinkle: skins: Call headElement() after getTemplateData() in SkinMustache [core] (wmf/1.36.0-wmf.3) - 10https://gerrit.wikimedia.org/r/619092 (https://phabricator.wikimedia.org/T259872) [02:08:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:10:39] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:19:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:24:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:28:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:23:17] (03PS1) 10VulpesVulpes825: Create TemplateEditor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619170 (https://phabricator.wikimedia.org/T260012) [04:44:36] (03PS1) 10Marostegui: report_users: Fix path detection [software] - 10https://gerrit.wikimedia.org/r/619171 [04:44:49] (03PS2) 10Marostegui: wikireplicas_dns: Depool dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/618986 (https://phabricator.wikimedia.org/T255408) [04:46:08] (03CR) 10Marostegui: [C: 03+2] report_users: Fix path detection [software] - 10https://gerrit.wikimedia.org/r/619171 (owner: 10Marostegui) [04:46:16] (03CR) 10Marostegui: [C: 03+2] wikireplicas_dns: Depool dbproxy1019 [puppet] - 10https://gerrit.wikimedia.org/r/618986 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [04:46:39] !log Depool dbproxy1019 for reimage T255408 [04:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:42] T255408: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 [04:47:07] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:48:57] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:01:43] (03PS1) 10Marostegui: install_server: Reimage dbproxy1019 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/619172 (https://phabricator.wikimedia.org/T255408) [05:02:30] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1019 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/619172 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [06:42:47] !log Stop replication on s8 codfw master to deploy MCR change, this will generate lag on s8 codfw T238966 [06:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:50] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [06:43:28] !log Remove revision triggers from db2094:3318 T238966 [06:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:33] (03CR) 10Gilles: [C: 03+1] Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [07:01:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:31] (03PS1) 10Elukey: profile::analytics::refinery::job::refine: exclude MobileWebUIClickTracking [puppet] - 10https://gerrit.wikimedia.org/r/619254 [07:16:08] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::refine: exclude MobileWebUIClickTracking [puppet] - 10https://gerrit.wikimedia.org/r/619254 (owner: 10Elukey) [07:19:54] (03PS1) 10KartikMistry: Enable Content Translation in Sundanese Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619255 (https://phabricator.wikimedia.org/T258502) [07:48:40] gilles: o/ [07:48:44] are you around? [07:52:22] (03PS1) 10Elukey: profile::webperf::processors: add generic dashboard_links [puppet] - 10https://gerrit.wikimedia.org/r/619256 (https://phabricator.wikimedia.org/T225739) [07:52:57] (03CR) 10JMeybohm: "As you're generating a quite complex envoy config here, you might want do extend the envoy config validation from: https://gerrit.wikimedi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [07:57:29] (03PS2) 10Elukey: profile::webperf::processors: fix prometheus monitors [puppet] - 10https://gerrit.wikimedia.org/r/619256 (https://phabricator.wikimedia.org/T225739) [07:59:18] (03CR) 10Elukey: [C: 03+2] profile::webperf::processors: fix prometheus monitors [puppet] - 10https://gerrit.wikimedia.org/r/619256 (https://phabricator.wikimedia.org/T225739) (owner: 10Elukey) [08:02:13] (03PS1) 10Jcrespo: mariadb: Add memory monitoring to core (mw) db hosts [puppet] - 10https://gerrit.wikimedia.org/r/619257 (https://phabricator.wikimedia.org/T172490) [08:04:52] (03PS9) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 [08:05:12] (03CR) 10JMeybohm: Add basic sre.discovery.pool and sre.discovery.depool (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [08:06:17] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1001/24385/db1083.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619257 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [08:17:00] (03PS1) 10Jcrespo: mariadb: Add memory check to most other mariadb roles other than core [puppet] - 10https://gerrit.wikimedia.org/r/619258 (https://phabricator.wikimedia.org/T172490) [08:21:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:44] (03PS1) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) [08:42:09] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:17] PROBLEM - Juniper alarms on mr1-esams is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 91.198.174.247 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:46:00] (03PS2) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) [08:51:56] elukey: I am. what's up? [08:52:09] (03PS3) 10ZPapierski: Replace a query service during data reload [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) [08:52:35] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add memory monitoring to core (mw) db hosts [puppet] - 10https://gerrit.wikimedia.org/r/619257 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [08:52:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) (owner: 10Dave Pifke) [09:02:35] gilles: nothing urgent, it was related to puppet broken on webperf nodes. I sent a patch, also added a comment to the task [09:02:46] https://gerrit.wikimedia.org/r/619256 [09:05:26] (03CR) 10Volans: [C: 03+1] "LGTM, ship it! ๐Ÿ˜Š" [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [09:06:12] thanks [09:06:49] RECOVERY - Juniper alarms on mr1-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:08:41] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 46, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:08:57] (03CR) 10Ayounsi: [C: 03+1] mgmt codfw: migrated Papaul's IP to Netbox [dns] - 10https://gerrit.wikimedia.org/r/619015 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [09:09:21] (03PS2) 10Jcrespo: mariadb: Add memory check to most other mariadb roles other than core [puppet] - 10https://gerrit.wikimedia.org/r/619258 (https://phabricator.wikimedia.org/T172490) [09:18:20] (03PS1) 10Marostegui: Revert "wikireplicas_dns: Depool dbproxy1019" [puppet] - 10https://gerrit.wikimedia.org/r/619093 [09:18:30] (03CR) 10JMeybohm: [C: 03+2] Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [09:19:27] (03Merged) 10jenkins-bot: Add basic sre.discovery.pool and sre.discovery.depool [cookbooks] - 10https://gerrit.wikimedia.org/r/618738 (owner: 10JMeybohm) [09:19:52] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplicas_dns: Depool dbproxy1019" [puppet] - 10https://gerrit.wikimedia.org/r/619093 (owner: 10Marostegui) [09:21:36] !log Promote dbproxy1019 back T255408 [09:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:39] T255408: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 [09:31:32] Hey there! I need someone to revoke my production key (username neilpquinn-wmf). [09:31:58] (03CR) 10Marostegui: [C: 03+1] mariadb: Add memory check to most other mariadb roles other than core [puppet] - 10https://gerrit.wikimedia.org/r/619258 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [09:32:32] ^ elukey you around? [09:33:53] Also pinging jfishback [09:34:09] Not sure who else I can ping [09:34:31] nshahquinn: we are handling it [09:34:31] nshahquinn: message received! [09:35:25] marostegui volans thank you! ๐Ÿ˜Š [09:37:55] nshahquinn: I am yes :) [09:38:23] elukey: nice, but looks like it's already handled. [09:38:32] yep [09:38:33] Thank you! [09:39:37] FWIW, my laptop display died and I just gave it for repair. Nothing worse than that. [09:40:24] okok thanks for letting us know :) [09:41:17] (03CR) 10Filippo Giunchedi: [C: 03+2] Add Debian packaging [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:41:19] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add Debian packaging [debs/karma] - 10https://gerrit.wikimedia.org/r/618764 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:49:16] !log jayme@cumin1001 START - Cookbook sre.discovery.depool [09:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:37] (03PS1) 10Vgutierrez: admin: Revoke neilpquinn-wmf ssh key [puppet] - 10https://gerrit.wikimedia.org/r/619265 [09:51:43] (03CR) 10Volans: [C: 03+1] "LGTM, as requested by the user." [puppet] - 10https://gerrit.wikimedia.org/r/619265 (owner: 10Vgutierrez) [09:54:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.depool (exit_code=0) [09:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:40] !log Updated container for Jenkins job operations-puppet-tests-buster-docker https://gerrit.wikimedia.org/r/619266 [09:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:03] !log Updated containeer for Jenkins job operations-dns-lint-docker https://gerrit.wikimedia.org/r/619267 [09:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:50] (03PS3) 10Hnowlan: Support wikifeeds in api-gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/618963 (https://phabricator.wikimedia.org/T246265) [10:02:01] (03CR) 10Vgutierrez: [C: 03+2] admin: Revoke neilpquinn-wmf ssh key [puppet] - 10https://gerrit.wikimedia.org/r/619265 (owner: 10Vgutierrez) [10:02:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC LGTM https://puppet-compiler.wmflabs.org/compiler1001/24386/logstash1020.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619032 (https://phabricator.wikimedia.org/T259219) (owner: 10Herron) [10:04:01] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [10:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:24] (03PS1) 10Elukey: admin: remove krb flag from user neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619269 [10:05:26] nshahquinn: your SSH key has been revoked as requested, thanks for pinging us <3 [10:05:28] (03CR) 10jerkins-bot: [V: 04-1] admin: remove krb flag from user neilpquinn-wmf [puppet] - 10https://gerrit.wikimedia.org/r/619269 (owner: 10Elukey) [10:06:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=chartmuseum site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:54] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:10:22] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:25] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619269 (owner: 10Elukey) [10:14:23] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:59] (03CR) 10Hnowlan: [C: 03+2] Support wikifeeds in api-gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/618963 (https://phabricator.wikimedia.org/T246265) (owner: 10Hnowlan) [10:18:01] (03Merged) 10jenkins-bot: Support wikifeeds in api-gateway. [deployment-charts] - 10https://gerrit.wikimedia.org/r/618963 (https://phabricator.wikimedia.org/T246265) (owner: 10Hnowlan) [10:18:49] !log jayme@cumin1001 START - Cookbook sre.discovery.pool [10:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:04] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.pool (exit_code=0) [10:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:57] !log jayme@cumin1001 START - Cookbook sre.discovery.depool [10:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:12] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:52] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.depool (exit_code=0) [10:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T1030). [10:32:22] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [10:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:04] !log jayme@cumin1001 START - Cookbook sre.discovery.pool [10:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:14] !log jayme@cumin1001 END (FAIL) - Cookbook sre.discovery.pool (exit_code=99) [10:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:01] !log jayme@cumin1001 START - Cookbook sre.discovery.pool [10:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:54] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619273 (https://phabricator.wikimedia.org/T128546) [10:40:07] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619273 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:40:52] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619273 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.pool (exit_code=0) [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:619273| Bumping portals to master (T128546)]] (duration: 01m 01s) [10:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:16] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:44:12] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:619273| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:34] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618989 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:57:26] (03CR) 10Ema: [C: 03+2] cache: move all VCL files to the same directory [puppet] - 10https://gerrit.wikimedia.org/r/618989 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T1100). [11:00:04] VulpesVulpes825: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:06] VulpesVulpes825: around? :-) [11:01:13] Present and waiting for deployment [11:01:25] excellent! [11:01:27] I can deploy today [11:01:42] Urbanecm: Great! [11:02:23] (03CR) 10Urbanecm: [C: 03+2] Add WN as an alias to project namespace in Portuguese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619150 (owner: 10VulpesVulpes825) [11:03:10] (03Merged) 10jenkins-bot: Add WN as an alias to project namespace in Portuguese Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619150 (owner: 10VulpesVulpes825) [11:03:35] VulpesVulpes825: ^^ is at mwdebug1001 [11:03:45] (03PS2) 10Urbanecm: Create TemplateEditor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619170 (https://phabricator.wikimedia.org/T260012) (owner: 10VulpesVulpes825) [11:04:07] Urbanecm: Testing... [11:05:28] Urbanecm: WN alias works on ptwikinews at mwdebug1001 [11:05:34] thanks, syncing [11:06:19] !log urbanecm@deploy1001 sync-file aborted: 010f63ed64c599712e9ac11ed7fced666cc88ca1: Add WN as an alias to project namespace in Portuguese Wikinews (T259959ยจ) (duration: 00m 00s) [11:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:22] T259959: Add WN as an alias to project namespace in Portuguese Wikinews - https://phabricator.wikimedia.org/T259959 [11:06:44] (03CR) 10Urbanecm: [C: 03+2] Create TemplateEditor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619170 (https://phabricator.wikimedia.org/T260012) (owner: 10VulpesVulpes825) [11:07:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 010f63ed64c599712e9ac11ed7fced666cc88ca1: Add WN as an alias to project namespace in Portuguese Wikinews (T259959) (duration: 00m 58s) [11:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:29] VulpesVulpes825: done [11:07:32] (03Merged) 10jenkins-bot: Create TemplateEditor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619170 (https://phabricator.wikimedia.org/T260012) (owner: 10VulpesVulpes825) [11:08:00] VulpesVulpes825: ^^ is at mwdebug1001 [11:08:50] (03PS1) 10Ayounsi: Re-prioritize peering over transit - ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/619278 (https://phabricator.wikimedia.org/T259614) [11:08:59] !log Run mwscript namespaceDupes.php --wiki=ptwikinews --fix (T259959) [11:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:38] !log Run mwscript namespaceDupes.php --wiki=ptwikinews --fix --add-prefix=T259959 (T259959) [11:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:46] Urbanecm: Special:Protected pages now display Template Editor Protected on zhwiki at mwdebug1001, so the patch should be working [11:11:05] thanks [11:12:05] (03CR) 10Giuseppe Lavagetto: helmfile: strawman refactoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [11:13:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ba0b2abc0414940749c28a1f82ffbbfd94cd0fc5: Create TemplateEditor group on zhwiki (T260012) (duration: 00m 58s) [11:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:09] T260012: Create TemplateEditor group on zhwiki - https://phabricator.wikimedia.org/T260012 [11:13:16] !log Re-prioritize peering over transit - ulsfo - T259614 [11:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:19] T259614: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 [11:13:32] (03PS2) 10Urbanecm: Increase autoconfirmed threshold for Chinese Wikinews to 7 days and 20 edits at least [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619081 (https://phabricator.wikimedia.org/T259869) (owner: 10Hamish) [11:13:42] (03CR) 10Urbanecm: [C: 03+2] Increase autoconfirmed threshold for Chinese Wikinews to 7 days and 20 edits at least [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619081 (https://phabricator.wikimedia.org/T259869) (owner: 10Hamish) [11:14:06] VulpesVulpes825: group synced, wating for CI on the autoconfirmed treshold patch [11:14:50] (03Merged) 10jenkins-bot: Increase autoconfirmed threshold for Chinese Wikinews to 7 days and 20 edits at least [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619081 (https://phabricator.wikimedia.org/T259869) (owner: 10Hamish) [11:15:56] VulpesVulpes825: I have 0 edits there, and it no longer considers me autoconfirmed, syncing... [11:16:26] Urbanecm: Great to hear! [11:16:46] (03CR) 10Ayounsi: [C: 03+2] Re-prioritize peering over transit - ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/619278 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [11:17:08] (03Merged) 10jenkins-bot: Re-prioritize peering over transit - ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/619278 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [11:17:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a15e3a22da13af89e9a2b76fdb24d6b5bebe6ec4: Increase autoconfirmed threshold for Chinese Wikinews to 7 days and 20 edits at least (T259869) (duration: 00m 58s) [11:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:16] T259869: Increase autoconfirmed threshold for Chinese Wikinews - https://phabricator.wikimedia.org/T259869 [11:17:24] VulpesVulpes825: done :) [11:17:27] anything else? [11:17:49] Urbanecm: That is it. Thank you so much for your help. [11:17:55] happy to help then! [11:20:11] (03PS1) 10Ayounsi: Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/619101 [11:20:48] (03PS2) 10Ayounsi: Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/619101 [11:20:53] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/619101 (owner: 10Ayounsi) [11:21:19] !log repool ulsfo [11:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:46] (03PS2) 10Volans: dns: zone generation improvements [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/618926 [11:22:58] (03PS1) 10Urbanecm: Regenerate shnwiktionary logo from source svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619279 (https://phabricator.wikimedia.org/T260010) [11:23:27] (03CR) 10Urbanecm: [C: 03+2] Regenerate shnwiktionary logo from source svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619279 (https://phabricator.wikimedia.org/T260010) (owner: 10Urbanecm) [11:24:08] (03Merged) 10jenkins-bot: Regenerate shnwiktionary logo from source svg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619279 (https://phabricator.wikimedia.org/T260010) (owner: 10Urbanecm) [11:24:38] (03PS2) 10Urbanecm: add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:25:25] (03CR) 10jerkins-bot: [V: 04-1] add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:27:04] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: c5c96ca419b3ab13c90cb55646be2aa9a07c8527: Regenerate shnwiktionary logo from source svg (T260010) (duration: 00m 58s) [11:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:07] T260010: Regenerate Wiktionary Shan Logo - https://phabricator.wikimedia.org/T260010 [11:27:29] !log standardize cr2-eqiad interfaces [11:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:24] !log Purge https://en.wikipedia.org/static/images/project-logos/shnwiktionary*.png with purgeList.php (T260010) [11:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:14] (03PS3) 10Urbanecm: add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:29:39] (03PS4) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [11:29:41] (03PS1) 10Ema: cache: remove 'cache_ats' from 'wikimedia_clusters' [puppet] - 10https://gerrit.wikimedia.org/r/619282 (https://phabricator.wikimedia.org/T241239) [11:29:43] (03PS1) 10Ema: cache: remove cache_ats_ definitions from monitoring::groups [puppet] - 10https://gerrit.wikimedia.org/r/619283 (https://phabricator.wikimedia.org/T241239) [11:30:07] (03CR) 10jerkins-bot: [V: 04-1] cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:31:16] (03PS5) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [11:31:34] (03CR) 10Urbanecm: [C: 03+2] add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:31:38] (03PS4) 10Urbanecm: add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:31:43] (03CR) 10Urbanecm: add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:31:45] (03CR) 10jerkins-bot: [V: 04-1] cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:31:47] (03CR) 10Urbanecm: [C: 03+2] add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:32:45] (03Merged) 10jenkins-bot: add two extra namespaces for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619151 (https://phabricator.wikimedia.org/T259987) (owner: 10Ashot1997) [11:35:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 177148777e9c11a9f936d5d8b4d1c201ba9bf7fb: add two extra namespaces for hywiki (T259987) (duration: 00m 59s) [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] T259987: Two new namespaces for hy.wikipedia.org - https://phabricator.wikimedia.org/T259987 [11:37:01] !log Run `mwscript namespaceDupes.php --wiki=hywiki --fix` at mwmaint1002 (T259987) [11:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:37] (03PS2) 10Ema: ATS: set caching to 'websockets' for Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/619036 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:39:06] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:39:28] (03PS2) 10Urbanecm: Search Work NS by default at bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617284 (https://phabricator.wikimedia.org/T258982) [11:39:32] (03CR) 10Urbanecm: [C: 03+2] Search Work NS by default at bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617284 (https://phabricator.wikimedia.org/T258982) (owner: 10Urbanecm) [11:40:17] (03Merged) 10jenkins-bot: Search Work NS by default at bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617284 (https://phabricator.wikimedia.org/T258982) (owner: 10Urbanecm) [11:40:46] (03CR) 10Ema: [C: 03+2] ATS: set caching to 'websockets' for Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/619036 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [11:41:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0d8366f6951afae25617bea9402aa52e26f34e5d: Search Work NS by default at bnwikisource (T258982) (duration: 00m 59s) [11:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:59] T258982: Add "work" namespace to search results for Bengali Wikisource - https://phabricator.wikimedia.org/T258982 [11:46:39] (03PS1) 10Urbanecm: Regenerate Bengali Wikipedia logo from source SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619284 (https://phabricator.wikimedia.org/T259292) [11:46:53] (03CR) 10Urbanecm: [C: 03+2] Regenerate Bengali Wikipedia logo from source SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619284 (https://phabricator.wikimedia.org/T259292) (owner: 10Urbanecm) [11:47:36] (03Merged) 10jenkins-bot: Regenerate Bengali Wikipedia logo from source SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619284 (https://phabricator.wikimedia.org/T259292) (owner: 10Urbanecm) [11:47:58] (03PS6) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [11:48:23] (03CR) 10jerkins-bot: [V: 04-1] cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:49:20] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: bbbf7018b1b4db1a99f4828d1907c77a19158884: Regenerate Bengali Wikipedia logo from source SVG (T259292) (duration: 00m 59s) [11:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:24] T259292: Regenerate Bengali Wikipedia logo - https://phabricator.wikimedia.org/T259292 [11:51:13] (03PS1) 10Urbanecm: Define Portal namespace for tiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619285 (https://phabricator.wikimedia.org/T259295) [11:51:31] (03CR) 10Urbanecm: [C: 03+2] Define Portal namespace for tiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619285 (https://phabricator.wikimedia.org/T259295) (owner: 10Urbanecm) [11:52:33] (03Merged) 10jenkins-bot: Define Portal namespace for tiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619285 (https://phabricator.wikimedia.org/T259295) (owner: 10Urbanecm) [11:52:48] (03PS4) 10Hnowlan: api-gateway: enable TLS when talking to appservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) [11:54:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 14b2897760d2658be70765fbb8e56ad552ce7a81: Define Portal namespace for tiwiki (T259295) (duration: 00m 59s) [11:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:36] T259295: Portal namespace configuration for ti.wikipedia - https://phabricator.wikimedia.org/T259295 [11:55:50] !log Run `mwscript namespaceDupes.php --wiki=tiwiki --fix` at mwmaint1002 (T259295) [11:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:23] !log EU B&C window done [11:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:24] (03PS2) 10Ayounsi: Netbox driven interfaces for cr1/2-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/618827 [12:01:26] (03PS1) 10Ayounsi: Add vrrp_bandwidth_threshold variable [homer/public] - 10https://gerrit.wikimedia.org/r/619286 [12:07:16] !log standardize cr1-eqiad interfaces [12:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:07] (03PS2) 10Ema: cache: remove 'cache_ats' from 'wikimedia_clusters' [puppet] - 10https://gerrit.wikimedia.org/r/619282 (https://phabricator.wikimedia.org/T241239) [12:16:51] (03CR) 10Ayounsi: [C: 03+2] "This change is ready for review." [homer/public] - 10https://gerrit.wikimedia.org/r/618827 (owner: 10Ayounsi) [12:17:18] (03Merged) 10jenkins-bot: Netbox driven interfaces for cr1/2-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/618827 (owner: 10Ayounsi) [12:18:20] (03CR) 10Ayounsi: [C: 03+2] Add vrrp_bandwidth_threshold variable [homer/public] - 10https://gerrit.wikimedia.org/r/619286 (owner: 10Ayounsi) [12:18:42] (03Merged) 10jenkins-bot: Add vrrp_bandwidth_threshold variable [homer/public] - 10https://gerrit.wikimedia.org/r/619286 (owner: 10Ayounsi) [12:19:59] (03CR) 10Ema: [C: 03+2] cache: remove 'cache_ats' from 'wikimedia_clusters' [puppet] - 10https://gerrit.wikimedia.org/r/619282 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [12:32:54] (03PS1) 10ZPapierski: Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) [12:33:05] (03PS1) 10Ayounsi: Re-prioritize peering over transit - eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/619290 (https://phabricator.wikimedia.org/T259614) [12:33:25] (03CR) 10jerkins-bot: [V: 04-1] Add a weekly reload job for wcqs data reload [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [12:34:02] !log Re-prioritize peering over transit - eqsin - T259614 [12:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] T259614: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 [12:35:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:37:27] (03CR) 10Ayounsi: [C: 03+2] Re-prioritize peering over transit - eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/619290 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [12:37:34] (03CR) 10ZPapierski: "PCC shows expected changes for maintenance file tracking - https://puppet-compiler.wmflabs.org/compiler1001/24390/" [puppet] - 10https://gerrit.wikimedia.org/r/619259 (https://phabricator.wikimedia.org/T259543) (owner: 10ZPapierski) [12:37:49] (03Merged) 10jenkins-bot: Re-prioritize peering over transit - eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/619290 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [12:43:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:45:32] (03PS1) 10Kormat: mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 [12:45:57] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [12:56:23] (03PS7) 10Ema: cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) [12:56:47] (03CR) 10jerkins-bot: [V: 04-1] cache: remove '_ats' suffix from DC names [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [12:58:33] (03CR) 10JMeybohm: [C: 04-1] helmfile: strawman refactoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:14:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:18:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:21:44] (03CR) 10Filippo Giunchedi: [C: 03+1] mariadb: Add memory check to most other mariadb roles other than core [puppet] - 10https://gerrit.wikimedia.org/r/619258 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [13:25:47] (03CR) 10Elukey: [C: 03+1] mariadb: Add memory check to most other mariadb roles other than core [puppet] - 10https://gerrit.wikimedia.org/r/619258 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [13:25:53] (03CR) 10Hnowlan: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [13:27:46] (03CR) 10Jcrespo: "This is unrelated to this patch, but I think it would be relevant. This lag uses prometheus metrics from "show slave status". This has som" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [13:28:47] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [13:29:17] (03PS3) 10Filippo Giunchedi: alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) [13:29:18] (03PS3) 10Filippo Giunchedi: role: add alertmanager::irc to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/617689 (https://phabricator.wikimedia.org/T258948) [13:29:38] (03CR) 10jerkins-bot: [V: 04-1] alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:29:54] (03CR) 10jerkins-bot: [V: 04-1] role: add alertmanager::irc to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/617689 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:31:21] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/24392/" [puppet] - 10https://gerrit.wikimedia.org/r/617689 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:31:32] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] alertmanager: add IRC notifier [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:31:39] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] role: add alertmanager::irc to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/617689 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:45:01] (03CR) 10Jcrespo: "Looks ok: https://puppet-compiler.wmflabs.org/compiler1003/24393/db2098.codfw.wmnet/index.html" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [13:45:06] (03CR) 10Jcrespo: [C: 03+1] mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [13:47:13] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10hashar) Went to https://phabricator.wikimedia.org/config/issue/aphlict.connect/ again and it states: > **Issue... [13:47:18] (03PS4) 10Giuseppe Lavagetto: helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) [13:48:49] PROBLEM - nova-compute proc minimum on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:50:16] (03PS1) 10Filippo Giunchedi: prometheus: Introduce Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) [13:50:41] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Introduce Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [13:50:45] RECOVERY - nova-compute proc minimum on cloudvirt1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:52:56] (03CR) 10Elukey: "> Patch Set 5:" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [13:53:01] (03PS2) 10Filippo Giunchedi: prometheus: Introduce Alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) [13:53:03] (03PS1) 10Filippo Giunchedi: add alertmanager to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/619296 [13:53:33] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add memory check to most other mariadb roles other than core [puppet] - 10https://gerrit.wikimedia.org/r/619258 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [13:53:43] (03CR) 10jerkins-bot: [V: 04-1] add alertmanager to alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/619296 (owner: 10Filippo Giunchedi) [13:55:27] !log Re-prioritize peering over transit - codfw - T259614 [13:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:31] T259614: Re-prioritize peering over transit - https://phabricator.wikimedia.org/T259614 [13:55:45] (03PS6) 10Elukey: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) [13:56:02] (03PS5) 10Giuseppe Lavagetto: helmfile: refactoring blubberoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) [13:59:49] (03PS1) 10Ayounsi: Re-prioritize peering over transit - codfw [homer/public] - 10https://gerrit.wikimedia.org/r/619299 (https://phabricator.wikimedia.org/T259614) [14:00:49] (03CR) 10Ayounsi: [C: 03+2] Re-prioritize peering over transit - codfw [homer/public] - 10https://gerrit.wikimedia.org/r/619299 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [14:01:14] (03Merged) 10jenkins-bot: Re-prioritize peering over transit - codfw [homer/public] - 10https://gerrit.wikimedia.org/r/619299 (https://phabricator.wikimedia.org/T259614) (owner: 10Ayounsi) [14:03:35] (03CR) 10Ppchelko: "Just one tiny little thingy inlined" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [14:04:04] (03PS2) 10Kormat: mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 [14:04:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [14:07:19] (03CR) 10Herron: [C: 03+2] logstash: increase 'hdd' hosts heap from 24G to 26G [puppet] - 10https://gerrit.wikimedia.org/r/619032 (https://phabricator.wikimedia.org/T259219) (owner: 10Herron) [14:09:22] (03PS6) 10Herron: kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) [14:09:48] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [14:12:27] (03CR) 10Ppchelko: api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [14:12:54] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:13:05] \o/ [14:13:37] (03CR) 10Filippo Giunchedi: [C: 03+1] cache: remove cache_ats_ definitions from monitoring::groups [puppet] - 10https://gerrit.wikimedia.org/r/619283 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [14:14:00] (03CR) 10Ema: [C: 03+2] cache: remove cache_ats_ definitions from monitoring::groups [puppet] - 10https://gerrit.wikimedia.org/r/619283 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [14:14:03] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:14:03] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:06] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:06] 04Critical Alert for device cr1-eqiad.wikimedia.org - CDR bills over 98% used got acknowledged [14:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:17] (03PS2) 10Ema: cache: remove cache_ats_ definitions from monitoring::groups [puppet] - 10https://gerrit.wikimedia.org/r/619283 (https://phabricator.wikimedia.org/T241239) [14:15:06] 04Critical Alert for device cr2-codfw.wikimedia.org - CDR bills over 98% used got acknowledged [14:15:10] 04Critical Alert for device cr2-eqord.wikimedia.org - CDR bills over 98% used got acknowledged [14:15:15] 04Critical Alert for device cr3-ulsfo.wikimedia.org - CDR bills over 98% used got acknowledged [14:15:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:15:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:10] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:32] 10Operations, 10ops-eqiad, 10netops: new cloudflare xconnect to cr1-eqiad - https://phabricator.wikimedia.org/T259923 (10RobH) [14:21:15] (03PS7) 10Herron: kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) [14:21:31] (03CR) 10jerkins-bot: [V: 04-1] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [14:22:22] (03CR) 10Hnowlan: api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [14:22:48] (03PS5) 10Hnowlan: api-gateway: enable TLS when talking to appservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) [14:25:03] (03CR) 10Elukey: [C: 03+2] Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:26:20] (03Merged) 10jenkins-bot: Initial release of wmflib [software/pywmflib] - 10https://gerrit.wikimedia.org/r/617403 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [14:31:00] (03CR) 10Herron: [V: 03+2] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [14:31:19] (03CR) 10Herron: [V: 03+2 C: 03+2] kafkamon: add role::kafka::monitoring_buster, assign kafkamon[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/618359 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [14:36:31] (03PS3) 10Jcrespo: mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [14:36:33] (03PS1) 10Jcrespo: mariadb: Reduce s7 memory usage for dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/619301 (https://phabricator.wikimedia.org/T172490) [14:36:57] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [14:37:00] (03PS2) 10Jcrespo: mariadb: Reduce s7 memory usage for dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/619301 (https://phabricator.wikimedia.org/T172490) [14:41:18] (03PS4) 10Jcrespo: mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [14:41:33] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [14:42:44] (03PS5) 10Kormat: mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 [14:43:26] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on cloudvirt1032 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.74: Connection reset by peer andrew bogott rebuild in progress https://wikitech.wikimedia.org/wiki/NTP [14:43:29] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1032 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.74: Connection reset by peer andrew bogott rebuild in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:44:21] (03CR) 10Kormat: [V: 03+2 C: 03+2] mariadb: Give better name to sustained replication lag alert [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [14:47:21] PROBLEM - Check systemd state on kafkamon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:24] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock the run of sre.dns.netbox cookbook as this fixes an issue with addresses not attached to any interface." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/618926 (owner: 10Volans) [14:47:44] 10Operations, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10MF-Warburg) Mail from @Ladsgroup to list admins: > Mailman allows us to upgrade mailing list by mailing list, that's good but we haven't found a way to keep the old version and t... [14:47:48] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reduce s7 memory usage for dbstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/619301 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [14:48:24] !log volans@cumin1001 START - Cookbook sre.dns.netbox [14:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:23] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (... [14:53:47] (03CR) 10Ppchelko: [C: 03+2] api-gateway: enable TLS when talking to appservers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [14:54:54] (03Merged) 10jenkins-bot: api-gateway: enable TLS when talking to appservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/618972 (https://phabricator.wikimedia.org/T235272) (owner: 10Hnowlan) [14:54:59] (03CR) 10JMeybohm: [C: 04-1] helmfile: refactoring blubberoid (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:55:49] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Volans) Any news on this? Now it seems that bohrium's mgmt IP has been taken by pki1001 but keeping the old DNS name, see https://netbox.wikimedia.org/ipam/ip-addresses/155/ [14:59:18] liw: hi. would you be able to help out with jenkins issues on the puppet repo? [14:59:51] the `docker-registry.wikimedia.org/releng/operations-puppet:0.7.5` docker image is broken [15:00:01] https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/8828/console is the first build that failed using it [15:01:14] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:11] (03CR) 10Ebernhardson: [C: 03+1] "lgtm. Can ship at any time, this doesn't turn on the test it only makes the backend configuration available to invoke." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [15:02:59] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:32] kormat, I'm afraid I don't know how to even diagnose that, or to fix and build a new Docker image :( [15:04:49] kormat, hasharAway will be back later today, and should be able to help [15:04:53] liw: hah, ok :) [15:05:08] 10Operations: jenkins CI broken for operations/puppet - https://phabricator.wikimedia.org/T260063 (10Kormat) [15:05:17] i filed a task for it, to keep track ^ [15:05:22] liw: thanks anyway [15:05:49] kormat: thanks [15:06:09] kormat, sorry I couldn't be helpful :( [15:07:31] (03PS1) 10Hnowlan: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619305 [15:10:26] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619305 (owner: 10Hnowlan) [15:11:05] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:29] (03Merged) 10jenkins-bot: api-gateway: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/619305 (owner: 10Hnowlan) [15:11:40] 10Operations, 10ops-eqiad, 10netops: new cloudflare xconnect to cr1-eqiad - https://phabricator.wikimedia.org/T259923 (10RobH) [15:12:20] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) [15:12:33] PROBLEM - Check systemd state on kafkamon2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:15] elukey: FYI ^^^ burrow is complaining [15:13:39] (03PS1) 10Jcrespo: mariadb: Increase labsdb* memory monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/619306 (https://phabricator.wikimedia.org/T172490) [15:13:53] (03PS2) 10Jcrespo: mariadb: Increase labsdb* memory monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/619306 (https://phabricator.wikimedia.org/T172490) [15:14:25] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Increase labsdb* memory monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/619306 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [15:14:52] cc ottomata too (burrow ^^^) [15:15:03] volans: I think it is due to herron's work on the new VMs [15:15:16] (upgrade to buster etc..) [15:15:19] ah [15:16:57] yep yep it runs kafka::monitoring_buster [15:17:19] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:19] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) [15:18:27] (03PS1) 10Cmjohnson: Manual removal of asset tags mgmt ip associated with ganeti100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/619308 (https://phabricator.wikimedia.org/T255553) [15:18:34] 10Operations, 10ops-eqiad, 10netops: cloudflare CLF-20200806 dmarc to router patch - https://phabricator.wikimedia.org/T259923 (10RobH) p:05Mediumโ†’03High The equinix side of things has been completed via order 1-199908135743. Ports 11/12 on the dmarc panel should connect to our router cr1-eqiad:xe-3/0/5... [15:19:09] volans: ahh I know why, zookeeper is refusing to work, firewall rules probably [15:22:09] (03PS1) 10Hnowlan: api-gateway: Correct puppet CA path [deployment-charts] - 10https://gerrit.wikimedia.org/r/619309 [15:22:20] k [15:22:23] thx for looking [15:22:51] (03CR) 10Ppchelko: [C: 03+2] api-gateway: Correct puppet CA path [deployment-charts] - 10https://gerrit.wikimedia.org/r/619309 (owner: 10Hnowlan) [15:24:07] (03Merged) 10jenkins-bot: api-gateway: Correct puppet CA path [deployment-charts] - 10https://gerrit.wikimedia.org/r/619309 (owner: 10Hnowlan) [15:25:30] (03PS1) 10Elukey: zookeeper: allow new kafkamon vms to contact zookeeper main clusters [puppet] - 10https://gerrit.wikimedia.org/r/619310 (https://phabricator.wikimedia.org/T252773) [15:26:38] 10Operations, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services): jenkins CI broken for operations/puppet - https://phabricator.wikimedia.org/T260063 (10thcipriani) Change from this morning seems innocuous 1bb229918b88bfdcd2eeea1c89002dfb1a63f401 but let's revert b629dc949c50... [15:30:36] RECOVERY - Check systemd state on kafkamon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:50] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10mforns) ping @akosiaris? :] [15:32:09] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10mforns) [15:32:15] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:58] PROBLEM - Check systemd state on kafkamon2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:19] (03CR) 10Cmjohnson: [C: 03+2] Manual removal of asset tags mgmt ip associated with ganeti100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/619308 (https://phabricator.wikimedia.org/T255553) (owner: 10Cmjohnson) [15:37:30] 10Puppet, 10Analytics, 10VPS-Projects: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10mforns) p:05Triageโ†’03Medium Maybe a good task for onboarding. [15:38:23] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10Cmjohnson) [15:38:34] 10Operations, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission ganeti100[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T255553 (10Cmjohnson) 05Openโ†’03Resolved [15:38:54] 10Operations: edtadros is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T260070 (10Dzahn) [15:39:06] 10Operations: edtadros is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T260070 (10Dzahn) p:05Triageโ†’03Medium [15:39:56] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): `releng/operations-puppet:0.7.5` docker image broke the `operations-puppet-tests-buster-docker` CI job - https://phabricator.wikimedia.org/T260063 (10thcipriani) [15:40:33] (03PS1) 10Hnowlan: api-gateway: fix configmap mounting only the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/619313 [15:41:02] (03PS1) 10Cmjohnson: Left over mgmt ip for bohrium, server was decom'd 2 years ago [dns] - 10https://gerrit.wikimedia.org/r/619314 (https://phabricator.wikimedia.org/T206315) [15:43:02] (03CR) 10Cmjohnson: [C: 03+2] Left over mgmt ip for bohrium, server was decom'd 2 years ago [dns] - 10https://gerrit.wikimedia.org/r/619314 (https://phabricator.wikimedia.org/T206315) (owner: 10Cmjohnson) [15:47:54] (03PS1) 10Andrew Bogott: cloudvirt103[1,2]: update nic names for Buster [puppet] - 10https://gerrit.wikimedia.org/r/619315 (https://phabricator.wikimedia.org/T259399) [15:49:09] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt103[1,2]: update nic names for Buster [puppet] - 10https://gerrit.wikimedia.org/r/619315 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [15:49:50] RECOVERY - Disk space on netbox1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netbox1001&var-datasource=eqiad+prometheus/ops [15:50:00] (03PS2) 10Hnowlan: api-gateway: fix configmap mounting only the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/619313 [15:50:04] (03CR) 10Ppchelko: [C: 03+2] api-gateway: fix configmap mounting only the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/619313 (owner: 10Hnowlan) [15:50:44] (03CR) 10Ppchelko: api-gateway: fix configmap mounting only the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/619313 (owner: 10Hnowlan) [15:51:07] (03CR) 10Ppchelko: [C: 03+2] api-gateway: fix configmap mounting only the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/619313 (owner: 10Hnowlan) [15:51:24] (03Merged) 10jenkins-bot: api-gateway: fix configmap mounting only the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/619313 (owner: 10Hnowlan) [15:51:58] PROBLEM - Host db2087.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:52:35] checking ^ [15:52:44] ah, is mgmt [15:52:54] papaul: are you around db2087 by any chance? ^ [15:53:05] (03PS1) 10Cmjohnson: Updating an-test-worker1003 mgmt ip address - wrong asset tag number [dns] - 10https://gerrit.wikimedia.org/r/619316 (https://phabricator.wikimedia.org/T255520) [15:53:34] (03CR) 10Cmjohnson: [C: 03+2] Updating an-test-worker1003 mgmt ip address - wrong asset tag number [dns] - 10https://gerrit.wikimedia.org/r/619316 (https://phabricator.wikimedia.org/T255520) (owner: 10Cmjohnson) [15:53:46] marostegui: yes [15:53:51] papaul: maybe loose cable? [15:53:56] (03CR) 10CRusnov: [C: 03+2] "I have cleaned up some old entries that definitely will not get rotated in once this is merged, and freed a large amount of space, so I am" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [15:54:11] marostegui: may be since i asm working in that rac k [15:54:15] will check [15:54:16] looks like it is back [15:54:27] (03CR) 10CRusnov: [C: 03+2] rotatedump: Enhance to retain period copies (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/562408 (https://phabricator.wikimedia.org/T231512) (owner: 10CRusnov) [15:54:29] papaul: thanks! [15:54:33] marostegui: np [15:55:21] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:30] (03CR) 10Dzahn: [C: 03+2] gerrit: Switch favicon to green git logo [puppet] - 10https://gerrit.wikimedia.org/r/619163 (https://phabricator.wikimedia.org/T257218) (owner: 10QChris) [15:57:58] RECOVERY - Host db2087.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [15:58:14] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): `releng/operations-puppet:0.7.5` docker image broke the `operations-puppet-tests-buster-docker` CI job - https://phabricator.wikimedia.org/T260063 (10hashar) **TLDR puppet 5.5.21 breaks it... [15:58:19] (03CR) 10Dzahn: "winner of the poll with 56% approval" [puppet] - 10https://gerrit.wikimedia.org/r/619163 (https://phabricator.wikimedia.org/T257218) (owner: 10QChris) [15:59:26] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] !log volans@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:46] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:59:47] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) [15:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:10] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:41] (03PS3) 10Cparle: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [16:04:42] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:05] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): `releng/operations-puppet:0.7.5` docker image broke the `operations-puppet-tests-buster-docker` CI job - https://phabricator.wikimedia.org/T260063 (10hashar) If I got it right, our puppet... [16:08:54] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:11] (03PS1) 10Hashar: Gemfile: upgrade rspec-puppet to 2.7.x [puppet] - 10https://gerrit.wikimedia.org/r/619319 (https://phabricator.wikimedia.org/T260063) [16:15:14] (03PS1) 10Hashar: Gemfile: match Buster puppet version [puppet] - 10https://gerrit.wikimedia.org/r/619320 (https://phabricator.wikimedia.org/T260063) [16:15:17] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): `releng/operations-puppet:0.7.5` docker image broke the `operations-puppet-tests-buster-docker` CI job - https://phabricator.wikimedia.org/T260063 (10hashar) a:03hashar [16:17:28] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10wiki_willy) a:03Cmjohnson [16:17:34] (03CR) 10Hashar: "I usually pass those changes through Alexandros. rspec-puppet 2.7.x is pretty straighforward." [puppet] - 10https://gerrit.wikimedia.org/r/619319 (https://phabricator.wikimedia.org/T260063) (owner: 10Hashar) [16:20:20] (03CR) 10Hashar: "When we build the CI image we run "bundle install --clean" which snapshot whatever is the latest version available matching the ~> 5.5.10," [puppet] - 10https://gerrit.wikimedia.org/r/619320 (https://phabricator.wikimedia.org/T260063) (owner: 10Hashar) [16:21:19] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) ps1-c8-eqiad is the new pdu, I am not sure if it's setup or not. I will ask robh, he did all the setups for rows A and B. bohrium issue and the asset tag issue have been f... [16:26:58] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Vgutierrez) @akosiaris is on vacations, I'll handle this ASAP [16:27:35] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Vgutierrez) a:05DVrandecicโ†’03Vgutierrez [16:27:37] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) [16:28:10] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) I manually updated the cloudcephosd servers in netbox 1004 is 10.65.0.8 1005 is 10.65.0.9 [16:29:12] (03PS2) 10Vgutierrez: admin: Add dvrandecic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/618243 (https://phabricator.wikimedia.org/T259388) (owner: 10Alexandros Kosiaris) [16:33:11] (03CR) 10Vgutierrez: [C: 03+2] admin: Add dvrandecic to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/618243 (https://phabricator.wikimedia.org/T259388) (owner: 10Alexandros Kosiaris) [16:34:24] mutante: may I merge Dzahn: gerrit: Switch favicon to green git logo (5d8829b3b6)? [16:36:25] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) @DVrandecic will also need a kerberos password [16:37:08] (03CR) 10Marostegui: [C: 03+1] "ok with those values! we'll see if we need to tweak them even more later" [puppet] - 10https://gerrit.wikimedia.org/r/619306 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [16:37:42] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Increase labsdb* memory monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/619306 (https://phabricator.wikimedia.org/T172490) (owner: 10Jcrespo) [16:39:36] (03CR) 10CDanis: [C: 03+2] Gemfile: upgrade rspec-puppet to 2.7.x [puppet] - 10https://gerrit.wikimedia.org/r/619319 (https://phabricator.wikimedia.org/T260063) (owner: 10Hashar) [16:39:42] (03CR) 10CDanis: [C: 03+2] Gemfile: match Buster puppet version [puppet] - 10https://gerrit.wikimedia.org/r/619320 (https://phabricator.wikimedia.org/T260063) (owner: 10Hashar) [16:40:14] mutante: I'm merging that one [16:41:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:32] jynus: ok to merge Jcrespo: mariadb: Increase labsdb* memory monitoring thresholds (ca43ae42fa) [16:41:34] ? [16:41:38] lol [16:41:41] yeah, I was waiting the bus [16:41:48] ok proceeding [16:43:56] PROBLEM - Host elastic2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:43:56] PROBLEM - Host elastic2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:44:12] PROBLEM - Host ms-fe2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:44:25] a mgmt switch down?^ [16:45:26] PROBLEM - Host ms-be2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:56] all rack C2 [16:46:37] papaul may be doing msw swaps this week [16:46:45] i dont know if maint mode is inlcuded for all hosts in a rack [16:46:46] (03PS6) 10Hnowlan: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) [16:46:52] as thats kinda hard to filter in icinga [16:47:04] in fact... im not sure if its easily possible to regex into a search for icinga [16:47:14] is icinga rack aware of each server? [16:47:54] (03PS1) 10Vgutierrez: admin: Set krb flag for dvrandecic [puppet] - 10https://gerrit.wikimedia.org/r/619325 (https://phabricator.wikimedia.org/T259388) [16:48:28] https://phabricator.wikimedia.org/T253154 is the msw work task [16:48:45] but i dont see a date for c2 [16:49:39] ive asked in dcops channel [16:49:53] RECOVERY - Host elastic2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.02 ms [16:49:53] RECOVERY - Host elastic2046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.96 ms [16:50:08] RECOVERY - Host ms-fe2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms [16:51:22] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) @volans or @crusnov I think I fixed ps1-c8-eqiad please verify. Thanks! Chris [16:51:24] RECOVERY - Host ms-be2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [16:52:51] (03CR) 10Vgutierrez: [C: 03+2] admin: Set krb flag for dvrandecic [puppet] - 10https://gerrit.wikimedia.org/r/619325 (https://phabricator.wikimedia.org/T259388) (owner: 10Vgutierrez) [16:54:38] vgutierrez: please do. it's the winner of the poll with 56% approval. sorry for leaving it open [16:54:48] np, already merged :) [16:54:49] all: gerrit favicon will change ^ [16:54:59] but the point was that it doesn't look like meta [16:55:37] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Vgutierrez) 05Openโ†’03Resolved >>! In T259388#6373672, @Nuria wrote: > @DVrandecic will also need a kerberos password ` v... [16:56:33] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928 (10jcrespo) [16:57:06] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10wiki_willy) a:03Cmjohnson [16:58:36] cdanis: what's the best way to figure out which rack those elastic instances are in? (you already mentioned they're in rack c2 so i'm asking how to, not the actual answer to be clear) [16:58:54] ryankemper: I just looked at https://netbox.wikimedia.org [16:58:58] (03PS1) 10Cmjohnson: Removing old mgmt (asset tag) for ms-be101[678] [dns] - 10https://gerrit.wikimedia.org/r/619326 (https://phabricator.wikimedia.org/T252008) [16:59:05] ah, duh :P [16:59:06] thx [16:59:32] ryankemper: you can also look in icinga and it will show there in its dependencies as well [16:59:37] Member of [16:59:39] Equipment with asw-c-codfw as uplink, codfw elasticsearch servers [16:59:52] (from the page for the host) [17:00:04] ryankemper: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T1700). [17:02:42] thanks mr bot, deploy started in ~10 mins [17:02:50] starting* [17:04:26] (03CR) 10Cmjohnson: [C: 03+2] Removing old mgmt (asset tag) for ms-be101[678] [dns] - 10https://gerrit.wikimedia.org/r/619326 (https://phabricator.wikimedia.org/T252008) (owner: 10Cmjohnson) [17:06:06] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:47] 10Operations, 10Phabricator, 10Traffic, 10serviceops, and 2 others: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) >>! In T238593#6372927, @hashar wrote: > Seems like setting caching to `websockets` solved the connection... [17:09:06] (03PS1) 10Dzahn: arclamp: discard stderr output of generate-svg cronjob [puppet] - 10https://gerrit.wikimedia.org/r/619328 [17:12:15] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Volans) @Cmjohnson thanks for the updates, in order: * ps1-c8-eqiad seems ok from netbox PoV * cloudcephosd: the DNS name for .8 had remained 1005, I've fixed it now, looks ok * bohri... [17:12:41] 10Puppet, 10Beta-Cluster-Infrastructure, 10VPS-Projects: Puppet failures on deployment-docker-changeprop01, deployment-docker-cpjobqueue01, deployment-push-notifications01, deployment-docker-mobileapps01, and deployment-docker-proton01 due to Docker version pinning - https://phabricator.wikimedia.org/T259812 (... [17:12:43] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:59] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:15:32] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:00] (03PS1) 10Dzahn: arclamp: remove absented resources for xenon [puppet] - 10https://gerrit.wikimedia.org/r/619329 [17:16:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:52] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Volans) [17:18:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:20:21] (03CR) 10Hnowlan: "> Patch Set 5:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:20:43] (03PS7) 10Hnowlan: api-gateway: open parts of the admin interface internally [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) [17:25:06] (03CR) 10Ppchelko: [C: 04-1] api-gateway: open parts of the admin interface internally (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:27:04] (03PS1) 10Cmjohnson: Updating mgmt dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619330 (https://phabricator.wikimedia.org/T259826) [17:28:04] (03CR) 10Dave Pifke: [C: 04-1] "These emails are reporting on a real problem, namely that flamegraph.pl is failing due to out of memory errors (T259167; latest attempt to" [puppet] - 10https://gerrit.wikimedia.org/r/619328 (owner: 10Dzahn) [17:29:15] (03CR) 10Dave Pifke: [C: 03+1] arclamp: remove absented resources for xenon [puppet] - 10https://gerrit.wikimedia.org/r/619329 (owner: 10Dzahn) [17:30:01] (03CR) 10Cmjohnson: [C: 03+2] Updating mgmt dns for pki1001 [dns] - 10https://gerrit.wikimedia.org/r/619330 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [17:31:10] RECOVERY - Check systemd state on kafkamon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:13] (03CR) 10Ppchelko: [C: 04-1] "A more general question:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/616121 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:31:28] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:31:28] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [17:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:41] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:43] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:34:43] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:57] (03PS2) 10Dzahn: arclamp: send mail from generate-svg cronjob to perf-team [puppet] - 10https://gerrit.wikimedia.org/r/619328 [17:36:20] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/619328 (owner: 10Dzahn) [17:37:04] PROBLEM - Check systemd state on kafkamon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:11] (03PS3) 10Dzahn: arclamp: send mail from generate-svg cronjob to perf-team [puppet] - 10https://gerrit.wikimedia.org/r/619328 [17:37:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:15] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [17:38:15] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:09] (03CR) 10Dzahn: [C: 03+1] "yea, '~/puppet$ grep -r environment * | grep -i mailto' shows we already do this for some other cases" [puppet] - 10https://gerrit.wikimedia.org/r/619328 (owner: 10Dzahn) [17:39:19] (03CR) 10Dave Pifke: [C: 03+1] "LGTM, thanks." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619328 (owner: 10Dzahn) [17:39:59] (03CR) 10Dzahn: [C: 03+2] arclamp: send mail from generate-svg cronjob to perf-team [puppet] - 10https://gerrit.wikimedia.org/r/619328 (owner: 10Dzahn) [17:40:59] (03PS2) 10Dzahn: arclamp: remove absented resources for xenon [puppet] - 10https://gerrit.wikimedia.org/r/619329 [17:41:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:05] (03CR) 10Dzahn: [C: 03+2] arclamp: remove absented resources for xenon [puppet] - 10https://gerrit.wikimedia.org/r/619329 (owner: 10Dzahn) [17:44:06] (03CR) 10C. Scott Ananian: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 (owner: 10Arlolra) [17:44:16] (03PS2) 10C. Scott Ananian: Be explicit about disabling nativeGallery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 (owner: 10Arlolra) [17:44:45] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) [17:45:23] (03PS1) 10Ottomata: Add eventgate-analytics stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619331 (https://phabricator.wikimedia.org/T251935) [17:45:47] 10Operations, 10ops-eqiad, 10DC-Ops: Check Netbox/dns/reality inconsistencies - https://phabricator.wikimedia.org/T259283 (10Cmjohnson) 05Openโ†’03Resolved updated the DNS name in netbox and did a manual update to pki1001 to reflect the new name allocation for the asset tag. Resolving this task [17:46:42] !log robh@cumin1001 START - Cookbook sre.dns.netbox [17:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:58] (03CR) 10Ottomata: [C: 03+2] Add eventgate-analytics stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619331 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [17:48:32] 10Operations, 10ops-eqiad, 10SRE-swift-storage, 10decommission-hardware, 10Patch-For-Review: decommission ms-be101[678] - https://phabricator.wikimedia.org/T252008 (10Cmjohnson) [17:48:39] 10Operations, 10ops-eqiad, 10SRE-swift-storage, 10decommission-hardware, 10Patch-For-Review: decommission ms-be101[678] - https://phabricator.wikimedia.org/T252008 (10Cmjohnson) 05Openโ†’03Resolved [17:49:53] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreamConfig - Add eventgate-analytics streams - T251935 (duration: 01m 02s) [17:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:56] T251935: All EventGate instances should use EventStreamConfig - https://phabricator.wikimedia.org/T251935 [17:50:36] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:51] (03PS1) 10Cmjohnson: Removing dbproxy1003 mgmt dns for decom [dns] - 10https://gerrit.wikimedia.org/r/619333 (https://phabricator.wikimedia.org/T256216) [17:52:01] (03PS1) 10Ottomata: eventgate-analytics - Use MW EventStreamConfig API, only in staging for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/619334 (https://phabricator.wikimedia.org/T251935) [17:52:03] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:16] (03CR) 10Cmjohnson: [C: 03+2] Removing dbproxy1003 mgmt dns for decom [dns] - 10https://gerrit.wikimedia.org/r/619333 (https://phabricator.wikimedia.org/T256216) (owner: 10Cmjohnson) [17:52:46] (03PS2) 10Ottomata: eventgate-analytics - Use MW EventStreamConfig API, only in staging for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/619334 (https://phabricator.wikimedia.org/T251935) [17:54:31] (03CR) 10Herron: [C: 03+2] zookeeper: allow new kafkamon vms to contact zookeeper main clusters [puppet] - 10https://gerrit.wikimedia.org/r/619310 (https://phabricator.wikimedia.org/T252773) (owner: 10Elukey) [17:54:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:46] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - Use MW EventStreamConfig API, only in staging for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/619334 (https://phabricator.wikimedia.org/T251935) (owner: 10Ottomata) [17:56:53] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [17:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T1800). [18:00:05] zpapierski and Tchanders: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:21] I can deploy [18:00:29] I'm around for zpapierski's patch [18:00:56] i just added a patch to the backport deploy window [18:00:58] and i'm around [18:01:06] RECOVERY - Check systemd state on kafkamon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:16] (03PS2) 10Catrope: Bump the weight of near match for search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618974 (https://phabricator.wikimedia.org/T257922) (owner: 10ZPapierski) [18:01:21] (03CR) 10Catrope: [C: 03+2] Bump the weight of near match for search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618974 (https://phabricator.wikimedia.org/T257922) (owner: 10ZPapierski) [18:01:44] OK, we'll do this in order of appearance [18:01:51] ryankemper first, cscott second [18:01:51] OK [18:01:55] Sounds good [18:02:04] (03Merged) 10jenkins-bot: Bump the weight of near match for search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618974 (https://phabricator.wikimedia.org/T257922) (owner: 10ZPapierski) [18:02:09] And Tchanders third [18:02:10] 10Operations: netbox1001's root partition is filling up - https://phabricator.wikimedia.org/T258912 (10crusnov) 05Openโ†’03Resolved I have fixed this problem and placed a long term fix in place. [18:02:22] RECOVERY - Check systemd state on kafkamon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:35] 10Operations: edtadros is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T260070 (10Aklapper) `@edtadros` is a WMF contractor according to T256435 [18:03:02] ryankemper: Your change is now on mwdebug1002 for testing (not sure if that's appropriate/necessary for your change) [18:03:25] Either test it there and let me know when you're satisfied that it works, or tell me to go ahead and deploy straight to production [18:03:26] Thanks, no manual testing here so you're good to roll out to the rest of the fleet [18:03:30] OK great [18:04:02] RoanKattouw: no manual testing for mine either; it just belt-and-suspenders a default before the next version of parsoid rolls out on the train [18:04:20] (03CR) 10Catrope: [C: 03+2] Be explicit about disabling nativeGallery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 (owner: 10Arlolra) [18:04:25] i'll be testing it w/ group0 tomorrow [18:04:37] cscott: OK so this is expected to be a no-op? (at this time at least) [18:04:43] RoanKattouw: yes [18:04:47] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump the weight of near match for search (T257922) (duration: 00m 59s) [18:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:50] T257922: Search does not return exact title match - https://phabricator.wikimedia.org/T257922 [18:05:01] (03Merged) 10jenkins-bot: Be explicit about disabling nativeGallery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 (owner: 10Arlolra) [18:05:38] Thanks for rolling this RoanKattouw [18:07:09] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Explicitly disable nativeGallery in Parsoid settings (no-op) (duration: 00m 58s) [18:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:07] (03PS2) 10Catrope: Enable Special:Investigate on French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618839 (https://phabricator.wikimedia.org/T257891) (owner: 10Tchanders) [18:08:10] (03CR) 10Catrope: [C: 03+2] Enable Special:Investigate on French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618839 (https://phabricator.wikimedia.org/T257891) (owner: 10Tchanders) [18:08:42] (03Merged) 10jenkins-bot: Enable Special:Investigate on French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618839 (https://phabricator.wikimedia.org/T257891) (owner: 10Tchanders) [18:09:36] RECOVERY - Disk space on netbox2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netbox2001&var-datasource=codfw+prometheus/ops [18:10:14] (03CR) 10Dzahn: "confirmed with cumin that all the files are "no such file or directory" on webperf*" [puppet] - 10https://gerrit.wikimedia.org/r/619329 (owner: 10Dzahn) [18:10:34] Tchanders: Your Special:Investigate patch is ready for testing on mwdebug1002 [18:10:45] Thanks - having a look [18:11:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: Decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Cmjohnson) [18:11:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: Decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Cmjohnson) 05Openโ†’03Resolved [18:11:48] Thanks RoanKattouw, looks good [18:12:11] Tchanders: You can also access Special:InvestigateBlock to make sure it shows up correctly. [18:12:52] That looks fine [18:13:07] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Special:Investigate on frwiki (T257891) (duration: 00m 58s) [18:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:09] T257891: Rollout Special:Investigate on French wikipedia - https://phabricator.wikimedia.org/T257891 [18:13:43] Thanks Tchanders & RoanKattouw! [18:15:13] Thank you RoanKattouw! [18:15:41] (03PS1) 10Cmjohnson: Removing mgmt dns entries for dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/619338 (https://phabricator.wikimedia.org/T255406) [18:15:54] (03PS1) 10Cicalese: Configured additional settings for API Portal beta wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) [18:16:40] (03PS1) 10Cmjohnson: mend [dns] - 10https://gerrit.wikimedia.org/r/619340 [18:17:01] (03Abandoned) 10Cmjohnson: mend [dns] - 10https://gerrit.wikimedia.org/r/619340 (owner: 10Cmjohnson) [18:17:07] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@3e12dbb]: 0.3.44 [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:36] thanks RoanKattouw ! [18:17:49] !log volans@cumin1001 START - Cookbook sre.dns.netbox [18:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:51] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns entries for dbproxy1008 [dns] - 10https://gerrit.wikimedia.org/r/619338 (https://phabricator.wikimedia.org/T255406) (owner: 10Cmjohnson) [18:19:11] subbu: do you have the regression test results for 870be90c6260309b5735213543562bf8da4605f6 or should i re-run the test script? [18:19:44] i had run it on thu / fri i think. [18:20:16] yeah, maybe just comment that the rt test run was clean. usually i put the rt test results as the first comment on the patch. [18:20:34] https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/619336 [18:20:37] and also, fyi: i only pick the section of regression with semantic errors on them ... and maybe add a couple syntactic ones ... plus look at the rtselser page and pick up the section of titles that had semantic errors ... just to make sure there were no shifts in that section. [18:20:49] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:38] my rule of thumb used to be "the first page of results", but for the last few deploys i used a largest list of titles just to make sure i wasn't missing something. [18:22:02] but that's a good example of unwritten practices ;) [18:22:06] (03PS2) 10Volans: mgmt codfw: migrated Papaul's IP to Netbox [dns] - 10https://gerrit.wikimedia.org/r/619015 (https://phabricator.wikimedia.org/T233183) [18:22:25] (03PS1) 10Ppchelko: WIP: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [18:23:24] (03CR) 10Volans: [C: 03+2] mgmt codfw: migrated Papaul's IP to Netbox [dns] - 10https://gerrit.wikimedia.org/r/619015 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:24:09] cscott, yes .. part of my todos for the next 2 weeks. [18:26:07] (03PS1) 10Cmjohnson: Missed the asset tag for dbproxy1008 mgmt removal [dns] - 10https://gerrit.wikimedia.org/r/619343 (https://phabricator.wikimedia.org/T255406) [18:26:25] (03CR) 10Dzahn: "also confirmed xenon cron is gone and arclamp cron is there (also there is now MAILTO=performance-team@wikimedia.org in /var/spool/cron/c" [puppet] - 10https://gerrit.wikimedia.org/r/619329 (owner: 10Dzahn) [18:27:03] (03CR) 10Cmjohnson: [C: 03+2] Missed the asset tag for dbproxy1008 mgmt removal [dns] - 10https://gerrit.wikimedia.org/r/619343 (https://phabricator.wikimedia.org/T255406) (owner: 10Cmjohnson) [18:27:48] (03CR) 10Ottomata: "Looks good, we can maybe bikeshed some of the other field names a bit when you write a schema, but looks good as a start! :)" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [18:28:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Cmjohnson) [18:28:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Cmjohnson) 05Openโ†’03Resolved [18:31:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:32:25] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@3e12dbb]: 0.3.44 (duration: 15m 18s) [18:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:36] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:19] (03CR) 10Ppchelko: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese) [18:40:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:45] (03PS2) 10Volans: cameras: remove old stale records [dns] - 10https://gerrit.wikimedia.org/r/618952 (https://phabricator.wikimedia.org/T207965) [18:41:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:44:29] (03CR) 10RobH: [C: 03+1] cameras: remove old stale records [dns] - 10https://gerrit.wikimedia.org/r/618952 (https://phabricator.wikimedia.org/T207965) (owner: 10Volans) [18:45:37] (03CR) 10Volans: [C: 03+2] cameras: remove old stale records [dns] - 10https://gerrit.wikimedia.org/r/618952 (https://phabricator.wikimedia.org/T207965) (owner: 10Volans) [18:46:33] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Dzahn) sounds like this was declined instead of resolved. is there more to cleanup? [18:50:16] 10Operations, 10ops-eqiad, 10Patch-For-Review: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10wiki_willy) Hi @Dzahn - we decided that this was no longer needed. There's plenty of cameras onsite run by Equinix (for eqiad) and CyrusOne (for codfw), which we could always default... [19:21:01] (03PS1) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:22:30] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:23:00] (03PS1) 10Herron: lists: make lists1001 primary mailman host [puppet] - 10https://gerrit.wikimedia.org/r/619354 (https://phabricator.wikimedia.org/T224586) [19:23:54] (03PS2) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:25:11] (03PS3) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:26:38] (03PS2) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [19:26:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:27:18] (03PS4) 10Volans: mgmt: netbox-generated data for mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/617509 (https://phabricator.wikimedia.org/T233183) [19:29:06] (03PS1) 10Herron: lists: stop automatically sycing fermium to lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/619355 (https://phabricator.wikimedia.org/T224586) [19:29:15] (03PS4) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:30:48] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:31:20] (03CR) 10BPirkle: [C: 03+1] "Change looks good. Have not tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619339 (https://phabricator.wikimedia.org/T259569) (owner: 10Cicalese) [19:31:26] (03PS5) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:32:24] (03PS6) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:33:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:43] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:34:40] (03PS7) 10Andrew Bogott: wmcs/ceph/backy: add basic backup script, wmcs-backup-instances [puppet] - 10https://gerrit.wikimedia.org/r/619350 (https://phabricator.wikimedia.org/T259192) [19:40:44] 10Operations, 10Patch-For-Review: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10Dzahn) The existing link already redirects to https://diff.wikimedia.org/2018/01/09/technology-department-highlights/ Nothing seems broken here. [19:41:06] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:21] (03CR) 10Dzahn: "the link does not seem to be broken. it redirects to the relevant post from 2018 now on diff.wikimedia.org." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619129 (https://phabricator.wikimedia.org/T259978) (owner: 10Aklapper) [20:00:04] halfak and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services โ€“ Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T2000). [20:09:45] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): `releng/operations-puppet:0.7.5` docker image broke the `operations-puppet-tests-buster-docker` CI job - https://phabricator.wikimedia.org/T260063 (10hashar) Thanks @cdanis for the reviews... [20:13:08] !log Updated container for Jenkins job operations-puppet-tests-buster-docker https://gerrit.wikimedia.org/r/c/integration/config/+/619359/ [20:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:56] (03CR) 10Hashar: "recheck T260063" [puppet] - 10https://gerrit.wikimedia.org/r/619295 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [20:15:13] (03CR) 10Hashar: "recheck T260063" [puppet] - 10https://gerrit.wikimedia.org/r/619296 (owner: 10Filippo Giunchedi) [20:15:59] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): `releng/operations-puppet:0.7.5` docker image broke the `operations-puppet-tests-buster-docker` CI job - https://phabricator.wikimedia.org/T260063 (10hashar) 05Openโ†’03Resolved I have u... [20:16:14] 10Operations, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1))): `releng/operations-puppet:0.7.5` docker image broke the `operations-p... - https://phabricator.wikimedia.org/T260063 [20:17:54] (03CR) 10Hashar: "recheck T260063" [puppet] - 10https://gerrit.wikimedia.org/r/619269 (owner: 10Elukey) [20:18:11] (03CR) 10Hashar: "recheck T260063" [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [20:18:21] (03CR) 10Hashar: "recheck T260063" [puppet] - 10https://gerrit.wikimedia.org/r/618975 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [20:24:03] (03Abandoned) 10Hashar: Add flake8 rule for selected modules [puppet] - 10https://gerrit.wikimedia.org/r/263866 (owner: 10John Vandenberg) [20:24:29] (03PS4) 10Dzahn: hiera: switch releases server to releases1001, remove 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/618621 (https://phabricator.wikimedia.org/T247652) [20:30:35] (03CR) 10Dave Pifke: [C: 03+1] "Thanks for fixing this. Filed a follow-up task (T260086) to build a more relevant dashboard." [puppet] - 10https://gerrit.wikimedia.org/r/619256 (https://phabricator.wikimedia.org/T225739) (owner: 10Elukey) [20:32:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:21] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) >>! In T259002#6370883, @Urbanecm wrote: >>>! In T259002#6340904, @Pcoombe wrote: >... [20:42:02] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) Oh, one more thing that I couldn't find in the settings: all users should have acce... [20:42:04] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:23] 10Operations, 10Fundraising-Backlog, 10User-Urbanecm, 10User-dancy, 10Wiki-Setup (Create): New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Ladsgroup) Okay then, I think we have everything we need. IMO, except the section (s5 or s3)... [20:45:03] (03PS1) 10Dzahn: add thankyou.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/619361 (https://phabricator.wikimedia.org/T259002) [20:46:49] (03CR) 10Dzahn: [C: 03+2] hiera: switch releases server to releases1001, remove 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/618621 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [20:46:59] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/619361 (https://phabricator.wikimedia.org/T259002) (owner: 10Dzahn) [21:00:04] Reedy and sbassett: #bothumor I ๏ฟฝ Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T2100). [21:28:15] (03CR) 10Ppchelko: "Ok. An unexpected complication.... "Only string values are supported in the JSON access log format"..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [21:29:47] (03PS3) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [21:32:58] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:54] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:01] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/619361 (https://phabricator.wikimedia.org/T259002) (owner: 10Dzahn) [21:48:26] 10Operations, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) >>! In T259002#6374471, @Ladsgroup wrote: > Okay then, I think we have everything we ne... [21:50:38] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [21:50:54] 10Operations, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) [21:53:55] (03CR) 10Dzahn: [C: 03+2] add thankyou.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/619361 (https://phabricator.wikimedia.org/T259002) (owner: 10Dzahn) [21:53:59] (03PS2) 10Dzahn: add thankyou.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/619361 (https://phabricator.wikimedia.org/T259002) [21:56:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [21:57:20] 10Operations, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Dzahn) @Pcoombe thankyou.wikipedia.org has been added to DNS Is an '.m.' mobile link needed/wa... [22:01:40] (03PS1) 10Dzahn: httpbb: move test_releases.yaml to releases subdir [puppet] - 10https://gerrit.wikimedia.org/r/619387 [22:05:47] (03PS1) 10Dzahn: arclamp: also send mail to perf-team for compress cron job [puppet] - 10https://gerrit.wikimedia.org/r/619389 [22:06:28] (03CR) 10Dzahn: arclamp: send mail from generate-svg cronjob to perf-team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619328 (owner: 10Dzahn) [22:07:29] (03CR) 10Dzahn: [C: 03+2] httpbb: move test_releases.yaml to releases subdir [puppet] - 10https://gerrit.wikimedia.org/r/619387 (owner: 10Dzahn) [22:08:23] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10colewhite) [22:08:46] (03CR) 10RLazarus: [C: 03+1] httpbb: move test_releases.yaml to releases subdir [puppet] - 10https://gerrit.wikimedia.org/r/619387 (owner: 10Dzahn) [22:15:10] (03PS5) 10Dave Pifke: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) [22:15:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:18:37] (03CR) 10Dave Pifke: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/619389 (owner: 10Dzahn) [22:20:00] (03CR) 10Dave Pifke: "> LGTM, just noting that Prometheus will scrape metrics for each arclamp host in each site. Pointing this out in case it is relevant for e" [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [22:24:28] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1002/24399/" [puppet] - 10https://gerrit.wikimedia.org/r/618388 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:28:33] (03PS2) 10Dave Pifke: arclamp: require python-swiftclient [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) [22:32:25] (03PS2) 10Dzahn: arclamp: also send mail to perf-team for compress cron job [puppet] - 10https://gerrit.wikimedia.org/r/619389 [22:33:40] (03CR) 10Dzahn: [C: 03+2] arclamp: also send mail to perf-team for compress cron job [puppet] - 10https://gerrit.wikimedia.org/r/619389 (owner: 10Dzahn) [22:33:48] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:59] (03CR) 10Dzahn: [C: 03+2] "merging it since Filippo as the Prometheus expert already said +1" [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [22:36:09] (03PS6) 10Dzahn: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [22:38:03] (03CR) 10Cwhite: [C: 03+2] prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618799 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [22:38:12] (03PS3) 10Cwhite: prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/618799 (https://phabricator.wikimedia.org/T256418) [22:38:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:39:03] (03CR) 10Dzahn: "webperf1002: Notice: /Stage[main]/Arclamp/Cron[arclamp_compress_logs]/environment: defined 'environment' as MAILTO=performance-team@wikime" [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [22:39:19] (03PS3) 10Dzahn: arclamp: require python-swiftclient [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [22:39:32] (03PS4) 10Dave Pifke: arclamp: require python-swiftclient [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) [22:40:12] (03CR) 10Dzahn: [C: 03+2] arclamp: require python-swiftclient [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [22:40:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:41:31] (03CR) 10Dzahn: "webperf1002: Package[python-swiftclient]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/618626 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [22:41:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:59] 10Operations, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) >>! In T259002#6374625, @Dzahn wrote: > @Pcoombe > > thankyou.wikipedia.org has been a... [22:43:21] (03Abandoned) 10Dave Pifke: [WIP] webperf: add APT repository [puppet] - 10https://gerrit.wikimedia.org/r/613320 (owner: 10Dave Pifke) [22:44:03] (03PS4) 10Dzahn: ATS: switch releases.wm to new buster backend servers [dns] - 10https://gerrit.wikimedia.org/r/618412 (https://phabricator.wikimedia.org/T247652) [22:44:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:46:27] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw daniel_zahn the exporter for arclamp has just been added - its new, so i think this is normal for now https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:46:34] dpifke: ^ i guess that's normal when it's new but also shows it was added [22:50:19] Makes sense if the cron job hadn't run there yet and Prometheus tried to scrape it. [22:50:38] The /metrics file is there now so it should succeed. [22:50:45] (03PS4) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [22:50:51] yea, "<60% available" or so also seemed normal when it's brandnew [22:50:58] cool [22:51:29] (03CR) 10Ppchelko: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [22:53:19] 10Operations, 10observability, 10Performance-Team (Radar), 10Platform Team Workboards (External Code Reviews): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) Change to heap metrics merged into service-runner/prom... [22:53:40] (03CR) 10Ppchelko: Modify api-gateway access logging to conform to schema (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [22:54:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:54:27] :) [22:56:08] 10Operations, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Urbanecm, and 2 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) @Ladsgroup Any idea how to get rid of mobile subdomain assumption? Is removing mobilefr... [22:57:15] 10Operations, 10observability, 10Performance-Team (Radar), 10Platform Team Workboards (External Code Reviews): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) 05Openโ†’03Resolved a:03colewhite Change to router... [22:57:19] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [22:59:44] 10Operations: expired puppet cert on scb1001 - https://phabricator.wikimedia.org/T260094 (10Dzahn) [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200810T2300). [23:01:03] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Reedy) >>! In T259002#6374756, @Urbanecm wrote: > @Ladsgroup Any idea... [23:01:30] 10Operations: expired puppet cert on scb1001 - https://phabricator.wikimedia.org/T260094 (10Dzahn) an attempt to "`puppet cert clean scb1001.eqiad.wmnet`" to then create a new signing request and sign a new cert fails with: ` Error: Could not find a serial number for scb1001.eqiad.wmnet ` [23:09:37] ACKNOWLEDGEMENT - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet last ran 2 days ago daniel_zahn https://phabricator.wikimedia.org/T260094 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:19:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:21:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:26:09] Jdlrobson: Is this safe to backport? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/619092/ [23:26:13] I can deploy it now if you're around [23:29:14] (03PS1) 10Dzahn: releases: allow http connections also from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/619393 [23:30:34] (03CR) 10jerkins-bot: [V: 04-1] releases: allow http connections also from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/619393 (owner: 10Dzahn) [23:32:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:17] (03PS2) 10Dzahn: releases: allow http connections also from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/619393 [23:33:45] (03PS3) 10Dzahn: releases: allow http connections also from cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/619393 (https://phabricator.wikimedia.org/T247652) [23:34:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24402/releases1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/619393 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [23:42:18] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:39] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10aaron) [23:44:44] (03PS5) 10Dzahn: ATS: switch releases.wikimedia.org to buster backends [dns] - 10https://gerrit.wikimedia.org/r/618412 (https://phabricator.wikimedia.org/T247652) [23:45:08] (03PS6) 10Dzahn: ATS: switch releases.wikimedia.org to buster backends [dns] - 10https://gerrit.wikimedia.org/r/618412 (https://phabricator.wikimedia.org/T247652) [23:46:18] (03CR) 10Dzahn: [C: 03+2] ATS: switch releases.wikimedia.org to buster backends [dns] - 10https://gerrit.wikimedia.org/r/618412 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [23:51:10] (03PS1) 10Dave Pifke: arclamp: configurable email address for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/619395 [23:52:10] !log https://releases.wikimedia.org switched to new backends running Debian buster. files have been synced of course. [23:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:14] !log https://releases.wikimedia.org switched to new backends running Debian buster. files have been synced. httpbb tests have been created and pass. (T247652) [23:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:16] T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 [23:53:35] (03PS2) 10Dave Pifke: arclamp: configurable email address for cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/619395 [23:56:34] (03CR) 10Dzahn: [C: 03+1] "ah, nice. but you don't need to go one step further and add a lookup() in the parameter and put it in Hiera for labs?" [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke) [23:56:53] (03CR) 10Dave Pifke: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1003/24403/" [puppet] - 10https://gerrit.wikimedia.org/r/619395 (owner: 10Dave Pifke)