[00:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T0000). [00:00:31] jouncebot: no phabricator deployment today. [00:03:06] 10Operations, 10SRE-Access-Requests: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10jwang) [00:03:47] 10Operations, 10SRE-Access-Requests: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10jwang) [00:06:01] (03PS3) 10Ryan Kemper: debian/changelog: Make spaces match previous entry [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/599140 [00:24:45] 10Puppet, 10Puppet-infrastructure-modernization, 10cloud-services-team (Kanban): broken puppet on codfw1dev VMs - https://phabricator.wikimedia.org/T253817 (10Andrew) 05Open→03Resolved a:03Andrew I upgraded the puppetmaster packages on labtestpuppetmaster2001 and things are working for now. This isn't... [00:34:36] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18135120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:30] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:29] (03PS6) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) [00:48:04] (03CR) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [00:55:29] (03PS2) 10Andrew Bogott: M5 grants: allow designate access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/599137 (https://phabricator.wikimedia.org/T253780) [00:55:31] (03PS1) 10Andrew Bogott: codfw1dev: add a second ldap server to keystone conf [puppet] - 10https://gerrit.wikimedia.org/r/599144 [00:56:56] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: add a second ldap server to keystone conf [puppet] - 10https://gerrit.wikimedia.org/r/599144 (owner: 10Andrew Bogott) [01:09:51] 10Operations, 10SRE-Access-Requests: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10CDanis) 05Open→03Resolved a:03CDanis `bast1002.eqiad.wmnet` isn't a host that exists :) If you target `bast1002.wikimedia.org` instead, that should work just fine. It's confusing, but the error message you'... [01:23:42] (03CR) 10Krinkle: Wikidata client wikis: Define entity sources configuration (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [01:25:23] (03CR) 10Krinkle: Wikidata client wikis: Define entity sources configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [01:35:20] (03PS4) 10EBernhardson: Rename role::wdqs to role::wdqs::public [puppet] - 10https://gerrit.wikimedia.org/r/598884 [01:35:21] (03PS1) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [01:35:23] (03PS1) 10EBernhardson: WIP: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [01:57:17] (03PS1) 10Krinkle: scap: Remove commit and sync steps from 'update-interwiki-cache' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) [01:59:23] (03PS2) 10Krinkle: scap: Remove commit and sync steps from 'update-interwiki-cache' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) [02:01:25] (03PS3) 10Krinkle: scap: Remove commit and sync steps from 'update-interwiki-cache' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) [02:02:38] 10Operations, 10WMF-Legal, 10Graphite, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Krinkle) [02:17:02] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:20:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:32:10] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:44] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:52:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:12:07] (03PS1) 10Ppchelko: Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) [03:14:42] (03PS2) 10Ppchelko: Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) [03:15:38] (03CR) 10Ppchelko: "This will send purges to the same 'resource-purge' topic for group0 wikis only. This does not disable HTCP purges." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [03:19:30] (03CR) 10Krinkle: Enable kafka purges production on group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [03:23:37] Pchelolo: btw, is the schema for resource-purge in git? It's the one that restbase uses already, right? I can't seem to find it in mediawiki event schemas [03:23:46] no worries if it's a new one, just checking if I'm looking in the wrong place [03:24:03] Krinkle: it's using resource_change schema [03:24:32] Andrew keeps moving the mapping around, I forgot where is the mapping these days [03:25:09] Ah, I see. schema != stream [03:25:12] I keep forgetting this [03:25:15] okay, got it [03:29:39] (03PS3) 10Ppchelko: Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) [03:30:09] (03CR) 10Ppchelko: Enable kafka purges production on group0 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [04:22:14] !log Stop MySQL on db1141 - T249188 [04:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:17] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:29:00] 10Operations: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10CDanis) [04:35:54] 10Operations: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10CDanis) [04:38:28] 10Operations: planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) - https://phabricator.wikimedia.org/T253824 (10CDanis) [04:43:45] (03PS1) 10Marostegui: dbproxy1018: Pool db1141 into analytics role [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) [04:44:11] !log Run check_private data on db1141 - T249188 [04:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:17] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:44:27] (03CR) 10Marostegui: [C: 04-2] "Not ready until check_private data has confirmed this host is redacted entirely" [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [04:56:05] (03PS1) 10Marostegui: mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/599155 (https://phabricator.wikimedia.org/T253808) [04:56:40] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/599155 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [04:56:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:58:08] (03PS1) 10Marostegui: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/599156 (https://phabricator.wikimedia.org/T253808) [04:58:41] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/599156 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [04:58:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:14:58] 10Operations, 10SRE-Access-Requests: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10jwang) 05Resolved→03Open Thank you for your prompt response. @CDanis Still failed as it was asking for a password. I have tried all the possible passwords. Seems not working. Can you suggest the next step? I... [05:32:06] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:38] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) Data check between db1138 and db1081 (candidate master) finished successfully. [05:51:29] (03PS10) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [05:52:23] (03CR) 10jerkins-bot: [V: 04-1] Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [05:52:57] (03PS11) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [05:53:25] (03PS8) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) [05:54:48] (03PS6) 10WMDE-leszek: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) [05:54:59] (03PS7) 10WMDE-leszek: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) [05:55:36] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:59:16] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:14:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) Blocked a maintenance window on the deployment's calendar for tomorrow. [06:21:38] (03CR) 10Giuseppe Lavagetto: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [06:27:55] (03CR) 10Giuseppe Lavagetto: "Overall LGTM. I think there is one important piece lacking: if we want to use a on-host memcached, we need to install and configure memcac" [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [06:29:08] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10ayounsi) p:05Triage→03Medium [06:30:09] 10Operations, 10SRE-Access-Requests: Requesting access for deployment and restricted group for cicalese - https://phabricator.wikimedia.org/T253676 (10ayounsi) [06:30:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1081 from API and set its weight to 0 on main traffic - preparation for tomorrow's failover T253808', diff saved to https://phabricator.wikimedia.org/P11329 and previous config saved to /var/cache/conftool/dbconfig/20200528-063037-marostegui.json [06:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:43] T253808: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 [06:39:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] blubberoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598759 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [06:47:20] 10Operations, 10MediaWiki-General, 10serviceops, 10Service-Architecture: Create a grafana dashboard to monitor services proxied via envoy - https://phabricator.wikimedia.org/T247388 (10Joe) 05Open→03Resolved [06:47:26] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [06:47:44] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) 05Open→03Resolved [06:47:48] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [06:48:49] 10Operations, 10SRE-Access-Requests: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10ayounsi) a:05CDanis→03ayounsi When doing `ssh -v bast1002.wikimedia.org` It will not apply the `User jiawang` configuration as it's only applied for hosts `bast` `*.wmnet` and `stat7`. Note that everything at... [06:49:24] 10Operations: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10ayounsi) [06:49:27] 10Operations, 10serviceops, 10Kubernetes, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10Joe) Picking this up again - we already migrated the CDN to use https - do we need to do something for CI? [06:51:03] 10Operations, 10Traffic, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) The deploy strategy is simply adding the new users to etcd, move most hosts to use `conftool` as the root user immediately, and then progressively m... [06:54:18] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22807/ DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/597805 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [06:58:35] (03CR) 10Ayounsi: "Thanks!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [06:58:44] (03PS5) 10Ayounsi: Anycast: introduce new "deterministic" variable [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) [07:02:24] (03CR) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: add additional users for pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597806 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [07:02:52] (03PS15) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: add additional users for pools [puppet] - 10https://gerrit.wikimedia.org/r/597806 (https://phabricator.wikimedia.org/T97972) [07:05:01] (03PS1) 10Ayounsi: Move bpirkle from group restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/599272 (https://phabricator.wikimedia.org/T253640) [07:06:20] (03CR) 10Dzahn: [C: 03+1] "if you are ready to open it to the world, sure, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [07:12:01] (03CR) 10Muehlenhoff: [C: 03+1] Move bpirkle from group restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/599272 (https://phabricator.wikimedia.org/T253640) (owner: 10Ayounsi) [07:12:50] (03PS1) 10Ladsgroup: Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) [07:15:08] (03CR) 10Ayounsi: [C: 03+2] Move bpirkle from group restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/599272 (https://phabricator.wikimedia.org/T253640) (owner: 10Ayounsi) [07:16:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment rights for BPirkle - https://phabricator.wikimedia.org/T253640 (10ayounsi) 05Open→03Resolved a:03ayounsi Done! [07:20:48] (03PS16) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: add additional users for pools [puppet] - 10https://gerrit.wikimedia.org/r/597806 (https://phabricator.wikimedia.org/T97972) [07:22:25] (03PS1) 10Filippo Giunchedi: Move ms-be101[678] to spares [puppet] - 10https://gerrit.wikimedia.org/r/599275 (https://phabricator.wikimedia.org/T252008) [07:22:40] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/22826/" [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [07:24:23] (03PS3) 10Gilles: Optimise all static PNGs losslessly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) [07:26:40] (03CR) 10Gilles: [C: 03+2] Optimise all static PNGs losslessly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [07:27:30] (03Merged) 10jenkins-bot: Optimise all static PNGs losslessly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [07:27:55] (03CR) 10Filippo Giunchedi: [C: 03+2] Move ms-be101[678] to spares [puppet] - 10https://gerrit.wikimedia.org/r/599275 (https://phabricator.wikimedia.org/T252008) (owner: 10Filippo Giunchedi) [07:27:57] 10Operations, 10Continuous-Integration-Infrastructure, 10Doxygen, 10Patch-For-Review, and 2 others: Update Doxygen to 1.8.18 - https://phabricator.wikimedia.org/T253793 (10hashar) p:05Triage→03Medium [07:28:04] (03PS2) 10Filippo Giunchedi: Move ms-be101[678] to spares [puppet] - 10https://gerrit.wikimedia.org/r/599275 (https://phabricator.wikimedia.org/T252008) [07:29:17] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down - TTN-0004110251 - https://phabricator.wikimedia.org/T253610 (10ayounsi) As of 6h ago: > OSP crews are hands off at this time and all customers have confirmed restored. There will be no additional updates after this one. Thank you. But the... [07:31:33] !log gilles@deploy1001 Synchronized static/apple-touch: T252108 Deploying optimised static PNGs (duration: 01m 12s) [07:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:37] T252108: Optimise production wiki logos - https://phabricator.wikimedia.org/T252108 [07:32:54] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP Ayounsi https://phabricator.wikimedia.org/T253610#6171534 - The acknowledgement expires at: 2020-05-29 07:32:36. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:32:54] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T253610#6171534 - The acknowledgement expires at: 2020-05-29 07:32:36. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:33:34] !log gilles@deploy1001 Synchronized static/images: T252108 Deploying optimised static PNGs (duration: 01m 39s) [07:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] 10Operations, 10observability, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Marostegui) [07:38:45] (03PS2) 10Privacybatm: transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) [07:39:08] (03CR) 10jerkins-bot: [V: 04-1] transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [07:39:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) Incident report (draft status) created: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200528-s4_(commonswik... [07:49:26] (03CR) 10Marostegui: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [07:49:40] (03CR) 10Hashar: "Note that from the zuul server (contint2001), I am unable to ssh to gerrit-test:" [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [07:50:45] (03CR) 10DannyS712: [C: 03+1] "LGTM" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/599140 (owner: 10Ryan Kemper) [07:51:53] (03CR) 10Kormat: [C: 04-1] dbproxy1018: Pool db1141 into analytics role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [07:54:12] (03PS2) 10JMeybohm: blubberoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598759 (https://phabricator.wikimedia.org/T253396) [07:54:29] (03PS1) 10Gehel: Revert "maps: disable tilerator in codfw for data reload" [puppet] - 10https://gerrit.wikimedia.org/r/599277 [07:54:34] (03PS2) 10Marostegui: dbproxy1018: Pool db1141 into analytics role [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) [07:55:22] ACKNOWLEDGEMENT - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T253833 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:22] ACKNOWLEDGEMENT - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T253833 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:15] (03PS1) 10Elukey: profile::superset::proxy: add httpd prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/599278 [07:58:46] (03CR) 10Elukey: [C: 03+2] profile::superset::proxy: add httpd prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/599278 (owner: 10Elukey) [07:58:48] (03CR) 10Kormat: [C: 03+1] dbproxy1018: Pool db1141 into analytics role [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [08:00:09] (03CR) 10Muehlenhoff: [C: 03+2] Unconditionally enable mod-crs [puppet] - 10https://gerrit.wikimedia.org/r/598482 (owner: 10Muehlenhoff) [08:02:59] 10Operations, 10Maps: maps2001 tilerator service failed - https://phabricator.wikimedia.org/T253835 (10ayounsi) [08:03:42] ACKNOWLEDGEMENT - Check systemd state on maps2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi https://phabricator.wikimedia.org/T253835 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:41] 10Operations, 10SRE-Access-Requests: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10ayounsi) 05Open→03Resolved a:05MBinder_WMF→03RLazarus [08:07:44] (03PS1) 10Elukey: profile::prometheus::ops: add Superset httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/599280 [08:10:05] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10ayounsi) a:03Jclark-ctr [08:12:03] (03CR) 10Filippo Giunchedi: [C: 03+1] Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 (owner: 10Elukey) [08:12:46] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Pool db1141 into analytics role [puppet] - 10https://gerrit.wikimedia.org/r/599153 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [08:13:56] !log Pool db1141 into labsdb analytics role - T249188 [08:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:00] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [08:14:24] (03CR) 10Dzahn: [C: 03+2] zuul: use modern [connection] section in config [puppet] - 10https://gerrit.wikimedia.org/r/598057 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [08:18:44] (03PS1) 10Marostegui: db1141: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599282 (https://phabricator.wikimedia.org/T249188) [08:19:21] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:26] (03PS8) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (https://phabricator.wikimedia.org/T233947) [08:19:51] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [08:20:54] (03PS1) 10Hashar: zuul: specify the driver for gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/599283 (https://phabricator.wikimedia.org/T253263) [08:21:41] ACKNOWLEDGEMENT - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/598057 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:41] ACKNOWLEDGEMENT - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger daniel_zahn https://gerrit.wikimedia.org/r/c/operations/puppet/+/598057 https://www.mediawiki.org/wiki/Continuous_integration/Zuul [08:22:07] (03CR) 10Dzahn: [C: 03+2] zuul: specify the driver for gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/599283 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [08:23:09] (03PS3) 10Hashar: zuul: add a connection to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) [08:23:19] (03PS1) 10Gilles: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) [08:23:43] (03CR) 10Hashar: [C: 04-1] "I have added a missing 'driver=gerrit'." [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [08:24:14] (03CR) 10Elukey: "Maybe both superset and matomo should be in the analytics instance?" [puppet] - 10https://gerrit.wikimedia.org/r/599280 (owner: 10Elukey) [08:25:30] (03CR) 10Dzahn: [V: 03+2 C: 03+2] zuul: specify the driver for gerrit connection [puppet] - 10https://gerrit.wikimedia.org/r/599283 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [08:26:33] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [08:26:42] (03CR) 10Marostegui: [C: 03+2] db1141: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/599282 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [08:26:48] (03CR) 10jerkins-bot: [V: 04-1] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [08:28:04] (03PS2) 10Gilles: Lossy optimisation of Wikipedia logos static PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) [08:28:25] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:02] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:15] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:39] !log de-pref all OSPF links to cr2-eqord - T243080 [08:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:50] !log deactivate peering/transit on cr2-eqord - T243080 [08:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:15] (03PS3) 10JMeybohm: blubberoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598759 (https://phabricator.wikimedia.org/T253396) [08:47:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22827/conf1004.eqiad.wmnet/index.html LGTM. I will only commit on one server and then tes" [puppet] - 10https://gerrit.wikimedia.org/r/597806 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [08:47:37] (03CR) 10Jbond: [C: 03+1] Enable managed adduser config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/599042 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [08:48:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::etcd::tlsproxy: add additional users for pools [puppet] - 10https://gerrit.wikimedia.org/r/597806 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [08:48:47] (03CR) 10JMeybohm: [C: 03+2] "Needed to re-package the tgz to include the fixes in current template version." [deployment-charts] - 10https://gerrit.wikimedia.org/r/598759 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [08:49:20] (03Merged) 10jenkins-bot: blubberoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598759 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [08:50:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [08:50:15] !log install new Junos on cr2-eqord - T243080 [08:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:35] <_joe_> !log upgrading etcd ACLs (adding new users) to conf1004 [08:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/598836 (https://phabricator.wikimedia.org/T253666) (owner: 10Ayounsi) [08:52:52] (03CR) 10Filippo Giunchedi: "See inline, yeah ops instance is fine here!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599280 (owner: 10Elukey) [08:53:49] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser config for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/599042 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [08:54:13] 10Operations, 10Release-Engineering-Team, 10SRE-tools: Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635 (10hashar) [08:54:51] (03PS2) 10Elukey: profile::prometheus::ops: add Superset httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/599280 [08:55:25] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02331 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:56:30] _joe_: ^ [08:56:40] 10Operations, 10WMF-Legal, 10observability, 10Graphite, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10fgiunchedi) [08:56:49] <_joe_> XioNoX: widespread? [08:56:51] <_joe_> oh damn [08:57:18] error on one host is "Error while evaluating a Function Call, Failed to parse template profile/etcd/htpasswd.erb: Filepath: /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb Line: 59 Detail: Could not autoload puppet/parser/functions/htpasswd: cannot load such file -- apr1md5 (file: /etc/puppet/modules ..." [08:57:32] <_joe_> yeah [08:57:46] <_joe_> I hoped it would only be a problem on the etcd servers [08:57:52] I see the chat on -sre now [08:58:08] <_joe_> anyways, working on it [08:59:11] (03CR) 10Elukey: [C: 03+2] profile::prometheus::ops: add Superset httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/599280 (owner: 10Elukey) [09:05:57] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [09:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:20] !log restart cr2-eqord for upgrade - T243080 [09:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:42] no OOB there, should be back in 10min :) [09:11:19] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:11:29] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:13:41] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:13:54] (03PS1) 10Hashar: Gemfile: fix pry / pry-byebug incompatibility [puppet] - 10https://gerrit.wikimedia.org/r/599288 (https://phabricator.wikimedia.org/T253635) [09:14:52] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: fix pry / pry-byebug incompatibility [puppet] - 10https://gerrit.wikimedia.org/r/599288 (https://phabricator.wikimedia.org/T253635) (owner: 10Hashar) [09:14:59] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:15:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:13] and it's back! [09:16:21] !log rollback cr2-eqord ospf/bgp - T243080 [09:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:25] (03CR) 10Elukey: [V: 03+2 C: 03+2] Improve performance of datapoint ingestion [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/599011 (owner: 10Elukey) [09:18:27] (03CR) 10Jbond: [C: 03+1] "LGTM you will need to bump the docker image as well. CI error seems unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/599288 (https://phabricator.wikimedia.org/T253635) (owner: 10Hashar) [09:18:35] (03PS1) 10JMeybohm: blubberoid: Use chart defaults for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/599289 (https://phabricator.wikimedia.org/T253396) [09:18:45] everything back up, removing downtime [09:18:54] 10Operations, 10observability, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10fgiunchedi) Definitely +1 to alert on sth like this, it is yet unclear to me if we can alert in a platform-independent way. Anyways see also "prior art" in {T214516} where clearly... [09:22:13] !log install new Junos on cr2-eqdfw - T243080 [09:22:16] 10Operations, 10observability: Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084 (10fgiunchedi) Probably "all errors" is quite ambitious (see T253810) but a way to detect whether there's warnings with the server (e.g. detect if the amber led or equivalent would be on) would... [09:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:19] (03PS2) 10MSantos: Revert "maps: disable tilerator in codfw for data reload" [puppet] - 10https://gerrit.wikimedia.org/r/599277 (owner: 10Gehel) [09:22:27] (03CR) 10MSantos: [C: 03+1] Revert "maps: disable tilerator in codfw for data reload" [puppet] - 10https://gerrit.wikimedia.org/r/599277 (owner: 10Gehel) [09:22:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] blubberoid: Use chart defaults for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/599289 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [09:23:28] (03CR) 10Gehel: [C: 03+2] Revert "maps: disable tilerator in codfw for data reload" [puppet] - 10https://gerrit.wikimedia.org/r/599277 (owner: 10Gehel) [09:25:21] 10Operations, 10observability, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10MoritzMuehlenhoff) >>! In T253810#6171818, @fgiunchedi wrote: > Definitely +1 to alert on sth like this, it is yet unclear to me if we can alert in a platform-independent way. It... [09:25:26] RECOVERY - Check systemd state on maps2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:25:41] <_joe_> !log updating ACLs on all etcd servers [09:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:34] RECOVERY - Check systemd state on maps2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:40] RECOVERY - Check systemd state on maps2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:12] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:21] (03PS9) 10Jbond: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [09:30:26] !log deactivate peering/transit on cr2-eqdfw - T243080 [09:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:33] 10Operations, 10Maps: maps2001 tilerator service failed - https://phabricator.wikimedia.org/T253835 (10Gehel) 05Open→03Resolved a:03Gehel data reload is completed, tilerator re-activated [09:31:00] (03PS2) 10Hashar: Gemfile: fix pry / pry-byebug incompatibility [puppet] - 10https://gerrit.wikimedia.org/r/599288 (https://phabricator.wikimedia.org/T253635) [09:31:01] (03PS1) 10Hashar: Run rubocop on Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/599290 [09:31:04] (03CR) 10Jbond: [C: 03+1] "LGTM (i also fixed the spec test)" [puppet] - 10https://gerrit.wikimedia.org/r/598718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [09:32:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/599290 (owner: 10Hashar) [09:32:54] 10Operations, 10Gerrit, 10Patch-For-Review: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) After talking with Manuel i realized finally the ro password was in pwstore, not in the passwords module in the private puppet repo. The one from the pwstore... [09:33:23] !log restart cr2-eqdfw for upgrade - T243080 [09:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:20] (03CR) 10Hashar: "The comment rubocop:disable Metrics/LineLength is to disable the line length check which is set at 159 characters." [puppet] - 10https://gerrit.wikimedia.org/r/599290 (owner: 10Hashar) [09:35:22] !log restarting gerrit on gerrit1002 after fixing db_pass to the readonly one (T243800) [09:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:26] T243800: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 [09:36:30] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:36:30] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:36:34] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:36:53] expected ^ [09:37:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 57 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:39:46] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:08] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:12] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:41:19] !log re-activate peering/transit on cr2-eqdfw - T243080 [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:25] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:48:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:50:16] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [09:51:04] !log failover VRRP in ulsfo [09:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:37] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Resolved→03Open Was blocked by T243800 which is now fixed. Though gerrit service does not keep running due to other issues, like file permissions. Reopening [09:52:54] 10Operations, 10Gerrit, 10Patch-For-Review: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) 05Open→03Resolved [09:53:16] (03PS1) 10JMeybohm: tls_helper: Add envoy readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/599293 [09:54:38] (03PS1) 10Marostegui: monitor_readonly.pp: Add a more specific page for this alert [puppet] - 10https://gerrit.wikimedia.org/r/599294 (https://phabricator.wikimedia.org/T253832) [09:57:15] (03CR) 10Dzahn: [C: 03+1] monitor_readonly.pp: Add a more specific page for this alert [puppet] - 10https://gerrit.wikimedia.org/r/599294 (https://phabricator.wikimedia.org/T253832) (owner: 10Marostegui) [09:58:17] (03CR) 10Marostegui: [C: 03+2] "Looks good: https://puppet-compiler.wmflabs.org/compiler1001/22833/" [puppet] - 10https://gerrit.wikimedia.org/r/599294 (https://phabricator.wikimedia.org/T253832) (owner: 10Marostegui) [09:59:42] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) [10:02:39] !log gerrit1002 (test server) - chown -R gerrit2:gerrit2 /var/lib/gerrit/review_site ; restarted gerrit service, now the service is not in restart loop anymore, gerrit-ssh is listening too, just not accepting publickey (T239151) [10:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:43] T239151: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 [10:04:15] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10MoritzMuehlenhoff) [10:05:15] 10Operations, 10Tor: Retire the Tor relay - https://phabricator.wikimedia.org/T243288 (10MoritzMuehlenhoff) 05Open→03Resolved This is done for a while now. [10:06:20] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) 05Open→03Resolved ` ● gerrit.service - Gerrit code review tool Loaded: loaded (/lib/systemd/system/gerrit.service; enabled; vendor preset: enabled) Active: active (running)... [10:07:25] 10Operations, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [10:08:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:09:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tls_helper: Add envoy readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/599293 (owner: 10JMeybohm) [10:09:58] (03PS2) 10Ayounsi: Cumin: add network devices support [puppet] - 10https://gerrit.wikimedia.org/r/596389 [10:10:07] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:10:26] (03CR) 10JMeybohm: [C: 03+2] tls_helper: Add envoy readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/599293 (owner: 10JMeybohm) [10:10:31] (03CR) 10Ayounsi: "Thanks!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/596389 (owner: 10Ayounsi) [10:10:36] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) Hashar confirmed things are working now... [10:10:51] (03Merged) 10jenkins-bot: tls_helper: Add envoy readiness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/599293 (owner: 10JMeybohm) [10:13:30] (03PS3) 10Ayounsi: Cumin: add network devices support [puppet] - 10https://gerrit.wikimedia.org/r/596389 [10:14:57] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/22835/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596389 (owner: 10Ayounsi) [10:15:01] (03CR) 10Ayounsi: [C: 03+2] Cumin: add network devices support [puppet] - 10https://gerrit.wikimedia.org/r/596389 (owner: 10Ayounsi) [10:17:30] (03PS2) 10JMeybohm: blubberoid: Use chart defaults for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/599289 (https://phabricator.wikimedia.org/T253396) [10:19:27] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1081 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/599155 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [10:19:54] (03CR) 10Dzahn: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [10:20:08] (03CR) 10Kormat: [C: 03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/599156 (https://phabricator.wikimedia.org/T253808) (owner: 10Marostegui) [10:20:53] (03PS4) 10Dzahn: zuul: add a connection to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [10:22:14] (03PS10) 10Muehlenhoff: Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (https://phabricator.wikimedia.org/T233947) [10:23:37] 10Operations, 10homer: Homer: add show support - https://phabricator.wikimedia.org/T250413 (10ayounsi) 05Open→03Resolved a:03ayounsi Done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/596389 Now commands like `$ sudo cumin D{cr4-ulsfo.wikimedia.org} 'show version'` are possible. Multiple co... [10:24:25] (03CR) 10Dzahn: [C: 03+1] zuul: add a connection to gerrit-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/598058 (https://phabricator.wikimedia.org/T253263) (owner: 10Hashar) [10:24:34] (03CR) 10Jbond: [C: 03+2] profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:24:52] (03CR) 10Jbond: [C: 03+2] apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:25:04] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-08-19 09:53:05 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [10:25:44] RECOVERY - HTTPS-planet on en.planet.wikimedia.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-08-19 09:53:05 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [10:25:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/598983 (owner: 10Muehlenhoff) [10:27:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for non-Tomcat and non-DEB deployments [puppet] - 10https://gerrit.wikimedia.org/r/598718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [10:27:55] (03CR) 10JMeybohm: [C: 03+2] blubberoid: Use chart defaults for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/599289 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [10:28:20] (03Merged) 10jenkins-bot: blubberoid: Use chart defaults for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/599289 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [10:30:29] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [10:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:25] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [10:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:21] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [10:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete pdns3hack class [puppet] - 10https://gerrit.wikimedia.org/r/598983 (owner: 10Muehlenhoff) [10:41:15] 10Operations, 10MediaWiki-General, 10serviceops, 10Service-Architecture: Create a grafana dashboard to monitor services proxied via envoy - https://phabricator.wikimedia.org/T247388 (10JMeybohm) Just to have the reference here. I guess it's: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry [10:42:34] (03CR) 10Muehlenhoff: [C: 03+2] profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [10:45:24] (03PS7) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [10:47:41] (03PS1) 10Filippo Giunchedi: prometheus: include ::profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) [10:47:43] (03PS1) 10Filippo Giunchedi: conftool: bail on confctl not found [puppet] - 10https://gerrit.wikimedia.org/r/599299 (https://phabricator.wikimedia.org/T253840) [10:51:10] (03PS1) 10Muehlenhoff: Unconditionally apply SubjectAltNameWarning [puppet] - 10https://gerrit.wikimedia.org/r/599300 [10:52:56] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: CAS build as a deb - https://phabricator.wikimedia.org/T233947 (10MoritzMuehlenhoff) Puppet is now deployed as a deb. [10:55:28] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat Waiting on a review https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:57:37] (03PS8) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [10:58:54] (03CR) 10jerkins-bot: [V: 04-1] apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [10:59:26] (03PS9) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:01:07] (03PS1) 10Kormat: mariadb: Enable notifications for db2138 [puppet] - 10https://gerrit.wikimedia.org/r/599304 (https://phabricator.wikimedia.org/T252985) [11:01:50] (03CR) 10Marostegui: [C: 03+1] mariadb: Enable notifications for db2138 [puppet] - 10https://gerrit.wikimedia.org/r/599304 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [11:02:22] (03CR) 10Kormat: [C: 03+2] mariadb: Enable notifications for db2138 [puppet] - 10https://gerrit.wikimedia.org/r/599304 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [11:03:34] !log kormat@cumin1001 dbctl commit (dc=all): 'Add db2138 to s2+s4 T252985', diff saved to https://phabricator.wikimedia.org/P11330 and previous config saved to /var/cache/conftool/dbconfig/20200528-110333-kormat.json [11:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:38] T252985: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 [11:04:57] (03CR) 10Hnowlan: "lgtm within my limited understanding of this but I don't think I'm qualified to even +1 this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [11:05:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:09:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:13:44] (03PS1) 10Muehlenhoff: Add component/idp-test and enable on the staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) [11:14:50] (03CR) 10jerkins-bot: [V: 04-1] Add component/idp-test and enable on the staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [11:20:19] (03CR) 10Jbond: Add component/idp-test and enable on the staging IDPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [11:23:23] (03CR) 10Dzahn: [C: 03+1] sre.hosts.decommission: check repositories [cookbooks] - 10https://gerrit.wikimedia.org/r/598065 (owner: 10Volans) [11:23:53] (03PS1) 10Esanders: Install DiscussionTools on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) [11:24:15] (03CR) 10Esanders: [C: 04-2] "Awaiting sign-off" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599307 (https://phabricator.wikimedia.org/T252264) (owner: 10Esanders) [11:26:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: Short-circuit urlproxy lookups against canonical_domain [puppet] - 10https://gerrit.wikimedia.org/r/599139 (https://phabricator.wikimedia.org/T253816) (owner: 10BryanDavis) [11:29:00] (03CR) 10Arturo Borrero Gonzalez: "Thanks! I tested this in toolsbeta to the extend it's possible. The k8s ingress (inside the cluster) has little support for using a domain" [puppet] - 10https://gerrit.wikimedia.org/r/599139 (https://phabricator.wikimedia.org/T253816) (owner: 10BryanDavis) [11:38:59] (03PS2) 10Muehlenhoff: Add component/idp-test and enable on the staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) [11:49:00] !log installing unbound security updates [11:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:48] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [11:53:58] 10Operations, 10Mail, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10MoritzMuehlenhoff) [12:01:59] 10Operations, 10serviceops: decom old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Dzahn) [12:02:53] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:08:33] (03CR) 10Jbond: "i think i may have add more confusion with the last comment" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [12:11:54] (03PS2) 10Filippo Giunchedi: prometheus: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) [12:11:55] (03PS2) 10Filippo Giunchedi: conftool: bail on confctl not found [puppet] - 10https://gerrit.wikimedia.org/r/599299 (https://phabricator.wikimedia.org/T253840) [12:14:43] 10Operations, 10DC-Ops, 10serviceops: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Dzahn) [12:15:46] (03PS1) 10Kormat: mariadb: Add db2139 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/599313 (https://phabricator.wikimedia.org/T252985) [12:16:14] 10Operations, 10DC-Ops, 10serviceops: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Dzahn) a:05Dzahn→03Cmjohnson Hi Chris, this is a new ticket I made in response to T218751#6134304. [12:17:13] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Dzahn) >>! In T218751#6134304, @Cmjohnson wrote: > there are more because of the mw's that need to be decom'd. I did not see a decommission task for them. @Cmjohnson The decom ticket for old eqiad hosts... [12:17:17] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:18:07] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Dzahn) [12:18:08] (03PS3) 10Filippo Giunchedi: conftool: bail on confctl not found [puppet] - 10https://gerrit.wikimedia.org/r/599299 (https://phabricator.wikimedia.org/T253840) [12:18:10] 10Operations, 10DC-Ops, 10serviceops: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Dzahn) [12:18:13] (03PS3) 10Filippo Giunchedi: prometheus: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) [12:19:32] 10Operations, 10DC-Ops, 10serviceops: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10Dzahn) p:05Triage→03Medium [12:22:16] (03PS1) 10Arturo Borrero Gonzalez: toolforge: urlproxy: refresh code comments [puppet] - 10https://gerrit.wikimedia.org/r/599315 (https://phabricator.wikimedia.org/T253816) [12:22:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [12:23:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: urlproxy: refresh code comments [puppet] - 10https://gerrit.wikimedia.org/r/599315 (https://phabricator.wikimedia.org/T253816) (owner: 10Arturo Borrero Gonzalez) [12:25:49] (03CR) 10Marostegui: [C: 03+1] mariadb: Add db2139 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/599313 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:26:20] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) @MBeat33 You're welcome and it sounds good to me. Should we keep this ticket open for more discussion with additional stakeholders or should we close it as it answered the original question? @MBe... [12:26:32] (03CR) 10Kormat: [C: 03+2] mariadb: Add db2139 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/599313 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:33:15] (03CR) 10Muehlenhoff: Add component/idp-test and enable on the staging IDPs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [12:35:06] (03PS4) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [12:35:21] (03CR) 10Dzahn: "needed manual rebase. modules/profile/files/mediawiki/web_testing/tests/test_wikimania_wikimedia has meanwhile been deleted. instead edite" [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [12:37:54] (03CR) 10Jbond: [C: 03+1] "lgtm" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [12:38:36] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/22837/" [puppet] - 10https://gerrit.wikimedia.org/r/599298 (https://phabricator.wikimedia.org/T253840) (owner: 10Filippo Giunchedi) [12:49:02] (03PS3) 10Muehlenhoff: Add component/idp-test and enable on the staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) [12:58:18] (03PS2) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [12:59:07] 10Operations, 10Discovery-Search: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540 (10MoritzMuehlenhoff) This is now a parameter of the java profile, so once Elastic* are migrated to that, it can easily be flipped on via Hiera for a single host. [12:59:49] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [13:00:10] 10Operations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10MoritzMuehlenhoff) [13:04:13] (03PS11) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [13:04:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Run rubocop on Vagrantfile [puppet] - 10https://gerrit.wikimedia.org/r/599290 (owner: 10Hashar) [13:05:26] (03PS1) 10Ema: 0.8: add new prometheus metrics [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) [13:06:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] Gemfile: fix pry / pry-byebug incompatibility [puppet] - 10https://gerrit.wikimedia.org/r/599288 (https://phabricator.wikimedia.org/T253635) (owner: 10Hashar) [13:06:45] (03CR) 10jerkins-bot: [V: 04-1] 0.8: add new prometheus metrics [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [13:07:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:13:17] akosiaris: \o/ indeed! now to actually enable upload only for that profile [13:18:21] (03PS1) 10Dzahn: ATS: add planet.wikimedia.org to also map to its backend [puppet] - 10https://gerrit.wikimedia.org/r/599323 [13:19:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/599293 (owner: 10JMeybohm) [13:21:55] akosiaris: thank you ;) [13:24:10] (03CR) 10Ema: "Tests are green, jenkins is blue." [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [13:30:04] (03PS1) 10Andrew Bogott: Prepare cloudservices2002-dev for debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/599324 (https://phabricator.wikimedia.org/T253780) [13:31:30] (03CR) 10Elukey: [C: 03+1] 0.8: add new prometheus metrics [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [13:31:48] (03PS3) 10JMeybohm: charts: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [13:32:52] (03CR) 10jerkins-bot: [V: 04-1] 0.8: add new prometheus metrics [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [13:33:26] (03CR) 10Andrew Bogott: [C: 03+2] M5 grants: allow designate access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/599137 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [13:33:47] (03CR) 10Andrew Bogott: [C: 03+2] "These rules have now been applied to m5 on the cli." [puppet] - 10https://gerrit.wikimedia.org/r/599137 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [13:34:34] (03CR) 10Andrew Bogott: [C: 03+2] Prepare cloudservices2002-dev for debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/599324 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [13:35:43] (03PS12) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [13:35:58] (03PS2) 10JMeybohm: cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) [13:36:49] !log Restarting CI Jenkins for plugin rollback [13:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:35] (03PS3) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [13:39:15] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [13:39:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:39:57] (03PS2) 10JMeybohm: wikifeeds: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598774 (https://phabricator.wikimedia.org/T253396) [13:40:45] (03PS13) 10Filippo Giunchedi: prometheus: enable Thanos upload for Prometheus k8s-staging [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) [13:41:02] restarting jenkins [13:41:04] (03PS2) 10JMeybohm: chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) [13:41:21] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:44:17] PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:47] (03PS4) 10JMeybohm: charts: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [13:46:32] (03CR) 10Ema: "recheck" [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [13:46:35] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC's happy now https://puppet-compiler.wmflabs.org/compiler1003/22843/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/598712 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:46:42] (03PS5) 10JMeybohm: charts: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [13:47:26] (03PS3) 10JMeybohm: cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) [13:49:45] !log roll-restart prometheus k8s-staging to enable thanos upload - T252186 [13:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:50] T252186: Deploy Thanos (Prometheus long-term storage) stateful components - https://phabricator.wikimedia.org/T252186 [13:49:50] (03CR) 10Ema: [C: 03+2] 0.8: add new prometheus metrics [software/atskafka] - 10https://gerrit.wikimedia.org/r/599322 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [13:54:03] (03PS3) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [13:54:05] (03PS1) 10Alexandros Kosiaris: Reorder YAML Service definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/599327 [13:54:07] (03PS1) 10Alexandros Kosiaris: Probes: If guard them in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599328 [13:54:10] (03PS1) 10Alexandros Kosiaris: eventstreams: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599329 [13:54:11] (03PS1) 10Alexandros Kosiaris: debug: Don't pass nodePort: null across all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/599330 [13:54:13] (03PS1) 10Alexandros Kosiaris: all charts: Only declare nodePort if specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599331 [13:54:16] (03PS1) 10Alexandros Kosiaris: all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 [13:54:18] (03PS1) 10Alexandros Kosiaris: chromium-render: Move ports to the debug pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/599333 [13:54:20] (03PS1) 10Alexandros Kosiaris: eventgate: Use gotemplate comment syntax [deployment-charts] - 10https://gerrit.wikimedia.org/r/599334 [13:54:22] (03PS1) 10Alexandros Kosiaris: zotero: ifguard deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/599335 [13:54:24] (03PS1) 10Alexandros Kosiaris: eventgate: Port tls.upstream_timeout in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 [13:54:26] (03PS1) 10Alexandros Kosiaris: zotero: if guard the volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599337 [13:54:27] (03PS1) 10Alexandros Kosiaris: blubberoid: ifguard volumes stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/599338 [13:54:34] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install thanos-be100[123] - https://phabricator.wikimedia.org/T251618 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson name rack_name position asset_tag switchport thanos-be1001 A2 37 WMF4821 37 thanos-be1002 C2 32 WMF4820 31 thanos-be1003 C4 3 WMF... [13:54:39] (03PS1) 10Filippo Giunchedi: thanos: fix sidecar objstore file permissions [puppet] - 10https://gerrit.wikimedia.org/r/599339 [13:54:55] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Jclark-ctr) name rack_name position asset_tag switchport thanos-fe1001 A2 35 WMF5100 35 thanos-fe1002 A4 22 WMF5101 38 thanos-fe1003 C2 31 WMF5102 30 [13:55:15] (03CR) 10Ottomata: [C: 03+1] eventgate: Port tls.upstream_timeout in values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/599336 (owner: 10Alexandros Kosiaris) [13:55:29] 10Operations, 10ops-eqiad, 10DC-Ops: (NEED BY: ASAP) rack/setup/install thanos-fe100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T251620 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [13:55:33] PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:18] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install thanos-be100[123] - https://phabricator.wikimedia.org/T251618 (10Jclark-ctr) [13:56:28] ^ ns-recursor0.openstack.codfw1dev.wikimediacloud.org. (can we add the hostname instead of the IP?) [13:56:40] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix sidecar objstore file permissions [puppet] - 10https://gerrit.wikimedia.org/r/599339 (owner: 10Filippo Giunchedi) [13:57:13] PROBLEM - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9907 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [13:57:25] restart is known ^ [13:58:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Jclark-ctr) [13:58:51] jouncebot: next [13:58:51] In 2 hour(s) and 1 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T1600) [13:59:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TDB) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Jclark-ctr) [14:01:04] !log atskafka 0.8 uploaded to buster-wikimedia T253551 [14:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:07] T253551: atskafka: expose rdkafka metrics to prometheus - https://phabricator.wikimedia.org/T253551 [14:05:35] (03PS1) 10Ema: prometheus: whitelist atskafka-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/599341 (https://phabricator.wikimedia.org/T253551) [14:07:01] PROBLEM - Prometheus prometheus1004/k8s-staging restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9907 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [14:07:09] PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:07] (03CR) 10Ppchelko: [C: 04-1] "If this is the absolutely only way, so be it. But this is a last resort in my opinion - having this in puppet repo will severely cripple o" [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:08:09] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: whitelist atskafka-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/599341 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [14:10:10] (03PS2) 10Ema: prometheus: whitelist atskafka-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/599341 (https://phabricator.wikimedia.org/T253551) [14:12:01] (03CR) 10Ema: [C: 03+2] prometheus: whitelist atskafka-specific metrics [puppet] - 10https://gerrit.wikimedia.org/r/599341 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [14:17:38] (03CR) 10Muehlenhoff: [C: 03+2] Add component/idp-test and enable on the staging IDPs [puppet] - 10https://gerrit.wikimedia.org/r/599306 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [14:18:05] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) corrected user name. Jcrespo confirmed able to log in [14:19:36] (03PS4) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [14:21:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) a:05Jclark-ctr→03jcrespo [14:21:17] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [14:21:29] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Jclark-ctr) fedex tracking says parts to arrive friday 5/29 @Marostegui would you want to do this tomorrow 3-4pm est. I would prefer... [14:24:01] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7291 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:24:03] (03PS1) 10Filippo Giunchedi: prometheus: move analytics to profile [puppet] - 10https://gerrit.wikimedia.org/r/599342 (https://phabricator.wikimedia.org/T252186) [14:25:13] (03PS1) 10Kormat: transfer.py: Enforce mariadb version match for xtrabackup. [puppet] - 10https://gerrit.wikimedia.org/r/599343 [14:25:51] (03PS2) 10Kormat: transfer.py: Enforce mariadb version match for xtrabackup. [puppet] - 10https://gerrit.wikimedia.org/r/599343 [14:27:51] hello memcached [14:28:35] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5519 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:29:31] RECOVERY - Prometheus prometheus1004/k8s-staging restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [14:29:47] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 573.9 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:29:56] (will continue in #sre) [14:30:02] (03PS1) 10Hoo man: Enable propagateChangeVisibility for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599344 [14:30:35] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:59] (03CR) 10Alexandros Kosiaris: "> 1. We enable HTTP API on k8s" [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:32:22] (03CR) 10Daniel Kinzler: [C: 03+1] "I agree with the intent as stated in the commit message. These fields need to be dropped, they are going to be removed from the production" [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) (owner: 10Bstorm) [14:33:08] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:27] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10Jclark-ctr) Just delivered to storage room [14:33:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments, otherwise LGTM" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:34:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:34:57] PROBLEM - Host 208.80.153.78 is DOWN: PING CRITICAL - Packet loss = 100% [14:35:27] that's ns-recursor0.openstack.codfw1dev.wikimediacloud.org. [14:35:58] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) @HMarcus I set up a trial GSuite account today to try and get this working and i have finally managed to get my... [14:37:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikifeeds: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598774 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:37:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [14:38:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10Jclark-ctr) @fgiunchedi 2 ssd drives installed in each host [14:39:31] !log milimetric@deploy1001 Started deploy [analytics/refinery@203d182]: Three hotfixes [analytics/refinery@203d182] [14:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:24] (03CR) 10Elukey: [C: 03+2] Add page_restrictions to analytics sqooped tables [puppet] - 10https://gerrit.wikimedia.org/r/599112 (https://phabricator.wikimedia.org/T251749) (owner: 10Joal) [14:42:39] (03CR) 10Ppchelko: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:43:13] RECOVERY - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [14:46:58] (03PS6) 10JMeybohm: citoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [14:51:47] RECOVERY - Host 208.80.153.78 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [14:58:19] (03CR) 10Hnowlan: "> Patch Set 2: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:58:55] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:59:16] (03PS2) 10JMeybohm: mobileapps: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598779 (https://phabricator.wikimedia.org/T253396) [14:59:49] (03PS4) 10JMeybohm: recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) [14:59:51] (03CR) 10Muehlenhoff: [C: 03+1] profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [15:00:01] (03PS3) 10JMeybohm: chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) [15:02:22] (03PS5) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [15:02:30] !log installing exim4 security updates on jessie (stretch/buster already fixed) [15:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:12] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10fgiunchedi) @Jclark-ctr thank you! I'll add them to the LVM and resolve the task [15:04:10] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [15:05:30] !log milimetric@deploy1001 Finished deploy [analytics/refinery@203d182]: Three hotfixes [analytics/refinery@203d182] (duration: 25m 59s) [15:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi All done, the procedure is the same as https://phabricator.wikimedia.org/T251622#6148780 [15:09:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] citoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [15:09:56] (03PS2) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) [15:11:45] (03CR) 10jerkins-bot: [V: 04-1] Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:16:05] (03PS3) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) [15:16:19] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/599354 (https://phabricator.wikimedia.org/T253780) [15:16:49] (03PS2) 10Ssingh: wikidough: allow traffic to tcp/443 (DoH port) [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) [15:17:26] (03CR) 10jerkins-bot: [V: 04-1] Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:17:28] (03CR) 10jerkins-bot: [V: 04-1] labs-ip-alias-dump.py: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/599354 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [15:18:48] Retrying fetcher due to error (2/4): Errno::EADDRNOTAVAIL Failed to open TCP connection to index.rubygems.org:443 [15:18:50] lovely [15:19:06] (03PS2) 10Filippo Giunchedi: prometheus: move analytics to profile [puppet] - 10https://gerrit.wikimedia.org/r/599342 (https://phabricator.wikimedia.org/T252186) [15:19:14] 10Operations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Updated java security policy in OpenJDK 8 u252 - https://phabricator.wikimedia.org/T251493 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [15:19:18] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:19:36] 10Operations, 10Discovery-Search, 10User-MoritzMuehlenhoff: Also use java::security on elasticsearch/relforge - https://phabricator.wikimedia.org/T251540 (10MoritzMuehlenhoff) [15:19:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Johan) @Jclark-ctr @Marostegui Does this mean you're not doing the 05:00 UTC window tomorrow? We've been informing the communities abo... [15:20:30] 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) 05Open→03Stalled a:05MoritzMuehlenhoff→03None Setting this as stalled until we can use a version of Kibana with integrated SAML/SSO support. [15:20:38] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/599354 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [15:20:56] 10Operations, 10User-MoritzMuehlenhoff: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112 (10MoritzMuehlenhoff) [15:21:02] (03CR) 10Andrew Bogott: "retest" [puppet] - 10https://gerrit.wikimedia.org/r/599354 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [15:21:27] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10jcrespo) @Johan plan continues as usual- @Jclark-ctr information is unrelated to the user impacting maintenance. [15:21:49] 10Operations, 10User-MoritzMuehlenhoff: Track services without a native systemd unit - https://phabricator.wikimedia.org/T240843 (10MoritzMuehlenhoff) [15:22:13] (03CR) 10Andrew Bogott: [C: 03+2] labs-ip-alias-dump.py: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/599354 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [15:22:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Johan) OK, great, thanks. [15:22:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10jcrespo) I talked to @Jclark-ctr on IRC, hw replacement will likely happen on Tuesday next week. Sw emergency maintenance (read only)... [15:23:48] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22856/" [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:23:51] (03CR) 10Ssingh: [C: 03+2] wikidough: allow traffic to tcp/443 (DoH port) [puppet] - 10https://gerrit.wikimedia.org/r/599063 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:25:01] andrewbogott: elukey: ok to merge your changes? [15:25:12] sukhe: yes please [15:25:17] +1 [15:25:53] 10Operations, 10serviceops, 10Kubernetes, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10thcipriani) >>! In T236017#6171471, @Joe wrote: > Picking this up again - we already migrated the CDN to use https - do we need to do something for CI? Fr... [15:26:29] "Aborting merge.problems merging production" oops [15:26:45] (03CR) 10Elukey: "Pcc looks good, the -1 from jenkins seems unrelated:" [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:27:03] can you try perhaps? [15:27:08] (03PS1) 10Muehlenhoff: Enable managed adduser config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/599358 (https://phabricator.wikimedia.org/T235162) [15:27:10] (03PS1) 10Muehlenhoff: Enable managed adduser config fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/599359 (https://phabricator.wikimedia.org/T235162) [15:27:12] sukhe: I'm trying [15:27:19] seems ok [15:27:30] ha ok great [15:27:46] sukhe: I'd open a task if you have the logs [15:27:55] looks really weird [15:28:02] elukey: just this error message and nothing else [15:28:19] sukhe: from puppet-merge? [15:28:24] yep [15:28:26] wow [15:28:41] maybe it is a corner case in the script? [15:28:52] I tried "yes" as the response, "y", and even "multiple" [15:29:00] Elukey: Add page_restrictions to analytics sqooped tables (47327234b9) [15:29:03] WARNING: Revision range includes commits from multiple committers! [15:29:05] Merge these changes? (multiple/no)? yes [15:29:08] Aborting merge. [15:29:10] problems merging production [15:29:35] ah ok this expected if you don't put "multiple" [15:29:51] ha I see. this is the first time I ran into this :) [15:30:24] but you put also multiple in another run IIUC ? [15:30:30] that should have worked [15:30:54] "yes" doesn't work since you might type it by accident and merge something unwanted from other people etc.. [15:31:11] oh did you mean in the response? yes, I did. I ran "puppet-merge" and then entered "multiple" and it gave me the same error [15:31:16] I thought you meant as an argument to puppet-merge [15:31:22] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10Milimetric) p:05Medium→03High [15:31:46] sukhe: ah nono in prompt! [15:31:54] (03CR) 10Ottomata: [C: 03+1] Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:31:54] yep, I did put "multiple" [15:32:03] then sorry it must be a corner case [15:32:08] never seen it [15:33:22] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:33:47] elukey: ok! I thought I did something wrong :D [15:34:19] sukhe: nono it's me not getting the problem that you reported, faulty brain, be patient please :D [15:35:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM. Hadoop also uses java::security, but can be switched over in a followup patch." [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:36:04] 10Operations, 10Analytics, 10Traffic: Remove ganglia leftovers from ops/puppet - https://phabricator.wikimedia.org/T253555 (10Milimetric) p:05Low→03Medium a:03Ottomata [15:37:34] !log milimetric@deploy1001 Started deploy [analytics/refinery@203d182] (thin): Three hotfixes (THIN) [analytics/refinery@203d182] [15:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:43] !log milimetric@deploy1001 Finished deploy [analytics/refinery@203d182] (thin): Three hotfixes (THIN) [analytics/refinery@203d182] (duration: 00m 10s) [15:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:09] (03CR) 10Elukey: "> LGTM. Hadoop also uses java::security, but can be switched over in" [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:45:09] 10Operations, 10ops-codfw: Degraded RAID on restbase2009 - https://phabricator.wikimedia.org/T253715 (10Papaul) a:03hnowlan @hnowlan This server is out of warranty but we do hve some 1.6TB SSD's on site. I replaced disk 4 which was bad with one we have on site ` Version : 1.2 Creation Time : Fri Sep 13 10:... [15:45:11] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Jdforrester-WMF) [15:45:59] (03PS1) 10Cicalese: added MediaModeration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 [15:47:09] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [15:47:22] (03CR) 10Jforrester: [C: 04-2] "You can't write patches like this. It'll blow up on deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (owner: 10Cicalese) [15:50:30] (03PS2) 10Cicalese: added MediaModeration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) [15:51:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598779 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [15:52:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [15:52:57] (03CR) 10Alexandros Kosiaris: "@Tarrow, @WMDE-Leszek, any objections to this?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598055 (owner: 10JMeybohm) [15:53:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] _scaffold: add image_version to tls_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/598782 (owner: 10JMeybohm) [15:53:32] (03PS6) 10Giuseppe Lavagetto: profile::conftool::client: allow overriding the user root can access [puppet] - 10https://gerrit.wikimedia.org/r/598415 (https://phabricator.wikimedia.org/T97972) [15:53:37] (03PS2) 10JMeybohm: _scaffold: add image_version to tls_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/598782 [15:54:42] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: upgrade docker-ce not docker [puppet] - 10https://gerrit.wikimedia.org/r/599367 (https://phabricator.wikimedia.org/T250867) [15:55:13] (03CR) 10Bstorm: [C: 03+1] kubeadm: wmcs-k8s-node-upgrade.py: upgrade docker-ce not docker [puppet] - 10https://gerrit.wikimedia.org/r/599367 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:56:11] (03PS1) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/599368 (https://phabricator.wikimedia.org/T251466) [15:56:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: decom 36 old appservers in eqiad (onsite, dcops) - https://phabricator.wikimedia.org/T253856 (10wiki_willy) [15:56:48] 10Operations, 10netops: Faulty port cr2-eqord:xe-0/1/1 - https://phabricator.wikimedia.org/T252988 (10ayounsi) Note that cr2-eqord has been upgraded and rebooted. [15:56:53] (03CR) 10JMeybohm: [C: 03+2] _scaffold: add image_version to tls_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/598782 (owner: 10JMeybohm) [15:57:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: upgrade docker-ce not docker [puppet] - 10https://gerrit.wikimedia.org/r/599367 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [15:57:22] (03Merged) 10jenkins-bot: _scaffold: add image_version to tls_enabled fixture [deployment-charts] - 10https://gerrit.wikimedia.org/r/598782 (owner: 10JMeybohm) [15:58:04] (03PS3) 10Jforrester: Install MediaModeration extension - III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [15:58:06] (03PS1) 10Jforrester: Install MediaModeration extension - I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) [15:58:11] (03PS5) 10JMeybohm: recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) [15:58:13] (03PS1) 10Jforrester: Install MediaModeration extension - II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599370 (https://phabricator.wikimedia.org/T247943) [16:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:08] PROBLEM - DPKG on deneb is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:01:28] (03CR) 10Cwhite: "PCC checks out https://puppet-compiler.wmflabs.org/compiler1003/22859/" [puppet] - 10https://gerrit.wikimedia.org/r/599368 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:01:56] (03CR) 10Cicalese: "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [16:02:17] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/599342 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:02:19] (03CR) 10JMeybohm: [C: 03+2] recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:02:52] (03CR) 10Cwhite: [C: 03+1] grafana: provision Thanos datasource [puppet] - 10https://gerrit.wikimedia.org/r/599059 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [16:02:59] (03Merged) 10jenkins-bot: recommendation-api: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598780 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:03:24] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: add more force options to apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/599371 (https://phabricator.wikimedia.org/T250867) [16:04:05] (03CR) 10Bstorm: [C: 03+1] kubeadm: wmcs-k8s-node-upgrade.py: add more force options to apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/599371 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:05:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: add more force options to apt-get install [puppet] - 10https://gerrit.wikimedia.org/r/599371 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:05:43] (03PS3) 10JMeybohm: mobileapps: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598779 (https://phabricator.wikimedia.org/T253396) [16:06:43] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599363 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [16:07:11] (03CR) 10Jforrester: [C: 04-2] "Wait for only production branches to be wmf.36+." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599369 (https://phabricator.wikimedia.org/T247943) (owner: 10Jforrester) [16:07:24] (03CR) 10Elukey: [C: 03+2] Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/598675 (https://phabricator.wikimedia.org/T253553) (owner: 10Elukey) [16:07:51] (03CR) 10JMeybohm: [C: 03+2] mobileapps: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598779 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:08:21] (03Merged) 10jenkins-bot: mobileapps: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598779 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:08:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Changes LGTM, minor comment about possible some stray helmfile.d changes." (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/598076 (owner: 10JMeybohm) [16:10:57] (03CR) 10Tarrow: "I'm not sure we have specific objections. I think y'all have a better idea of how we should properly do versioning of charts but I can try" [deployment-charts] - 10https://gerrit.wikimedia.org/r/598055 (owner: 10JMeybohm) [16:12:07] (03PS7) 10JMeybohm: citoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) [16:13:24] (03CR) 10JMeybohm: [C: 03+2] citoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:13:47] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: install mtail 3.0.0~rc35 from component in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/599368 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:13:51] (03Merged) 10jenkins-bot: citoid: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598760 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:15:05] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: provision Thanos datasource [puppet] - 10https://gerrit.wikimedia.org/r/599059 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [16:15:09] (03PS4) 10JMeybohm: chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) [16:15:13] (03CR) 10JMeybohm: [C: 03+2] chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:16:09] (03Merged) 10jenkins-bot: chromium-render: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598777 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:16:36] (03PS4) 10JMeybohm: cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) [16:17:51] (03CR) 10JMeybohm: [C: 03+2] cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:18:03] (03Merged) 10jenkins-bot: cxserver: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598766 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:18:28] (03PS3) 10JMeybohm: wikifeeds: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598774 (https://phabricator.wikimedia.org/T253396) [16:19:30] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598774 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:20:04] (03Merged) 10jenkins-bot: wikifeeds: Update to v0.2 helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/598774 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [16:22:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10Jclark-ctr) [16:22:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10Jclark-ctr) will update ticket when i get home and close procurement [16:23:50] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade: don't skip the pause stage [puppet] - 10https://gerrit.wikimedia.org/r/599374 (https://phabricator.wikimedia.org/T250867) [16:24:37] (03PS1) 10Elukey: Revert "Swap profile::java::analytics with profile::java" [puppet] - 10https://gerrit.wikimedia.org/r/599375 [16:25:19] (03CR) 10Elukey: [C: 03+2] "The defaults for the profile::hadoop::common seem not right, will revert and work on it again.." [puppet] - 10https://gerrit.wikimedia.org/r/599375 (owner: 10Elukey) [16:28:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade: don't skip the pause stage [puppet] - 10https://gerrit.wikimedia.org/r/599374 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:28:18] PROBLEM - Host mw2180.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:29:56] OK, I'm going to try to unblock the train. [16:31:56] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down (May 2020) - https://phabricator.wikimedia.org/T253610 (10ayounsi) [16:32:58] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.34/extensions/Wikibase: T253804 Use ThrowingEntityTermStoreWriter when writers shouldn't be called (duration: 01m 15s) [16:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:02] T253804: Exception from TermStoreWriterFactory : Local entity source does not have items. - https://phabricator.wikimedia.org/T253804 [16:33:29] twentyafterfour: OK, train should be unblocked again. Should we roll forward to group1? [16:33:39] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: wmcs-k8s-node-upgrade.py: force certificate renewal with kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/599380 (https://phabricator.wikimedia.org/T250867) [16:34:04] PROBLEM - Host mw2183.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:34:39] (03CR) 10Bstorm: [C: 03+1] "Looks like the thing they said to do!" [puppet] - 10https://gerrit.wikimedia.org/r/599380 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:35:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: wmcs-k8s-node-upgrade.py: force certificate renewal with kubeadm [puppet] - 10https://gerrit.wikimedia.org/r/599380 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [16:37:45] 10Operations, 10netops: Zayo link eqiad-codfw (OGYX/120003//ZYO) down (May 2020) - https://phabricator.wikimedia.org/T253610 (10ayounsi) 05Open→03Resolved a:03ayounsi Link is back up. > We found a defective fiber between nodes. > Currently seeing service restored. > Please advise the ONCC of you sta... [16:38:12] mdholloway: Hi! [16:38:29] joal: hello! [16:38:30] mdholloway: I'm joal from analytics and I see you running queries on the hadoop cluster :) [16:38:41] yes, everything ok? [16:39:24] mdholloway: it is :) I have an improvement for your requests - Your queries are about text/content (not images/media), therefore addin [16:39:41] mdholloway: adding webrequest_source = 'text' will reduce the data you'll need to read [16:39:56] RECOVERY - Host mw2183.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.75 ms [16:40:00] mdholloway: more than 1/3 of the data is not to be read [16:40:02] joal: ah, cool, thanks :) [16:40:07] will do! [16:40:10] RECOVERY - Host mw2180.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.80 ms [16:40:39] np mdholloway - When I see queries runnings, I have a quick look every now and then and ping people about improvement - nothing personal ;) [16:41:23] thanks mdholloway :) [16:42:23] joal: thanks for the tip! [16:43:28] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 54 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:45:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 54 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:46:12] (03PS1) 10Hnowlan: changeprop-jobqueue: enable all high-traffic jobs, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/599383 (https://phabricator.wikimedia.org/T220399) [16:48:16] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: enable all high-traffic jobs, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/599383 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:48:55] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @jbond thanks for following up, that service account should already have access to the specific scopes per this... [16:55:45] James_F: yeah I think so [16:56:01] Cool. [16:57:04] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 46 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:57:47] (03PS1) 1020after4: group1 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599386 [16:57:49] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599386 (owner: 1020after4) [16:59:19] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599386 (owner: 1020after4) [17:00:04] halfak and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T1700). [17:00:44] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 47 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:02:42] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.34 refs T253022 [17:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:46] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [17:03:49] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.34 refs T253022 (duration: 01m 06s) [17:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:10] (03PS1) 10Elukey: Revert "Revert "Swap profile::java::analytics with profile::java"" [puppet] - 10https://gerrit.wikimedia.org/r/599389 [17:08:34] (03PS1) 10Ssingh: dnsdist: allow DoT (DNS-over-TLS) [puppet] - 10https://gerrit.wikimedia.org/r/599390 (https://phabricator.wikimedia.org/T252132) [17:11:50] twentyafterfour: Hurrah. [17:12:38] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/22861/" [puppet] - 10https://gerrit.wikimedia.org/r/599390 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [17:12:51] dashboards all look clear [17:12:58] 10Operations: Failed to SSH - https://phabricator.wikimedia.org/T253820 (10jwang) 05Open→03Resolved Thank you @ayounsi. With your recommended setting, I can ssh now. So close this ticket. [17:16:50] +1 [17:19:33] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738 (10Jgreen) 05Open→03Declined Closing this as wontfix because it appears to be a larger project than we want to take on due to prometheus's design limitations--both in terms of the d... [17:19:36] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562 (10Jgreen) [17:25:17] (03PS2) 10Elukey: Swap profile::java::analytics with profile::java [puppet] - 10https://gerrit.wikimedia.org/r/599389 [17:29:47] (03CR) 10Elukey: [C: 04-1] "Doesn't work: https://puppet-compiler.wmflabs.org/compiler1002/22862/stat1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/599389 (owner: 10Elukey) [17:32:14] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:50] erm [17:43:36] o_o [17:53:35] (03CR) 10Ryan Kemper: [C: 03+2] debian/changelog: Make spaces match previous entry [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/599140 (owner: 10Ryan Kemper) [17:54:22] (03PS1) 10CRusnov: rotatedump: Call correct dumpbackup script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/599396 (https://phabricator.wikimedia.org/T253833) [17:54:37] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/599396 (https://phabricator.wikimedia.org/T253833) (owner: 10CRusnov) [17:56:22] (03CR) 10CRusnov: [C: 03+2] rotatedump: Call correct dumpbackup script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/599396 (https://phabricator.wikimedia.org/T253833) (owner: 10CRusnov) [17:59:56] (03PS1) 10Mstyles: Auth for SeFC [puppet] - 10https://gerrit.wikimedia.org/r/599399 (https://phabricator.wikimedia.org/T251500) [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:31] (03CR) 10jerkins-bot: [V: 04-1] Auth for SeFC [puppet] - 10https://gerrit.wikimedia.org/r/599399 (https://phabricator.wikimedia.org/T251500) (owner: 10Mstyles) [18:04:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:42] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:10] (03PS5) 10EBernhardson: Rename role::wdqs to role::wdqs::public [puppet] - 10https://gerrit.wikimedia.org/r/598884 [18:20:06] (03PS2) 10EBernhardson: query_service: Move shared config into common file [puppet] - 10https://gerrit.wikimedia.org/r/599145 [18:30:17] (03PS2) 10EBernhardson: WIP: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [18:34:55] (03PS3) 10EBernhardson: WIP: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [18:38:50] (03PS4) 10EBernhardson: WIP: Consolidate query_service profile duplication [puppet] - 10https://gerrit.wikimedia.org/r/599146 [18:51:15] (03CR) 10Herron: [C: 03+1] hiera: install mtail 3.0.0~rc35 from component in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/599368 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [18:53:02] (03CR) 10Cwhite: [C: 03+2] hiera: install mtail 3.0.0~rc35 from component in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/599368 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [18:59:55] * James_F is ready to train. [19:00:04] twentyafterfour and James_F: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T1900). [19:00:55] !log milimetric@deploy1001 Started deploy [analytics/refinery@f6d73c8]: Hotfix #2 today: forgot jars [analytics/refinery@f6d73c8] [19:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:15] !log restart varnishmtail and atsmtail on cp5001.eqsin.wmnet [19:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:35] James_F: here [19:07:08] Things look quiet on group1. [19:07:10] Let's roll? [19:08:08] James_F: just as soon as I can get logged in to deploy1001 [19:08:54] (03PS1) 1020after4: all wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599406 [19:08:56] (03CR) 1020after4: [C: 03+2] all wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599406 (owner: 1020after4) [19:10:11] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.34 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599406 (owner: 1020after4) [19:14:37] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.34 refs T253022 [19:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:41] T253022: 1.35.0-wmf.34 deployment blockers - https://phabricator.wikimedia.org/T253022 [19:15:23] twentyafterfour: Looks OK to me. [19:15:49] so far so good [19:15:53] https://phabricator.wikimedia.org/T253905 [19:15:56] doesn't look good to me [19:16:29] eww [19:17:00] James_F: you've broken something [19:17:03] confirm, the language selector is eating the globe [19:17:09] Krinkle: I've seen that from before a few times. [19:17:14] Not sure if it's new in wmf.34. [19:17:19] Sidebar CSS is b0rked by the looks of it [19:17:25] tabs too [19:17:28] Looks fine to me when I load. [19:17:28] someone reported an issue in the general chat on telegram too [19:17:30] Mentioned in here too seemingly [19:17:32] Cache issue? [19:17:34] Meh. [19:17:39] Does it fix on reload? [19:17:39] oh, Krinkle's task is it. [19:17:41] so we roll back? [19:17:43] Lemme see. [19:17:44] swat issue or some other task [19:17:44] Let's rollback before it escalates? [19:17:49] !log milimetric@deploy1001 Finished deploy [analytics/refinery@f6d73c8]: Hotfix #2 today: forgot jars [analytics/refinery@f6d73c8] (duration: 16m 54s) [19:17:50] Fixed on reload [19:17:50] Rollback might break it worse? [19:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:08] I don't see it either tbh [19:18:12] fixed now [19:18:15] (for me) [19:18:19] Cache issue I guess. Thanks all. :) [19:18:20] and me [19:18:21] twentyafterfour: Your call. [19:18:29] apparently was temporary? [19:18:30] still broken for me [19:18:38] I'm not seeing any issues logged out either. [19:18:38] rollback should be fine. The new HTML rolls out slowly through the CDN. [19:18:43] and now fixed [19:18:46] so majority is still fine [19:18:47] heh [19:19:10] Sounds like waiting is a good idea [19:19:29] reverting [19:19:34] Aha, loading logged out on non-refreshed pages looks broken, but as soon as I trigger a render (logged in page view) it's broken once and then fixed thereager. [19:19:52] * James_F sighs. [19:19:56] This is related to https://phabricator.wikimedia.org/T253691 (I've pinged the team) [19:20:00] !log milimetric@deploy1001 Started deploy [analytics/refinery@f6d73c8] (thin): Hotfix #2 today (thin): forgot jars [analytics/refinery@f6d73c8] [19:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:10] !log milimetric@deploy1001 Finished deploy [analytics/refinery@f6d73c8] (thin): Hotfix #2 today (thin): forgot jars [analytics/refinery@f6d73c8] (duration: 00m 09s) [19:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:17] sorry not reverting but syncing group2 back to wmf.32 [19:20:39] !log group2 back to wmf.32 due to T253905 [19:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:43] T253905: Vector page layout corrupted on enwiki - https://phabricator.wikimedia.org/T253905 [19:20:45] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: roll back the train due to T253905 [19:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:08] ok that was fun [19:21:14] :) [19:21:28] indeed it was :) [19:21:29] * James_F sighs at Vector. [19:21:57] 10Operations, 10Privacy Engineering, 10Research, 10Traffic, 10Privacy: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10JFishback_WMF) Looks good to me WRT #privacy. I think @Reedy also wanted to take a look at this for the... [19:21:59] (03PS1) 1020after4: group2 wikis to 1.35.0-wmf.32 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599411 [19:22:01] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.35.0-wmf.32 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599411 (owner: 1020after4) [19:22:12] ^ this is just to get the repo to match reality [19:22:33] (since I bypassed CR for the rollback) [19:22:40] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:22:47] uh [19:22:49] (03Merged) 10jenkins-bot: group2 wikis to 1.35.0-wmf.32 refs T253022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599411 (owner: 1020after4) [19:22:53] Don't worry about scandium. [19:22:56] It's a testing server. [19:23:01] No production traffic hits it. [19:23:42] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10JGulingan) Hi All, Just to clarify, IT does not manage donate@. Can you clarify if this points to Fundraising's zendesk email? It would look similar how techsupport@ is an alias for our zendesk ticketing... [19:24:50] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:25:37] we should possibly disable the opcache health alerts on nonproduction servers, "no production traffic" means 1) it doesn't super matter and 2) it's a lot more prone to firing there in the first place [19:26:10] cdanis: Yeah, maybe. Though it could be a sign that the opcache flush has stopped working, ahead of it hitting the problem in the real world? [19:26:39] leave the fullness alert, then, but remove the hit rate one? [19:26:41] cdanis: are the cronjob opcache fixers not enabled on these two? [19:26:49] cdanis: That could work. [19:26:55] oh this is cache hit rate [19:26:56] right [19:27:02] not cache size [19:27:04] Krinkle: Some of the deploys to scandium aren't by scap but manual git pulls by the Parsing team. [19:27:10] yeah, and it's hit rate for a cache that's populated on-demand [19:27:21] it's not surprising it fires on mwdebug*/scandium/etc all the time [19:27:31] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10MBeat33) donate@ is an address that is set up as a support address in Zendesk, so anything sent to it goes to the DS team [19:27:41] yeah, the traffic isn't regular so cache misses are almost all the traffic there is. [19:30:36] PROBLEM - Long running screen/tmux on mw1320 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 24806, 1740804s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:30:38] (03PS5) 10BPirkle: Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [19:31:54] (03CR) 10Ppchelko: [C: 03+1] Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [19:35:05] Krinkle: follow up question on https://phabricator.wikimedia.org/T253905 [19:35:23] i dont really understand how this could have happened. I've checked out master and that CSS rule shouldn't ever show up in legacy mode. [19:39:16] Jdlrobson: CSS is cached too [19:39:20] replied on task [19:39:30] the css is different though? [19:39:34] do you mean local storage? [19:39:36] no [19:39:44] i thought css had a cache life of 5 mins? [19:40:41] startup module has a consistently forward-moving guruantee (which does have edge cases, but in general not something devs should care about). [19:40:44] for CSS that's not the cse [19:40:46] case* [19:40:59] the stylesheet is indeed generally bindly cached for 5-10 min [19:41:26] but you can't generate new HTML that is incompatible with current/previous styles. [19:41:29] ok so to clarify i understand correctly, this would resolve itself in 5-10 mins, but obviously that's not acceptable [19:41:30] Can be removed next week though [19:41:53] I'm not entirely sure, but yes, if that's the only factor, then that sounds right. [19:41:55] so likely i need to make a hot fix for the branch to remove that rule (it's no longer needed) [19:41:58] Yeah, a wmf.34-only backport should be fine. [19:42:11] k i'll get that to you by end of day [19:42:25] Jdlrobson: there is also the patch to remove that class which in itself is also part of the regression [19:42:34] (adding it to elements that don't need/want it) [19:42:49] did that land in master? [19:43:03] which class? [19:43:29] vector-menu is applied to all classes. The issue is the layout rules in cached CSS were moved [19:43:33] class=vector-menu from legacy personal being added to all menus/portlets/tabs whcih was an unintended side effect I mentioned last week [19:43:45] if you are saying that CSS is cached indefinitely however, im gonna have to get creative about how to fix that [19:43:55] afaik you or Volker wrote a patch to undo that, limiting it to p-personal where it is needed. [19:43:59] vector-menu class is added intentionally to all menus [19:44:25] that class name though was previously only used for p-personal, that seems risky to re-use [19:44:45] it's a new class that was only recently added [19:44:50] I don't know why that previously had pos-absolute on it though [19:44:56] you might be confiusing it with vectorMenu ? [19:44:56] https://github.com/wikimedia/mediawiki-skins-Vector/blob/master/resources/skins.vector.styles/layout.less#L135 [19:44:58] (in legacy) [19:45:05] Jdlrobson: almost, I confused it with class=menu [19:45:07] ^ the issue is the absolute positioning was moved out of the component to here [19:45:11] that one was fixed indeed. [19:45:21] as it obviously shouldn't apply to everything [19:45:56] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:46:45] https://github.com/wikimedia/mediawiki-skins-Vector/commit/4dfe4a97c93759b439c27fe61d72e4afba0a6ace#diff-03b88f64ef575795a765ab9077b07a4a [19:46:48] SO I guess i need to patch 1.35.0-wmf.33 so it doesn't appear in https://en.wikipedia.org/w/load.php?modules=skins.vector.styles.legacy&debug=true&only=styles [19:47:03] then once that is done, we can wait 10 mins then push forward with deployment? or is that not acceptable either? [19:47:24] I think the maxage is 30min on this one before required http revalidation [19:47:36] but yeah, removing all vector-menu styles ahead of it should do the trick [19:48:38] so it was temporarily added to p-personal as vector-menu-default and then changed to vector-menu and then removed. [19:48:45] and the branch was cut after that second step [19:50:58] * Krinkle just realised we don't have a WMF deployed codesearch that reflects the latest wmf branch. [19:53:04] Krinkle: https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/599413 HOTFIX: Do not apply p-personal absolute positioning to all menus [19:53:14] so i think we need to hot fix 1.35.0-wmf.32 [19:53:20] let the caches update [19:53:33] then 1.35.0-wmf.33 should roll out safely [19:53:57] if the maxage is 30 mins hopefully can be deployed 30 mins after [19:54:00] sound good? [19:54:12] sounds good [19:54:24] (03PS1) 10Ottomata: Remove ganglia configs from cdh and jmxtrans modules [puppet] - 10https://gerrit.wikimedia.org/r/599415 (https://phabricator.wikimedia.org/T253555) [19:55:26] Jdlrobson: this essentially backports part of this one, right? https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/589395/ [19:55:54] Krinkle: yup https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/589395/39/resources/skins.vector.styles/Menu.less [19:56:24] im just checking if the padidng-left is a problem [19:56:29] atually ill include that just to be safe [19:56:37] (wmf.34 btw) [19:56:42] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/22872/" [puppet] - 10https://gerrit.wikimedia.org/r/599415 (https://phabricator.wikimedia.org/T253555) (owner: 10Ottomata) [19:56:43] wmf.33 didn't exist [19:58:20] @Krinkle when you have a minute can you take a look at the patches for T235944? I found that I enjoy working to cleanup ResourceLouder modules [19:58:21] T235944: Reduce number of modules registered by MassMessage - https://phabricator.wikimedia.org/T235944 [19:58:38] (03PS4) 10Ppchelko: Enable kafka purges production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599150 (https://phabricator.wikimedia.org/T250781) [20:00:23] Krinkle: ok new patch up [20:00:42] DannyS712: I don't know much about that extension. I assume review for that will require knowledge of the extension more than RL/Perf in general, so would rather not dive too much into those - as much as I like that you're working on that! [20:01:46] (03PS1) 10Rush: peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599417 (https://phabricator.wikimedia.org/T251784) [20:02:39] (03CR) 10jerkins-bot: [V: 04-1] peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599417 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [20:04:10] (03PS2) 10Rush: peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599417 (https://phabricator.wikimedia.org/T251784) [20:04:53] (03CR) 10jerkins-bot: [V: 04-1] peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599417 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [20:07:00] (03PS2) 10Rush: peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599103 (https://phabricator.wikimedia.org/T251784) [20:08:52] Krinkle, Jdlrobson: OK, change looks reasonable; want me to deploy it? [20:08:56] (03PS3) 10Rush: peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599103 (https://phabricator.wikimedia.org/T251784) [20:08:58] (03Abandoned) 10Rush: peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599417 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [20:09:42] James_F: yes please pending Krinkle 's okay [20:10:04] (03CR) 10Rush: [C: 03+2] peek: update dependencies [puppet] - 10https://gerrit.wikimedia.org/r/599103 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [20:11:09] OK [20:11:21] Ack [20:11:30] !log restart ncredirmtail on ncredir5001 [20:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:09] Krinkle: thanks for the lesson on caching. I thought I knew it all but apparently not. :) [20:13:08] (03PS1) 10Rush: peek: fix require reference for /etc/peek [puppet] - 10https://gerrit.wikimedia.org/r/599424 (https://phabricator.wikimedia.org/T253901) [20:13:45] CI says no [20:14:01] 21:11:59 resources/skins.vector.styles/Menu.less [20:14:02] 21:11:59 14✖ Unexpected empty line before rule rule-empty-line-before [20:14:15] Already pushed a fix. [20:14:15] (03CR) 10Rush: [C: 03+2] peek: fix require reference for /etc/peek [puppet] - 10https://gerrit.wikimedia.org/r/599424 (https://phabricator.wikimedia.org/T253901) (owner: 10Rush) [20:14:21] thanks James_F ! [20:14:41] CI has opinions [20:15:03] sometimes it's opinions we tell it to have [20:15:08] Mostly. [20:15:10] other times it does what it wants [20:15:14] It's like a child [20:15:23] Useful when it behaves. Not so much when it has a tantrum [20:16:18] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/transferpy] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/599427 [20:16:21] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/transferpy] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/599427 (owner: 10QChris) [20:16:59] (03PS1) 10QChris: Import done. Revoke import grants [software/transferpy] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/599428 [20:17:02] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/transferpy] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/599428 (owner: 10QChris) [20:17:17] * James_F idly wonders if the group2 roll/rollback/roll will mean we smear the INSERT flood from RL out over longer and so have less of an impact. [20:17:26] How long is the table cache kept around? [20:22:01] (03PS1) 10Rush: peek: fix paths to conf files for cron [puppet] - 10https://gerrit.wikimedia.org/r/599430 (https://phabricator.wikimedia.org/T253901) [20:22:19] !log restart varnishmtail and atsmtail eqsin [20:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:05] (03CR) 10Rush: [C: 03+2] peek: fix paths to conf files for cron [puppet] - 10https://gerrit.wikimedia.org/r/599430 (https://phabricator.wikimedia.org/T253901) (owner: 10Rush) [20:29:38] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:32:31] Finally. [20:34:32] Jdlrobson, Krinkle: Live on mwdebug1001 if you want to check. [20:36:19] Looks OK to me. [20:37:30] OK, deploying now. [20:37:43] James_F: yup looks good here too [20:38:10] OK, so, after this we should wait for … 10 mins? 20? and then re-roll the train? [20:38:35] (03PS1) 10Rush: peek: add python3-asana dependency [puppet] - 10https://gerrit.wikimedia.org/r/599433 (https://phabricator.wikimedia.org/T253901) [20:38:40] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.32/skins/Vector/resources/skins.vector.styles: T253905 HOTFIX: Do not apply p-personal absolute positioning to all menus (duration: 01m 07s) [20:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:44] T253905: Vector page layout corrupted on cached pages - https://phabricator.wikimedia.org/T253905 [20:44:29] Hmm. There's no nice way to roll the train to only a debug server. [20:45:12] pulling the wikiversions change to mwdebug should do the trick. [20:45:37] but it won't help see the cahce issue since that bypasses caching [20:45:41] Need to bump both wikiversions.json and wikiversions.php, though. [20:45:58] And .php is normally generated as part of the scap command. [20:46:14] The X-Debug header skips CDNs, of course. [20:46:27] checking now to see if the new styles are applied [20:46:36] on an anon session with old cache [20:47:50] also comparing between enwiki wmf.32 and hewiki wmf.34 visually [20:47:50] https://he.wikipedia.org/?uselang=en&safemode=on [20:47:56] (hewiki is group1) [20:49:08] adding vector-menu or removing it from either seems to cause no visual change [20:49:24] so.. yeah, I'd say roll again in 10min? [20:49:30] Kk [20:49:42] btw, given we rolled back so quickly - did we look at logstash yet for php stuff? [20:49:50] Yes. [20:49:56] V. quiet. [20:50:14] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 305 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:50:25] Timeouts went up but nothing else. [20:50:38] Krinkle: ^^ That you? [20:51:00] nope [20:51:06] * James_F re-pulls. [20:53:48] Reedy: possibly from the visibility changes? https://phabricator.wikimedia.org/T253922 [20:55:50] Possible [20:55:54] I'll have a look in a minute [20:56:02] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:56:08] Reedy: Want me to take your landing one and deploy it? [20:56:15] Please :) [20:56:18] Cool. [20:56:23] Now you're free to look at the next one. ;-) [20:57:51] James as backup makes train conducting too easy... I think I owe you a train James_F [20:58:37] (03CR) 10Herron: [C: 03+1] prometheus: move analytics to profile [puppet] - 10https://gerrit.wikimedia.org/r/599342 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [20:58:48] $callback = [ $this, 'passThrough' ]; [20:59:03] * James_F grins at twentyafterfour. [20:59:50] I guess this is what happens when we pass callbacks outside a class.. [21:00:11] Lovely. [21:00:52] I'm sure we'll come across some more crappy stuff like this [21:01:33] Yeah. [21:01:46] * James_F mumbles to himself about porting MW to a proper language. [21:03:56] https://gerrit.wikimedia.org/r/599441 [21:05:00] C+2'ed. [21:09:52] (03PS1) 10Jforrester: [Beta Cluster] Stop setting wgMFPhotoUploadEndpoint, ignored. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599443 [21:11:22] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Stop setting wgMFPhotoUploadEndpoint, ignored. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599443 (owner: 10Jforrester) [21:12:09] (03Merged) 10jenkins-bot: [Beta Cluster] Stop setting wgMFPhotoUploadEndpoint, ignored. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599443 (owner: 10Jforrester) [21:12:17] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.34/includes/specials/SpecialUserrights.php: T253909 Restore visibility (previously implicitely public) (duration: 01m 06s) [21:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:21] T253909: Global group membership cannot be changed - https://phabricator.wikimedia.org/T253909 [21:16:03] (03CR) 10Thcipriani: [C: 03+1] "Working well, zomg new features!" [puppet] - 10https://gerrit.wikimedia.org/r/593936 (https://phabricator.wikimedia.org/T242882) (owner: 10Brennen Bearnes) [21:26:10] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.34/includes/filerepo/FileRepo.php: T253922 Mark two FileRepo functions public (duration: 01m 07s) [21:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:14] T253922: PHP "cannot access private method LocalRepo::passThrough" from SpecialUpload->processUpload - https://phabricator.wikimedia.org/T253922 [21:29:00] twentyafterfour: OK, we should be good to re-roll the train, assuming that Krinkle intentionally didn't mark T253927 as a train blocker. [21:29:00] T253927: "PHP Notice: Undefined offset: 0" from HTMLMultiSelectField.php - https://phabricator.wikimedia.org/T253927 [21:31:22] Indeed [21:31:35] twentyafterfour: Want to do the honours? :-) [21:31:59] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [21:34:00] (03PS1) 10Andrew Bogott: designate pools.yaml: Fix some hard-coded eqiad things [puppet] - 10https://gerrit.wikimedia.org/r/599448 (https://phabricator.wikimedia.org/T253780) [21:34:02] (03PS1) 10Andrew Bogott: Designate: designate doesn't need to write to the pdns db anymore [puppet] - 10https://gerrit.wikimedia.org/r/599449 (https://phabricator.wikimedia.org/T253780) [21:34:04] (03PS1) 10Andrew Bogott: designate.conf.erb: remove pdns pool config [puppet] - 10https://gerrit.wikimedia.org/r/599450 (https://phabricator.wikimedia.org/T253780) [21:36:02] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [21:39:11] (03PS2) 10Andrew Bogott: designate pools.yaml: Fix some hard-coded eqiad things [puppet] - 10https://gerrit.wikimedia.org/r/599448 (https://phabricator.wikimedia.org/T253780) [21:39:13] (03PS2) 10Andrew Bogott: Designate: designate doesn't need to write to the pdns db anymore [puppet] - 10https://gerrit.wikimedia.org/r/599449 (https://phabricator.wikimedia.org/T253780) [21:39:15] (03PS2) 10Andrew Bogott: designate.conf.erb: remove pdns pool config [puppet] - 10https://gerrit.wikimedia.org/r/599450 (https://phabricator.wikimedia.org/T253780) [21:42:48] (03PS1) 10Andrew Bogott: Add a dummy ldap sync pass [labs/private] - 10https://gerrit.wikimedia.org/r/599453 [21:42:57] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add a dummy ldap sync pass [labs/private] - 10https://gerrit.wikimedia.org/r/599453 (owner: 10Andrew Bogott) [21:45:05] OK, let's roll the train. [21:46:02] (03PS1) 10Jforrester: all wikis to 1.35.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599454 [21:46:04] (03CR) 10Jforrester: [C: 03+2] all wikis to 1.35.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599454 (owner: 10Jforrester) [21:46:47] (03PS1) 10Andrew Bogott: Add another ldap dummy pass [labs/private] - 10https://gerrit.wikimedia.org/r/599455 [21:46:50] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.34 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599454 (owner: 10Jforrester) [21:46:53] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add another ldap dummy pass [labs/private] - 10https://gerrit.wikimedia.org/r/599455 (owner: 10Andrew Bogott) [21:48:32] (03CR) 10Andrew Bogott: [C: 03+2] designate pools.yaml: Fix some hard-coded eqiad things [puppet] - 10https://gerrit.wikimedia.org/r/599448 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [21:48:39] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.34 [21:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:57] doh just missed it [21:49:58] OK, quiet so far it seems. [21:50:07] Sorry twentyafterfour, I'm stealing your thunder. :-) [21:50:26] I just wanted to give us as much runway as possible if we need to re-revert. [21:50:30] I'll keep an eye on things [21:51:57] (03PS3) 10Andrew Bogott: designate.conf.erb: remove pdns pool config [puppet] - 10https://gerrit.wikimedia.org/r/599450 (https://phabricator.wikimedia.org/T253780) [21:51:59] (03PS3) 10Andrew Bogott: Designate: designate doesn't need to write to the pdns db anymore [puppet] - 10https://gerrit.wikimedia.org/r/599449 (https://phabricator.wikimedia.org/T253780) [21:52:26] DB INSERT rate looks normal, unless I'm missing something. [21:52:46] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen&panelId=2&from=now-3h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1083&var-port=9104&refresh=1m [21:55:45] yeah looks fine to me [21:56:44] Excellent. Congratulations. [21:58:05] does it not take a while to ramp up? [21:59:03] Possibly, but that doesn't make much sense given they're blaming a ResourceLoader stampede (which is a one-and-done deploy-time change). [21:59:36] (03CR) 10Andrew Bogott: [C: 03+2] designate.conf.erb: remove pdns pool config [puppet] - 10https://gerrit.wikimedia.org/r/599450 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [21:59:47] (03CR) 10Andrew Bogott: [C: 03+2] Designate: designate doesn't need to write to the pdns db anymore [puppet] - 10https://gerrit.wikimedia.org/r/599449 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [22:02:12] (03PS3) 10Jforrester: Remove PHP version if around $wgOverrideUcfirstCharacters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599113 (owner: 10Reedy) [22:02:26] (03CR) 10Jforrester: [C: 03+2] Remove PHP version if around $wgOverrideUcfirstCharacters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599113 (owner: 10Reedy) [22:03:30] (03Merged) 10jenkins-bot: Remove PHP version if around $wgOverrideUcfirstCharacters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599113 (owner: 10Reedy) [22:05:37] (03PS3) 10Jforrester: testNoAmbiguouslyTaggedSettings: Re-work to identify the dblists and values at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599122 [22:05:43] (03CR) 10Jforrester: [C: 03+2] testNoAmbiguouslyTaggedSettings: Re-work to identify the dblists and values at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599122 (owner: 10Jforrester) [22:06:19] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Remove version wrapper around wgOverrideUcfirstCharacters; always true (duration: 00m 59s) [22:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:30] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:06:31] (03Merged) 10jenkins-bot: testNoAmbiguouslyTaggedSettings: Re-work to identify the dblists and values at fault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599122 (owner: 10Jforrester) [22:06:42] (03PS2) 10Jforrester: Group CheckUser rights together in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594904 (owner: 10Tchanders) [22:06:58] (03CR) 10Jforrester: [C: 03+2] Group CheckUser rights together in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594904 (owner: 10Tchanders) [22:07:48] (03Merged) 10jenkins-bot: Group CheckUser rights together in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594904 (owner: 10Tchanders) [22:09:16] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Move one CheckUser right change next to the other (duration: 00m 57s) [22:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:32] (03PS1) 10Mholloway: MachineVision: Update blocklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599462 (https://phabricator.wikimedia.org/T253821) [22:18:35] (03CR) 10jerkins-bot: [V: 04-1] MachineVision: Update blocklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599462 (https://phabricator.wikimedia.org/T253821) (owner: 10Mholloway) [22:18:55] (03PS2) 10Mholloway: MachineVision: Update blocklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599462 (https://phabricator.wikimedia.org/T253821) [22:19:07] mdholloway: Need that slung out right now? [22:19:18] James_F: yes, please! [22:20:32] (03CR) 10Jforrester: [C: 03+2] MachineVision: Update blocklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599462 (https://phabricator.wikimedia.org/T253821) (owner: 10Mholloway) [22:21:26] that adds more phpcs issues ;p [22:21:41] (03Merged) 10jenkins-bot: MachineVision: Update blocklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599462 (https://phabricator.wikimedia.org/T253821) (owner: 10Mholloway) [22:22:55] Reedy: If we cared, we'd enforce, right? [22:24:29] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T253821 Update MachineVision block list for 2020-05-27 (duration: 00m 57s) [22:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:33] T253821: CAT blacklist update, 2020-05-27 - https://phabricator.wikimedia.org/T253821 [22:24:48] mdholloway: Done. [22:24:59] I guess fix all the issues first [22:25:00] https://github.com/wikimedia/operations-mediawiki-config/blob/master/phpcs.xml#L15 [22:25:15] Yeah. :-( [22:25:24] Or just wait for me to move it to YAML? ;) [22:26:58] Thanks for deploying, James_F. I can do another quick patch to fix the spacing. [22:27:18] No, don't worry about it. [22:28:29] k. thanks again! :) [22:29:29] Hmm, looks like the wikibugs test suite (at least) can't talk to Phabricator any more because of an error about the TLS version? [22:38:24] (03PS1) 10BryanDavis: wmcs: Add db1141.eqiad.wmnet to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/599466 (https://phabricator.wikimedia.org/T253930) [22:41:09] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.34/extensions/WikibaseMediaInfo/src/Services/FilePageLookup.php: T253792 Follow-up 1827c7a: Ensure inNamespace() is called only on Title object (duration: 00m 58s) [22:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:14] T253792: Fatal error: Call to a member function inNamespace() on null (from FilePageLookup.php) - https://phabricator.wikimedia.org/T253792 [22:41:57] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1002/22881/labstore1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/599466 (https://phabricator.wikimedia.org/T253930) (owner: 10BryanDavis) [22:42:02] (03CR) 10Bstorm: [C: 03+2] wmcs: Add db1141.eqiad.wmnet to maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/599466 (https://phabricator.wikimedia.org/T253930) (owner: 10BryanDavis) [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200528T2300) [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:06:06] I have a last minute SWAT addition [23:06:12] (03PS2) 10Aaron Schulz: Enable "coalesceKeys"="non-global" for WANCache on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/598852 [23:06:12] pending some review to unblock https://phabricator.wikimedia.org/T253912 [23:07:59] RoanKattouw: Urbanecm James_F any chance you can help me swat https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/Vector/+/599469/ ? [23:08:09] i can put it on wikitech deployments if so [23:08:31] master patch isn't reviewed yet? [23:08:48] James_F: stephen is doing it as we speak [23:08:57] he's almost done [23:09:04] You're just removing the `display: none` from `.vector-menu-empty`? [23:09:17] basically yeh :) [23:09:48] addPortletLink apparently needs the emptyPortlet class to reveal previously empty menus [23:09:58] so if the vector-menu-empty class is there the portal never gets revealed [23:10:04] But that's unchanged? [23:10:42] So an empty portal has classes emptyPortlet AND vector-menu-empty [23:10:48] if the portal is empty [23:10:55] Right [23:10:57] if a gadget calls `mw.util.addPortletLink('p-cactions', '#', 'yooooo')` [23:11:07] it removes emptyPortlet but not vector-menu-empty to reveal it [23:11:11] so this rule is tripping up that logic [23:11:18] Oh. [23:11:20] this is the hot fix - i'll future proof this later [23:11:22] * James_F sighs. [23:11:34] Moving Vector out of core but leaving half the JS behind wasn't a great idea. [23:11:39] there was no comment in Vector before saying "if you change this class change it in addPortletLink or don't change) [23:11:42] yeh [23:11:55] i've added it to my skin sanity todo list :) [23:12:13] James_F: anyway apparently some critical wikidata gadgets are broken [23:12:33] thanks for the merge... want me to add on wikitech:deployments for prosperity? [23:12:41] Nah. [23:12:41] it's been a loonngggg day :) [23:13:00] Very much so. :-) [23:19:59] Krinkle: Aha! https://gerrit.wikimedia.org/r/c/mediawiki/core/+/598117 fails on tests with "1) ChangesListSpecialPageTest::testRcNsFilterMultipleAssociated – InvalidArgumentException: NamespaceInfo::isTalk called with non-integer (string) namespace '1'". [23:20:09] Should we just silently try intval()? [23:35:35] hm. in general we support int/string interchangably I think for ns ids [23:35:46] because they;re often from array keys or user inpuyt [23:35:51] Should we? [23:36:03] well, today, yes [23:36:16] tomorrow we can start doing better [23:36:37] (03PS1) 10Bstorm: toolforge-k8s: proposing removing hostkey checking for the upgrades [puppet] - 10https://gerrit.wikimedia.org/r/599472 (https://phabricator.wikimedia.org/T246122) [23:37:31] James_F: try is_numerical [23:37:38] the php warning should not happen for int-ish strings [23:37:59] the problem in the task will be due to a worse kind of string (or differnet type together, not sure) [23:38:00] I think [23:38:04] (03PS1) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in uslfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/599473 (https://phabricator.wikimedia.org/T251466) [23:38:16] (03PS2) 10Bstorm: toolforge-k8s: proposing removing hostkey checking for the upgrades [puppet] - 10https://gerrit.wikimedia.org/r/599472 (https://phabricator.wikimedia.org/T246122) [23:38:44] James_F: https://3v4l.org/Wg2GA [23:38:51] Ack. [23:39:37] (03PS1) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in esams and eqiad [puppet] - 10https://gerrit.wikimedia.org/r/599474 (https://phabricator.wikimedia.org/T251466) [23:39:55] (03PS2) 10Cwhite: hiera: install mtail 3.0.0~rc35 from component in ulsfo and codfw [puppet] - 10https://gerrit.wikimedia.org/r/599473 (https://phabricator.wikimedia.org/T251466) [23:41:01] James_F: actually ctype_digit might be better. is_numeric also accepts float-like and various non-base10/unconventional number-like things [23:41:10] there's probagbly an example of this somewhere for a similar use case [23:41:13] I gotta go now though :) [23:41:14] o/ [23:43:03] Krinkle: I'm just wrapping in intval(). [23:45:41] (03CR) 10Cwhite: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [23:47:02] James_F: it's a warning, not exception right now. T253098 means it is already implicitly casting the garbage to 0 as intval would. [23:47:03] T253098: "PHP Warning: A non-numeric value encountered" from NamespaceInfo.php via SpecialRecentChanges.php - https://phabricator.wikimedia.org/T253098 [23:47:15] I don't think we should remove our ability to find that bug. [23:48:35] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.34/skins/Vector/resources/skins.vector.styles/Menu.less: T253912 Hotfix: Cannot rename emptyPortlet to empty-portlet yet (duration: 00m 59s) [23:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:39] T253912: addPortletLink doesn't reveal hidden menus in Vector any more [causes disappearance of merge datas on Wikidata] - https://phabricator.wikimedia.org/T253912 [23:49:10] Hmm. We could wfDeprecated() warn if `intval(foo) != foo`, without complaining that it's not an int. [23:49:41] Jdlrobson: Deployed, finally.