[00:00:56] (03PS1) 10Dzahn: tcpircbot: remove "tin" IP from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635106 [00:02:13] (03PS1) 10Dzahn: tcpircbot: add deploy1002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635107 (https://phabricator.wikimedia.org/T265963) [00:03:00] (03PS1) 10Dzahn: tcpircbot: remove deploy1001 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635108 (https://phabricator.wikimedia.org/T265963) [00:04:59] (03PS1) 10Dzahn: scap/dsh: add deploy1002 to mediawiki_installation hosts [puppet] - 10https://gerrit.wikimedia.org/r/635109 (https://phabricator.wikimedia.org/T265963) [00:07:01] (03PS1) 10Dzahn: cumin: remove hardcoded hostname from comments [puppet] - 10https://gerrit.wikimedia.org/r/635110 (https://phabricator.wikimedia.org/T265963) [00:10:46] (03PS4) 10Dzahn: puppetmaster: pass $servers parameter to gitclone class [puppet] - 10https://gerrit.wikimedia.org/r/634368 [00:10:48] (03PS2) 10Dzahn: hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) [00:10:50] (03PS1) 10Dzahn: common/scap/DHCP: remove deploy1001 from scap hosts and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/635111 (https://phabricator.wikimedia.org/T265963) [00:12:09] (03PS1) 10Dzahn: site: remove deploy1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/635112 (https://phabricator.wikimedia.org/T265963) [00:13:45] (03PS1) 10Dzahn: switch deployment CNAME from deploy1001 to deploy1002 [dns] - 10https://gerrit.wikimedia.org/r/635113 (https://phabricator.wikimedia.org/T265963) [00:14:34] (03PS1) 10Dzahn: remove deploy1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/635114 (https://phabricator.wikimedia.org/T265963) [00:39:27] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23405408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:27] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37636264 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:09] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 44800 and 27 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:09] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 976 and 27 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:02] (03PS1) 10Andrew Bogott: cloud-vps ceph backups: Standardize on the hiera key 'profile::wmcs::backy2::backup_time' [puppet] - 10https://gerrit.wikimedia.org/r/635117 [00:42:21] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph backups: Standardize on the hiera key 'profile::wmcs::backy2::backup_time' [puppet] - 10https://gerrit.wikimedia.org/r/635117 (owner: 10Andrew Bogott) [00:43:55] (03PS2) 10Andrew Bogott: cloud-vps ceph backups: use hiera key 'profile::wmcs::backy2::backup_time' [puppet] - 10https://gerrit.wikimedia.org/r/635117 [00:44:32] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps ceph backups: use hiera key 'profile::wmcs::backy2::backup_time' [puppet] - 10https://gerrit.wikimedia.org/r/635117 (owner: 10Andrew Bogott) [02:07:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.14 [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635121 [02:22:24] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1019/1020: Move to Buster on next reimage [puppet] - 10https://gerrit.wikimedia.org/r/635023 (https://phabricator.wikimedia.org/T263677) (owner: 10Andrew Bogott) [04:30:48] 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10tstarling) The way it worked in T37378 was to include cl_collation in the primary key. So the data in the table was duplicated, but with cl_sortkey depending on... [04:55:37] 10Operations, 10MediaWiki-General, 10Platform Engineering: Allow easier ICU transitions in MediaWiki - https://phabricator.wikimedia.org/T263437 (10tstarling) We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU. That would make the T37378 approach more feasibl... [05:02:51] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Pablo-WMDE) Works for me now (but my reproduction seemed to be much less compelling than those from other reporters). Thanks! [05:17:59] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Marostegui) p:05Triage→03Medium [05:39:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:03] 10Operations, 10Cloud-Services: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823 (10Marostegui) 05Open→03Declined I am going to close this as per T163823#3214418 as it is hard to reproduce and/or never happened agai... [05:58:06] 10Operations, 10Cloud-Services: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402 (10Marostegui) [06:21:23] (03CR) 10Elukey: [C: 03+1] hue: switch from nginx to envoy for tls [puppet] - 10https://gerrit.wikimedia.org/r/634660 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:21:44] (03CR) 10Elukey: [C: 03+1] turnilo: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634661 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:22:04] (03CR) 10Elukey: [C: 03+1] superset: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634662 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:22:49] (03CR) 10Elukey: [C: 03+1] piwik: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634664 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:23:17] (03CR) 10Elukey: [C: 03+1] stats: Add envoy on port 8443 alongside nginx [puppet] - 10https://gerrit.wikimedia.org/r/634667 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:24:17] (03CR) 10Elukey: stats: temporarily switch analytics sites to port 8443 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634669 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [06:24:29] 10Operations, 10DBA: Rename be_x_oldwiki database to be_taraskwiki - https://phabricator.wikimedia.org/T127570 (10Marostegui) [06:25:26] 10Operations, 10DBA, 10Wikimedia-Site-requests: script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609 (10Marostegui) 05Stalled→03Declined I am going to close this. I don't think we'll really work on this. First of all, renaming a database isn't possible on MySQL (there is no `... [06:27:45] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [06:29:00] 10Operations, 10Patch-Needs-Improvement, 10User-fgiunchedi: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673 (10Marostegui) @fgiunchedi can this task be closed or is this still an issue? [06:30:48] 10Operations, 10Analytics, 10SRE-Access-Requests: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) [06:31:05] 10Operations, 10Analytics, 10SRE-Access-Requests: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) [06:31:34] 10Operations, 10SRE-tools, 10Traffic, 10netbox, and 4 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10ayounsi) [06:32:48] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635110 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [06:32:50] (03PS1) 10Elukey: Add sbisson to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/635227 (https://phabricator.wikimedia.org/T265969) [06:32:51] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [06:33:43] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10elukey) @Nuria can you review/approve? I'll then merge and create the kerberos identity :) [06:37:15] (03PS1) 10Elukey: Decommission analytics1056 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/635235 (https://phabricator.wikimedia.org/T255140) [06:37:59] (03CR) 10Elukey: [C: 03+2] Decommission analytics1056 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/635235 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [06:51:41] (03PS2) 10Alexandros Kosiaris: sretest: Experiment with preserving docker rules [puppet] - 10https://gerrit.wikimedia.org/r/634192 [06:56:05] (03PS2) 10Nikerabbit: Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 (https://phabricator.wikimedia.org/T264158) [06:57:18] (03PS3) 10Nikerabbit: Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 [07:00:23] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10Marostegui) p:05Triage→03Medium a:03elukey [07:02:04] (03PS1) 10Nikerabbit: Disable registrations stat on Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635237 (https://phabricator.wikimedia.org/T264158) [07:16:07] (03PS3) 10Alexandros Kosiaris: sretest: Experiment with preserving docker rules [puppet] - 10https://gerrit.wikimedia.org/r/634192 [07:17:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC at https://puppet-compiler.wmflabs.org/compiler1001/25989/sretest1001.eqiad.wmnet/fulldiff.html, seems to do what intended and even ha" [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [07:21:16] (03CR) 10Volans: "Some questions inline for related parties, if you see your name in a comment it's a question for you." (036 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:21:31] 10Operations, 10patch-welcome: smokeping config puppetization issue? - https://phabricator.wikimedia.org/T131326 (10Marostegui) 05Open→03Resolved a:03Dzahn This looks resolved: ` root@netmon1002:/etc/smokeping# puppet agent -t Info: Using configured environment 'production' Info: Retrieving pluginfacts I... [07:21:51] (03CR) 10Volans: [C: 04-1] "Some questions inline for related parties, if you see your name in a comment it's a question for you." (0310 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:22:39] (03CR) 10Nikerabbit: wgSkipSkins: Exclude contenttranslation skin from skin options for users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628065 (https://phabricator.wikimedia.org/T263093) (owner: 10Santhosh) [07:27:04] (03CR) 10Alexandros Kosiaris: "That's a setting directly from express. See https://expressjs.com/en/guide/behind-proxies.html" [puppet] - 10https://gerrit.wikimedia.org/r/635094 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [07:27:54] (03CR) 10Alexandros Kosiaris: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/635098 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [07:32:28] (03PS1) 10Alexandros Kosiaris: service: Fix termbox/wikifeeds typo [puppet] - 10https://gerrit.wikimedia.org/r/635239 [07:33:20] (03CR) 10JMeybohm: [C: 03+2] Enable atomic helm upgrades for admin deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/634954 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:33:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] service: Fix termbox/wikifeeds typo [puppet] - 10https://gerrit.wikimedia.org/r/635239 (owner: 10Alexandros Kosiaris) [07:35:49] (03Merged) 10jenkins-bot: Enable atomic helm upgrades for admin deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/634954 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:36:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Aside from the tabs vs spaces thing, LGTM, this is the default anyway" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [07:36:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] "tabs vs spaces, but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635091 (owner: 10Dzahn) [07:41:36] (03CR) 10Ayounsi: netbox: Move eqiad private to automation (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:46:56] (03CR) 10Ayounsi: netbox: Move eqiad public to automation (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [07:52:12] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10hashar) Thanks for adding the Prometheus probe, looks like it could be helpful in the future :) I guess raising `commitRateLimiting` addressed it. Once upstream release a new version... [07:52:49] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add alerts for Thanos sidecar not uploading or failing to do so [puppet] - 10https://gerrit.wikimedia.org/r/634475 (https://phabricator.wikimedia.org/T265632) (owner: 10Filippo Giunchedi) [07:57:58] (03PS1) 10JMeybohm: Fix json syntax [software/heptiolabs/eventrouter] - 10https://gerrit.wikimedia.org/r/635240 [07:58:00] (03PS1) 10JMeybohm: Add vendor [software/heptiolabs/eventrouter] - 10https://gerrit.wikimedia.org/r/635241 [07:58:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix json syntax [software/heptiolabs/eventrouter] - 10https://gerrit.wikimedia.org/r/635240 (owner: 10JMeybohm) [08:07:55] (03Abandoned) 10JMeybohm: Add vendor [software/heptiolabs/eventrouter] - 10https://gerrit.wikimedia.org/r/635241 (owner: 10JMeybohm) [08:09:06] (03PS1) 10JMeybohm: Fix json syntax [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/635242 [08:09:08] (03PS1) 10JMeybohm: Add vendor [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/635243 [08:12:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix json syntax [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/635242 (owner: 10JMeybohm) [08:12:47] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add vendor [software/heptiolabs/eventrouter] (v0.3-wmf) - 10https://gerrit.wikimedia.org/r/635243 (owner: 10JMeybohm) [08:19:33] (03PS1) 10Phuedx: SearchSatisfaction: Set isAnon field [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635030 (https://phabricator.wikimedia.org/T259250) [08:26:50] PROBLEM - Thanos sidecar is failing to upload blocks on alert1001 is CRITICAL: cluster=prometheus instance={prometheus1003,prometheus1004} job=thanos-sidecar prometheus=ops site=eqiad https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [08:43:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: Install docker-credential-environment credHelper [puppet] - 10https://gerrit.wikimedia.org/r/634316 (https://phabricator.wikimedia.org/T265177) (owner: 10Dduvall) [08:50:21] (03PS2) 10Alexandros Kosiaris: admin/akosiaris: Switch to manually controlling proxies [puppet] - 10https://gerrit.wikimedia.org/r/634320 [08:51:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin/akosiaris: Switch to manually controlling proxies [puppet] - 10https://gerrit.wikimedia.org/r/634320 (owner: 10Alexandros Kosiaris) [08:58:24] (03PS1) 10JMeybohm: eventgate-analytics-external: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635246 (https://phabricator.wikimedia.org/T264157) [08:58:26] (03PS1) 10JMeybohm: eventgate-logging-external: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635247 (https://phabricator.wikimedia.org/T264157) [08:58:28] (03PS1) 10JMeybohm: eventstreams: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635248 (https://phabricator.wikimedia.org/T264157) [09:02:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:24] (03CR) 10JMeybohm: [C: 03+2] eventgate-analytics-external: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635246 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:02:40] apergos: lots of errors coming from snapshot1009, not sure if that's "normal" [09:03:08] apergos: all for nlwiktionary from what I can see [09:03:30] https://logstash.wikimedia.org/goto/4e7be79a21fe0321b262f03f89bd5e2d [09:03:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:16] marostegui: that's not usual, no [09:04:24] nothing has changed recently on my end [09:04:29] what sort of errors? [09:04:29] (03CR) 10Klausman: [C: 03+2] Assing role::analytics_cluster::coordinator::query to an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/635000 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [09:05:07] (03Merged) 10jenkins-bot: eventgate-analytics-external: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635246 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:06:10] (03CR) 10Kormat: "Adding Brooke here instead of me; i didn't touch this hiera() calls in my CR as i don't know if WMCS is ever relying on the default `unset" [puppet] - 10https://gerrit.wikimedia.org/r/634387 (https://phabricator.wikimedia.org/T256972) (owner: 10Dzahn) [09:06:38] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [09:06:42] apergos: 90% of previous cases it is mw "breaking" (softly) dump process, or heavy concurrent mw maintenance running at the same time [09:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:37] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [09:08:37] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [09:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:40] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:19] that's me ^ [09:10:56] we're in eqiad so I would expect it not to be other maintenance scripts running [09:11:04] but let's see what marostegui says [09:11:06] true [09:11:22] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:23] curious to see what nlwiktionary has that nothing else does [09:12:27] apergos: Nothing on my end running on eqiad either [09:12:46] what exactly are you seeing? [09:12:58] I don't even see tables dumped for that wiki, that's odd [09:13:15] WMFTimeoutException ? marostegui [09:13:16] but other wikis are in that state, so maybe that's nbd [09:13:34] it's very early in the new run.. [09:13:34] let me help with this so you can focus on your clinic duty [09:14:04] apergos: https://phabricator.wikimedia.org/P13027 [09:14:04] I see also memory exhaustion, but that is "normal" [09:14:43] (03PS1) 10Filippo Giunchedi: thanos: disable compaction check in sidecar [puppet] - 10https://gerrit.wikimedia.org/r/635249 (https://phabricator.wikimedia.org/T261281) [09:14:46] crap ok that looks "broken" legitimately in mw someplace [09:14:53] :-( [09:14:56] (03CR) 10jerkins-bot: [V: 04-1] thanos: disable compaction check in sidecar [puppet] - 10https://gerrit.wikimedia.org/r/635249 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [09:15:07] apergos: https://logstash.wikimedia.org/goto/d3cac7d001fadb59c1adb65ea0edab1b [09:15:24] (03PS2) 10Filippo Giunchedi: thanos: disable compaction check in sidecar [puppet] - 10https://gerrit.wikimedia.org/r/635249 (https://phabricator.wikimedia.org/T261281) [09:15:36] uff [09:16:12] let me see if anything is being written out [09:16:26] maybe some errors are being exposed in a different way that were previously buried...and maybe not [09:17:16] apergos: please share any findings to discard data corruption on my side, and that it is only code path issues [09:18:04] yeah [09:18:13] e.g. which ids are failing to query the db for them [09:18:21] ids/rows [09:20:48] ok so first off, we are generating metadata. there should be no circumstance in which we try to get content for these revs [09:20:52] so that's broken [09:21:16] (03CR) 10Vgutierrez: [C: 03+1] acmechief: Also allow ldap-replica2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/634974 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [09:21:24] i think that is relatively good news Re:data corruption [09:21:25] now let me see about the specific revs/slots [09:21:30] well no. [09:21:43] ah, it still could be metadata corruption [09:21:45] this is in the lookup of the sha1 so I need to see if that entry is missing and if so, why [09:21:49] uh yeah [09:23:28] we don't have the specific revid (of course) so I need to do this the hard way, trying to narrow down the range. it may be a little while. [09:23:46] yeah, a range would work, too [09:25:19] (03PS1) 10Kormat: mariadb: Add profile:: prefix to mariadb::mysql_role [puppet] - 10https://gerrit.wikimedia.org/r/635253 (https://phabricator.wikimedia.org/T247956) [09:25:19] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [09:25:19] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [09:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:10] 10Operations, 10serviceops, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) [09:26:15] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: disable compaction check in sidecar [puppet] - 10https://gerrit.wikimedia.org/r/635249 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [09:26:34] I'm going to start testing these on snapshot1005, jynus, so if you see errors there, that's what it is [09:26:50] (03CR) 10JMeybohm: [C: 03+2] eventgate-logging-external: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635247 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:27:00] 10Operations, 10Traffic, 10Performance-Team (Radar): Consider collecting more timestamp milestones from ATS-TLS - https://phabricator.wikimedia.org/T265869 (10ema) p:05Triage→03Medium [09:27:49] apergos: thanks for the heads up [09:28:21] (03CR) 10Kormat: "PCC no changes: https://puppet-compiler.wmflabs.org/compiler1001/25990/" [puppet] - 10https://gerrit.wikimedia.org/r/635253 (https://phabricator.wikimedia.org/T247956) (owner: 10Kormat) [09:29:54] (03Merged) 10jenkins-bot: eventgate-logging-external: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635247 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [09:30:10] ok so the good news is I ran a little batch and I saw 5 errors from mw but the stub file was produced, so it's not breaking the dumps. [09:30:24] now let me narrow it down a bit [09:31:58] (03PS1) 10Elukey: role::analytics_cluster::coordinator: force presto to use puppet tls certs [puppet] - 10https://gerrit.wikimedia.org/r/635255 (https://phabricator.wikimedia.org/T253957) [09:32:42] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: force presto to use puppet tls certs [puppet] - 10https://gerrit.wikimedia.org/r/635255 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [09:33:22] 10Operations, 10serviceops, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10akosiaris) LGTM, perhaps old do codfw as well since you are at it to have a fallback/backup? [09:33:32] RECOVERY - Thanos sidecar is failing to upload blocks on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar [09:34:00] 10Operations, 10serviceops, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) No, no need for codfw for now, we are still on super early stages. [09:35:55] (03PS1) 10JMeybohm: Initial commit of eventrouter chart from stable/charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/635258 (https://phabricator.wikimedia.org/T262675) [09:35:59] (03PS1) 10JMeybohm: admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) [09:37:01] (03PS1) 10Klausman: analytics_cluster: Add rocm38 group for driver update testing [puppet] - 10https://gerrit.wikimedia.org/r/635260 (https://phabricator.wikimedia.org/T264408) [09:37:04] (03CR) 10jerkins-bot: [V: 04-1] admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:37:12] (03CR) 10jerkins-bot: [V: 04-1] Initial commit of eventrouter chart from stable/charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/635258 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [09:37:54] 10Operations, 10Puppet, 10observability, 10User-fgiunchedi, 10User-jbond: PuppetDB grafana graphs not matching logs - https://phabricator.wikimedia.org/T265649 (10fgiunchedi) Looking at the panel's query, AFAICT that's the rate of `replace facts` operations: ` rate(puppetlabs_puppetdb_mq_replace_facts_5... [09:41:33] (03PS2) 10JMeybohm: Initial commit of eventrouter chart from stable/charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/635258 (https://phabricator.wikimedia.org/T262675) [09:41:35] (03PS2) 10JMeybohm: admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) [09:42:10] apergos: there was one error of that kind on dewiktionary [09:42:23] they will be the same sort of thing [09:42:24] not that 1 error is important, but probably relevant for investigation [09:42:33] I guess we will see them on other wikis soon enough [09:42:37] https://logstash.wikimedia.org/goto/177c76bfb351c806628a27a098bba4de [09:42:40] still trying to narrow down the range [09:42:44] thanks [09:43:14] when I have the exact revision I can go look at all the fields and see what's up [09:43:17] also check previous dumps etc [09:46:26] (03PS2) 10Klausman: analytics_cluster: set one machine to receive rocm38 [puppet] - 10https://gerrit.wikimedia.org/r/635260 (https://phabricator.wikimedia.org/T264408) [09:47:04] (03CR) 10Elukey: [C: 03+1] analytics_cluster: set one machine to receive rocm38 [puppet] - 10https://gerrit.wikimedia.org/r/635260 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [09:47:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:47:40] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [09:47:41] (03PS1) 10Elukey: role::analytics_cluster::presto::server: force presto to use puppet TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/635261 (https://phabricator.wikimedia.org/T253957) [09:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:09] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::presto::server: force presto to use puppet TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/635261 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [09:48:45] 10Operations, 10Cloud-Services, 10Traffic, 10cloud-services-team (Kanban): cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10Marostegui) [09:48:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:50] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Marostegui) p:05Triage→03Low [09:50:18] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [09:50:18] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [09:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:30] (03PS3) 10Jbond: diffscan: switch to new refactored diffscan [puppet] - 10https://gerrit.wikimedia.org/r/634566 [09:51:40] 10Operations, 10netops, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10aborrero) No problems on my side. Probably the smart thing to do is to clearly define the semantics of that data file, so we can safely add/remove stuff from th... [09:51:46] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime [09:51:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:13] (03CR) 10Klausman: [C: 03+2] analytics_cluster: set one machine to receive rocm38 [puppet] - 10https://gerrit.wikimedia.org/r/635260 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [09:54:22] (03PS2) 10Jbond: sre.pdus.rotate-password: fix TypeError: 'tuple' object does not support item assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/629074 [09:55:50] (03CR) 10Jbond: [C: 03+2] sre.pdus.rotate-password: fix TypeError: 'tuple' object does not support item assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/629074 (owner: 10Jbond) [09:57:01] (03Merged) 10jenkins-bot: sre.pdus.rotate-password: fix TypeError: 'tuple' object does not support item assignment [cookbooks] - 10https://gerrit.wikimedia.org/r/629074 (owner: 10Jbond) [09:59:10] (03PS1) 10Filippo Giunchedi: puppetdb: don't export metrics for host-specific mbeans [puppet] - 10https://gerrit.wikimedia.org/r/635264 (https://phabricator.wikimedia.org/T265649) [09:59:35] !log T255399: resuming wdqs-data-reload manually from chunk no 776 on wdqs1009 [09:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:43] T255399: Prepare wdqs1009 to run the streaming updater - https://phabricator.wikimedia.org/T255399 [10:00:08] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: don't export metrics for host-specific mbeans [puppet] - 10https://gerrit.wikimedia.org/r/635264 (https://phabricator.wikimedia.org/T265649) (owner: 10Filippo Giunchedi) [10:01:11] (03PS3) 10Arturo Borrero Gonzalez: network: constants: add cloud floating IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/634050 [10:02:38] (03CR) 10Arturo Borrero Gonzalez: "OK the patch is back to the original version. Hope this works now." [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [10:03:23] 10Operations, 10Documentation: document debian packaging guidelines - https://phabricator.wikimedia.org/T115757 (10Marostegui) 05Open→03Resolved Going to consider this as fixed for now with: https://wikitech.wikimedia.org/wiki/Package_management https://wikitech.wikimedia.org/wiki/Backport_packages https:/... [10:04:01] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [10:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:07] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [10:05:06] (03CR) 10Jbond: "looks good: minor nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) (owner: 10Kormat) [10:05:29] (03CR) 10Filippo Giunchedi: "Build failure is due to commit message:" [puppet] - 10https://gerrit.wikimedia.org/r/635264 (https://phabricator.wikimedia.org/T265649) (owner: 10Filippo Giunchedi) [10:06:55] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6538699, @Gilles wrote: > I've captured 30 minutes of data using varnishlog simultaneously on cp3052 and cp3054, using 4... [10:07:15] 10Operations, 10netops, 10Patch-For-Review, 10Security, 10User-jbond: Review default ferm INPUT policy - https://phabricator.wikimedia.org/T264888 (10jbond) [10:07:55] (03PS1) 10Elukey: presto: revert TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/635267 (https://phabricator.wikimedia.org/T253957) [10:08:30] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10Marostegui) [10:08:33] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976 (10Marostegui) [10:08:43] (03CR) 10Elukey: [C: 03+2] presto: revert TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/635267 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [10:08:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:56] 10Operations, 10DBA, 10Traffic, 10Sustainability: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141 (10Marostegui) 05Stalled→03Declined Closing this as we won't be really working on this anymore, but on deprecating tendril in favour of something e... [10:09:18] (03CR) 10JMeybohm: [C: 03+2] eventstreams: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635248 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:09:27] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [10:09:27] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [10:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:11:43] (03PS3) 10Kormat: mariadb: Convert role::mariadb::misc to profile. [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) [10:11:45] (03PS1) 10Kormat: mariadb: Move passwords include to profiles [puppet] - 10https://gerrit.wikimedia.org/r/635268 (https://phabricator.wikimedia.org/T256972) [10:11:54] (03Merged) 10jenkins-bot: eventstreams: Update envoy to 1.15.1-2 See: Id8dfd7c5002cfd2c71b7f0aac4f21902035cc150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/635248 (https://phabricator.wikimedia.org/T264157) (owner: 10JMeybohm) [10:12:21] (03CR) 10Kormat: mariadb: Convert role::mariadb::misc to profile. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) (owner: 10Kormat) [10:13:14] great, over 1200 revisions to look at for the single page [10:16:08] (03CR) 10Jbond: parsoid: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634385 (owner: 10Dzahn) [10:18:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [10:20:11] I see. these are old text entries that are stored directly in the text table (in theory), and the entry is supposedly deflated, but inflation fails on the entry. probably garbage in there. [10:20:19] I expect there are a number of these across the old wikis [10:20:26] I'll create a task so we can track this [10:20:43] jynus: ^^ [10:21:17] (03CR) 10Jbond: "Looks good see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [10:24:39] 10Operations, 10observability, 10User-fgiunchedi: rsyslog occasional segfault on centrallog hosts - https://phabricator.wikimedia.org/T259780 (10fgiunchedi) [10:25:21] 10Operations, 10Cloud-Services: Add lock_wait_timeout to maintain_views and maintain-meta_p - https://phabricator.wikimedia.org/T160412 (10Marostegui) 05Open→03Resolved It is already there, it is set to 60 seconds, which is good enough. [10:25:57] (03CR) 10Jbond: [C: 03+1] puppetmaster: pass $servers parameter to gitclone class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634368 (owner: 10Dzahn) [10:30:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_restbase_esams} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:02] (03Abandoned) 10Kormat: mariadb: Add profile::mariadb::common [puppet] - 10https://gerrit.wikimedia.org/r/622578 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:31:04] 10Operations, 10SRE-swift-storage: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040 (10Marostegui) ok to close this task? I don't think this is reproducible anymore, and eventually will be fixed once the epic T160229 is completed. [10:32:00] 10Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966 (10Marostegui) 05Open→03Resolved Closing this per the last comment (also baham.wikimedia.org doesn't exist anymore) [10:32:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:32:25] jynus: https://phabricator.wikimedia.org/T265989 [10:32:31] (03CR) 10Jbond: [C: 03+1] "LGTM <3" [puppet] - 10https://gerrit.wikimedia.org/r/635253 (https://phabricator.wikimedia.org/T247956) (owner: 10Kormat) [10:32:36] looking [10:33:01] title may have been cut accidentaly? [10:33:24] (03CR) 10Kormat: "Brooke: Adding you here as an FYI." [puppet] - 10https://gerrit.wikimedia.org/r/635253 (https://phabricator.wikimedia.org/T247956) (owner: 10Kormat) [10:33:27] (03CR) 10Kormat: [C: 03+2] mariadb: Add profile:: prefix to mariadb::mysql_role [puppet] - 10https://gerrit.wikimedia.org/r/635253 (https://phabricator.wikimedia.org/T247956) (owner: 10Kormat) [10:33:54] yes indeed [10:33:55] fixing [10:34:38] what is the context for the "Info from metadata dump:" comment, is that the range you belive failing? [10:35:00] it is known to be failing [10:35:02] ah, I see, it is 0 size, correct? [10:35:07] not normal, right? [10:35:11] it is we can't decode the revision [10:35:18] there's no sha1, that's the first place it runs into trouble [10:35:22] yeah, but I can query [10:35:27] so then it says 'let me just get the content for that' [10:35:30] to see if a regular api call [10:35:35] also fails [10:35:40] look below at the sql [10:35:53] those are the 4 revisions as they are in the revision table [10:36:08] so there is some data? [10:36:10] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635264 (https://phabricator.wikimedia.org/T265649) (owner: 10Filippo Giunchedi) [10:36:18] ok so [10:36:20] ah [10:36:22] the revisions have no sha1 [10:36:30] because the entry in the text table is garbage [10:36:37] so they can't actually generate a sha1... [10:36:41] I see [10:36:44] no sha1 = let's look up the content [10:36:56] oh it's garbage, it's gzip but I can't inflate... whine [10:37:04] 10Operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627 (10Marostegui) 05Open→03Resolved Closing per: T116627#3452914 [10:37:08] that's almost certainly all 1k of those other errors too [10:37:13] (03PS1) 10Ema: ATS: add metric trafficserver_tls_client_total_time [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) [10:37:15] what is the right way o fix, to try to regenerate the sha1 [10:37:16] I mean, we coul check them all, but... [10:37:21] or remove them? [10:37:23] the text is bad. can't be done [10:37:38] the right way is to have daniel look at runnning the 'bad text entries' maintenance script I guess [10:37:42] so unlink/modify content to say "this revision was lost"? [10:37:48] cool [10:37:53] add him to the ticket [10:37:58] I've tossed it into the clinic duty pile [10:38:08] cool [10:38:08] from there it will get to hmi or to whoever should pick it up [10:38:11] marostegui thanks you [10:38:12] is this something new? [10:38:21] the cpt clinic duty thing? mm [10:38:22] no [10:38:27] o, the error [10:38:27] the corruption [10:38:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) (owner: 10Kormat) [10:38:35] no these are all from 2004 [10:38:38] as it says on the task [10:38:40] (03CR) 10Jbond: [C: 03+1] "thx LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635268 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:38:55] (03CR) 10Kormat: [C: 03+2] mariadb: Convert role::mariadb::misc to profile. [puppet] - 10https://gerrit.wikimedia.org/r/634971 (https://phabricator.wikimedia.org/T265900) (owner: 10Kormat) [10:39:00] what's likel new is mw logging the error in this way [10:39:04] *likely [10:39:05] (03CR) 10Kormat: [C: 03+2] mariadb: Move passwords include to profiles [puppet] - 10https://gerrit.wikimedia.org/r/635268 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:39:17] I'd have to hunt around, which I can do if you think it's useful [10:40:05] apergos: I see [10:40:07] ok [10:40:13] no worth [10:40:17] (03PS2) 10Kormat: mariadb: Move passwords include to profiles [puppet] - 10https://gerrit.wikimedia.org/r/635268 (https://phabricator.wikimedia.org/T256972) [10:40:21] (03CR) 10Jbond: diffscan: switch to new refactored diffscan (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634566 (owner: 10Jbond) [10:40:32] 👍 [10:40:53] for now, ignore ignore ignore and sorry for the noise [10:41:09] if you do see any errors that are not this sha1 thing, feel free to poke me though [10:41:26] (03PS5) 10Jbond: labstore::nfs_mount: drop support for empty string share_path [puppet] - 10https://gerrit.wikimedia.org/r/630589 [10:45:02] 10Operations, 10Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229 (10jcrespo) Backup of commons files is a part of the more ambitious: "Backup al wikis media files" project being worked curr... [10:45:51] 10Operations, 10Commons, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229 (10jcrespo) [10:45:53] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [10:45:58] PROBLEM - MariaDB Replica Lag: s6 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1039.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:46:09] ^ looking [10:47:16] uh, that could be me [10:47:20] ah, a dbstore host [10:47:32] I am running the checks for a task [10:47:35] jynus: i'm willing to accept that :) [10:47:41] but I didn't expect the maintenance to create lag [10:47:46] (it didn't for others) [10:47:51] I will downtime all now [10:48:43] jynus: maybe the logical dumps are also running? [10:48:50] they shouldn't [10:49:12] I can confirm they aren't now [10:49:27] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Convert role::mariadb::misc to profile - https://phabricator.wikimedia.org/T265900 (10Kormat) 05Open→03Resolved [10:49:27] something something about ruwiki is special [10:49:29] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [10:49:50] as not even enwiki lags :-) [10:50:10] could also be the activity, we are getting peak writes and I did the other check at night [10:50:48] (03CR) 10Jbond: [C: 03+1] "This should be good to deploy now, @Bstorm feel free to deploy your self or let me know when is a good time thanks" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [10:51:22] (03PS2) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [10:52:07] should be acked now, thanks and sorry, kormat [10:52:23] I will downtime the other s2 host just in case [10:52:38] jynus: np [10:53:14] (03PS1) 10Klausman: aptrepo: Add missing rocm38 deps [puppet] - 10https://gerrit.wikimedia.org/r/635279 (https://phabricator.wikimedia.org/T264408) [10:53:49] 10Operations, 10DBA, 10User-Kormat: Puppetize orchestrator - https://phabricator.wikimedia.org/T265990 (10Kormat) p:05Triage→03Medium [10:54:06] (03PS3) 10Giuseppe Lavagetto: Add apache httpd base image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) [10:54:09] 10Operations, 10DBA, 10User-Kormat: Puppetize orchestrator - https://phabricator.wikimedia.org/T265990 (10Kormat) [10:54:12] 10Operations, 10serviceops, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Kormat) [10:54:35] (03CR) 10Vgutierrez: ATS: add metric trafficserver_tls_client_total_time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [10:54:53] (03CR) 10Elukey: [C: 03+1] aptrepo: Add missing rocm38 deps [puppet] - 10https://gerrit.wikimedia.org/r/635279 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:55:38] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:48] RECOVERY - MariaDB Replica Lag: s6 on db1139 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:56:21] 10Operations, 10Cloud-Services: Icinga alert for labnet1001 for conntrack saturation graphite check - https://phabricator.wikimedia.org/T101980 (10Marostegui) 05Open→03Resolved a:03Marostegui This host is no more: {T221818} [10:56:38] (03CR) 10Klausman: [C: 03+2] aptrepo: Add missing rocm38 deps [puppet] - 10https://gerrit.wikimedia.org/r/635279 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [10:58:19] 10Operations, 10Analytics-Clusters: Rename an-scheduler1001 to an-coord1002 - https://phabricator.wikimedia.org/T265620 (10Marostegui) p:05Triage→03Medium [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201020T1100). [11:00:05] Lucas_WMDE and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:18] still in a meeting for a few minutes [11:04:42] o/ okay I’m here now [11:04:46] O/ [11:04:56] o/ even [11:05:18] (03PS2) 10Lucas Werkmeister (WMDE): Remove noratelimit from Wikidata bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634938 (https://phabricator.wikimedia.org/T258354) [11:05:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove noratelimit from Wikidata bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634938 (https://phabricator.wikimedia.org/T258354) (owner: 10Lucas Werkmeister (WMDE)) [11:06:30] (03Merged) 10jenkins-bot: Remove noratelimit from Wikidata bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634938 (https://phabricator.wikimedia.org/T258354) (owner: 10Lucas Werkmeister (WMDE)) [11:06:52] pulled to mwdebug2001, testing [11:07:40] looks fine, no unexpected changes in the $wgGroupPermissions per eval.php [11:07:46] testing a manual edit [11:08:13] all seems to work fine [11:09:56] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:634938|Remove noratelimit from Wikidata bot group (T258354)]] (duration: 00m 56s) [11:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:04] T258354: remove noratelimit from bot group for Wikidata - https://phabricator.wikimedia.org/T258354 [11:10:16] 10Operations, 10SRE-swift-storage: ms-be2023 unresponsive while rebuilding one disk - https://phabricator.wikimedia.org/T185306 (10Marostegui) 05Open→03Resolved Unlikely it can be reproduced again, closing! Reopen if you feel it still needs work [11:10:18] (03PS6) 10Lucas Werkmeister (WMDE): Set Wikidata MF to collapse sections by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634039 (https://phabricator.wikimedia.org/T239195) (owner: 10Itamar Givon) [11:10:38] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set Wikidata MF to collapse sections by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634039 (https://phabricator.wikimedia.org/T239195) (owner: 10Itamar Givon) [11:10:59] (03PS1) 10Klausman: analytics_cluster: Revert rocm38 change for an-worker1101 [puppet] - 10https://gerrit.wikimedia.org/r/635282 (https://phabricator.wikimedia.org/T264408) [11:11:22] (03Merged) 10jenkins-bot: Set Wikidata MF to collapse sections by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634039 (https://phabricator.wikimedia.org/T239195) (owner: 10Itamar Givon) [11:11:50] (03CR) 10Lucas Werkmeister (WMDE): SearchSatisfaction: Set isAnon field (031 comment) [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635030 (https://phabricator.wikimedia.org/T259250) (owner: 10Phuedx) [11:12:02] testing on mwdebug2001 [11:12:35] (03CR) 10Klausman: [C: 03+2] analytics_cluster: Revert rocm38 change for an-worker1101 [puppet] - 10https://gerrit.wikimedia.org/r/635282 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman) [11:13:38] I think it’s working, syncing [11:15:13] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:634039|Set Wikidata MF to collapse sections by default (T239195)]] (duration: 00m 56s) [11:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:19] T239195: Wikidata.org: On mobile, sections stays expanded even it's disabled in settings - https://phabricator.wikimedia.org/T239195 [11:15:46] okay, I think I’m done with my changes [11:15:59] (03CR) 10Phuedx: SearchSatisfaction: Set isAnon field (031 comment) [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635030 (https://phabricator.wikimedia.org/T259250) (owner: 10Phuedx) [11:17:19] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432 (10Marostegui) 05Open→03Resolved Per the last two comments, looks like this is fixed. [11:17:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] SearchSatisfaction: Set isAnon field (031 comment) [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635030 (https://phabricator.wikimedia.org/T259250) (owner: 10Phuedx) [11:18:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "(I verified that a backport to wmf.14 is not necessary because the master change was merged before the wmf.14 branch cut.)" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635030 (https://phabricator.wikimedia.org/T259250) (owner: 10Phuedx) [11:19:57] phuedx: will you be able to test your change on mwdebug? [11:20:38] (03Merged) 10jenkins-bot: SearchSatisfaction: Set isAnon field [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635030 (https://phabricator.wikimedia.org/T259250) (owner: 10Phuedx) [11:20:47] Lucas_WMDE: Possibly. The instrumentation has a sample size of 1% [11:21:09] ... and there's currently no way of enabling it for this kind of testing [11:21:34] okay, the change should be live on mwdebug2001 now [11:23:44] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Good starting point, inline comments" (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [11:27:43] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1020 - firmware upgrade: (was: host went down) - https://phabricator.wikimedia.org/T234698 (10Marostegui) Is this firmware upgrade still needed? [11:29:58] phuedx: any luck so far? [11:30:11] Lucas_WMDE: Unfortunately not [11:30:31] I am confident in the change but it's always good to verify [11:30:50] 10Operations, 10ops-eqiad, 10decommission-hardware: Return sulfur to spares - https://phabricator.wikimedia.org/T224475 (10Marostegui) [11:30:52] how bad would it be if it didn’t work? is it likely to break horribly? [11:31:20] the change looks fairly harmless to me so I’d also be okay with just syncing it, I think [11:31:33] (afk for 2min) [11:32:04] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:43] (back) [11:33:03] Lucas_WMDE: Nothing would break horribly if the change didn't work. We'd see `null` in the associated Hive table [11:33:13] ok then I’ll just sync it [11:33:37] unless you want to wait longer [11:34:04] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10Marostegui) @Dzahn good to close after Alex's answer? [11:34:07] scap running now [11:34:27] Alright. I've tried about 30x to get sampled. I'll be sure to review the change to add the testing querystring parameter :) [11:34:30] (03PS2) 10Matthias Mullie: [cirrus] cleanup mediasearch commons A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634991 (owner: 10DCausse) [11:35:02] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/WikimediaEvents/: Backport: [[gerrit:635030|SearchSatisfaction: Set isAnon field (T259250)]] (duration: 00m 57s) [11:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:08] T259250: A/B test setup for search changes - https://phabricator.wikimedia.org/T259250 [11:35:22] Thanks Lucas_WMDE [11:35:27] np [11:35:41] (03CR) 10Lars Wirzenius: [C: 03+2] Branch commit for wmf/1.36.0-wmf.14 [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635121 (owner: 10TrainBranchBot) [11:35:42] looks like there’s nothing else in the deployment calendar [11:35:54] !log EU backport/config window done [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:08] !log 1.36.0-wmf.14 was branched at 1b7b5f716015f9303d37158820dadf759e8db707 for T263180 [11:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:14] T263180: 1.36.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T263180 [11:38:01] 10Operations, 10observability: Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084 (10Marostegui) [11:38:04] 10Operations, 10observability, 10User-MoritzMuehlenhoff, 10Wikimedia-Incident: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Marostegui) [11:38:54] 10Operations, 10Icinga, 10SRE-tools, 10observability: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10Marostegui) p:05Medium→03Low [11:39:12] (03PS1) 10Jbond: facter: ipaddress6 prefer none SLACC addresses [puppet] - 10https://gerrit.wikimedia.org/r/635283 (https://phabricator.wikimedia.org/T265904) [11:40:15] (03CR) 10jerkins-bot: [V: 04-1] facter: ipaddress6 prefer none SLACC addresses [puppet] - 10https://gerrit.wikimedia.org/r/635283 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [11:41:06] 10Operations, 10serviceops, 10Performance-Team (Radar): Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) - https://phabricator.wikimedia.org/T243149 (10Marostegui) 05Open→03Resolved a:03Marostegui I am going to close this, as there is not much else we can really do here a... [11:41:56] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) >>! In T265904#6560485, @Volans wrote: > Do you think we could trick facter into reporting the non-SLAAC address as primary? > > ` > $ sudo facter -p interface_primary > priv... [11:43:35] 10Operations, 10serviceops: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10Marostegui) @jijiki what do you want to do with this task? [11:50:16] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, and 2 others: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10Kormat) [11:51:31] (03CR) 10Matthias Mullie: [C: 03+1] [cirrus] cleanup mediasearch commons A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634991 (owner: 10DCausse) [11:51:51] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] puppetdb: don't export metrics for host-specific mbeans [puppet] - 10https://gerrit.wikimedia.org/r/635264 (https://phabricator.wikimedia.org/T265649) (owner: 10Filippo Giunchedi) [11:52:21] 10Operations, 10Cloud-Services, 10observability: Setting up grafana should also setup Anonymous read-only access for the default org - https://phabricator.wikimedia.org/T143556 (10Marostegui) 05Stalled→03Resolved Per our IRC chat, let's close this for now! [12:01:16] (03PS1) 10Jcrespo: admin: Change defaults on Jaime's bashrc [puppet] - 10https://gerrit.wikimedia.org/r/635285 [12:03:05] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.14 [core] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635121 (owner: 10TrainBranchBot) [12:04:12] (03PS2) 10Jbond: facter: ipaddress6 prefer none SLACC addresses [puppet] - 10https://gerrit.wikimedia.org/r/635283 (https://phabricator.wikimedia.org/T265904) [12:05:43] (03CR) 10Jcrespo: "CC for the my() function fixes/adaptations originally created by Volans." [puppet] - 10https://gerrit.wikimedia.org/r/635285 (owner: 10Jcrespo) [12:06:01] (03CR) 10Jcrespo: [C: 03+2] admin: Change defaults on Jaime's bashrc [puppet] - 10https://gerrit.wikimedia.org/r/635285 (owner: 10Jcrespo) [12:06:41] (03PS1) 10Marostegui: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/635038 [12:11:57] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635287 [12:11:59] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635287 (owner: 10Lars Wirzenius) [12:12:44] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635287 (owner: 10Lars Wirzenius) [12:12:59] I'm deploying to testwikis (sorry, I'm late, I didn't know I'd be doing train until an hour ago) [12:13:06] 10Operations, 10Patch-For-Review: sshd stretch puppet support - https://phabricator.wikimedia.org/T170298 (10Marostegui) On buster `UsePrivilegeSeparation` is deprecated [12:13:36] !log liw@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.14 [12:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:30] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [12:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:30] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [12:16:30] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [12:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:32] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:34] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [12:24:34] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [12:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:46] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [12:25:46] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [12:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:40] (03CR) 10JMeybohm: Add apache httpd base image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [12:31:32] (03CR) 10Gilles: ATS: add metric trafficserver_tls_client_total_time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [12:32:02] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:56] 10Operations, 10Patch-For-Review: sshd stretch puppet support - https://phabricator.wikimedia.org/T170298 (10jcrespo) https://www.openssh.com/txt/release-7.5: > This release deprecates the sshd_config UsePrivilegeSeparation > option, thereby making privilege separation mandatory. Privilege > separation has bee... [12:36:19] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633836 (owner: 10Dzahn) [12:37:03] (03CR) 10JMeybohm: [C: 03+1] "PCC still sows no change (https://puppet-compiler.wmflabs.org/compiler1003/25991/)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [12:45:42] 10Operations, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) I had a look at upstreaming this to facter v3 but i didn't see an obvious fix and as facter v4 is moving back to ruby i'm not sure its worth the effort to fix this in facter v3 [12:46:08] (03PS1) 10Jcrespo: ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) [12:46:17] 10Operations, 10netops: Upgrade Routinator 3000 to 0.8.0 - https://phabricator.wikimedia.org/T266001 (10ayounsi) p:05Triage→03Medium [12:47:04] (03CR) 10jerkins-bot: [V: 04-1] ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [12:48:48] (03CR) 10Jcrespo: "While I uploaded this patch (untested) I think it would be better if we just removed the option, based on https://www.openssh.com/txt/rele" [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [12:52:41] (03CR) 10Marostegui: [C: 03+2] Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/635038 (owner: 10Marostegui) [12:54:01] 10Operations, 10DBA, 10User-Kormat: orchestrator: integrate promotion rules into puppet - https://phabricator.wikimedia.org/T266002 (10Kormat) [12:54:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 20%: Slowly repool db2125 after checking tables ', diff saved to https://phabricator.wikimedia.org/P13029 and previous config saved to /var/cache/conftool/dbconfig/20201020-125423-root.json [12:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:41] (03PS2) 10Jcrespo: ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) [12:54:46] (03PS1) 10Elukey: presto: use puppet TLS certificates instead of the self signed ones [puppet] - 10https://gerrit.wikimedia.org/r/635289 (https://phabricator.wikimedia.org/T253957) [12:55:28] 10Operations, 10DBA, 10User-Kormat: orchestrator: integrate promotion rules into puppet - https://phabricator.wikimedia.org/T266002 (10Marostegui) p:05Triage→03Medium It is especially important to specify hosts that should never be masters [12:56:54] 10Operations, 10DBA, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Kormat) [13:03:26] (03CR) 10Ayounsi: "a" [puppet] - 10https://gerrit.wikimedia.org/r/635283 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:04:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:05:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:05:57] (03PS2) 10Elukey: presto: use puppet TLS certificates instead of the self signed ones [puppet] - 10https://gerrit.wikimedia.org/r/635289 (https://phabricator.wikimedia.org/T253957) [13:06:30] (03CR) 10Ayounsi: [C: 03+1] "s/slacc/slaac/" [puppet] - 10https://gerrit.wikimedia.org/r/635283 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:07:05] 10Operations, 10DBA, 10User-Kormat: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 (10Kormat) [13:07:14] 10Operations, 10DBA, 10User-Kormat: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 (10Kormat) p:05Triage→03Medium [13:08:59] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6538699, @Gilles wrote: > ` > SELECT event.responsestart - event.fetchstart FROM event.navigationtiming WHERE year = 2020... [13:09:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 40%: Slowly repool db2125 after checking tables ', diff saved to https://phabricator.wikimedia.org/P13030 and previous config saved to /var/cache/conftool/dbconfig/20201020-130926-root.json [13:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:20] !log liw@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.14 (duration: 58m 03s) [13:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:52] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [13:13:46] (03PS1) 10Lars Wirzenius: group0 wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635291 [13:13:48] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635291 (owner: 10Lars Wirzenius) [13:14:40] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635291 (owner: 10Lars Wirzenius) [13:15:01] (03PS3) 10Elukey: presto: use puppet TLS certificates instead of the self signed ones [puppet] - 10https://gerrit.wikimedia.org/r/635289 (https://phabricator.wikimedia.org/T253957) [13:16:25] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.14 [13:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:32] (03PS4) 10Elukey: presto: use puppet TLS certificates instead of the self signed ones [puppet] - 10https://gerrit.wikimedia.org/r/635289 (https://phabricator.wikimedia.org/T253957) [13:19:02] !log install routinator 3000 0.8.0 on rpki2001 - T266001 [13:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:07] T266001: Upgrade Routinator 3000 to 0.8.0 - https://phabricator.wikimedia.org/T266001 [13:24:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 60%: Slowly repool db2125 after checking tables ', diff saved to https://phabricator.wikimedia.org/P13031 and previous config saved to /var/cache/conftool/dbconfig/20201020-132430-root.json [13:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:56] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25995/" [puppet] - 10https://gerrit.wikimedia.org/r/635289 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [13:25:58] (03PS1) 10Urbanecm: Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635039 (https://phabricator.wikimedia.org/T265810) [13:26:32] (03PS1) 10Urbanecm: Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635040 (https://phabricator.wikimedia.org/T265810) [13:28:55] (03PS1) 10KartikMistry: Enable ContentTranslation in 5 Wikipedias as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635294 (https://phabricator.wikimedia.org/T264737) [13:29:42] (03PS1) 10Filippo Giunchedi: prometheus: add Pushgateway profile and module [puppet] - 10https://gerrit.wikimedia.org/r/635295 (https://phabricator.wikimedia.org/T249311) [13:29:44] (03PS1) 10Filippo Giunchedi: prometheus: add pushgateway profile [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) [13:31:11] (03PS3) 10Jbond: facter: ipaddress6 prefer none SLAAC addresses [puppet] - 10https://gerrit.wikimedia.org/r/635283 (https://phabricator.wikimedia.org/T265904) [13:32:00] liw: I see patch for T265994 has been approved, and I'd like to deploy a backport too (see few lines above), should we deploy both "now"? [13:32:01] T265994: PHP Warning: mb_stripos(): Empty delimiter - https://phabricator.wikimedia.org/T265994 [13:32:04] (03CR) 10Vgutierrez: [C: 03+1] ATS: add metric trafficserver_tls_client_total_time [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [13:33:56] Urbanecm, backporting T265994 would be good - if you can do it right now, that'd be OK by me, I've just gotten train to group0 and there's about 90 minutes of the train window left [13:34:28] liw: I'll do it together with my thing - thanks! [13:34:35] Urbanecm, cool, thank you [13:34:36] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be20[58-61] - https://phabricator.wikimedia.org/T265419 (10fgiunchedi) [13:34:38] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-be106[0-3] - https://phabricator.wikimedia.org/T265093 (10fgiunchedi) [13:34:42] 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [13:34:48] * liw afk for fifteen minutes [13:36:22] (03CR) 10Urbanecm: [C: 03+2] Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635039 (https://phabricator.wikimedia.org/T265810) (owner: 10Urbanecm) [13:36:37] (03CR) 10Urbanecm: [C: 03+2] Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635040 (https://phabricator.wikimedia.org/T265810) (owner: 10Urbanecm) [13:37:22] (03PS1) 10Urbanecm: Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635042 (https://phabricator.wikimedia.org/T265994) [13:37:48] (03PS1) 10Ema: varnish: fix websockets on 6.x [puppet] - 10https://gerrit.wikimedia.org/r/635298 (https://phabricator.wikimedia.org/T264398) [13:38:09] (03PS1) 10Urbanecm: Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635043 (https://phabricator.wikimedia.org/T265994) [13:38:17] (03CR) 10Urbanecm: [C: 03+2] Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635043 (https://phabricator.wikimedia.org/T265994) (owner: 10Urbanecm) [13:38:34] (03CR) 10Urbanecm: [C: 03+2] Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635042 (https://phabricator.wikimedia.org/T265994) (owner: 10Urbanecm) [13:39:30] (03PS1) 10Elukey: presto: remove unused code [puppet] - 10https://gerrit.wikimedia.org/r/635299 (https://phabricator.wikimedia.org/T253957) [13:39:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 80%: Slowly repool db2125 after checking tables ', diff saved to https://phabricator.wikimedia.org/P13032 and previous config saved to /var/cache/conftool/dbconfig/20201020-133933-root.json [13:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:39] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) >>! In T264398#6564065, @ema wrote: >>>! In T264398#6538699, @Gilles wrote: >> ` >> SELECT event.responsestart... [13:42:03] (03CR) 10Vgutierrez: [C: 03+1] varnish: fix websockets on 6.x [puppet] - 10https://gerrit.wikimedia.org/r/635298 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [13:42:07] (03CR) 10Elukey: [C: 03+2] "a nice no-op https://puppet-compiler.wmflabs.org/compiler1001/25996/" [puppet] - 10https://gerrit.wikimedia.org/r/635299 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [13:48:51] (03CR) 10Gilles: ATS: add metric trafficserver_tls_client_total_time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [13:50:16] (03CR) 10Vgutierrez: [C: 04-1] ATS: add metric trafficserver_tls_client_total_time (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635276 (https://phabricator.wikimedia.org/T265869) (owner: 10Ema) [13:54:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2125 (re)pooling @ 100%: Slowly repool db2125 after checking tables ', diff saved to https://phabricator.wikimedia.org/P13033 and previous config saved to /var/cache/conftool/dbconfig/20201020-135436-root.json [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:53] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 (10fgiunchedi) [13:54:59] 10Operations, 10Patch-Needs-Improvement, 10User-fgiunchedi: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673 (10fgiunchedi) 05Open→03Resolved Yes let's resolve, thanks @Marostegui [13:56:39] (03Merged) 10jenkins-bot: Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635039 (https://phabricator.wikimedia.org/T265810) (owner: 10Urbanecm) [13:58:59] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10nshahquinn-wmf) If approval from @SBisson's manager is needed, that would be @Arrbee. [14:00:35] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1020 - firmware upgrade: (was: host went down) - https://phabricator.wikimedia.org/T234698 (10fgiunchedi) 05Open→03Declined Not really, host is going to be decom'd soon. [14:00:36] (03PS1) 10Vgutierrez: vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/635302 (https://phabricator.wikimedia.org/T258405) [14:05:49] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:31] (03CR) 10Hashar: [C: 03+1] "I have no idea how systemd timers work. But surely please do ;)" [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [14:09:49] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:20] (03CR) 10jerkins-bot: [V: 04-1] Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635040 (https://phabricator.wikimedia.org/T265810) (owner: 10Urbanecm) [14:11:02] (03CR) 10jerkins-bot: [V: 04-1] Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635042 (https://phabricator.wikimedia.org/T265994) (owner: 10Urbanecm) [14:11:14] damn it, I hate browser tests [14:11:22] (03CR) 10Urbanecm: [C: 03+2] Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635040 (https://phabricator.wikimedia.org/T265810) (owner: 10Urbanecm) [14:12:25] (03CR) 10Urbanecm: [C: 03+2] "failed browser test, try again" [extensions/AbuseFilter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635042 (https://phabricator.wikimedia.org/T265994) (owner: 10Urbanecm) [14:13:29] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:20] !log [urbanecm@deploy1001 /srv/mediawiki-staging (master u=)]$ sudo /usr/local/sbin/fix-staging-perms [14:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:38] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/FileImporter/: 5f8d3de14c116b618f5226419082d5c9a07766fb: Set originalRequest (incl. X-Forwarded-For) for remote edits (T265810) (duration: 01m 09s) [14:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:00] liw: status report 1/4 done, because of a failed browser test :/. Waiting for CI again. [14:17:50] 10Operations, 10DBA, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Kormat) [14:21:33] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:11] Urbanecm, ack [14:22:29] (03CR) 10Hashar: [C: 03+1] "Filed removal of mediawiki-extensions.txt as T266024 but that should not block the migration toward systemd timers." [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [14:26:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] "aphlict and etherpad have had their websocket support broken recently, this would explain it. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/635298 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [14:27:05] 10Operations, 10serviceops: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) 05Open→03Resolved a:03jijiki Resolve it since it has not been updated for so long :) [14:31:31] (03Merged) 10jenkins-bot: Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635043 (https://phabricator.wikimedia.org/T265994) (owner: 10Urbanecm) [14:31:52] 10Operations, 10Traffic, 10observability: prometheus-varnish-exporter@frontend.service: Unit entered failed state - invalid character 'C' - https://phabricator.wikimedia.org/T203191 (10ema) 05Open→03Resolved a:03ema The following now returns nothing: ` cumin 'A:cp' 'journalctl -u prometheus-varnish-e... [14:33:15] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:36] (03CR) 10Ema: [C: 03+2] varnish: fix websockets on 6.x [puppet] - 10https://gerrit.wikimedia.org/r/635298 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [14:33:55] (03PS10) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [14:34:57] (03CR) 10Jbond: "Thanks updated" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:38:43] PROBLEM - Disk space on sretest1001 is CRITICAL: DISK CRITICAL - /var/lib/docker/overlay2/f17bd20f2acde2f8f40c9a6364472e317d93b16b0206be825278eb3558751c83/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [14:41:04] (03Merged) 10jenkins-bot: Prevent uncaught warnings/exception on Special:AbuseFilter [extensions/AbuseFilter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635042 (https://phabricator.wikimedia.org/T265994) (owner: 10Urbanecm) [14:41:07] (03Merged) 10jenkins-bot: Set originalRequest (incl. X-Forwarded-For) for remote edits [extensions/FileImporter] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635040 (https://phabricator.wikimedia.org/T265810) (owner: 10Urbanecm) [14:48:03] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/FileImporter/: 5eee9b773338e5181867cabec9faefbdeacf67ca: Set originalRequest (incl. X-Forwarded-For) for remote edits (T265810) (duration: 01m 06s) [14:48:06] (03PS1) 10Ssingh: dnsdist: set and increase the value of setMaxTCPClientThreads [puppet] - 10https://gerrit.wikimedia.org/r/635309 [14:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:26] (03CR) 10Jbond: sretest: Experiment with preserving docker rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [14:55:57] (03PS2) 10Ottomata: Remove eventlogging-valid-mixed output for eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/634317 (https://phabricator.wikimedia.org/T265651) [14:56:46] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/AbuseFilter/includes/Views/AbuseFilterViewList.php: 00ef00f59fd2a7a1366161ccc66c260be20e3e50: Prevent uncaught warnings/exception on Special:AbuseFilter (T265994) (duration: 01m 01s) [14:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:53] T265994: PHP Warning: mb_stripos(): Empty delimiter - https://phabricator.wikimedia.org/T265994 [14:58:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:31] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/AbuseFilter/includes/Views/AbuseFilterViewList.php: fee2d3be13ae14d7ea51ff2db42090a1c27819bf: Prevent uncaught warnings/exception on Special:AbuseFilter (T265994) (duration: 01m 03s) [14:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:03] liw: done :) [14:59:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:59:37] Urbanecm, thank you kindly [14:59:57] No problem [15:00:04] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25998/eventlog1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/634317 (https://phabricator.wikimedia.org/T265651) (owner: 10Ottomata) [15:00:19] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/25997/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635309 (owner: 10Ssingh) [15:00:57] liw: the logspam issue seems to be fixed when I tried at mwdebug, didn't look at the real prod data through. [15:01:16] Urbanecm, ack [15:01:45] Urbanecm, if now, we can file another task, not worried about that now [15:02:02] 10Operations, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [15:03:29] is there a working dashboard for job queue backlog/processing rate? [15:03:55] Nikerabbit, do you mean https://integration.wikimedia.org/zuul/ ? [15:04:06] liw: mediawiki jobs [15:04:54] then I don' tknow [15:05:23] there is one in grafana, but everything except insert rates seems broken [15:05:53] (03PS2) 10BryanDavis: Create wiki replica views for MachineVision extension tables [puppet] - 10https://gerrit.wikimedia.org/r/623775 (https://phabricator.wikimedia.org/T238574) (owner: 10Cparle) [15:07:13] (03CR) 10RLazarus: gerrit: replace cron jobs with systemd timers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633857 (owner: 10Dzahn) [15:07:58] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:08:30] (03CR) 10Nuria: [C: 03+1] Remove eventlogging-valid-mixed output for eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/634317 (https://phabricator.wikimedia.org/T265651) (owner: 10Ottomata) [15:09:10] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:11:30] (03PS2) 10Razzi: stats: switch analytics sites to use Envoy on port 8443 [puppet] - 10https://gerrit.wikimedia.org/r/634669 (https://phabricator.wikimedia.org/T240439) [15:11:38] (03CR) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:13:39] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime [15:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:13] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [15:15:16] (03PS1) 10Andrew Bogott: wmcs-backup-images.py: use admin rather than observer credentials [puppet] - 10https://gerrit.wikimedia.org/r/635312 (https://phabricator.wikimedia.org/T265843) [15:15:38] (03PS1) 10DCausse: [cirrus] A/B test perfield build on spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635313 (https://phabricator.wikimedia.org/T266027) [15:15:39] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup-images.py: use admin rather than observer credentials [puppet] - 10https://gerrit.wikimedia.org/r/635312 (https://phabricator.wikimedia.org/T265843) (owner: 10Andrew Bogott) [15:23:31] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) Here's the same data collected with commands like the following (using `Process`), over a 30 minute period. ` varnishlog -n frontend... [15:25:41] (03PS4) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [15:26:02] (03CR) 10jerkins-bot: [V: 04-1] get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) (owner: 10ArielGlenn) [15:26:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:56] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns recursor: allow querys from extra CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/635314 (https://phabricator.wikimedia.org/T261724) [15:27:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:27:31] (03PS5) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [15:32:10] (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns recursor: allow querys from extra CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/635314 (https://phabricator.wikimedia.org/T261724) [15:34:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:34:48] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) @Marostegui 2 questions: * When you said: > the disk went full only full in activity, not on disk space, right? * You only saw this happen once on... [15:34:58] (03CR) 10Andrew Bogott: [C: 03+1] "If the pcc likes this then it's fine with me!" [puppet] - 10https://gerrit.wikimedia.org/r/635314 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:35:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26004/" [puppet] - 10https://gerrit.wikimedia.org/r/635314 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:36:51] (03PS3) 10Arturo Borrero Gonzalez: openstack: pdns recursor: allow querys from extra CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/635314 (https://phabricator.wikimedia.org/T261724) [15:38:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: pdns recursor: allow querys from extra CIDRs [puppet] - 10https://gerrit.wikimedia.org/r/635314 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [15:44:29] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [15:44:57] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, and 2 others: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10jijiki) [15:50:02] 10Operations, 10Wikimedia-Mailing-lists: Bot unable to send messages to wikipedia-fr-wikimag - https://phabricator.wikimedia.org/T265844 (10Orlodrim) I tried to send the issue of the week in plain text format, and it worked. This confirms that the e-mail is filtered due to its content (at least partially). Is... [15:52:59] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@629e8bc]: search satisfaction: remove unused y/m/d cli args [15:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:30] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@629e8bc]: search satisfaction: remove unused y/m/d cli args (duration: 01m 31s) [15:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:57:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:37] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns recursor: allow monitoring hosts in the ACL [puppet] - 10https://gerrit.wikimedia.org/r/635317 [16:00:05] jbond42 and cdanis: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201020T1600). Please do the needful. [16:00:11] 10Operations, 10Platform Engineering, 10serviceops: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [16:00:14] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [16:01:30] 10Operations, 10Platform Engineering, 10serviceops, 10User-jijiki: Upgrade MediaWiki's Redis cluster to Debian Buster - https://phabricator.wikimedia.org/T265643 (10jijiki) [16:03:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26005/" [puppet] - 10https://gerrit.wikimedia.org/r/635317 (owner: 10Arturo Borrero Gonzalez) [16:05:18] 10Operations, 10DBA: Puppetize grants for mysql hosts that are the source of recovery (dbstore, passive misc) - https://phabricator.wikimedia.org/T111929 (10jcrespo) @LSobanski I think Manuel and/or I requested to document what grants are needed to setup a backup host. The problems is there is no good way to... [16:05:22] 10Operations, 10Traffic: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10BBlack) [16:05:59] 10Operations, 10Traffic: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10BBlack) p:05Triage→03Medium [16:06:13] (03PS1) 10BBlack: VCL: use hfm for large_objects_cutoff [puppet] - 10https://gerrit.wikimedia.org/r/635318 (https://phabricator.wikimedia.org/T266040) [16:07:45] 10Operations, 10DBA: Puppetize grants for mysql hosts that are the source of recovery (dbstore, passive misc) - https://phabricator.wikimedia.org/T111929 (10jcrespo) In other works this is a subtask of bigger issue T146149, specific to the backup-related hosts. [16:08:34] (03PS2) 10BBlack: VCL: use hfm for large_objects_cutoff [puppet] - 10https://gerrit.wikimedia.org/r/635318 (https://phabricator.wikimedia.org/T266040) [16:09:22] (03PS1) 10Elukey: Add Thanos Swift endpoints to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/635319 (https://phabricator.wikimedia.org/T246004) [16:10:07] (03PS2) 10Elukey: Add Thanos Swift endpoints to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/635319 (https://phabricator.wikimedia.org/T246004) [16:10:43] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1587080795(60.833333333333336gb), commonswiki_file_1595354515(54.510416666666664gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [16:14:25] dcausse, ryankemper --^ [16:14:38] (probably already known) [16:14:46] elukey: thanks, yes this is known :/ [16:15:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:17:02] 10Operations, 10Cloud-Services, 10Traffic, 10cloud-services-team (Kanban): cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10nskaggs) p:05Triage→03Medium [16:17:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:17:58] 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 3 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10jcrespo) 05Open→... [16:22:22] 10Operations, 10Traffic, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10RLazarus) [16:22:27] 10Operations, 10Wikidata, 10serviceops: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10RLazarus) [16:23:12] (03CR) 10Effie Mouzeli: Add apache httpd base image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634924 (https://phabricator.wikimedia.org/T265324) (owner: 10Giuseppe Lavagetto) [16:28:56] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10BBlack) I stumbled on T266040 while looking at something unrelated, but now I'm remembering that earlier in this ticket, there was some menti... [16:31:58] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) That's fair. I will try proposing a new date tomorrow. [16:34:53] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) New proposed date: 2020-11-03, [16:35:45] (03CR) 10Ottomata: [C: 03+1] Add Thanos Swift endpoints to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/635319 (https://phabricator.wikimedia.org/T246004) (owner: 10Elukey) [16:44:01] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [16:45:41] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [16:47:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 49.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:50:22] (03PS1) 10Cicalese: Components: Handle missing special pages [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635329 (https://phabricator.wikimedia.org/T266021) [16:51:15] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 74.03 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:53:15] (03PS1) 10Arturo Borrero Gonzalez: openstack: pdns recursor: allow openstack control plane to query the server [puppet] - 10https://gerrit.wikimedia.org/r/635321 [16:54:37] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) [16:54:56] 10Operations, 10Data-Persistence-Backup, 10observability, 10Goal, and 2 others: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) 05Open→03Resolved I am going to consider this resolved- there is monitoring, and we have a dashboard and tooling for it (command line an... [16:55:19] (03PS2) 10DannyS712: Components: Handle missing special pages [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635329 (https://phabricator.wikimedia.org/T266021) (owner: 10Cicalese) [16:58:12] (03PS2) 10Arturo Borrero Gonzalez: openstack: pdns recursor: allow openstack control plane to query the server [puppet] - 10https://gerrit.wikimedia.org/r/635321 [17:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201020T1700). [17:01:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/26007/" [puppet] - 10https://gerrit.wikimedia.org/r/635321 (owner: 10Arturo Borrero Gonzalez) [17:03:48] 10Operations, 10observability, 10Patch-For-Review, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [17:06:43] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:08:50] elukey: hmm, that critical should not have gone off earlier, since the new thresholds (80GB warn 100 GB critical) should be deployed now [17:08:59] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) Seeing that for some reason in my 30 minute test cp3054 was getting significantly more miss and pass requests than cp3052, I've just... [17:09:01] re >PROBLEM - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1587080795(60.833333333333336gb), commonswiki_file_1595354515(54.510416666666664gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [17:12:33] (03PS1) 10Catrope: StartEditingDialog: Prevent scrolling in non-modal mode [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635330 (https://phabricator.wikimedia.org/T265751) [17:12:57] (03PS1) 10Catrope: Show homepage discovery popup in variant C/D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635331 (https://phabricator.wikimedia.org/T265754) [17:15:10] 10Operations, 10Data-Persistence-Backup: Setup an Offsite backup infrastructure - https://phabricator.wikimedia.org/T85278 (10jcrespo) This is most likely delayed to Q3 or even if we setup an alternative backup method to bacula. [17:21:37] 10Operations, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) ES are backed up, but currently only locally. We need to finish the cross-dc backup, hopfully on Q3. [17:21:39] (03PS1) 10Cmjohnson: Adding new mac address for cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/635324 (https://phabricator.wikimedia.org/T243414) [17:21:48] 10Operations, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) p:05High→03Medium [17:24:00] (03CR) 10Cmjohnson: [C: 03+2] Adding new mac address for cloudvirt1013 [puppet] - 10https://gerrit.wikimedia.org/r/635324 (https://phabricator.wikimedia.org/T243414) (owner: 10Cmjohnson) [17:25:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Cmjohnson) [17:25:38] ACKNOWLEDGEMENT - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1587080795(60.833333333333336gb), commonswiki_file_1595354515(54.510416666666664gb) Ryan Kemper this was supposed to have been fixed looking into if a value was missed in the puppet repo somewhere https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [17:26:05] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: total VRPs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:26:14] 10Operations, 10DBA, 10Sustainability (Incident Followup), 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) As a last comment, I thought at first it was 1, but after some analysis, I believe there are more cha... [17:26:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Cmjohnson) @Andrew The server didn't have to move locations, added 10G cables, fixed the network switch, updated dhcpd file with new mac address. Veri... [17:27:51] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) Comparing percentages for that 30-minute test, which was only looking at hit-front/hit-local/miss/pass for requests to /wiki/ URLs (a... [17:28:19] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 381 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:29:59] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 50 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:35:54] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [17:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:21] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've found the dashboard for total objects, and it seems like as many objects are stored now as there were before the Varnish 6 deplo... [17:37:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:22] (03PS1) 10Andrew Bogott: cloudvirt1020: update nic names for Buster [puppet] - 10https://gerrit.wikimedia.org/r/635325 (https://phabricator.wikimedia.org/T263677) [17:38:52] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1020: update nic names for Buster [puppet] - 10https://gerrit.wikimedia.org/r/635325 (https://phabricator.wikimedia.org/T263677) (owner: 10Andrew Bogott) [17:44:03] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Nemo_bis) > We'll prepare at least a lightweight incident report in the coming days. Did this happen? I couldn't find it. Sorry if I looked in the wrong places. (... [17:44:10] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:46:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:47:35] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Mholloway) [17:47:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:48:48] !log depooling mw2328 - T266052 [17:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:55] T266052: Interface 'MediaWiki\EditPafe\IEditObject' not found - https://phabricator.wikimedia.org/T266052 [17:50:03] Thanks effie [17:50:04] (03PS1) 10Andrew Bogott: Revert "cloudvirt1020: update nic names for Buster" [puppet] - 10https://gerrit.wikimedia.org/r/635346 (https://phabricator.wikimedia.org/T263677) [17:50:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "cloudvirt1020: update nic names for Buster" [puppet] - 10https://gerrit.wikimedia.org/r/635346 (https://phabricator.wikimedia.org/T263677) (owner: 10Andrew Bogott) [17:51:18] (03PS2) 10Andrew Bogott: Revert "cloudvirt1020: update nic names for Buster" [puppet] - 10https://gerrit.wikimedia.org/r/635346 (https://phabricator.wikimedia.org/T263677) [17:52:05] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloudvirt1020: update nic names for Buster" [puppet] - 10https://gerrit.wikimedia.org/r/635346 (https://phabricator.wikimedia.org/T263677) (owner: 10Andrew Bogott) [17:52:07] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201020T1800) [18:08:51] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) >>! In T265490#6562801, @Pablo-WMDE wrote: > Works for me now >>! In T265490#6563057, @hashar wrote: > I guess raising `commitRateLimiting` addressed it. Yay, thanks! If peo... [18:10:42] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) Let's keep it open for a few more days I guess. Reports still welcome if it's gone or still happening for others during certain events. [18:13:37] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) >>! In T165885#3397920, @Kelson wrote: > If I have a look to the solution Mozilla has implemented, this sounds quite trivial... [18:15:42] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1412 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:17:48] (03CR) 10Ryan Kemper: [C: 03+2] Revert "cirrus: temporarily disable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/635047 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [18:17:57] (03CR) 10Ryan Kemper: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/26008/" [puppet] - 10https://gerrit.wikimedia.org/r/635047 (https://phabricator.wikimedia.org/T263073) (owner: 10Ebernhardson) [18:19:30] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:19:58] (03CR) 10Dzahn: "Thank you very much, I had noticed aphlict (Phabricator realtime notifications) stopped working and the reason was that the upgrade to web" [puppet] - 10https://gerrit.wikimedia.org/r/635298 (https://phabricator.wikimedia.org/T264398) (owner: 10Ema) [18:21:12] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:21:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) a:05Cmjohnson→03Andrew Thanks! I'll re-image and see what I can see. [18:23:08] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) >>! In T265490#6543576, @mmodell wrote: > Phabricator's websockets have also recently stopped working. May be related? Probably not related but Aphlict has been fixed now an... [18:25:30] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:25:58] (03PS1) 10Jbond: new module: debian [puppet] - 10https://gerrit.wikimedia.org/r/635356 [18:26:24] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:26:28] (03CR) 10jerkins-bot: [V: 04-1] new module: debian [puppet] - 10https://gerrit.wikimedia.org/r/635356 (owner: 10Jbond) [18:27:14] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:29:52] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:29:55] (03PS2) 10Jbond: new module: debian [puppet] - 10https://gerrit.wikimedia.org/r/635356 [18:30:56] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:31:18] (03CR) 10Jbond: "I have created a new CR that introduces a new `debian` module which implements some of the ideas proposed in this CR" [puppet] - 10https://gerrit.wikimedia.org/r/626723 (owner: 10Jbond) [18:31:20] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:36] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:31:36] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:31:43] (03PS8) 10Jbond: wmflib::debian::version: update the os_version [puppet] - 10https://gerrit.wikimedia.org/r/626723 [18:32:25] (03PS3) 10Jbond: new module: debian [puppet] - 10https://gerrit.wikimedia.org/r/635356 [18:32:38] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:34:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:44] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:52] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:36:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:36:47] (03PS4) 10Jbond: new module: debian [puppet] - 10https://gerrit.wikimedia.org/r/635356 [18:37:38] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:34] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:06] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:43:26] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:07] (03PS3) 10Dzahn: etherpad: explicitly use the colibris skin and default variant [puppet] - 10https://gerrit.wikimedia.org/r/635087 [18:45:38] (03CR) 10Dzahn: etherpad: explicitly use the colibris skin and default variant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [18:46:18] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:46:37] (03CR) 10Dzahn: "> Patch Set 2: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [18:46:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:46:58] ACKNOWLEDGEMENT - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper wdqs updater failed investigating why https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:58] ACKNOWLEDGEMENT - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper wdqs updater failed investigating why https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:46:58] ACKNOWLEDGEMENT - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper wdqs updater failed investigating why https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:12] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:02] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:30] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:23] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:53:06] (03CR) 10Dzahn: [C: 03+2] etherpad: explicitly use the colibris skin and default variant [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [18:53:12] (03PS4) 10Dzahn: etherpad: explicitly use the colibris skin and default variant [puppet] - 10https://gerrit.wikimedia.org/r/635087 [18:53:25] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:09] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:55] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:56] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:15] (03PS1) 10Ppchelko: Bata: enable ParserCache JSON serialization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635359 (https://phabricator.wikimedia.org/T263579) [18:57:27] (03PS5) 10Dzahn: etherpad: explicitly use the colibris skin and default variant [puppet] - 10https://gerrit.wikimedia.org/r/635087 [18:58:32] (03CR) 10Dzahn: [C: 03+2] etherpad: explicitly use the colibris skin and default variant [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [18:58:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] longma and liw: That opportune time is upon us again. Time for a Mediawiki train - American+European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201020T1900). [19:01:22] (03CR) 10Dzahn: "gaah.. so much about "it's the default" ?:)" [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [19:02:17] (03CR) 10Dzahn: "well.. scratch the last comment.. it's gone after hard refresh" [puppet] - 10https://gerrit.wikimedia.org/r/635087 (owner: 10Dzahn) [19:10:53] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:11:28] (03PS2) 10Dzahn: etherpad: activate shortcut keys [puppet] - 10https://gerrit.wikimedia.org/r/635091 [19:13:25] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:21] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:15:57] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:09] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:16:45] (03PS3) 10Dzahn: etherpad: activate shortcut keys [puppet] - 10https://gerrit.wikimedia.org/r/635091 [19:17:09] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:17:37] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:17:37] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:20:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:20:51] 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10razzi) [19:20:53] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:21:21] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:03] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:03] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:15] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:43] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm [19:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:51] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:27:23] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:31] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:27:45] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:28:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={jmx_wdqs_updater,swagger_check_restbase_esams} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:28:11] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:28:30] (03PS2) 10Ppchelko: Beta: enable ParserCache JSON serialization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635359 (https://phabricator.wikimedia.org/T263579) [19:28:57] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 500 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne [19:28:57] status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:00] (03CR) 10Dzahn: [C: 03+2] etherpad: activate shortcut keys [puppet] - 10https://gerrit.wikimedia.org/r/635091 (owner: 10Dzahn) [19:31:23] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:34:56] (03CR) 10Ppchelko: [C: 03+2] Beta: enable ParserCache JSON serialization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635359 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [19:35:27] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/635094 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [19:35:31] (03Abandoned) 10Dzahn: etherpad: add 'trustProxy' config setting and enable it [puppet] - 10https://gerrit.wikimedia.org/r/635094 (https://phabricator.wikimedia.org/T265490) (owner: 10Dzahn) [19:35:43] (03Merged) 10jenkins-bot: Beta: enable ParserCache JSON serialization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635359 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [19:37:15] (03PS1) 10Andrew Bogott: cloudvirt1013: switch nova to use the 10G nics [puppet] - 10https://gerrit.wikimedia.org/r/635370 (https://phabricator.wikimedia.org/T243414) [19:37:39] (03PS1) 10Catrope: GrowthExperiments: Make variant D the default everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635371 (https://phabricator.wikimedia.org/T265556) [19:37:48] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1013: switch nova to use the 10G nics [puppet] - 10https://gerrit.wikimedia.org/r/635370 (https://phabricator.wikimedia.org/T243414) (owner: 10Andrew Bogott) [19:39:48] (03CR) 10Dzahn: [C: 03+2] cumin: remove hardcoded hostname from comments [puppet] - 10https://gerrit.wikimedia.org/r/635110 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [19:40:45] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:42:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:45:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,service=canary [19:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:51] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:46:53] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:47:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [19:47:43] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [19:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [19:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:05] RECOVERY - Check systemd state on ldap-replica2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:31] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:50:13] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:53] what's the deal with recommendations? [19:53:13] PROBLEM - Check systemd state on ldap-replica2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:10] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 500 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne [19:54:10] status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:00] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:10] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:52] I think I broke beta sites. [19:57:18] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:58:06] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:58:20] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:18] (03PS1) 10Razzi: Add an-test-client1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/635379 (https://phabricator.wikimedia.org/T266064) [19:59:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [19:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:28] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:00:36] (03CR) 10Ottomata: [C: 03+1] Add an-test-client1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/635379 (https://phabricator.wikimedia.org/T266064) (owner: 10Razzi) [20:01:05] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10wiki_willy) a:05Jclark-ctr→03RobH Hi @RobH - since John is still out and Chris is knee deep with installs, can you see if you're able to work with HP remotely, in getting a replacement... [20:01:32] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:03:40] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:03:40] (03CR) 10Razzi: [C: 03+2] Add an-test-client1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/635379 (https://phabricator.wikimedia.org/T266064) (owner: 10Razzi) [20:04:23] (03PS3) 10Ppchelko: Enable warn+ logging for ParserCache channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) [20:06:04] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:06:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:40] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:12:20] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:12:30] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 500 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne [20:12:30] status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:24] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:15:34] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:42] PROBLEM - Ensure local MW versions match expected deployment on wtp2020 is CRITICAL: CRITICAL: 131 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:18:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:18:29] ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on wtp2020 is CRITICAL: CRITICAL: 131 mismatched wikiversions daniel_zahn decom https://wikitech.wikimedia.org/wiki/Application_servers [20:18:29] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wtp2020 is CRITICAL: Host wtp2020 is not in mediawiki-installation dsh group daniel_zahn decom https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:18:42] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:18:58] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10CDanis) >>! In T264398#6565366, @Gilles wrote: > I'm not sure how frontend servers are picked to serve requests (hashed by IP? URL?), but thi... [20:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:14] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:04] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:23:58] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:46] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:25:42] RECOVERY - Disk space on sretest1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=sretest1001&var-datasource=eqiad+prometheus/ops [20:25:49] !log mforns@deploy1001 Started deploy [analytics/refinery@e4d16f0]: Regular analytics weekly train [analytics/refinery@e4d16f08a96b6f65447fcdc6c9e8945724a89f54] [20:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:10] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:26:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) [20:27:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:14] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:28:58] Looks like wdqs updater across all instances keeps trying to do its work, hitting an unexpected character and bombing out, then starting over again when systemd eventually restarts the failed unit [20:29:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:18] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:30:40] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:31:07] (03PS3) 10Dzahn: site: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634362 (https://phabricator.wikimedia.org/T265558) [20:31:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:31:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) 05Open→03Resolved [20:31:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:31:49] !log [Temporarily] disabled notifications for all wdqs hosts while we figure out how to unstick the updater process. Impact is that new updates will be delayed, but queries will still keep serving as normal, so fixing this is a priority but note that there's no availability outage [20:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:59] !log mforns@deploy1001 Finished deploy [analytics/refinery@e4d16f0]: Regular analytics weekly train [analytics/refinery@e4d16f08a96b6f65447fcdc6c9e8945724a89f54] (duration: 08m 10s) [20:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:31] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:37] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 500 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne [20:34:37] status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:36:09] 10Operations, 10serviceops, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Dzahn) parsoid: WIP in https://gerrit.wikimedia.org/r/c/operations/puppet/+/634383 / T257906 [20:36:47] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:37:51] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:38:08] (03PS1) 10Ppchelko: Enable ParserCache JSON serialization on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635382 (https://phabricator.wikimedia.org/T263579) [20:39:45] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:39:49] !log doing some manual testing on mw2221, depooled and puppet disabled [20:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:46] 10Operations, 10Gerrit, 10Phabricator, 10Traffic, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655 (10Dzahn) I think this is done meanwhile. Both Phabricator and Gerrit do not show generic 503 error... [20:41:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/26010/" [puppet] - 10https://gerrit.wikimedia.org/r/634362 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [20:43:25] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_restbase_esams} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:44:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:45:04] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10jijiki) [20:45:06] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) 05Stalled→03Open [20:45:09] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10jijiki) [20:45:11] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10jijiki) [20:45:44] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [20:46:00] 10Operations, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar), and 2 others: Upgrade memcached cluster to Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10jijiki) [20:46:27] (03PS2) 10Dzahn: remove wtp2001.codfw.wmnet through wtp2020.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/634129 (https://phabricator.wikimedia.org/T265558) [20:46:50] (03PS3) 10Dzahn: remove wtp2001.codfw.wmnet through wtp2020.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/634129 (https://phabricator.wikimedia.org/T265558) [20:47:30] (03CR) 10Dzahn: [C: 03+2] remove wtp2001.codfw.wmnet through wtp2020.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/634129 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [20:49:25] (03PS1) 10Ayounsi: Update AssignIPs to handle switch port and cable [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635385 (https://phabricator.wikimedia.org/T265339) [20:51:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:52:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:53:05] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:54:46] (03PS6) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [20:56:54] !log mforns@deploy1001 Started deploy [analytics/refinery@e4d16f0] (thin): Regular analytics weekly train THIN [analytics/refinery@e4d16f08a96b6f65447fcdc6c9e8945724a89f54] [20:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:02] !log mforns@deploy1001 Finished deploy [analytics/refinery@e4d16f0] (thin): Regular analytics weekly train THIN [analytics/refinery@e4d16f08a96b6f65447fcdc6c9e8945724a89f54] (duration: 00m 08s) [20:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:21] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:00:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:00:49] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10wiki_willy) a:05wiki_willy→03Papaul Thanks @Dzahn . Just a quick reminder to add the "DC-Ops" and "ops-codfw" project tags, once it's... [21:01:09] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:01:28] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Dzahn) Ah, yes, i meant to do that but was an oversight. Thanks! [21:02:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Dzahn) [21:03:13] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:03:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:03:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 (10Dzahn) [21:04:23] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:06:53] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:07:26] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) [21:08:29] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:11:02] (03CR) 10Dzahn: [C: 03+1] "While this is an existing group on a new role it seems obvious the intention has always been that this VM was for the security-team to use" [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [21:12:01] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) This task has a number of issues, starting with: * There has been a [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ | hardware troubleshooting form available ]] on... [21:12:08] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) a:05RobH→03jcrespo [21:13:16] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) [21:18:09] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:18:46] !log ✔️ cdanis@mw2252.codfw.wmnet ~ 🕠🍺 sudo depool [21:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:19] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [21:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:45] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:21:23] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:23:47] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10Dzahn) 05Open→03Resolved a:03Dzahn Thanks for the detailed answer @akosiaris Up to the "different config" part I would have thought about unifying them. But if there are differences in use cases as you ex... [21:28:54] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) I'm waiting on the very slow HPE site upload to parse the AHS file I downloaded for this, and I also noticed that via https interface (https://db1139.mgmt.eqiad.wmnet/) that it has a... [21:30:01] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 5) is the actual failure from the log. Once the HPE site parses, I'll try to get a new dimm dispatched. They will likel... [21:35:18] 10Operations: Why do we have 2 sets of squid proxies? - https://phabricator.wikimedia.org/T254011 (10Dzahn) Updated docs: https://wikitech.wikimedia.org/w/index.php?title=Obsolete%3ASquids&type=revision&diff=1885659&oldid=1820051 https://wikitech.wikimedia.org/w/index.php?title=HTTP_proxy&type=revision&diff=188... [21:54:28] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) Case ID: 5350976764 opened, requesting a new mainboard and any/all migration directions to be dispatched to eqiad to @Cmjohnson's attention. (He is currently out sick, but is projec... [21:58:56] (03PS4) 10Jbond: diffscan: switch to new refactored diffscan [puppet] - 10https://gerrit.wikimedia.org/r/634566 [22:00:32] (03CR) 10Jbond: [C: 03+1] "This has been running now for the last few days and seems to give the same results as the current script i recommend we merge and fix forw" [puppet] - 10https://gerrit.wikimedia.org/r/634566 (owner: 10Jbond) [22:00:52] (03PS5) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [22:00:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:02:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:08:47] (03PS6) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [22:14:55] (03PS7) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [22:15:25] (03CR) 10Jbond: "updated: ready for review" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [22:29:43] (03CR) 10Dzahn: "I think this is not actually used anywhere.. the classes on registry1001 are the "ha" variant of the role/profile but not this one." [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [22:30:44] (03CR) 10Dzahn: "and the cloud docker registry host is broken: https://puppet-compiler.wmflabs.org/compiler1003/26013/toolsbeta-docker-registry-01.toolsbet" [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [22:34:46] (03PS11) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [22:38:41] 10Operations, 10Puppet, 10cloud-services-team (Kanban): Using $facts['networking']['ip'] breaks puppet on cloud hosts - https://phabricator.wikimedia.org/T266075 (10Dzahn) [22:41:34] (03CR) 10Dzahn: "yea..so ... this is not used in prod but it is used on https://openstack-browser.toolforge.org/puppetclass/role::wmcs::toolforge::docker::" [puppet] - 10https://gerrit.wikimedia.org/r/633835 (owner: 10Dzahn) [22:44:34] (03PS2) 10Dzahn: docker: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/633836 [22:44:46] (03PS3) 10Dzahn: docker: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/633836 [22:45:27] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26014/" [puppet] - 10https://gerrit.wikimedia.org/r/633836 (owner: 10Dzahn) [22:47:13] (03CR) 10Dzahn: "ack, thanks. it makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/634368 (owner: 10Dzahn) [22:47:28] (03PS5) 10Dzahn: puppetmaster: pass $servers parameter to gitclone class [puppet] - 10https://gerrit.wikimedia.org/r/634368 [22:50:08] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26015/puppetmaster2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/634368 (owner: 10Dzahn) [22:58:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201020T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:03:39] (03PS12) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [23:08:34] (03PS1) 10Dzahn: docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 [23:09:23] (03CR) 10Dzahn: "Brooke, do you think these are populated anywhere in cloud?" [puppet] - 10https://gerrit.wikimedia.org/r/633838 (owner: 10Dzahn) [23:09:35] (03CR) 10jerkins-bot: [V: 04-1] docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [23:09:46] (03PS13) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [23:11:18] (03CR) 10Dzahn: "joe: I think this was your TODO from some time ago" [puppet] - 10https://gerrit.wikimedia.org/r/633853 (owner: 10Dzahn) [23:12:10] (03PS14) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [23:13:48] (03CR) 10Jbond: "ready for another review" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [23:14:13] (03CR) 10Dzahn: "@BBlack This domain is: registrant: WMNL name servers: WMF Would we add any domains to ncredir that have this kind of status?" [dns] - 10https://gerrit.wikimedia.org/r/634928 (https://phabricator.wikimedia.org/T257536) (owner: 10Ladsgroup) [23:14:36] 10Operations, 10DBA, 10Wikimedia-Site-requests: create script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609 (10Dcljr) [23:20:18] 10Operations, 10DBA, 10Wikimedia-Site-requests: create script & docs to rename wiki databases - https://phabricator.wikimedia.org/T83609 (10Dcljr) Took the liberty of changing the task name, to try to clarify what exactly was being discussed here. It would be nice to have a better task description, but maybe... [23:20:45] (03CR) 10Dzahn: "Lucas' idea to limit it to just at the beginning seems good to me. I don't know if there was a reason to exclude _ though I'll leave that " [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [23:24:48] (03CR) 10Dzahn: "Not sure we should change it unless we are actually trying to send toolforge.org mail via prod servers? This seems a bit vague." [puppet] - 10https://gerrit.wikimedia.org/r/619851 (owner: 10Andrew Bogott) [23:30:01] (03PS2) 10Dzahn: profile: apply ipsec monitoring where enabled with ipsec_exporter [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [23:30:41] (03CR) 10Dzahn: profile: apply ipsec monitoring where enabled with ipsec_exporter (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [23:31:43] (03CR) 10Dzahn: "amended to fix jenkins-bot vote" [puppet] - 10https://gerrit.wikimedia.org/r/632738 (https://phabricator.wikimedia.org/T148976) (owner: 10Cwhite) [23:34:07] (03PS2) 10Dzahn: docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 [23:34:32] (03CR) 10Dzahn: "Now _this_ is the one that is actually used on registry1001 in prod, unlike the previous patch." [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [23:35:10] (03CR) 10jerkins-bot: [V: 04-1] docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [23:36:21] (03PS3) 10Dzahn: docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 [23:37:16] (03CR) 10jerkins-bot: [V: 04-1] docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [23:37:47] (03PS4) 10Dzahn: docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 [23:39:47] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/633838 (owner: 10Dzahn) [23:40:20] (03PS5) 10Dzahn: docker_registry_ha: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/635399 [23:43:01] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [23:44:00] (03CR) 10Dzahn: [V: 03+1] "compiles now: (and because it failed before it also proof registry1001 is actually a host using it)" [puppet] - 10https://gerrit.wikimedia.org/r/635399 (owner: 10Dzahn) [23:48:11] (03CR) 10Dzahn: cassandra: add data types, hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [23:50:13] (03CR) 10Dzahn: cassandra: add data types, hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [23:51:48] (03CR) 10Dzahn: cassandra: add data types, hiera->lookup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [23:53:12] (03PS2) 10Dzahn: cassandra: add data types, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/634363 [23:57:39] (03CR) 10Bstorm: [C: 03+1] "LGTM" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) (owner: 10Legoktm) [23:58:09] (03CR) 10Dzahn: [V: 03+1] parsoid: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634385 (owner: 10Dzahn) [23:59:59] (03PS6) 10Dzahn: parsoid: add data types [puppet] - 10https://gerrit.wikimedia.org/r/634385