[00:00:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,webperf_arclamp} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:05:59] hm, I wonder if the prometheus alert is related [00:06:38] (03PS1) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) [00:11:49] The Prometheus failure for arclamp should be transient, I think it's due to having logs now without any graphs. [00:11:53] (03CR) 10Bstorm: "This code basically works (unless I introduced a mistake between my last test and this version...which is likely). I am currently storing " [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [00:13:18] ack [00:16:08] 10SRE, 10Kubernetes, 10Release-Engineering-Team (Pipeline): Helm install fails in CI namespace - https://phabricator.wikimedia.org/T273563 (10jeena) [00:19:38] https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-01_23.excimer-buster.index.svgz woot [00:20:47] I'll wait for another hour to actually look at it since that'll be a full hour's worth of data [00:24:35] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [00:27:35] 10SRE, 10Kubernetes, 10Release-Engineering-Team (Pipeline): Helm install fails in CI namespace: apparmor failed to apply profile - https://phabricator.wikimedia.org/T273563 (10jeena) [00:57:04] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) [00:57:18] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) [01:03:56] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10RobH) [01:04:16] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10RobH) [01:04:33] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10RobH) a:03Papaul [01:10:55] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1252061144 and 110 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:11] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1048 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:56:38] (03PS1) 10Legoktm: logos: Update nlwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660973 [01:56:40] (03PS1) 10Legoktm: logos: Update eswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660974 [01:56:42] (03PS1) 10Legoktm: logos: Update ptwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660975 [01:56:44] (03PS1) 10Legoktm: logos: Update ruwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660976 [01:56:46] (03PS1) 10Legoktm: logos: Update svwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660977 [01:56:48] (03PS1) 10Legoktm: logos: Remove TODO for pngout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) [02:04:30] (03PS1) 10Legoktm: logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 [02:07:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.29 [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/660980 [02:07:56] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.29 [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/660980 (https://phabricator.wikimedia.org/T271343) (owner: 10TrainBranchBot) [02:33:35] RECOVERY - dump of matomo in eqiad on alert1001 is OK: Last dump for matomo at eqiad (db1108.eqiad.wmnet:3351) taken on 2021-02-02 02:25:41 (1 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:19:17] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:21:01] !log `sudo systemctl restart wdqs-blazegraph` on `wdqs1006` [03:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:43] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.073 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:28] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.62`. Pre-deploy tests passing on canary `wdqs1003` [03:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:48] 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10drochford) Apologies for the tardiness @pkang - Following up with Nasma (Ops Manager). Will revert then. [03:29:03] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@ad9db35]: 0.3.62 [03:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:46] !log [WDQS Deploy] Tests passing following deploy of `0.3.62` on canary `wdqs1003`; proceeding to rest of fleet [03:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:02] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@ad9db35]: 0.3.62 (duration: 06m 59s) [03:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:15] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [03:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:20] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [03:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:27] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [03:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:07] PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:55:21] RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:56:01] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f1adc828518: Failed to establish a new connection: [Errno 111] Connection [03:56:01] ://wikitech.wikimedia.org/wiki/Search%23Administration [03:56:05] PROBLEM - Check systemd state on logstash2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:37] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2005 is OK: OK - elasticsearch status production-logstash-codfw: relocating_shards: 0, active_shards_percent_as_number: 100.0, number_of_in_flight_fetch: 0, active_primary_shards: 456, number_of_nodes: 6, initializing_shards: 0, timed_out: False, cluster_name: production-logstash-codfw, task_max_waiting_in_queue_millis: 0, status: green, delayed_unassigned_shards [04:09:37] nding_tasks: 0, number_of_data_nodes: 3, unassigned_shards: 0, active_shards: 862 https://wikitech.wikimedia.org/wiki/Search%23Administration [04:09:37] RECOVERY - Check systemd state on logstash2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:18] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [04:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:21] (03CR) 10Legoktm: [C: 04-1] "Need to handle language variants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm) [05:06:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:38:45] (03PS1) 10Legoktm: noc: Publicly expose logos/config.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660986 (https://phabricator.wikimedia.org/T273330) [05:41:12] (03CR) 10Legoktm: "The plan is to have the autoprotection bot download config.yaml from noc.wikimedia.org and parse it that way." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660986 (https://phabricator.wikimedia.org/T273330) (owner: 10Legoktm) [06:04:19] (03PS1) 10Marostegui: clouddb*: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/660989 (https://phabricator.wikimedia.org/T267090) [06:06:17] (03CR) 10Marostegui: [C: 03+2] clouddb*: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/660989 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui) [06:12:28] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) [06:12:58] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Marostegui) [06:19:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 T266483', diff saved to https://phabricator.wikimedia.org/P14113 and previous config saved to /var/cache/conftool/dbconfig/20210202-062303-marostegui.json [06:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:09] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [06:24:01] !log Restart mysql on es1022 [06:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repool es1022 after a restart', diff saved to https://phabricator.wikimedia.org/P14114 and previous config saved to /var/cache/conftool/dbconfig/20210202-063050-root.json [06:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:38] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repool es1022 after a restart', diff saved to https://phabricator.wikimedia.org/P14115 and previous config saved to /var/cache/conftool/dbconfig/20210202-064553-root.json [06:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:49] 10SRE: ping servers running out of disk - https://phabricator.wikimedia.org/T273509 (10MoritzMuehlenhoff) We certainly should automate the removal of obsolete kernels in a better way, but with only 3G on the root partition that would happen again anyway (there will always be two kernels installed in any case), 3... [07:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repool es1022 after a restart', diff saved to https://phabricator.wikimedia.org/P14116 and previous config saved to /var/cache/conftool/dbconfig/20210202-070057-root.json [07:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:57] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Joe) @Legoktm ran some tests, analogous to the one we... [07:05:34] ... why does the stack trace on T273242 have to be so unhelpful? I still haven't managed to reproduce :/ [07:05:35] T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242 [07:06:49] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) >>! In T273312#6795003, @Joe wrote: > * The... [07:07:35] hmm [07:07:45] we need to know the name of the deferred update that's triggering it [07:08:47] Majavah: how are you trying to reproduce it? [07:09:31] legoktm: given the url suggests it's in FeaturedFeeds, I setup that and memcached locally and am playing around with it and looking at the logs [07:09:46] that's a red herring unfortunately [07:09:56] the bug is in a deferredupdate [07:10:00] which could be from anywhere [07:10:39] https://codesearch.wmcloud.org/deployed/?q=addCallableUpdate&i=nope&files=&excludeFiles=&repos= [07:10:58] so "there's a bug somewhere, good luck finding it"? [07:12:06] well it's in WANObjectCache [07:12:42] I think it's https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/libs/objectcache/wancache/WANObjectCache.php#2568 [07:12:56] which matches the stacktrace [07:14:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [07:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:50] (03PS1) 10Urbanecm: SpecialHomepage: Do not load start-startediting if SE aren't enabled [extensions/GrowthExperiments] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/660937 (https://phabricator.wikimedia.org/T273243) [07:16:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repool es1022 after a restart', diff saved to https://phabricator.wikimedia.org/P14117 and previous config saved to /var/cache/conftool/dbconfig/20210202-071602-root.json [07:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:21] (03PS1) 10Marostegui: mariadb: Decommission db1089 [puppet] - 10https://gerrit.wikimedia.org/r/660994 (https://phabricator.wikimedia.org/T273417) [07:18:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1089 [puppet] - 10https://gerrit.wikimedia.org/r/660994 (https://phabricator.wikimedia.org/T273417) (owner: 10Marostegui) [07:19:01] Majavah: yeah I'm totally lost in all the indirection and abstraction [07:19:51] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10Marostegui) a:05Marostegui→03wiki_willy [07:19:53] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10Marostegui) [07:20:06] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10Marostegui) Ready for DC-Ops! [07:20:16] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10Marostegui) [07:21:07] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:21:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [07:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:37] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1089.eqiad.wmnet` - db1089.eqiad.wmnet (**PASS**) - Downtimed host on Icinga -... [07:28:07] Majavah: I left a comment [07:28:56] legoktm: ty, looking at another blocker atm since I have some idea how to fix it (and possibly caused it too) [07:29:04] those are the best :p [07:30:00] Using PermissionManager in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/651945 already caused last week's reverts, we suspect it's causing more breakage [07:31:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repool es1022 after a restart', diff saved to https://phabricator.wikimedia.org/P14118 and previous config saved to /var/cache/conftool/dbconfig/20210202-073105-root.json [07:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:09] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10MoritzMuehlenhoff) All architectual mitigations are a... [07:39:39] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10MoritzMuehlenhoff) In fact, we still have an API serv... [07:45:00] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1405.eqiad.wmnet [07:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:05] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1403.eqiad.wmnet [07:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:48] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1381.eqiad.wmnet [07:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:27] !log depooled mw1381.eqiad.wmnet for perf testing (T273312) [08:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:31] T273312: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 [08:03:14] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/661042/ might solve some train blockers, unfortunately I need to go afk for some time now [08:20:06] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @herron in T255973 @razzi is moving partitions to new Kafka Jumbo brokers, and the... [08:23:17] (03PS1) 10Hashar: Revert "Bring back jsonevent-layout library" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661048 (https://phabricator.wikimedia.org/T268020) [08:25:04] (03CR) 10Hashar: "Thank you Daniel! Can confirm some amount of logs are still digested by logstash and are no more broken than they used to be :]" [puppet] - 10https://gerrit.wikimedia.org/r/660030 (https://phabricator.wikimedia.org/T141324) (owner: 10Hashar) [08:26:39] (03PS1) 10Hashar: Merge tag 'v3.2.7' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661050 [08:29:12] (03PS1) 10Elukey: Decommission an-worker1117 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/661051 (https://phabricator.wikimedia.org/T260411) [08:30:53] !log swift eqiad-prod: add weight back to sdg on ms-be1054 - T273582 [08:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:57] T273582: sdg1 on ms-be1054 is not rebuilding after replacement - https://phabricator.wikimedia.org/T273582 [08:32:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27799/console" [puppet] - 10https://gerrit.wikimedia.org/r/661051 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:34:19] (03CR) 10Hashar: [C: 03+2] Revert "Bring back jsonevent-layout library" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661048 (https://phabricator.wikimedia.org/T268020) (owner: 10Hashar) [08:34:50] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.2.7' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661050 (owner: 10Hashar) [08:35:39] (03CR) 10Elukey: [V: 03+1 C: 03+2] Decommission an-worker1117 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/661051 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:39:44] 10SRE, 10Analytics-Clusters: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10fgiunchedi) Thank you @elukey, I don't remember this issue being reported, did the reimage go as expected ? If there are other similar hosts to be reimaged/installed we should definitely keep an e... [08:40:05] (03Merged) 10jenkins-bot: Revert "Bring back jsonevent-layout library" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661048 (https://phabricator.wikimedia.org/T268020) (owner: 10Hashar) [08:40:07] (03Merged) 10jenkins-bot: Merge tag 'v3.2.7' into wmf/stable-3.2 [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661050 (owner: 10Hashar) [08:40:31] 10SRE, 10Analytics-Clusters: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) >>! In T273412#6795218, @fgiunchedi wrote: > Thank you @elukey, I don't remember this issue being reported, did the reimage go as expected ? If there are other similar hosts to be reimaged... [08:43:15] (03PS3) 10Jbond: 6.3.1: updated ready for 6.3.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 [08:46:36] (03PS4) 10Jbond: 6.3.1: updated ready for 6.3.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 [08:48:27] 10SRE, 10Analytics-Clusters: rsyslog segfault on an-test-presto1001 - https://phabricator.wikimedia.org/T273412 (10elukey) 05Open→03Resolved a:03elukey After a chat with Filippo we concluded that the issue was originated due to the temporary root partition being full (it happened for a bit due to presto... [08:51:17] (03PS2) 10Legoktm: logos: Update zhwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 [08:51:19] (03PS1) 10Legoktm: logos: Redo how variants work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661052 (https://phabricator.wikimedia.org/T98640) [08:52:27] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:55] (03CR) 10Legoktm: "No one had previously compressed the zhwiki-hans logos :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm) [08:56:34] !log disable DE-CIX codfw peering session [08:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:49] (03PS1) 10Filippo Giunchedi: interfaces: allow setting queues on i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661053 (https://phabricator.wikimedia.org/T271415) [08:58:51] (03PS1) 10Filippo Giunchedi: swift: apply interface::rps to i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415) [09:00:59] (03CR) 10David Caro: [C: 03+1] Revert "dumps: fail over dumps web" [dns] - 10https://gerrit.wikimedia.org/r/660798 (owner: 10Bstorm) [09:01:17] (03CR) 10David Caro: [C: 03+1] Revert "dumps-dist: fail over labstore1006 to 1007" [puppet] - 10https://gerrit.wikimedia.org/r/660799 (owner: 10Bstorm) [09:02:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27800/console" [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [09:06:13] do the requests on T273242 have any other logs? according to the source code it should debug the cache key to the debug channel and knowing that would be really useful [09:06:13] T273242: MemcachedPeclBagOStuff: Serialization of 'Closure' is not allowed - https://phabricator.wikimedia.org/T273242 [09:08:37] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:16] (03PS5) 10Jbond: 6.3.1: updated ready for 6.3.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 [09:23:29] PROBLEM - SSH on ms-be1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:26:58] (03PS1) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [09:27:04] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) >>! In T273312#6795045, @MoritzMuehlenhoff w... [09:27:28] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1381.eqiad.wmnet [09:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:04] (03PS2) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [09:30:12] (03PS3) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [09:30:42] (03CR) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [09:33:53] RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:15] (03CR) 10David Caro: "There's a couple questions, you can safely ignore any comment starting with 'nit' xd" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [09:34:47] RECOVERY - SSH on ms-be1054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:40:32] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Joe) This kind-of seals the deal. Upgrading the kerne... [09:43:38] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Ladsgroup) Maybe mw devs (including yours truly) can... [09:50:04] 10SRE, 10Anti-Harassment, 10SRE-tools, 10netops: Surprising new svc.eqiad.wmnet dns entry deployed: similar-users on host decommission - https://phabricator.wikimedia.org/T273275 (10jcrespo) 05Open→03Resolved a:03hnowlan I don't think there is further actionables here except for @Volans to read this... [09:51:02] (03PS1) 10Hashar: Gerrit v3.2.7 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661056 (https://phabricator.wikimedia.org/T273223) [09:51:50] 10SRE, 10Anti-Harassment, 10SRE-tools, 10netops: Surprising new svc.eqiad.wmnet dns entry deployed: similar-users on host decommission - https://phabricator.wikimedia.org/T273275 (10Volans) @jcrespo I'm aware of this conversation, I just didn't had anything to add as @akosiaris had already gave all the rel... [09:51:53] (03CR) 10Jbond: [C: 04-1] "See inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660950 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [09:53:00] (03CR) 10Hashar: [C: 03+2] "Uploaded to Archiva" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661056 (https://phabricator.wikimedia.org/T273223) (owner: 10Hashar) [09:53:17] (03CR) 10Hashar: [V: 03+2 C: 03+2] Gerrit v3.2.7 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/661056 (https://phabricator.wikimedia.org/T273223) (owner: 10Hashar) [09:54:09] Going to upgrade Gerrit starting with replica first [09:56:11] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Joe) >>! In T273312#6795425, @Ladsgroup wrote: > Mayb... [09:56:15] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 20.31 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [09:56:44] !log hashar@deploy1001 Started deploy [gerrit/gerrit@c3cd63b]: Gerrit replica on gerrit2001 to v3.2.7 T273223 [09:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:48] T273223: Upgrade Gerrit to 3.2.7 - https://phabricator.wikimedia.org/T273223 [09:56:56] !log hashar@deploy1001 Finished deploy [gerrit/gerrit@c3cd63b]: Gerrit replica on gerrit2001 to v3.2.7 T273223 (duration: 00m 12s) [09:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:58] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) > If it is too impacting we could try to figure out a workaround for these nodes :( How bad would it be to disable monitoring of backups (and backups to fail) of these (but keeping the backups of con... [10:00:02] !log Restarted Gerrit replica on gerrit2001 # T273223 [10:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:46] !log hashar@deploy1001 Started deploy [gerrit/gerrit@c3cd63b]: Gerrit primary on gerrit1001 to v3.2.7 T273223 [10:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:55] !log hashar@deploy1001 Finished deploy [gerrit/gerrit@c3cd63b]: Gerrit primary on gerrit1001 to v3.2.7 T273223 (duration: 00m 09s) [10:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:40] !log Restarted Gerrit primary on gerrit1001 # T273223 [10:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:45] T273223: Upgrade Gerrit to 3.2.7 - https://phabricator.wikimedia.org/T273223 [10:03:07] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:07:51] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:08:31] (03PS1) 10Marostegui: mariadb: Productionize db1174 [puppet] - 10https://gerrit.wikimedia.org/r/661066 (https://phabricator.wikimedia.org/T258361) [10:08:52] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [10:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1094 to clone db1174 - T258361', diff saved to https://phabricator.wikimedia.org/P14121 and previous config saved to /var/cache/conftool/dbconfig/20210202-100859-marostegui.json [10:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:04] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:10:28] (03PS1) 10Elukey: Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) [10:12:50] (03PS7) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [10:12:52] (03PS2) 10Jcrespo: jessie: Revert openssl conf on director/storage to package defaults [puppet] - 10https://gerrit.wikimedia.org/r/660856 (https://phabricator.wikimedia.org/T273182) [10:12:54] (03PS2) 10Jcrespo: jessie: Remove old openssl override after revert to package version [puppet] - 10https://gerrit.wikimedia.org/r/660857 (https://phabricator.wikimedia.org/T273182) [10:12:56] (03PS1) 10Jcrespo: backup-sources: Productionize db1171 to substitute db1095 [puppet] - 10https://gerrit.wikimedia.org/r/661069 (https://phabricator.wikimedia.org/T258361) [10:14:27] (03PS1) 10Ladsgroup: tlsproxy::localssl hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/661070 (https://phabricator.wikimedia.org/T209953) [10:16:28] (03PS2) 10Ladsgroup: tlsproxy::localssl hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/661070 (https://phabricator.wikimedia.org/T209953) [10:17:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661070 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:19:06] (03PS1) 10Elukey: Add eventstreams-internal to service_catalog [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) [10:19:49] PROBLEM - SSH on mw2249.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:22:12] (03PS1) 10Elukey: role::kubernetes::worker: add empty stanza for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) [10:23:16] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) I have followed https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_bal... [10:26:42] (03PS1) 10Ladsgroup: ipsec: Clean up the parts related to cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) [10:28:20] (03PS2) 10Ladsgroup: ipsec: Clean up the parts related to cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) [10:29:08] (03PS3) 10Ladsgroup: ipsec: Clean up the parts related to cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) [10:30:54] !log re-enable DE-CIX codfw peering sessions [10:30:55] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) (owner: 10Ladsgroup) [10:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:28] (03PS1) 10Jcrespo: install_server: Reimage db1095 into stretch so we still have 10.1 backups [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) [10:33:15] (03PS2) 10Jcrespo: install_server: Reimage db1171 into stretch so we still have 10.1 backups [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) [10:34:31] (03CR) 10Marostegui: "I believe it must still be on the allowed reformat hosts, but probably worth double checking" [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [10:34:39] (03PS3) 10Jcrespo: install_server: Reimage db1171 into stretch so we still have 10.1 backups [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) [10:34:57] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [10:35:17] (03CR) 10Marostegui: [C: 03+1] install_server: Reimage db1171 into stretch so we still have 10.1 backups [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [10:37:54] (03PS3) 10Filippo Giunchedi: swift: limit rsync service memory [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) [10:37:56] (03PS4) 10Filippo Giunchedi: swift: limit rsync to 10% memory in codfw [puppet] - 10https://gerrit.wikimedia.org/r/660855 (https://phabricator.wikimedia.org/T221904) [10:38:06] (03CR) 10Filippo Giunchedi: swift: limit rsync service memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [10:38:48] (03CR) 10Jcrespo: [C: 03+2] install_server: Reimage db1171 into stretch so we still have 10.1 backups [puppet] - 10https://gerrit.wikimedia.org/r/661074 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [10:43:29] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:43:49] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:46:09] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/660085 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [10:48:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 (owner: 10Jbond) [10:48:12] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10jcrespo) @wiki_willy As promised, we sped up the decommissioning of eqiad hw, this should free 3Us of space. No blocker on us, but I though... [10:49:10] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) (owner: 10Ladsgroup) [10:51:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/659004 (owner: 10Jbond) [10:52:12] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` db1171.eqiad.wmnet ` The log can be foun... [10:53:39] (03PS2) 10Jcrespo: backup-sources: Productionize db1171 to substitute db1095 [puppet] - 10https://gerrit.wikimedia.org/r/661069 (https://phabricator.wikimedia.org/T258361) [10:54:29] (03CR) 10Jcrespo: [C: 03+2] backup-sources: Productionize db1171 to substitute db1095 [puppet] - 10https://gerrit.wikimedia.org/r/661069 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [10:54:45] 10SRE, 10Continuous-Integration-Infrastructure, 10puppet-compiler: Puppet compiler running out of space - https://phabricator.wikimedia.org/T273599 (10Ladsgroup) [10:54:55] (03PS2) 10Elukey: archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) [10:56:21] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:49] (03PS3) 10Elukey: archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) [10:57:07] (03CR) 10Elukey: "Just rebased, going to check comments now" [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:57:18] 10SRE, 10Continuous-Integration-Infrastructure, 10puppet-compiler: Puppet compiler running out of space on compiler1002.puppet-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T273599 (10hashar) [11:00:13] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [11:00:41] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev, 10cloud-services-team (Kanban): Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10aborrero) [11:01:00] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/660037 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [11:02:12] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) (owner: 10Ladsgroup) [11:02:41] (03PS1) 10Muehlenhoff: Disable LDAP auth in debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/661078 [11:03:15] (03CR) 10Volans: [C: 03+1] "LGTM, I leave it to Ryan or Guillaume to check it as they're the owners and users of the module." [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [11:03:41] 10SRE, 10Continuous-Integration-Infrastructure, 10puppet-compiler: Puppet compiler running out of space on compiler1002.puppet-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T273599 (10hashar) The instance `compiler1002.puppet-diffs.eqiad.wmflabs` is full :/ ` name=df -h -t ext4 Filesystem... [11:04:05] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1171.eqiad.wmnet with reason: REIMAGE [11:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:08] (03PS4) 10Elukey: archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) [11:04:17] (03CR) 10Elukey: archiva::proxy: allow nginx to serve content from repositories (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [11:05:21] (03CR) 10Elukey: archiva::proxy: allow nginx to serve content from repositories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [11:05:42] (03PS1) 10Marostegui: mariadb: Productionize db1174 [puppet] - 10https://gerrit.wikimedia.org/r/661079 (https://phabricator.wikimedia.org/T258361) [11:06:09] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: REIMAGE [11:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:32] (03PS5) 10Elukey: archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) [11:07:03] (03PS2) 10Marostegui: mariadb: Productionize db1174 [puppet] - 10https://gerrit.wikimedia.org/r/661079 (https://phabricator.wikimedia.org/T258361) [11:07:11] (03PS3) 10Arturo Borrero Gonzalez: [DONT MERGE] cloud: drop NAT exceptions for dumps NFS [puppet] - 10https://gerrit.wikimedia.org/r/657152 (https://phabricator.wikimedia.org/T272397) [11:07:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1174 [puppet] - 10https://gerrit.wikimedia.org/r/661079 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [11:12:04] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Miriam) [11:12:16] (03PS6) 10Elukey: archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) [11:14:15] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1171.eqiad.wmnet'] ` and were **ALL** successful. [11:16:20] (03PS1) 10Jcrespo: install_server: Reenable notifications and disable disk format for db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) [11:16:32] (03PS2) 10Jcrespo: install_server: Reenable notifications and disable disk format for db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) [11:16:41] (03CR) 10Jcrespo: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [11:17:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27802/console" [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [11:17:27] 10SRE, 10Continuous-Integration-Infrastructure, 10puppet-compiler: Puppet compiler running out of space on compiler1002.puppet-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T273599 (10hashar) 05Open→03Resolved a:03hashar I have deleted bunch of old builds under `/srv/jenkins-workspace/puppet-... [11:20:36] RECOVERY - SSH on mw2249.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:21:10] (03CR) 10Ladsgroup: "PCC on all of the cluster seems to be noop." [puppet] - 10https://gerrit.wikimedia.org/r/661070 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:23:50] RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:25] (03PS1) 10JMeybohm: k8s::kubelet: Ensure apparmor is purged on old k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/661083 (https://phabricator.wikimedia.org/T228967) [11:30:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] k8s::kubelet: Ensure apparmor is purged on old k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/661083 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:41:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27803/console" [puppet] - 10https://gerrit.wikimedia.org/r/661083 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:47:00] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::kubelet: Ensure apparmor is purged on old k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/661083 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [11:55:44] (03PS1) 10Urbanecm: Banner module: Switch to using activated/unactivated for state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/661086 (https://phabricator.wikimedia.org/T273084) [11:57:12] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Daimona) >>! In T273312#6795006, @Legoktm wrote: >>>!... [12:00:04] (03PS2) 10Esanders: DiscussionTools: Enable new topic tool by default on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656572 (https://phabricator.wikimedia.org/T272077) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T1200). [12:00:04] Urbanecm: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] I can deploy today [12:00:22] ok [12:00:30] (03CR) 10Urbanecm: [C: 03+2] Banner module: Switch to using activated/unactivated for state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/661086 (https://phabricator.wikimedia.org/T273084) (owner: 10Urbanecm) [12:00:32] (03CR) 10Urbanecm: [C: 03+2] SpecialHomepage: Do not load start-startediting if SE aren't enabled [extensions/GrowthExperiments] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/660937 (https://phabricator.wikimedia.org/T273243) (owner: 10Urbanecm) [12:02:57] (03CR) 10Ladsgroup: "PCC is happy fleet-wide." [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) (owner: 10Ladsgroup) [12:11:59] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test1001.wikimedia.org [12:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test2001.wikimedia.org [12:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:43] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp2001.wikimedia.org [12:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:46] !log upload cas_6.3 package [12:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1001.wikimedia.org [12:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:33] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host apt2001.wikimedia.org [12:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:05] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2001.wikimedia.org [12:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:15] (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.3.1: updated ready for 6.3.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 (owner: 10Jbond) [12:15:37] (03PS6) 10Jbond: 6.3.1: updated ready for 6.3.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 [12:15:43] (03Merged) 10jenkins-bot: Banner module: Switch to using activated/unactivated for state [extensions/GrowthExperiments] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/661086 (https://phabricator.wikimedia.org/T273084) (owner: 10Urbanecm) [12:15:45] (03Merged) 10jenkins-bot: SpecialHomepage: Do not load start-startediting if SE aren't enabled [extensions/GrowthExperiments] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/660937 (https://phabricator.wikimedia.org/T273243) (owner: 10Urbanecm) [12:15:52] qo/ [12:15:56] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2001.wikimedia.org [12:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.3.1: updated ready for 6.3.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/658983 (owner: 10Jbond) [12:16:39] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2001.wikimedia.org [12:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:33] (03CR) 10Jbond: [C: 03+2] apereo: update tomcat proxy setting post 6.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/659004 (owner: 10Jbond) [12:17:39] (03PS3) 10Jbond: apereo: update tomcat proxy setting post 6.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/659004 [12:18:18] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-etcd1001.eqiad.wmnet [12:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:53] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/GrowthExperiments/includes/Specials/SpecialHomepage.php: 18c59d018b6ef72c750e25588518d2df6f492db3: SpecialHomepage: Do not load start-startediting if SE arent enabled (T273243) (duration: 01m 01s) [12:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:57] T273243: [wmf.28-regression] frwiktionary - SE module displayed on Homepage with errors - https://phabricator.wikimedia.org/T273243 [12:18:57] (03PS1) 10Elukey: profile::hadoop::master: add more specific alert runbook [puppet] - 10https://gerrit.wikimedia.org/r/661109 [12:19:19] (03PS1) 10Jbond: apt, idp: failover to codfw for reboot [dns] - 10https://gerrit.wikimedia.org/r/661110 (https://phabricator.wikimedia.org/T273278) [12:20:26] (03CR) 10Urbanecm: [C: 03+2] "sounds reasonable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660986 (https://phabricator.wikimedia.org/T273330) (owner: 10Legoktm) [12:20:36] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: add more specific alert runbook [puppet] - 10https://gerrit.wikimedia.org/r/661109 (owner: 10Elukey) [12:20:38] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/GrowthExperiments/includes/HomepageModules/Banner.php: da8f328640ca5c46385a57e706cd76614bbfdc7a: Banner module: Switch to using activated/unactivated for state (T273084) (duration: 00m 58s) [12:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:42] T273084: HomepageModule events with validation errors - https://phabricator.wikimedia.org/T273084 [12:20:59] (03CR) 10Jbond: [C: 03+2] apt, idp: failover to codfw for reboot [dns] - 10https://gerrit.wikimedia.org/r/661110 (https://phabricator.wikimedia.org/T273278) (owner: 10Jbond) [12:21:22] (03Merged) 10jenkins-bot: noc: Publicly expose logos/config.yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660986 (https://phabricator.wikimedia.org/T273330) (owner: 10Legoktm) [12:22:26] (03PS1) 10Elukey: profile::hadoop::master: better naming for the alert runbook [puppet] - 10https://gerrit.wikimedia.org/r/661112 [12:22:44] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2001.codfw.wmnet [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: better naming for the alert runbook [puppet] - 10https://gerrit.wikimedia.org/r/661112 (owner: 10Elukey) [12:23:07] !log urbanecm@deploy1001 Synchronized docroot/noc/conf/logos-config.yaml: 210647e915c91a4bddf0407d05436a9e231d3f29: noc: Publicly expose logos/config.yaml (1/2; T273330) (duration: 00m 57s) [12:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:10] T273330: Publish logos.php at noc.wikimedia.org - https://phabricator.wikimedia.org/T273330 [12:26:21] !log urbanecm@deploy1001 Synchronized docroot/noc/createTxtFileSymlinks.sh: 210647e915c91a4bddf0407d05436a9e231d3f29: noc: Publicly expose logos/config.yaml (2/2; T273330) (duration: 00m 55s) [12:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:25] * Urbanecm done [12:27:04] (03PS1) 10Jbond: Revert "apt, idp: failover to codfw for reboot" [dns] - 10https://gerrit.wikimedia.org/r/661087 [12:29:13] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org [12:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:15] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp1001.wikimedia.org [12:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:20] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.29 [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/660980 (https://phabricator.wikimedia.org/T271343) (owner: 10TrainBranchBot) [12:30:43] !log klausman@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-etcd2001.codfw.wmnet [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:53] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2001.codfw.wmnet [12:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:16] (03PS1) 10Urbanecm: noc: yaml files may be published w/o .txt extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661113 [12:32:18] (03CR) 10Urbanecm: [C: 03+2] noc: yaml files may be published w/o .txt extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661113 (owner: 10Urbanecm) [12:33:05] (03Merged) 10jenkins-bot: noc: yaml files may be published w/o .txt extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661113 (owner: 10Urbanecm) [12:33:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org [12:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:20] (03CR) 10Jbond: [C: 03+2] Revert "apt, idp: failover to codfw for reboot" [dns] - 10https://gerrit.wikimedia.org/r/661087 (owner: 10Jbond) [12:34:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1001.wikimedia.org [12:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:45] !log urbanecm@deploy1001 Synchronized docroot/noc/conf/index.php: 995649efafc2f5a44824af1e96128baaf15ef928: noc: yaml files may be published w/o .txt extension (duration: 00m 57s) [12:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:59] !log klausman@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-etcd2001.codfw.wmnet [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:09] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install1003.wikimedia.org [12:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:47] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install2003.wikimedia.org [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:20] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install3001.wikimedia.org [12:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:01] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install4001.wikimedia.org [12:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:38] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host install5001.wikimedia.org [12:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:03] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install1003.wikimedia.org [12:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:39] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki1001.eqiad.wmnet [12:38:41] (03CR) 10Volans: "LGTM, minor nits inline. Just missing the unit tests." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [12:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3001.wikimedia.org [12:39:27] Hmmm. When the VM creation cookbook fails (due to running in parallel with another one), I get leftover IPs in the repo when re-running it (after the other cookbook has completed DNS stuff). Is there some cleanup step? [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:01] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host pki2001.codfw.wmnet [12:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install2003.wikimedia.org [12:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:59] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2001.codfw.wmnet [12:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:06] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [12:41:07] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd1001.eqiad.wmnet [12:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5001.wikimedia.org [12:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4001.wikimedia.org [12:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:31] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [12:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [12:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki1001.eqiad.wmnet [12:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:25] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-etcd1002.eqiad.wmnet [12:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:14] !log klausman@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-etcd2001.codfw.wmnet [12:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:20] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [12:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:24] !log klausman@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ml-etcd1002.eqiad.wmnet [12:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:43] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2001.codfw.wmnet [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:57] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on cescout1001.eqiad.wmnet with reason: rebooting for kernel update [12:46:58] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on cescout1001.eqiad.wmnet with reason: rebooting for kernel update [12:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:04] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:05:00 on malmok.wikimedia.org with reason: rebooting for kernel update [12:50:05] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on malmok.wikimedia.org with reason: rebooting for kernel update [12:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:05] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-etcd1002.eqiad.wmnet [12:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:11] !log klausman@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) for new host ml-etcd1002.eqiad.wmnet [12:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:16] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-etcd1002.eqiad.wmnet [12:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.29 [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/660980 (https://phabricator.wikimedia.org/T271343) (owner: 10TrainBranchBot) [12:55:41] (03PS1) 10Jbond: ipd: failover to codfw for cas upgrade [dns] - 10https://gerrit.wikimedia.org/r/661115 [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T1300) [13:03:17] (03PS1) 10Jbond: stdlib: update to 6.6.0 [puppet] - 10https://gerrit.wikimedia.org/r/661118 [13:05:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org [13:05:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org [13:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:58] (03CR) 10Jbond: "PCC (running): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27804" [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond) [13:08:08] (03CR) 10Jbond: [C: 03+2] ipd: failover to codfw for cas upgrade [dns] - 10https://gerrit.wikimedia.org/r/661115 (owner: 10Jbond) [13:08:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org [13:08:23] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org [13:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:48] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host irc2001.wikimedia.org [13:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host failoid2001.codfw.wmnet [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2001.codfw.wmnet [13:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:19] (03CR) 10Jbond: "may want to wait for the following to get merged" [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond) [13:11:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2001.wikimedia.org [13:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:45] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd1002.eqiad.wmnet [13:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host doc1002.eqiad.wmnet [13:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host doc2001.codfw.wmnet [13:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:34] (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661120 [13:12:36] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661120 (owner: 10Hashar) [13:13:18] !log klausman@cumin1001 START - Cookbook sre.ganeti.makevm for new host ml-etcd1003.eqiad.wmnet [13:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:21] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661120 (owner: 10Hashar) [13:13:42] !log hashar@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.29 [13:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:36] hasharLunch: are you planning to leave testwikis/group0 on .29? [13:15:50] (03PS1) 10Jbond: Revert "ipd: failover to codfw for cas upgrade" [dns] - 10https://gerrit.wikimedia.org/r/661088 [13:16:16] Majavah: I guess yes [13:16:23] the whole train is blocked anyway [13:17:08] ack, thanks [13:17:23] (03CR) 10Jbond: [C: 03+2] Revert "ipd: failover to codfw for cas upgrade" [dns] - 10https://gerrit.wikimedia.org/r/661088 (owner: 10Jbond) [13:17:23] maybe we'll be able to get logs using x-wm-d for https://phabricator.wikimedia.org/T273242 if there's at least some traffic [13:17:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc1002.eqiad.wmnet [13:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:49] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [13:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:16] Majavah: that one is puzzling :/ [13:18:30] if only we could find the code that introduces the issue and revert it.. [13:18:37] or it is related to some user data maybe [13:19:02] that stack trace is so misleading, it debug logs the memcached key but debug logs are disabled by default on production [13:20:04] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1174 is now replicating. Not pooling until tomorrow. [13:22:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/661078 (owner: 10Muehlenhoff) [13:24:23] (03CR) 10Jbond: Disable LDAP auth in debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661078 (owner: 10Muehlenhoff) [13:27:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc2001.codfw.wmnet [13:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:43] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd1003.eqiad.wmnet [13:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:09] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2001.codfw.wmnet [13:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:13] (03CR) 10Jbond: puppet: add puppetmaster retrieval (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [13:33:34] PROBLEM - SSH on ms-be1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:34:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 10%: Repool db1094 after cloning another host', diff saved to https://phabricator.wikimedia.org/P14128 and previous config saved to /var/cache/conftool/dbconfig/20210202-133936-root.json [13:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:54] RECOVERY - SSH on ms-be1054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:43:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:46:01] (03CR) 10Jbond: "> Patch Set 7:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:48:14] PROBLEM - Check systemd state on doc2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:59] (03PS6) 10Jbond: customscripts/interface_automation: skip slaac addresses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654439 (https://phabricator.wikimedia.org/T265904) [13:49:07] !log klausman@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2001.codfw.wmnet [13:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:52] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2002.codfw.wmnet [13:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:11] (03CR) 10Volans: [C: 03+1] "Thanks for the data, LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:51:09] (03CR) 10Volans: "typo inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [13:54:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 25%: Repool db1094 after cloning another host', diff saved to https://phabricator.wikimedia.org/P14132 and previous config saved to /var/cache/conftool/dbconfig/20210202-135439-root.json [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:03] (03PS8) 10Jbond: interface_automation: update is_primary logic. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) [13:56:20] 10SRE, 10Traffic, 10Wikisource: Error when trying to create new page on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10Majavah) [13:56:45] (03CR) 10Jbond: "Sorry fixed now, thx" (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:00:05] hashar and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T1400). [14:00:54] 10SRE, 10vm-requests: eqiad: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273074 (10klausman) `$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 eqiad_B ml-etcd1001.eqiad.wmnet` IPv4: `10.64.16.200` IPv6: `2620:0:861:102:10:64:16:200` MAC: `aa:00:00:ef:5f:2d` `$ sudo cookb... [14:02:29] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/654435 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [14:04:00] (03CR) 10Volans: [C: 04-1] Disable LDAP auth in debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661078 (owner: 10Muehlenhoff) [14:06:11] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host deploy1002.eqiad.wmnet [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 50%: Repool db1094 after cloning another host', diff saved to https://phabricator.wikimedia.org/P14133 and previous config saved to /var/cache/conftool/dbconfig/20210202-140943-root.json [14:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:44] !log klausman@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2002.codfw.wmnet [14:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:42] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:14:46] (03PS3) 10David Caro: last-puppet-run: don't crash if puppet has not run yet [puppet] - 10https://gerrit.wikimedia.org/r/641207 [14:14:48] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:15:29] (03CR) 10David Caro: "Applied requested changes" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/641207 (owner: 10David Caro) [14:15:48] (03CR) 10Volans: [C: 04-1] "See few possible improvements inline. Need also tests." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [14:21:01] !log klausman@cumin2001 START - Cookbook sre.ganeti.makevm for new host ml-etcd2003.codfw.wmnet [14:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:12] (03PS6) 10David Caro: puppet: add ca_server retrieval [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 [14:21:19] (03CR) 10David Caro: puppet: add ca_server retrieval (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [14:21:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1002.eqiad.wmnet [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:34] PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [14:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 75%: Repool db1094 after cloning another host', diff saved to https://phabricator.wikimedia.org/P14134 and previous config saved to /var/cache/conftool/dbconfig/20210202-142446-root.json [14:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet [14:26:32] !log hashar@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.29 (duration: 73m 10s) [14:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:42] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet [14:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:52] RECOVERY - Keyholder SSH agent on deploy1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [14:29:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet [14:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:18] PROBLEM - SSH on ms-be1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:35:37] !log klausman@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2003.codfw.wmnet [14:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet [14:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host failoid1001.eqiad.wmnet [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:22] RECOVERY - SSH on ms-be1054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:38:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1001.eqiad.wmnet [14:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:59] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10klausman) `$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 codfw_B ml-etcd2001.codfw.wmnet` IPv4: `10.192.16.44/22` IPv6: `2620:0:860:102:10:192:16:44/64` MAC: `aa:00:00:71:6a:f3` `$ sudo... [14:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1094 (re)pooling @ 100%: Repool db1094 after cloning another host', diff saved to https://phabricator.wikimedia.org/P14135 and previous config saved to /var/cache/conftool/dbconfig/20210202-143950-root.json [14:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:54] PROBLEM - SSH on ms-be1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:46:50] RECOVERY - SSH on ms-be1054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:54:16] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.829 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [15:07:21] (03CR) 10Hashar: "This change is no more needed, we have since removed our custom log4j configuration that required jsonevent-layout and have switched to us" [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/617206 (owner: 10QChris) [15:10:54] (03PS1) 10Esanders: Make DiscussionTools' newtopictool available on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 (https://phabricator.wikimedia.org/T272077) [15:16:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet [15:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host miscweb2002.codfw.wmnet [15:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:25] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/661131 [15:18:43] (03CR) 10Gehel: [C: 03+1] "LGTM. I have some doubts that this will have noticeable performance improvement, but I'm happy to be proven wrong!" [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [15:19:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet [15:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb2002.codfw.wmnet [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:45] (03CR) 10Gehel: [C: 04-1] "I think the issue is raising a real problem, we should fix the production code, not the tests. See comment inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [15:29:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/661118 (owner: 10Jbond) [15:32:42] (03CR) 10Arturo Borrero Gonzalez: cloud-vps instances: add a helper script to format & mount a cinder volume (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [15:37:12] (03CR) 10Elukey: [V: 03+1 C: 03+2] archiva::proxy: allow nginx to serve content from repositories [puppet] - 10https://gerrit.wikimedia.org/r/608812 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [15:37:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 but make sure to upload and submit the dns repo change first." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [15:38:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [15:38:14] (03PS8) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:40:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Make sure to merge along with the parent commit." [puppet] - 10https://gerrit.wikimedia.org/r/661072 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [15:42:06] more blockers :/ [15:42:15] (03PS1) 10Bstorm: wikireplicas-proxy: front proxy should not keep connections down [puppet] - 10https://gerrit.wikimedia.org/r/661132 (https://phabricator.wikimedia.org/T272523) [15:42:44] so four open, with two that have patches and one which has a patch to revert if necessary [15:44:17] (03PS1) 10Andrew Bogott: profile::wmcs::instance: remove needless requirement on lvm2 [puppet] - 10https://gerrit.wikimedia.org/r/661133 [15:45:14] (03CR) 10Bstorm: "Adding CC's for the FYI. Also open to the idea of this layer just always keeping the connection open if I can do that. This probably fixes" [puppet] - 10https://gerrit.wikimedia.org/r/661132 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [15:45:36] (03CR) 10Bstorm: "This only affects the VM proxies" [puppet] - 10https://gerrit.wikimedia.org/r/661132 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [15:45:51] (03CR) 10Bstorm: [C: 03+2] wikireplicas-proxy: front proxy should not keep connections down [puppet] - 10https://gerrit.wikimedia.org/r/661132 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [15:46:24] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::instance: remove needless requirement on lvm2 [puppet] - 10https://gerrit.wikimedia.org/r/661133 (owner: 10Andrew Bogott) [15:46:44] (03PS9) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [15:47:30] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [15:48:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add BGP configuration for the new ML Serve eqiad/codfw clusters (036 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [15:50:08] subtitle: Luca please do not work on BGP [15:50:11] :D [15:51:02] I still think we should get all those lists from netbox :D [15:51:03] fwiw [15:52:26] (03PS1) 10Elukey: archiva: add missing ; to nginx config [puppet] - 10https://gerrit.wikimedia.org/r/661135 [15:53:24] (03CR) 10Elukey: [C: 03+2] archiva: add missing ; to nginx config [puppet] - 10https://gerrit.wikimedia.org/r/661135 (owner: 10Elukey) [15:56:28] volans: yes yes always netbox [15:56:30] :D [16:05:49] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10elukey) @hashar I applied the nginx change to bypass Jetty, can you test again? [16:09:59] (03PS2) 10Giuseppe Lavagetto: Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 [16:10:01] (03PS1) 10Giuseppe Lavagetto: Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) [16:10:22] (03PS4) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [16:11:01] (03CR) 10Elukey: "I love it when I don't understand half of what I have done, hopefully this time is a little better. Thanks for the review :(" (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [16:11:14] 10SRE, 10Orchestrator: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Kormat) [16:11:16] (03CR) 10jerkins-bot: [V: 04-1] Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [16:12:18] (03PS5) 10Elukey: Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) [16:14:39] 10Puppet, 10SRE, 10Orchestrator, 10CAS-SSO: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond) p:05Triage→03Medium [16:19:35] (03CR) 10Elukey: Add eventstreams-internal to service_catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661071 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [16:21:41] (03PS1) 10Klausman: install_server: Add DHCP, partman and manifest entries for ml-etcd* [puppet] - 10https://gerrit.wikimedia.org/r/661139 [16:21:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 (owner: 10Giuseppe Lavagetto) [16:22:03] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) Fetching https://archiva.wikimedia.org/repository/mirrored/junit/junit/4.13.1/junit-4.13.1.jar it still takes a while until the transfer starts: | time... [16:22:57] (03CR) 10Ottomata: [C: 03+1] Add conftool data for eventstreams-internal (new VIP) [puppet] - 10https://gerrit.wikimedia.org/r/661067 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [16:23:16] (03CR) 10Elukey: install_server: Add DHCP, partman and manifest entries for ml-etcd* (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661139 (owner: 10Klausman) [16:23:54] (03CR) 10Jbond: [C: 03+1] "LGTM thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/641207 (owner: 10David Caro) [16:25:07] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [16:25:19] (03PS1) 10BryanDavis: domainproxy: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/661140 (https://phabricator.wikimedia.org/T120486) [16:26:03] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) That's really exciting! Yes I'd love do see this happen as well, and am on board... [16:26:59] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10elukey) I think that we should make tests inside the wikimedia network, testing from home is not reliable (as you said there are too many variables, one above a... [16:28:19] (03PS3) 10Jbond: (WIP): add script to copy ldap entries to a local db [puppet] - 10https://gerrit.wikimedia.org/r/660869 [16:29:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add the 'uid' template helper (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto) [16:29:51] (03PS2) 10Jbond: stdlib: update to 6.6.0 [puppet] - 10https://gerrit.wikimedia.org/r/661118 [16:29:53] (03PS10) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [16:30:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host auth2001.codfw.wmnet [16:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host auth1002.eqiad.wmnet [16:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:34] (03CR) 10jerkins-bot: [V: 04-1] Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [16:31:43] (03PS2) 10CRusnov: modules/network: Break out management networks by DC label [puppet] - 10https://gerrit.wikimedia.org/r/660950 (https://phabricator.wikimedia.org/T271583) [16:31:52] (03CR) 10Razzi: [C: 03+2] sre.kafka.reboot-workers: Properly format arguments in log message [cookbooks] - 10https://gerrit.wikimedia.org/r/660037 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [16:33:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host auth1002.eqiad.wmnet [16:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:36] (03CR) 10CRusnov: "Thanks 😊" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660950 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:34:47] (03Merged) 10jenkins-bot: sre.kafka.reboot-workers: Properly format arguments in log message [cookbooks] - 10https://gerrit.wikimedia.org/r/660037 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [16:36:35] (03PS11) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [16:36:48] (03PS2) 10Klausman: install_server: Add DHCP, partman and manifest entries for ml-etcd* [puppet] - 10https://gerrit.wikimedia.org/r/661139 [16:36:57] (03CR) 10Klausman: install_server: Add DHCP, partman and manifest entries for ml-etcd* (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661139 (owner: 10Klausman) [16:37:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host auth2001.codfw.wmnet [16:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27809/console" [puppet] - 10https://gerrit.wikimedia.org/r/660950 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:38:00] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/660950 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:40:49] (03CR) 10JMeybohm: [C: 03+1] Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 (owner: 10Giuseppe Lavagetto) [16:42:06] (03PS12) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [16:45:40] (03CR) 10Volans: "LGTM, just missing the tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/659008 (owner: 10David Caro) [16:46:13] (03PS13) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [16:48:12] (03PS1) 10MSantos: mobileapps: bump to 2021-02-01-143328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/661144 [16:48:25] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1001/27811/" [puppet] - 10https://gerrit.wikimedia.org/r/661140 (https://phabricator.wikimedia.org/T120486) (owner: 10BryanDavis) [16:51:38] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-02-01-143328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/661144 (owner: 10MSantos) [16:51:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] domainproxy: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/661140 (https://phabricator.wikimedia.org/T120486) (owner: 10BryanDavis) [16:53:09] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-02-01-143328-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/661144 (owner: 10MSantos) [16:54:21] (03CR) 10CRusnov: [C: 03+2] modules/network: Break out management networks by DC label [puppet] - 10https://gerrit.wikimedia.org/r/660950 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [16:58:55] 10SRE, 10ops-eqiad, 10Data-Persistence-Backup, 10decommission-hardware: decommission helium.eqiad.wmnet and helium-array - https://phabricator.wikimedia.org/T273049 (10wiki_willy) Thanks a lot @jcrespo, it's much appreciated! >>! In T273049#6795619, @jcrespo wrote: > @wiki_willy As promised, we sped up th... [16:59:10] 10SRE, 10vm-requests: codfw: 3 VM request for ML team etcd - https://phabricator.wikimedia.org/T273075 (10Dzahn) Looks good to me! From here you can look at the "Planned -> Staged" step in the Server Lifecycle page: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Planned_-%3E_Staged You will need to ad... [17:00:04] jbond42 and cdanis: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T1700). Please do the needful. [17:00:04] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:08] o/ [17:00:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1089.eqiad.wmnet - https://phabricator.wikimedia.org/T273417 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks @Marostegui, this helps us a lot! >>! In T273417#6795017, @Marostegui wrote: > Ready for DC-Ops! [17:00:54] (03CR) 10Lucas Werkmeister (WMDE): "(Note, the gui-deploy config was updated in Id834a3a1b0.)" [puppet] - 10https://gerrit.wikimedia.org/r/659242 (https://phabricator.wikimedia.org/T267656) (owner: 10Lucas Werkmeister (WMDE)) [17:01:36] 10SRE: ping servers running out of disk - https://phabricator.wikimedia.org/T273509 (10Dzahn) Thank you. This solution seems good to me. Should we just close this again then? Or we can recycle/rename it to "upgrade ping servers to bullseye" :p [17:03:12] (03PS1) 10MSantos: proton: bump to 2021-02-02-122014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/661145 [17:03:35] (03CR) 10Ayounsi: [C: 04-1] "Please document ASNs in https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations as well." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [17:03:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Haven't looked at the tests, but the code lgtm. Couple of pedantic comment related asks." (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [17:03:53] Lucas_WMDE: just looking one sec (sorry got tied up today) [17:04:06] 10ops-eqiad: unplug old zayo links at dmarc - https://phabricator.wikimedia.org/T273647 (10RobH) p:05Triage→03High [17:04:12] 10ops-eqiad: unplug old zayo links at dmarc - https://phabricator.wikimedia.org/T273647 (10RobH) [17:04:27] alright thanks [17:04:34] (03CR) 10DannyS712: [C: 03+1] query_service: add Special:MyLanguage to copyright URLs [puppet] - 10https://gerrit.wikimedia.org/r/659242 (https://phabricator.wikimedia.org/T267656) (owner: 10Lucas Werkmeister (WMDE)) [17:04:57] Lucas_WMDE: lgtm anything specific you want me to do after the merge? [17:05:17] Amir/Ladsgroup said the wdqs part should be a no-op anyways [17:05:23] not sure if anything would need to be done to update wcqs [17:05:24] (03PS14) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [17:05:28] (03CR) 10MSantos: [C: 03+2] proton: bump to 2021-02-02-122014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/661145 (owner: 10MSantos) [17:05:49] Lucas_WMDE: in that case ill just let it roll out naturaly (which will take updat 30 mins) [17:05:56] alright, sounds good [17:05:59] it’s not urgent :) [17:06:14] (03CR) 10Jbond: [C: 03+2] query_service: add Special:MyLanguage to copyright URLs [puppet] - 10https://gerrit.wikimedia.org/r/659242 (https://phabricator.wikimedia.org/T267656) (owner: 10Lucas Werkmeister (WMDE)) [17:06:39] (03CR) 10Elukey: install_server: Add DHCP, partman and manifest entries for ml-etcd* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661139 (owner: 10Klausman) [17:06:43] 10SRE: ping servers running out of disk - https://phabricator.wikimedia.org/T273509 (10jcrespo) +1 to rename and stall on bullseye being ready. [17:06:55] (03Merged) 10jenkins-bot: proton: bump to 2021-02-02-122014-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/661145 (owner: 10MSantos) [17:07:03] (03CR) 10jerkins-bot: [V: 04-1] Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [17:08:20] Lucas_WMDE: merged and deploed to wdqs1009 with no issues [17:08:40] (03PS15) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [17:08:57] great, thanks! [17:09:08] np [17:10:39] (03PS1) 10BryanDavis: wmcs: Force HTTPS with 366 day HSTS header with profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/661147 (https://phabricator.wikimedia.org/T273648) [17:13:50] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1001/27813/" [puppet] - 10https://gerrit.wikimedia.org/r/661147 (https://phabricator.wikimedia.org/T273648) (owner: 10BryanDavis) [17:17:57] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: Force HTTPS with 366 day HSTS header with profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/661147 (https://phabricator.wikimedia.org/T273648) (owner: 10BryanDavis) [17:18:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] releases: Provide remaining pipelinelib dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659437 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [17:23:17] thanks jbond42! and Lucas_WMDE if you need something further feel free to ping me -- starting to get a bit late in the day for jbond :) [17:23:36] I think that was all! getting a bit late for me too :) [17:24:29] (03PS1) 10Thcipriani: Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) [17:25:28] (03CR) 10Andrew Bogott: [C: 03+2] domainproxy: Perform HTTPS redirects unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/661140 (https://phabricator.wikimedia.org/T120486) (owner: 10BryanDavis) [17:25:39] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Force HTTPS with 366 day HSTS header with profile::wmcs::proxy::static [puppet] - 10https://gerrit.wikimedia.org/r/661147 (https://phabricator.wikimedia.org/T273648) (owner: 10BryanDavis) [17:26:50] cdanis: are you updating https://people.wikimedia.org/~cdanis/sremap/, btw? [17:27:20] Urbanecm: it is on my list... [17:27:28] cool :) [17:27:31] hasn't been updated in probably nine months at this point sadly :) [17:27:56] just wondered whether that was an experiment with idp auth, or a place that should eventually have up2date info :) [17:28:21] I got halfway through updating its internal schema (so SREs could manually override what Google Calendar said their timezone was), then didn't get around to updating the data-scrape script, and also got sidetracked on making some other code improvements [17:28:38] and now it's been so long that it would take a little time just to find some of those pieces, very embarrassing :) [17:28:46] hehe [17:28:55] thought that was mostly manually-created database [17:29:14] (03PS1) 10BryanDavis: toolforge: Force HTTPS with 366 day HSTS header in profile::toolforge::static [puppet] - 10https://gerrit.wikimedia.org/r/661153 (https://phabricator.wikimedia.org/T273651) [17:30:07] (03PS3) 10Klausman: install_server: Add DHCP, partman and manifest entries for ml-etcd* [puppet] - 10https://gerrit.wikimedia.org/r/661139 [17:42:01] (03CR) 10Elukey: [C: 03+1] install_server: Add DHCP, partman and manifest entries for ml-etcd* [puppet] - 10https://gerrit.wikimedia.org/r/661139 (owner: 10Klausman) [17:48:29] (03CR) 10Zfilipin: [C: 03+1] Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [17:49:53] (03CR) 10Klausman: [C: 03+2] install_server: Add DHCP, partman and manifest entries for ml-etcd* [puppet] - 10https://gerrit.wikimedia.org/r/661139 (owner: 10Klausman) [17:53:12] (03PS1) 10Klausman: install-server: Fix overly generous RE for ml-etcd machines [puppet] - 10https://gerrit.wikimedia.org/r/661157 [17:57:19] (03CR) 10BryanDavis: "PCC compilation is failing for missing facts: https://integration.wikimedia.org/ci/view/Ops/job/operations-puppet-catalog-compiler/27817/c" [puppet] - 10https://gerrit.wikimedia.org/r/661153 (https://phabricator.wikimedia.org/T273651) (owner: 10BryanDavis) [17:58:52] (03PS1) 10Bstorm: wikireplicas-proxy: tune the main haproxy config for databases [puppet] - 10https://gerrit.wikimedia.org/r/661158 (https://phabricator.wikimedia.org/T272523) [18:00:05] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T1800). [18:01:01] (03CR) 10Elukey: [C: 03+1] "Ok this is why I was confused earlier on, I thought I had seen a 1004/2004 :D" [puppet] - 10https://gerrit.wikimedia.org/r/661157 (owner: 10Klausman) [18:01:25] (03CR) 10Klausman: [C: 03+2] install-server: Fix overly generous RE for ml-etcd machines [puppet] - 10https://gerrit.wikimedia.org/r/661157 (owner: 10Klausman) [18:01:51] (03CR) 10BryanDavis: "PCC errors are different when using equad1.wikimedia.cloud hostnames: https://puppet-compiler.wmflabs.org/compiler1002/27818/" [puppet] - 10https://gerrit.wikimedia.org/r/661153 (https://phabricator.wikimedia.org/T273651) (owner: 10BryanDavis) [18:02:57] 10SRE, 10Cloud-Services, 10Cloud-VPS, 10Quarry, and 3 others: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627 (10bd808) [18:03:01] (03CR) 10Bstorm: [C: 03+2] wikireplicas-proxy: tune the main haproxy config for databases [puppet] - 10https://gerrit.wikimedia.org/r/661158 (https://phabricator.wikimedia.org/T272523) (owner: 10Bstorm) [18:03:41] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:19] (03PS1) 10Bstorm: wikireplicas-proxy: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/661163 [18:07:36] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:17] (03CR) 10Bstorm: [C: 03+2] wikireplicas-proxy: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/661163 (owner: 10Bstorm) [18:12:56] (03CR) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [18:17:07] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:09] hmm, mousing over the "Move v" label doesn't reveal the dropdown anymore on the beta cluster - did something change? [18:22:39] !log milimetric@deploy1001 Started deploy [analytics/turnilo/deploy@052348b]: (no justification provided) [18:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:49] !log milimetric@deploy1001 deploy aborted: (no justification provided) (duration: 00m 10s) [18:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:32] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.017 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:23:47] !log milimetric@deploy1001 Started deploy [analytics/turnilo/deploy@052348b]: (no justification provided) [18:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:50] !log milimetric@deploy1001 Finished deploy [analytics/turnilo/deploy@052348b]: (no justification provided) (duration: 00m 03s) [18:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:25] (03CR) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [18:27:37] (03PS2) 10Esanders: Make DiscussionTools' newtopictool available on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 [18:28:52] (03CR) 10Esanders: "This patch should make the newtopictool available as a beta feature on testwiki. To that end it is +1'd from Peter." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 (owner: 10Esanders) [18:29:02] (03CR) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [18:30:05] 10SRE, 10ops-eqiad, 10Analytics-Radar: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10wiki_willy) a:03Cmjohnson [18:30:48] (03CR) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [18:32:15] (03PS1) 10Alex Paskulin: labs: Remove redundant apiportal config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) [18:34:15] (03PS1) 10Addshore: Add wiki ID to WikiPageEntityDataLoader [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661091 (https://phabricator.wikimedia.org/T273622) [18:34:21] (03PS1) 10Addshore: Pass $databaseName into WikiPageEntityDataLoader [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661092 (https://phabricator.wikimedia.org/T273622) [18:40:50] hashar: around by any chance? [18:42:15] (03PS1) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [18:42:23] (03PS1) 10Andrew Bogott: Cloud instances: add duplicate hiera settings for profile::base::labs:: settings [puppet] - 10https://gerrit.wikimedia.org/r/661171 [18:42:25] elukey: dinner time overdue sorry :-\\\\ [18:42:57] hasharDinner: ack enjoy :) [18:43:41] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [18:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:15] (03CR) 10Cicalese: [C: 03+1] labs: Remove redundant apiportal config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [18:48:40] (03PS7) 10Razzi: analytics_cluster/turnilo: Configure url shortner [puppet] - 10https://gerrit.wikimedia.org/r/622600 (https://phabricator.wikimedia.org/T233336) (owner: 10Milimetric) [18:48:49] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . [18:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:07] (03CR) 10Razzi: [C: 03+2] analytics_cluster/turnilo: Configure url shortner [puppet] - 10https://gerrit.wikimedia.org/r/622600 (https://phabricator.wikimedia.org/T233336) (owner: 10Milimetric) [18:51:46] (03CR) 10Andrew Bogott: [C: 03+2] "It's going to be a while before I can get useful output from the pcc, so let's just merge this and see how it goes." [puppet] - 10https://gerrit.wikimedia.org/r/661153 (https://phabricator.wikimedia.org/T273651) (owner: 10BryanDavis) [19:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:57] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [19:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:17] (03PS1) 10Andrew Bogott: Revert "toolforge: Force HTTPS with 366 day HSTS header in profile::toolforge::static" [puppet] - 10https://gerrit.wikimedia.org/r/661095 [19:02:24] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Revert "toolforge: Force HTTPS with 366 day HSTS header in profile::toolforge::static" [puppet] - 10https://gerrit.wikimedia.org/r/661095 (owner: 10Andrew Bogott) [19:07:08] (03CR) 10Dzahn: [C: 03+2] "this is on the "builder" host/role. noop on deneb" [puppet] - 10https://gerrit.wikimedia.org/r/660951 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:11:44] (03CR) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [19:12:29] (03PS2) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) [19:14:04] (03PS3) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) [19:14:28] is anyone deploying right now? otherwise I’d like to do two backports for wmf.29 [19:14:36] cc RoanKattouw Niharika Urbanecm [19:16:04] (03PS4) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) [19:16:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add wiki ID to WikiPageEntityDataLoader [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661091 (https://phabricator.wikimedia.org/T273622) (owner: 10Addshore) [19:16:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Pass $databaseName into WikiPageEntityDataLoader [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661092 (https://phabricator.wikimedia.org/T273622) (owner: 10Addshore) [19:16:48] ^ I’ll add those two to the calendar [19:16:57] 😀 [19:17:54] (03CR) 10Bstorm: wikireplicas: deploy a cloud-based query sampler for the replicas (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [19:18:55] (03CR) 10Bstorm: "Tested this version of the script on the host to make sure it still behaves right. I also removed some typing because mypy HATES pymysql (" [puppet] - 10https://gerrit.wikimedia.org/r/660960 (https://phabricator.wikimedia.org/T272723) (owner: 10Bstorm) [19:22:23] ah, mypy and pymysql :3 [19:22:44] IIRC I ended up completely replacing the upstream type stubs with my own ^^ [19:24:38] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/658360 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:25:28] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:27:22] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27823/mc2026.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/659392 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:28:05] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/658414 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:29:34] Lucas_WMDE: not deploying right now [19:29:39] alright, thanks [19:29:43] waiting for gate-and-submit now [19:29:49] but I can if you wish me to :D [19:29:52] (and in the meantime trying to reproduce the error that the backports will hopefully fix) [19:30:04] I already opened the SSH windows :P [19:30:12] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/658415 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:31:01] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/658427 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:32:48] okay, I can reproduce the error, yay [19:33:41] (go to https://test.wikipedia.org/wiki/Wikidata?action=purge, enable X-Wikimedia-Debug with verbose logging, purge the page, visit the logstash link in the footer, hide debug messages) [19:34:19] (03CR) 10Jforrester: "Hmm. Is there an alternative that provides the extra compression? (Was there any extra compression or were we merely cargo-culting and re-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm) [19:38:42] how is merging taking so long... :/ [19:40:06] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2162635180552 and 3092340 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:41:36] Majavah: probably doesn’t help that Zuul seems to be pretty full at the moment :| [19:41:48] * Lucas_WMDE peeks at the chain [19:45:31] (03PS1) 10Ayounsi: Alert manager, fix DCops email [puppet] - 10https://gerrit.wikimedia.org/r/661178 [19:46:41] (03CR) 10CDanis: [C: 03+1] swift: apply interface::rps to i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661054 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [19:47:20] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/660854 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [19:47:33] (03CR) 10CDanis: [C: 03+1] interfaces: allow setting queues on i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661053 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [19:48:39] (03Merged) 10jenkins-bot: Add wiki ID to WikiPageEntityDataLoader [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661091 (https://phabricator.wikimedia.org/T273622) (owner: 10Addshore) [19:49:39] (03PS1) 10Krinkle: wikimedia.org: TXT entry for GitHub domain verified profile [dns] - 10https://gerrit.wikimedia.org/r/661180 [19:50:08] (03PS3) 10CDanis: geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496) [19:50:18] (03PS4) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) [19:50:26] (03PS6) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) [19:50:37] pulled the first backport to mwdebug1001, testing… [19:51:07] (03CR) 10Krinkle: "Ref https://docs.github.com/en/github/setting-up-and-managing-organizations-and-teams/verifying-your-organizations-domain" [dns] - 10https://gerrit.wikimedia.org/r/661180 (owner: 10Krinkle) [19:51:29] looks good [19:51:35] I’ll wait for the second backport to merge before syncing both [19:51:52] I think it’s fine if the first is deployed on its own, but I’ll feel better leaving a shorter gap before the second [19:52:12] (03CR) 10CDanis: [C: 03+2] geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [19:52:14] (03CR) 10CDanis: [C: 03+2] geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [19:52:35] (it’ll still be two scaps, I just want to do them more quickly after another) [19:52:46] (03CR) 10Legoktm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm) [19:52:50] (03CR) 10Dzahn: "This looks good, you don't want to give them .conf.erb file names though?" [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [19:52:52] !log ❌cdanis@cumin1001.eqiad.wmnet ~ 🕒☕ sudo cumin A:cp 'disable-puppet "cdanis deploying I7003b7b6 and Idd0e124f5 T263496"' [19:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:57] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [19:54:31] (03CR) 10Dzahn: [C: 03+1] mediawiki::prod_sites: move to dumb templates [puppet] - 10https://gerrit.wikimedia.org/r/659327 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [19:54:46] (03Merged) 10jenkins-bot: Pass $databaseName into WikiPageEntityDataLoader [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/661092 (https://phabricator.wikimedia.org/T273622) (owner: 10Addshore) [19:57:13] (03CR) 10Jforrester: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm) [19:57:14] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/Wikibase/: Backport: [[gerrit:661091|Add wiki ID to WikiPageEntityDataLoader (T273622)]] (duration: 01m 25s) [19:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:19] T273622: Deprecation warning: Expected RevisionRecord to belong to ... - https://phabricator.wikimedia.org/T273622 [19:58:46] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/WikibaseMediaInfo/: Backport: [[gerrit:661092|Pass $databaseName into WikiPageEntityDataLoader (T273622)]] (duration: 01m 07s) [19:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:27] (03PS1) 10Andrew Bogott: OpenStack Nova: use /var/lib/nova as $home for nova [puppet] - 10https://gerrit.wikimedia.org/r/661181 (https://phabricator.wikimedia.org/T273421) [20:00:04] hashar and dancy: #bothumor I � Unicode. All rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210202T2000). [20:00:12] (03CR) 10Jforrester: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660979 (owner: 10Legoktm) [20:00:17] !log Morning backport window done [20:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:26] sorry, didn’t realize it was getting so close to the next window! [20:00:51] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Nova: use /var/lib/nova as $home for nova [puppet] - 10https://gerrit.wikimedia.org/r/661181 (https://phabricator.wikimedia.org/T273421) (owner: 10Andrew Bogott) [20:02:44] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) >>! In T273312#6795794, @Daimona wrote: >>>!... [20:03:26] PROBLEM - nova-compute proc minimum on cloudvirt1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:31] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:32] PROBLEM - nova-compute proc minimum on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:36] PROBLEM - nova-compute proc minimum on cloudvirt1030 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:37] PROBLEM - nova-compute proc minimum on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:42] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:45] PROBLEM - nova-compute proc minimum on cloudvirt1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:42] RECOVERY - nova-compute proc minimum on cloudvirt1021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:48] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:49] RECOVERY - nova-compute proc minimum on cloudvirt1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:52] RECOVERY - nova-compute proc minimum on cloudvirt1030 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:54] RECOVERY - nova-compute proc minimum on cloudvirt1024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:05:58] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:06:01] RECOVERY - nova-compute proc minimum on cloudvirt1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:45] (03CR) 10Dzahn: mediawiki: use a data structure to define all virtualhosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [20:12:18] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕒☕ sudo cumin A:cp 'enable-puppet "cdanis deploying I7003b7b6 and Idd0e124f5 T263496"' # test on cp2027 looks good, perhaps slightly-increased Varnish CPU consumption but hard to be sure [20:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:23] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [20:12:44] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) From my connection something else is broken download a 2.13M [[ https://archiva.wikimedia.org/repository/releases/com/googlesource/gerrit/plugins/javame... [20:13:23] elukey: I have replied on the archiva task with some more tests [20:13:54] elukey: but it is nowhere near a high priority :] I just felt that something might be off with Archiva and we might want to check whether it can be improved [20:14:25] hashar: reading, thanks a lot, even if I think I have to revert the last nginx config, there is a corner case of the new config that makes the maven refinery build to fail :( [20:14:39] elukey: be bold!!!!! :] [20:15:08] the curl command showing the startransfer timing is interesting (900ms delay on the server side) [20:15:23] maybe that is not the case if hitting direclty the archiva backend (instead of nginx) [20:15:34] but yeah do revert if that cause any issue [20:15:47] I'll see if I find a workaround :) [20:16:03] pretty sure disabling the buffering did improve things for me but I do't think I captured that on the task [20:16:23] anyway, it is not a blocker [20:16:43] but maybe that can make maven slightly faster as an outcome. So might be worth digging into it [20:20:19] (03CR) 10Dzahn: kubernetes::deployment_server: add yaml to configure MediaWiki sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [20:21:44] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661184 [20:26:10] (03PS1) 10Elukey: Revert "archiva: add missing ; to nginx config" [puppet] - 10https://gerrit.wikimedia.org/r/661097 [20:26:27] (03PS1) 10Elukey: Revert "archiva::proxy: allow nginx to serve content from repositories" [puppet] - 10https://gerrit.wikimedia.org/r/661098 [20:26:42] (03CR) 10jerkins-bot: [V: 04-1] Revert "archiva::proxy: allow nginx to serve content from repositories" [puppet] - 10https://gerrit.wikimedia.org/r/661098 (owner: 10Elukey) [20:27:06] (03Abandoned) 10Elukey: Revert "archiva::proxy: allow nginx to serve content from repositories" [puppet] - 10https://gerrit.wikimedia.org/r/661098 (owner: 10Elukey) [20:27:14] (03CR) 10Elukey: [C: 03+2] Revert "archiva: add missing ; to nginx config" [puppet] - 10https://gerrit.wikimedia.org/r/661097 (owner: 10Elukey) [20:27:37] (03PS1) 10Elukey: Revert "archiva::proxy: allow nginx to serve content from repositories" [puppet] - 10https://gerrit.wikimedia.org/r/661099 [20:28:50] (03CR) 10Elukey: [C: 03+2] Revert "archiva::proxy: allow nginx to serve content from repositories" [puppet] - 10https://gerrit.wikimedia.org/r/661099 (owner: 10Elukey) [20:29:59] hashar: for sure! I'll keep commenting in the task [20:36:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:37:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:48:34] (03CR) 10Ottomata: [C: 03+1] burrow/check_kafka_consumer_lag.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:04:55] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) [21:05:31] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) [21:05:41] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) previously happened: https://gerrit.wikimedia.org/r/q/topic:%22cron-timer%22+(status:open%20OR%20status:merged) [21:05:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1336.eqiad.wmnet with reason: REIMAGE [21:05:52] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) p:05Triage→03Medium [21:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1335.eqiad.wmnet with reason: REIMAGE [21:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1336.eqiad.wmnet with reason: REIMAGE [21:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:40] (03PS1) 10Dzahn: debmonitor::client: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) [21:09:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1335.eqiad.wmnet with reason: REIMAGE [21:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:51] (03CR) 10Bartosz Dziewoński: [C: 03+1] Make DiscussionTools' newtopictool available on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 (owner: 10Esanders) [21:20:10] (03PS1) 10RobH: adding sku 389-DSXN [software] - 10https://gerrit.wikimedia.org/r/661195 [21:20:53] (03CR) 10RobH: [C: 03+2] adding sku 389-DSXN [software] - 10https://gerrit.wikimedia.org/r/661195 (owner: 10RobH) [21:21:27] (03Merged) 10jenkins-bot: adding sku 389-DSXN [software] - 10https://gerrit.wikimedia.org/r/661195 (owner: 10RobH) [21:21:31] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Dzahn) Hi @Miriam is Aiko going to get a -ctr@wikimedia email address from ITS? For contractors we require an expiry_date and expiry_contact (assum... [21:30:16] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Joe) >>! In T273312#6797450, @Legoktm wrote: > Not su... [21:30:26] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Miriam) Hi @Dzahn! Aiko is not getting a @wikimedia email address, unless needed. She will be working with us until June 30th, so if possible, she sh... [21:33:41] (03CR) 10CRusnov: [C: 03+2] "tested on alert1001, appears working." [puppet] - 10https://gerrit.wikimedia.org/r/655731 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:34:48] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Dzahn) @Miriam Alright, just wanted to say w often see contractors with special address with the -ctr@ suffix but as far as I can tell it's not a... [21:35:16] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [21:36:45] (03PS1) 10Dzahn: installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) [21:37:21] (03CR) 10jerkins-bot: [V: 04-1] installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:41:02] (03PS2) 10Dzahn: installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) [21:41:55] (03CR) 10CRusnov: [C: 03+2] check_graphite_freshness.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/655733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:42:45] (03CR) 10jerkins-bot: [V: 04-1] installserver::proxy: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661198 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:45:08] (03PS1) 10Dzahn: logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) [21:45:35] (03CR) 10jerkins-bot: [V: 04-1] logging::mediawiki::udp2log: replace cron with timer [puppet] - 10https://gerrit.wikimedia.org/r/661200 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:46:05] (03PS1) 10Ahmon Dancy: temp changes while experimenting [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661201 [21:47:31] (03CR) 10jerkins-bot: [V: 04-1] temp changes while experimenting [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661201 (owner: 10Ahmon Dancy) [21:49:19] (03CR) 10CRusnov: [C: 03+2] "tested, works." [puppet] - 10https://gerrit.wikimedia.org/r/655734 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:50:49] (03CR) 10CRusnov: [C: 03+2] modules/interface/files/interface-rps.py: Adapt for Python3 [puppet] - 10https://gerrit.wikimedia.org/r/652575 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:52:22] dancy: not sure if you know, but you can create a sandbox branch, see https://www.mediawiki.org/wiki/Gerrit/personal_sandbox [21:53:11] ooh, thanks legoktm! [21:53:38] :) [21:54:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1335.eqiad.wmnet'] ` an... [21:54:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1336.eqiad.wmnet'] ` an... [21:55:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1336.eqiad.wmnet [21:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1335.eqiad.wmnet [21:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:22] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01208 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:56:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1336.eqiad.wmnet [21:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1335.eqiad.wmnet [21:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:13] (03CR) 10Legoktm: "> If we re-sequence to run lossless compression first does that change anything?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660978 (https://phabricator.wikimedia.org/T273380) (owner: 10Legoktm) [21:57:22] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token - https://phabricator.wikimedia.org/T273681 (10dduvall) [21:58:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token - https://phabricator.wikimedia.org/T273681 (10Dzahn) a:03Dzahn [22:00:19] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token - https://phabricator.wikimedia.org/T273681 (10Dzahn) 21:56 <+icinga-wm> PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01208 ge 0... [22:03:36] (03CR) 10Dzahn: "this is causing puppet failures and alerts" [puppet] - 10https://gerrit.wikimedia.org/r/659437 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [22:06:02] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10K6ka) The bot has now been unblocked and is editing again. I will report here if the bot is seen blanking WP:CHUS again. [22:06:36] PROBLEM - Host mw1300 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:38] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token - https://phabricator.wikimedia.org/T273681 (10Dzahn) I fixed the first issue by adding the "kubernetes_config::token" to the releases role in the private... [22:09:01] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token - https://phabricator.wikimedia.org/T273681 (10Dzahn) ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not... [22:09:42] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10CDanis) a:03AikoChou Waiting on @AikoChou to complete prerequisites and also for @Ottomata to approve from Analytics. [22:09:56] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T273681 - added the missing token in private repo but there is another issue behind that:" [puppet] - 10https://gerrit.wikimedia.org/r/659437 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [22:12:07] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) [22:12:26] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Ottomata) Approved from Analytics. Not sure what the requirement for 'manager approval' is for contractors, but perhaps in this case we should get @... [22:35:08] (03PS1) 10Bstorm: wikireplicas-proxy: add commented examples of depoolings for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/661206 (https://phabricator.wikimedia.org/T271476) [22:35:48] legoktm: I think the sandbox branches may only apply to mediawiki/* [22:36:09] hmm, I've never tried outside that [22:36:41] (03PS1) 10Ahmon Dancy: temp changes while experimenting [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661207 [22:37:00] (03Abandoned) 10Ahmon Dancy: temp changes while experimenting [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661201 (owner: 10Ahmon Dancy) [22:37:14] https://gerrit.wikimedia.org/r/admin/repos/All-Projects,access sandbox is implemented globally [22:37:25] hmmmmm [22:37:47] if I had to guess, ops/mw-config has some paranoia rule that overrides it [22:38:20] (03CR) 10jerkins-bot: [V: 04-1] temp changes while experimenting [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661207 (owner: 10Ahmon Dancy) [22:38:24] https://gerrit.wikimedia.org/r/admin/repos/operations/mediawiki-config,access don't see any exclusive stuff though [22:38:55] Indeed. I'll try a bit harder next time around to figure out what blocked my last attempt. [22:48:35] (03PS1) 10Razzi: presto: require partitions predicate [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) [22:48:38] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) Yes it turns out the bot had robust error handling for prod errors. It didn’t have robust handling for reused OAuth nonces. For some reason it got... [22:49:10] (03CR) 10jerkins-bot: [V: 04-1] presto: require partitions predicate [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [22:50:41] (03PS1) 10Dzahn: profile::ci::kubernetes_config: ensure /etc/kubernetes exists [puppet] - 10https://gerrit.wikimedia.org/r/661211 (https://phabricator.wikimedia.org/T273681) [22:52:07] (03PS2) 10Razzi: presto: require partitions predicate [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) [22:53:30] 10SRE, 10Traffic: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Reedy) >>! In T273003#6798157, @Cyberpower678 wrote: > Yes it turns out the bot had robust error handling for prod errors. It didn’t have robust handling for reuse... [22:55:09] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004834 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:01:12] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10leila) makes sense to me. Approved. And thank you! [23:05:35] (03CR) 10Dzahn: [C: 04-1] "this would work but break it on contint masters with" [puppet] - 10https://gerrit.wikimedia.org/r/661211 (https://phabricator.wikimedia.org/T273681) (owner: 10Dzahn) [23:06:56] 10SRE, 10ops-eqiad: unplug old zayo links at dmarc - https://phabricator.wikimedia.org/T273647 (10RobH) [23:15:22] (03PS2) 10Dzahn: profile::ci::kubernetes_config: ensure /etc/kubernetes exists [puppet] - 10https://gerrit.wikimedia.org/r/661211 (https://phabricator.wikimedia.org/T273681) [23:18:48] (03PS1) 10Dzahn: releases: add fake ci::kubernetes_config::token to make compiler work [labs/private] - 10https://gerrit.wikimedia.org/r/661215 (https://phabricator.wikimedia.org/T273681) [23:19:14] (03CR) 10Dzahn: [V: 03+2 C: 03+2] releases: add fake ci::kubernetes_config::token to make compiler work [labs/private] - 10https://gerrit.wikimedia.org/r/661215 (https://phabricator.wikimedia.org/T273681) (owner: 10Dzahn) [23:21:38] (03CR) 10Dzahn: [V: 03+1] "after https://gerrit.wikimedia.org/r/c/labs/private/+/661215 this can compile now" [puppet] - 10https://gerrit.wikimedia.org/r/661211 (https://phabricator.wikimedia.org/T273681) (owner: 10Dzahn) [23:21:58] (03CR) 10Dzahn: [V: 03+1 C: 03+2] profile::ci::kubernetes_config: ensure /etc/kubernetes exists [puppet] - 10https://gerrit.wikimedia.org/r/661211 (https://phabricator.wikimedia.org/T273681) (owner: 10Dzahn) [23:22:15] 10SRE, 10ops-eqiad: unplug old zayo links at dmarc - https://phabricator.wikimedia.org/T273647 (10RobH) [23:22:41] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/661211" [puppet] - 10https://gerrit.wikimedia.org/r/659437 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [23:25:53] marxarelli: fixed issue 1, then issue 2, but now there is issue 3 [23:26:06] the good part is puppet can at least finish the run [23:26:33] though with errors, but better than not at all [23:30:34] !log powercycling crashed m1300.eqiad.wmnet [23:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:07] RECOVERY - Host mw1300 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [23:33:11] (03CR) 10Jeena Huneidi: [C: 03+2] "adds ability to target apt releases when installing packages" [deployment-charts] - 10https://gerrit.wikimedia.org/r/661184 (owner: 10PipelineBot) [23:33:21] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) ^ Mostly fixed the puppet runs on releases*... [23:33:57] 10SRE, 10ops-eqiad: unplug old zayo links at dmarc - https://phabricator.wikimedia.org/T273647 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr unplug both of the zayo links from the DMARC panel in the cage PP:0000:103234 ports 3/4 PP:0000:103234 ports 5/6 [23:34:21] mutante: c'est la vie. what's issue 3? [23:35:02] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team-TODO: Puppet failing on releases hosts due to missing profile::ci::kubernetes_config::token, dependency issue in kubeconfig.pp - https://phabricator.wikimedia.org/T273681 (10Dzahn) a:05Dzahn→03None At this point it is ge... [23:35:05] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661184 (owner: 10PipelineBot) [23:35:16] marxarelli: code expects an admin group that only applies on contint masters [23:35:47] grr. hmm. i thought i added that in a previous patch [23:35:57] marxarelli: well, at least: [23:36:02] Notice: Applied catalog in 62.79 seconds [23:36:14] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10wiki_willy) a:03Jclark-ctr [23:36:20] Notice: /Stage[main]/Helm/Systemd::Timer::Job[helm3-repo-update]/ ... etc [23:36:31] marxarelli: the helm stuff got pulled, timers created [23:36:54] config file exists in /etc/kubernertes [23:37:23] k [23:37:28] thanks for fixing! [23:37:35] i'll take a look at the admin group issue [23:37:42] ok, cool [23:39:32] marxarelli: I am wondering if the stuff in profile::kubernetes::deployment_server is needed on releases* [23:39:45] that is the class that contint masters have that created /etc/kubernetes there [23:40:10] but it also creates more configs [23:40:26] yeah i looked at that but thought it was overkill for what we need on releases [23:41:00] ok, good. I also opted for just doing the minimal thing at first.. create that dir..then see what else [23:41:08] do the contints apply that? [23:41:43] modules/role/manifests/ci/master.pp: include ::profile::kubernetes::deployment_server [23:41:46] yea [23:42:12] found that when I wanted to know where the missing dir came from on contint [23:43:17] so my first attempt to fix it with a simple file{} would have fixed it on releases* as well.. but broken it on contint [23:43:30] because then that would have a duplicate definition of the same thing [23:43:46] right [23:43:47] then i used the "ensure_resource" trick to avoid that [23:43:56] that is "only if needed" basically [23:46:31] well, puppet does run, nothing broken, contint was noop, i think it is slightly lower priority ow [23:47:32] i think we can avoid adding contint-admins to releases by parameterizing the k8s::kubeconfig group in profile::ci::kubernetes_config [23:47:44] the token in the private repo was copied from CI [23:47:52] it would be great to avoid that really. we want releases-jenkins to be as locked down as possible [23:48:14] that sounds good if that works [23:48:53] I was thinking you'd have to change which group it expects, contint-admins or releasers-mediawiki, based on role.. at first [23:50:24] i think i'll add parameters for config owner and group, and then add entries to the hiera data for roles/common/ci/master.yaml and roles/common/releases.yaml [23:50:27] does that sound right? [23:51:25] yea, it does [23:52:15] role/common/releases.yaml is also the place I used in the private hiera to put the token [23:52:33] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [23:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:39] right on. makes sense to me [23:53:44] !log mw1300 - scap pull (it crashed earlier put is back after powercycling) [23:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:51] (03PS1) 10Dduvall: releases: Parameterize profile::ci::kubernetes_config owner/group [puppet] - 10https://gerrit.wikimedia.org/r/661224 (https://phabricator.wikimedia.org/T273681)