[00:00:01] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1004 is CRITICAL: 5.768e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [00:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210205T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:10] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1008 is CRITICAL: 5.585e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [00:02:26] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) The logical conclusion seems to be that these errors are coming from something other than EDAC (ECC). Poking around at suspicious... [00:16:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1278.eqiad.wmnet'] ` an... [00:21:42] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@9858513]: transfer_to_es: Wait for link reco, and write to weighted_tags as well [00:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:26] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@9858513]: transfer_to_es: Wait for link reco, and write to weighted_tags as well (duration: 02m 43s) [00:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:17] (03PS1) 10Cwhite: profile: temporarily extend w3creportingapi retention period [puppet] - 10https://gerrit.wikimedia.org/r/661823 [00:34:25] (03CR) 10Cwhite: [C: 03+2] profile: temporarily extend w3creportingapi retention period [puppet] - 10https://gerrit.wikimedia.org/r/661823 (owner: 10Cwhite) [00:35:30] !log enabled remote IPMI access on mw1349.mgmt.eqiad.wmnet and mw1380.mgmt.eqiad.wmnet [00:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:16] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1278.eqiad.wmnet [00:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:53] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@85713c1]: restore data range specifier in extract job partition spec [00:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:05] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@85713c1]: restore data range specifier in extract job partition spec (duration: 01m 12s) [01:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:48] PROBLEM - Device not healthy -SMART- on mw1278 is CRITICAL: cluster=api_appserver device={sdc,sdd,sde,sdf,sdg,sdh} instance=mw1278 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1278&var-datasource=eqiad+prometheus/ops [01:43:24] (03PS1) 10Wugapodes: Revert "Change EnWiki logo for Wikipedia 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661656 [01:43:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "Change EnWiki logo for Wikipedia 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661656 (owner: 10Wugapodes) [01:53:56] (03CR) 10Krinkle: "LGTM. I'm not sure whether we want/need both mid-long term so rather than renaming 'excimer' we might also later promote 'real' to 'excime" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [02:03:18] !log krinkle@mwmaint1002 Prune globalimagelinks references on s4 database for the deleted ukwikimedia wiki, ref T218170. [02:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:23] T218170: Finish removal of ukwikimedia wiki - https://phabricator.wikimedia.org/T218170 [02:15:26] (03CR) 10Wugapodes: [C: 04-1] "I'm working on the rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661656 (owner: 10Wugapodes) [02:19:04] (03CR) 10Aaron Schulz: [C: 03+2] Enable "coalesceKeys" for global keys for WANCache (III) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658372 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [02:20:14] (03Merged) 10jenkins-bot: Enable "coalesceKeys" for global keys for WANCache (III) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658372 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [02:44:47] (03PS1) 10Aaron Schulz: rdbms: fix bogus read-only mode bug in LoadBalancer [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/661832 (https://phabricator.wikimedia.org/T252564) [02:47:15] (03CR) 10Krinkle: [C: 03+2] rdbms: fix bogus read-only mode bug in LoadBalancer [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/661832 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [03:07:05] RECOVERY - Device not healthy -SMART- on mw1278 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1278&var-datasource=eqiad+prometheus/ops [03:13:40] (03CR) 10DannyS712: "already done at T272108?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661656 (owner: 10Wugapodes) [03:14:14] (03PS1) 10Wugapodes: logos: Revert enwiki logo to standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661833 [03:14:56] (03Abandoned) 10Wugapodes: Revert "Change EnWiki logo for Wikipedia 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661656 (owner: 10Wugapodes) [03:15:49] (03Abandoned) 10Wugapodes: logos: Revert enwiki logo to standard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661833 (owner: 10Wugapodes) [03:16:39] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:27] (03Merged) 10jenkins-bot: rdbms: fix bogus read-only mode bug in LoadBalancer [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/661832 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [03:34:02] !log aaron@deploy1001 Synchronized php-1.36.0-wmf.27/includes/libs/rdbms: 4b386661a9820a002b43bfcef3e18241ea883870 (duration: 01m 12s) [03:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:05] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 7.110 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:42:35] !log aaron@deploy1001 Synchronized wmf-config/mc.php: af5b0effb5e88ac4ca4a06c2c409d303ec405305 (duration: 01m 06s) [03:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:23] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:52:16] (03PS4) 10Hamish: Allow sysop to add/remove transwiki for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660795 (https://phabricator.wikimedia.org/T273405) [03:57:06] (03CR) 10Ori.livneh: "> Patch Set 3:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [03:57:21] (03PS4) 10Ori.livneh: wall-clock excimer profiling for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) [03:59:07] (03CR) 10jerkins-bot: [V: 04-1] wall-clock excimer profiling for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [04:00:57] (03PS5) 10Ori.livneh: wall-clock excimer profiling for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) [04:06:48] (03CR) 10Krinkle: "@Dave we just did a doubling for buster, are we good for another doubling?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [04:33:41] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [04:39:55] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [04:46:21] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1008 is OK: (C)5e+06 ge (W)1e+06 ge 7.494e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [04:54:33] (03PS1) 10Aaron Schulz: Revert "SqlBlobStore HOT FIX: remove caching from getBlobBatch" [core] (wmf/1.35.0-wmf.2) - 10https://gerrit.wikimedia.org/r/661659 [04:55:02] (03CR) 10jerkins-bot: [V: 04-1] Revert "SqlBlobStore HOT FIX: remove caching from getBlobBatch" [core] (wmf/1.35.0-wmf.2) - 10https://gerrit.wikimedia.org/r/661659 (owner: 10Aaron Schulz) [05:05:12] (03Abandoned) 10Aaron Schulz: Revert "SqlBlobStore HOT FIX: remove caching from getBlobBatch" [core] (wmf/1.35.0-wmf.2) - 10https://gerrit.wikimedia.org/r/661659 (owner: 10Aaron Schulz) [05:08:05] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1001 is OK: (C)5e+06 ge (W)1e+06 ge 8.055e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [05:21:03] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1004 is OK: (C)5e+06 ge (W)1e+06 ge 7.306e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [06:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 T258361', diff saved to https://phabricator.wikimedia.org/P14212 and previous config saved to /var/cache/conftool/dbconfig/20210205-063554-marostegui.json [06:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:59] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:36:31] !log Stop MySQL on db1075 to clone db1157 T258361 [06:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:09] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) 05Open→03Resolved [07:28:30] !log oblivian@cumin1001 START - Cookbook sre.network.cf [07:28:30] !log oblivian@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [07:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:48] (03PS1) 10Marostegui: mariadb: Productionize db1157 [puppet] - 10https://gerrit.wikimedia.org/r/661836 (https://phabricator.wikimedia.org/T258361) [07:37:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1157 [puppet] - 10https://gerrit.wikimedia.org/r/661836 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:44:25] PROBLEM - SSH on mw2249.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:47:10] !log depooling wdqs1013 and restarting blazegraph [07:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:49] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:49:39] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:51:03] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:55:06] !log cleanup of left over ttl dumps on wdqs1009 and wdqs1010 [07:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210205T0800) [08:16:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:21:20] 10SRE: Modernise memcached systemd unit / sync to current buster setup - https://phabricator.wikimedia.org/T273950 (10MoritzMuehlenhoff) [08:21:31] (03CR) 10Muehlenhoff: [C: 03+1] "Opened https://phabricator.wikimedia.org/T273950 for this" [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [08:22:26] 10SRE, 10serviceops, 10User-jijiki: Modernise memcached systemd unit / sync to current buster setup - https://phabricator.wikimedia.org/T273950 (10jijiki) p:05Triage→03Medium [08:29:05] !log reloading categories from scratch on wdqs1009 [08:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:50] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:32:30] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:42:56] 10SRE, 10serviceops: Kubernetes hosts raid check make facter fail - https://phabricator.wikimedia.org/T237197 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I'm going to close this as we don't have this particular problem anymore (AFAICT) with mdadm checks being spread out across the first week of each month. [08:43:00] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10JMeybohm) [08:46:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1094 T273710', diff saved to https://phabricator.wikimedia.org/P14214 and previous config saved to /var/cache/conftool/dbconfig/20210205-084625-marostegui.json [08:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:31] T273710: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 [08:50:50] (03PS1) 10Marostegui: db1094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661890 (https://phabricator.wikimedia.org/T273710) [08:51:52] (03CR) 10Marostegui: [C: 03+2] db1094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661890 (https://phabricator.wikimedia.org/T273710) (owner: 10Marostegui) [08:55:19] (03CR) 10JMeybohm: [C: 04-1] Calculator Service second try (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/661770 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [09:00:42] (03CR) 10DCausse: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [09:14:03] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:14:28] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:18:42] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [09:33:10] (03PS7) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [09:38:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 5%: Slowly pooling db1075 after cloning db1157', diff saved to https://phabricator.wikimedia.org/P14217 and previous config saved to /var/cache/conftool/dbconfig/20210205-093827-root.json [09:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:05] !log reloading categories from scratch on wdqs1010 [09:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:59] (03CR) 10Gehel: [C: 04-1] "Looks almost good, one last inline comment. Feel free to merge once this is resolved." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [09:53:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 10%: Slowly pooling db1075 after cloning db1157', diff saved to https://phabricator.wikimedia.org/P14218 and previous config saved to /var/cache/conftool/dbconfig/20210205-095331-root.json [09:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:52] 10SRE: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10Aklapper) [10:06:47] !log repooling wdqs1013 - catched up on lag [10:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:59] ryankemper: ^ all good for wdqs1013 [10:08:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 25%: Slowly pooling db1075 after cloning db1157', diff saved to https://phabricator.wikimedia.org/P14219 and previous config saved to /var/cache/conftool/dbconfig/20210205-100834-root.json [10:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:18] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1157 is now replicating, I will start pooling it on Monday. [10:23:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 50%: Slowly pooling db1075 after cloning db1157', diff saved to https://phabricator.wikimedia.org/P14220 and previous config saved to /var/cache/conftool/dbconfig/20210205-102338-root.json [10:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:27:15] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [10:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:00] sigh :) [10:32:13] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [10:32:14] (03Abandoned) 10Hashar: Add jsonevent-layout to lib, managed by maven [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/659289 (https://phabricator.wikimedia.org/T268020) (owner: 10Hashar) [10:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:18] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [10:33:42] (03Abandoned) 10Hashar: Merge tag 'v2.15.14' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/595228 (owner: 10Thcipriani) [10:33:44] (03Abandoned) 10Hashar: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/593836 (owner: 10Paladox) [10:33:46] (03Abandoned) 10Hashar: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/549901 (owner: 10Paladox) [10:33:49] (03Abandoned) 10Hashar: Add websession-flatfile plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/543940 (https://phabricator.wikimedia.org/T222472) (owner: 10Paladox) [10:33:52] (03Abandoned) 10Hashar: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/518447 (owner: 10Paladox) [10:33:55] (03Abandoned) 10Hashar: Update image-diff plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/498427 (owner: 10Paladox) [10:33:58] (03Abandoned) 10Hashar: Add "multi-site" plugin so gerrit can have multi masters [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494865 (owner: 10Paladox) [10:36:54] (03PS15) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [10:37:14] (03PS4) 10Jbond: CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 [10:38:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27881/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [10:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 75%: Slowly pooling db1075 after cloning db1157', diff saved to https://phabricator.wikimedia.org/P14221 and previous config saved to /var/cache/conftool/dbconfig/20210205-103841-root.json [10:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:50] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/27881/mw1261.eqiad.wmnet/index.html shows no actual configuration differences. I think I'" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [10:42:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [10:43:31] (03CR) 10Jbond: "updated" (033 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [10:44:27] (03Merged) 10jenkins-bot: Add an check for numeric USER instruction in Dockerfile [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660784 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [10:45:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto) [10:47:32] RECOVERY - SSH on mw2249.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:47:47] (03Merged) 10jenkins-bot: Add the 'uid' template helper [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660851 (https://phabricator.wikimedia.org/T228967) (owner: 10Giuseppe Lavagetto) [10:47:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 (owner: 10Giuseppe Lavagetto) [10:48:14] (03CR) 10Filippo Giunchedi: [C: 03+1] scap::target: drop array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661780 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:48:52] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: update profiles to drop array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661785 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:50:31] (03Merged) 10jenkins-bot: Remove the build image functionality [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/660852 (owner: 10Giuseppe Lavagetto) [10:53:10] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1259132 MB (15% inode=80%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [10:53:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 100%: Slowly pooling db1075 after cloning db1157', diff saved to https://phabricator.wikimedia.org/P14222 and previous config saved to /var/cache/conftool/dbconfig/20210205-105345-root.json [10:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [10:54:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] scap::target: drop array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661780 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:56:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus: update profiles to drop array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661785 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:56:25] (03Merged) 10jenkins-bot: Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [10:56:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1002.eqiad.wmnet [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:47] (03CR) 10Jbond: [C: 03+2] wmflib: drop conflicts method [puppet] - 10https://gerrit.wikimedia.org/r/661794 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:59:53] (03CR) 10Jbond: [C: 03+2] wmflib: drop conftool funtion [puppet] - 10https://gerrit.wikimedia.org/r/661795 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [11:01:00] <_joe_> jbond42: I know it wasn't used... but I think it wasn't strictly a bad idea :) [11:01:09] jouncebot: which one? [11:01:15] _joe_: ? [11:01:23] <_joe_> actually I'm surprised we don't have that in the language [11:01:28] <_joe_> the conflicts function [11:02:31] im in two minds as to if its usefull enough to warrent its own function vs just doing if defined(foo) { fail() } [11:02:53] <_joe_> oh sure it's just syntactic sugar [11:02:59] my gut feeling is that this was much more usefull in the erlier days of puppet when there where a lot of cross competing modules or where there was a lot more embedding of code [11:03:04] <_joe_> but it's nice for people used to the debian Conflicts: model [11:03:12] <_joe_> jbond42: that too yes [11:03:31] <_joe_> jbond42: I'm not contesting removing it. It's unused, it should go [11:03:36] <_joe_> we can revive it if we need [11:03:37] anyway as you have show an intrest ill add it back as a puppet functions [11:03:41] :D [11:03:48] <_joe_> ahah no wait [11:03:51] <_joe_> :P [11:03:58] <_joe_> when I need it, I'll do so [11:04:08] ack sounds good :) [11:06:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir1002.eqiad.wmnet [11:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:07] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Marostegui) Thanks @jbond. However, I still get logged out from places like tendril, several times per day - today I think I got logged out twice already :-( Icinga though seems to be working b... [11:08:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir1001.eqiad.wmnet [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:19] (03PS8) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) [11:11:02] (03PS1) 10Muehlenhoff: Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 [11:12:34] (03CR) 10jerkins-bot: [V: 04-1] Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 (owner: 10Muehlenhoff) [11:14:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir1001.eqiad.wmnet [11:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3002.esams.wmnet [11:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:28] (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:19:23] (03PS9) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) [11:21:08] 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) I had another pass at black and also a long chat with @dcaro about it (thanks for resurfacing this task). #### TL;DR We've decided to test... [11:21:42] (03PS1) 10JMeybohm: Bump version to v3.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661903 [11:22:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir3002.esams.wmnet [11:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump version to v3.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661903 (owner: 10JMeybohm) [11:23:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3001.esams.wmnet [11:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:58] (03PS2) 10Muehlenhoff: Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 [11:27:04] (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:27:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir3001.esams.wmnet [11:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:55] (03PS1) 10Elukey: turnilo: add two new fields to webrequest_128 [puppet] - 10https://gerrit.wikimedia.org/r/661904 [11:28:34] (03CR) 10jerkins-bot: [V: 04-1] Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 (owner: 10Muehlenhoff) [11:29:35] !log restart acme-chief instances to catch up on kernel upgrades [11:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [11:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:28] (03CR) 10Elukey: [C: 03+2] "Tested in staging, it was requested by SRE for some ongoing work so if naming is not ok I'll amend it, going to merge to unblock people wa" [puppet] - 10https://gerrit.wikimedia.org/r/661904 (owner: 10Elukey) [11:34:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [11:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] (03PS1) 10JMeybohm: Release v3.0.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/661905 [11:37:17] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release v3.0.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/661905 (owner: 10JMeybohm) [11:38:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [11:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:33] (03Abandoned) 10Aklapper: Create a FeaturedFeed for the News on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) (owner: 10Aklapper) [11:39:06] !log jayme@deploy1001 Started deploy [docker-pkg/deploy@7257244]: (no justification provided) [11:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:25] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [11:39:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [11:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:27] (03Abandoned) 10Aklapper: Phab: Use our custom Priority field value in tooltip on Reports page [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) (owner: 10Aklapper) [11:41:03] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) >>! In T273867#6806145, @Marostegui wrote: > Thanks @jbond. However, I still get logged out from places like tendril, several times per day - today I think I got logged out twice already... [11:41:24] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, and 4 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Daimona) [11:42:55] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Marostegui) mmm, maybe it is a different browser. I will check! Thanks [11:44:11] (03PS2) 10Aklapper: Phabricator: Uninstall Conpherence application also in default settings [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) [11:44:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [11:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:56] !log jayme@deploy1001 Finished deploy [docker-pkg/deploy@7257244]: (no justification provided) (duration: 05m 50s) [11:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:09] (03PS3) 10Muehlenhoff: Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 [11:49:41] (03CR) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [11:50:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet [11:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:47] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@7257244]: (no justification provided) [11:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:12] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@7257244]: (no justification provided) (duration: 03m 25s) [11:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:15] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@7257244]: (no justification provided) [11:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:19] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@7257244]: (no justification provided) (duration: 04m 04s) [11:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:22] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@7257244]: (no justification provided) [11:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) [11:59:59] 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10jbond) Thanks for progressing this just wanted to make one note > Investigate if there is a way to customize the Clone with commit-msg hook comman... [12:00:22] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@7257244]: (no justification provided) (duration: 01m 00s) [12:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:28] (03PS1) 10Marostegui: site.pp: Add 9 new databases as insetup [puppet] - 10https://gerrit.wikimedia.org/r/661907 (https://phabricator.wikimedia.org/T273566) [12:04:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet [12:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:25] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [12:04:51] (03CR) 10Marostegui: [C: 03+2] site.pp: Add 9 new databases as insetup [puppet] - 10https://gerrit.wikimedia.org/r/661907 (https://phabricator.wikimedia.org/T273566) (owner: 10Marostegui) [12:05:25] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) These hosts have been added to puppet with: `insetup` role and also assigned a `partman` recipe for the installation. The only puppet change ne... [12:05:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host parse2001.codfw.wmnet [12:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [12:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host people1002.eqiad.wmnet [12:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [12:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1002.eqiad.wmnet [12:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host parse2001.codfw.wmnet [12:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ldap-corp2001.wikimedia.org [12:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:38] (03CR) 10David Caro: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [12:18:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp2001.wikimedia.org [12:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:41] (03PS3) 10David Caro: style: this introduces black as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) [12:29:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host releases2002.codfw.wmnet [12:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host releases2002.codfw.wmnet [12:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:12] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host releases1002.eqiad.wmnet [12:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host releases1002.eqiad.wmnet [12:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:04] 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10dcaro) This addresses the first stage of that plan. https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/659785 [12:42:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ldap-corp1001.wikimedia.org [12:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host testreduce1001.eqiad.wmnet [12:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1001.eqiad.wmnet [12:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-corp1001.wikimedia.org [12:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:18] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netflow5001.eqsin.wmnet [12:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:20] !log reset ifup on netflow5001 T273026 [12:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:24] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [12:58:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5001.eqsin.wmnet [12:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netflow4001.ulsfo.wmnet [12:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4001.ulsfo.wmnet [13:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:40] (03PS1) 10Jbond: wmflib:: function to replace get_clusters in puppet DSL [puppet] - 10https://gerrit.wikimedia.org/r/661909 [13:07:51] (03Abandoned) 10Hashar: Update translations [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/647053 (https://phabricator.wikimedia.org/T269339) (owner: 10Hashar) [13:08:15] (03PS2) 10Jbond: wmflib:: function to replace get_clusters in puppet DSL [puppet] - 10https://gerrit.wikimedia.org/r/661909 (https://phabricator.wikimedia.org/T273743) [13:08:56] (03Abandoned) 10Jbond: wmflib:: function to replace get_clusters in puppet DSL [puppet] - 10https://gerrit.wikimedia.org/r/661909 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [13:09:34] (03Abandoned) 10Hashar: docker:reporter: drop old images filters [puppet] - 10https://gerrit.wikimedia.org/r/624095 (owner: 10Hashar) [13:09:36] (03Abandoned) 10Hashar: docker:reporter: do include latest images for releng/* [puppet] - 10https://gerrit.wikimedia.org/r/624096 (https://phabricator.wikimedia.org/T261207) (owner: 10Hashar) [13:10:25] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [13:11:00] (03Abandoned) 10Hashar: gerrit: use proper hostname on replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [13:11:16] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) I had a look at porting get_clusters to the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/661909 | puppet DSL]] but i think its better to kee... [13:11:57] (03Abandoned) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) (owner: 10Hashar) [13:12:06] (03Abandoned) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 (owner: 10Hashar) [13:12:08] (03Abandoned) 10Hashar: .gitignore docker-pkg-build.log [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759 (owner: 10Hashar) [13:12:40] 10SRE, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO: The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10hashar) 05Open→... [13:12:50] (03Abandoned) 10Hashar: python-build: do not archive previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619779 (owner: 10Hashar) [13:16:24] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [13:20:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netflow3001.esams.wmnet [13:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3001.esams.wmnet [13:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netflow2001.codfw.wmnet [13:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:33] (03CR) 10Jbond: "just noticed i also forgot to push the types for this, can always add back if pasting here for posterity" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661909 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [13:25:33] (03PS1) 10Jbond: wmflib: drop hash(de)select_re functions as puppet has filter now [puppet] - 10https://gerrit.wikimedia.org/r/661910 (https://phabricator.wikimedia.org/T273743) [13:26:41] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [13:28:17] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2001.codfw.wmnet [13:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host netflow1001.eqiad.wmnet [13:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1001.eqiad.wmnet [13:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [13:31:48] (03CR) 10Jbond: [C: 03+2] wmflib: drop hash(de)select_re functions as puppet has filter now [puppet] - 10https://gerrit.wikimedia.org/r/661910 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [13:33:27] (03PS4) 10Filippo Giunchedi: logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) [13:39:19] (03CR) 10Jbond: [C: 03+1] "LGTM, few minor nits feel free to ignore" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661900 (owner: 10Muehlenhoff) [13:48:39] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) Limiting the memory of `rsync` (receive side) and `swift-object-replicator` (sender side) has helped quite a bit in bounding the read/write latency experience... [13:50:30] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [13:56:02] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [13:59:54] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [14:00:21] (03PS1) 10Giuseppe Lavagetto: Remove the pip upgrade from python-build-buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/661915 [14:00:27] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) [14:01:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) >>! In T266365#6802067, @Cmjohnson wrote: > @Jgreen Do you have an IP identified for these? @Cmjohnson I added the IPs to the task description. [14:02:04] (03CR) 10JMeybohm: [C: 03+1] Remove the pip upgrade from python-build-buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/661915 (owner: 10Giuseppe Lavagetto) [14:04:45] (03CR) 10Muehlenhoff: Add a new profile to install OpenLDAP client tools in production (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661900 (owner: 10Muehlenhoff) [14:04:59] (03PS4) 10Muehlenhoff: Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 [14:07:36] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove the pip upgrade from python-build-buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/661915 (owner: 10Giuseppe Lavagetto) [14:16:32] (03PS1) 10Ladsgroup: ldap: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/661916 (https://phabricator.wikimedia.org/T209953) [14:18:42] (03PS2) 10Wolfgang Kandek: Calculator Service second try [deployment-charts] - 10https://gerrit.wikimedia.org/r/661770 (https://phabricator.wikimedia.org/T273151) [14:19:34] (03PS1) 10Jbond: base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 [14:20:05] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [14:20:16] (03CR) 10Jbond: "PCC (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27882" [puppet] - 10https://gerrit.wikimedia.org/r/661917 (owner: 10Jbond) [14:21:55] (03CR) 10jerkins-bot: [V: 04-1] base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (owner: 10Jbond) [14:24:03] (03PS1) 10Giuseppe Lavagetto: Actually use builder methods in the cli build step [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661919 [14:24:39] i have a silly question for an SRE. if i wanted to implement a feature for Wikimedia wikis that would depend on a new DB table like pagelinks (similar structure and size), would that be a big deal? [14:26:06] (03CR) 10Gehel: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [14:27:31] (03CR) 10Ladsgroup: "So only Idba8a9e87222efeceeb7edec0816ec2b29262bdc left in puppet. There are some left in hiera files (which is another beast I'm not going" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/659789 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [14:27:33] (03CR) 10JMeybohm: [C: 03+1] Actually use builder methods in the cli build step [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661919 (owner: 10Giuseppe Lavagetto) [14:30:47] (03PS1) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 [14:31:41] (03PS2) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [14:32:14] (03PS1) 10Muehlenhoff: Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 [14:34:19] (03CR) 10jerkins-bot: [V: 04-1] Switch profile::openldap::management to use profile::openldap::client [puppet] - 10https://gerrit.wikimedia.org/r/661922 (owner: 10Muehlenhoff) [14:35:42] (03CR) 10Ottomata: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [14:37:03] (03CR) 10Ottomata: "We should p" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [14:39:04] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [14:41:34] (03PS2) 10Giuseppe Lavagetto: Actually use builder methods in the cli build step [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661919 [14:41:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Actually use builder methods in the cli build step [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661919 (owner: 10Giuseppe Lavagetto) [14:42:10] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [14:44:02] (03PS1) 10Muehlenhoff: Remove access for aezell [puppet] - 10https://gerrit.wikimedia.org/r/661924 [14:44:28] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@7257244]: (no justification provided) [14:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:54] (03PS1) 10Kormat: tox/unit: Allow unit tests to be indepdenent of env vars [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661925 [14:45:10] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for aezell [puppet] - 10https://gerrit.wikimedia.org/r/661924 (owner: 10Muehlenhoff) [14:45:54] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@7257244]: (no justification provided) (duration: 01m 26s) [14:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:34] (03PS2) 10Kormat: tox/unit: Allow unit tests to be indepdenent of env vars [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661925 [14:48:20] (03CR) 10Jbond: [C: 03+1] Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 (owner: 10Muehlenhoff) [14:49:37] (03PS2) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 [14:49:40] (03PS6) 10David Caro: DONOTMERGE wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 [14:51:46] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/661770 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [14:52:50] (03CR) 10jerkins-bot: [V: 04-1] DONOTMERGE wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro) [14:54:43] (03PS1) 10Hashar: Dummy build for stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/661946 [14:57:38] (03CR) 10Muehlenhoff: CAS style changes (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [14:59:25] (03CR) 10Wolfgang Kandek: [C: 03+2] Calculator Service second try (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/661770 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [14:59:53] (03PS4) 10David Caro: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) [15:01:06] (03CR) 10David Caro: tests: Improve the mocking of logs (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659295 (owner: 10David Caro) [15:01:13] (03Merged) 10jenkins-bot: Calculator Service second try [deployment-charts] - 10https://gerrit.wikimedia.org/r/661770 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [15:08:05] (03Abandoned) 10Hashar: Dummy build for stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/661946 (owner: 10Hashar) [15:08:37] 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10dcaro) A couple more things that we discussed and agreed: * Use isort for import autoformatting (probably disable pylint checks for them). * Disabl... [15:09:09] (03CR) 10jerkins-bot: [V: 04-1] style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [15:18:47] (03PS1) 10DCausse: [wdqs] consume the new updater stream from kafka-main [puppet] - 10https://gerrit.wikimedia.org/r/661947 [15:19:57] (03CR) 10DCausse: "@elukey when this is merged I think you can remove wdqs1009 from the machines allowed to contact jumbo" [puppet] - 10https://gerrit.wikimedia.org/r/661947 (owner: 10DCausse) [15:20:29] (03PS5) 10David Caro: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) [15:23:58] (03CR) 10Ottomata: [C: 03+2] [wdqs] consume the new updater stream from kafka-main [puppet] - 10https://gerrit.wikimedia.org/r/661947 (owner: 10DCausse) [15:26:46] (03PS1) 10Giuseppe Lavagetto: Release v3.0.1 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/661950 [15:27:35] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release v3.0.1 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/661950 (owner: 10Giuseppe Lavagetto) [15:28:04] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@6b74e78]: (no justification provided) [15:28:06] 10SRE: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10akosiaris) I think it's not on the first week of each month but on every week. The syntax is ` 57 5 * * <%= @dow %> root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray... [15:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:30] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@6b74e78]: (no justification provided) (duration: 00m 26s) [15:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:23] !log replacing optics and fiber on pfw3a-eqiad:xe-0/0/17 and fasw-c1a-eqiad:xe-0/2/0 T271295 [15:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:28] T271295: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 [15:33:31] (03PS2) 10Alexandros Kosiaris: nutcracker: drop use of to_milliseconds function [puppet] - 10https://gerrit.wikimedia.org/r/661414 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [15:36:56] 10SRE, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10hashar) 05Open→03Resolved a:03Paladox AFAIK @Paladox has handled the whole migration. We might have some rewrite rule... [15:41:33] 10SRE: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10JMeybohm) >>! In T273953#6806765, @akosiaris wrote: > I think it's not on the first week of each month but on every week. The syntax is > > ` > 57 5 * * <%= @dow %> root if [ -x /usr/share/mdadm/checkarray ] && [ $(date... [15:41:53] 10SRE, 10Gerrit, 10Phabricator, 10Release-Engineering-Team, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10hashar) 05Open→03Declined Closing this since it is an old task. Either Pherrit got fixed or nobody uses it at all. If there is a nee... [15:47:32] 10SRE: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10akosiaris) >>! In T273953#6806797, @JMeybohm wrote: >>>! In T273953#6806765, @akosiaris wrote: >> I think it's not on the first week of each month but on every week. The syntax is >> >> ` >> 57 5 * * <%= @dow %> root if... [15:48:09] 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Cmjohnson) [15:48:58] 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Cmjohnson) [15:49:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) [15:49:05] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [15:50:09] 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) @Cmjohnson this looks goods to me. If possible, can you provide the new IP the host will have before the move so I can set it on the host before powering it off? Thanks! [15:50:21] 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) p:05Triage→03Medium [15:51:46] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10jcrespo) @WMDE-leszek don't worry, the SRE that will attend you will make sure to contact legal in the agreed method (not the tag 0:-)). [15:52:01] 10SRE, 10Fundraising-Backlog, 10Traffic, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10MBeat33) [15:52:11] 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10Cmjohnson) [15:53:04] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` LDAP group for AGueyte - https://phabricator.wikimedia.org/T273980 (10jcrespo) a:05jcrespo→03None [15:53:08] 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10Cmjohnson) [15:53:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Cmjohnson) [15:53:13] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) [15:54:33] (03PS1) 10Elukey: profile::hadoop::master: add more precise runbooks for alerts [puppet] - 10https://gerrit.wikimedia.org/r/661953 [15:55:58] 10ops-eqiad: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10Cmjohnson) [15:56:50] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: add more precise runbooks for alerts [puppet] - 10https://gerrit.wikimedia.org/r/661953 (owner: 10Elukey) [15:58:06] 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Cmjohnson) @marostegui: the same IP, I am not changing the vlan [15:58:46] PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:11] 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) Excellent! I will have the host ready for you. [16:00:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] Allow running tests on an image once it's built [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661138 (https://phabricator.wikimedia.org/T273427) (owner: 10Giuseppe Lavagetto) [16:01:59] 10SRE, 10Gerrit, 10Phabricator, 10Release-Engineering-Team, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Jdlrobson) FWIW I stopped using Pherrit because I couldn't get it to work without this fix. I don't think that's a great reason to decline... [16:04:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add BGP configuration for the new ML Serve eqiad/codfw clusters [homer/public] - 10https://gerrit.wikimedia.org/r/661055 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [16:04:46] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @wiki_willy, @hnowlan Move tickets have been created for db1111 (T273982), logstash1020 (T273984) and maps1001 (T273983). Francium has been de... [16:08:59] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Cmjohnson) replaced both the optics and fiber...waiting the weekend to see if the CRC errors return [16:09:42] (03PS1) 10Kormat: WIP: dbutil: Handle IP addresses in resolve() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661957 [16:10:48] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27883/console" [puppet] - 10https://gerrit.wikimedia.org/r/661757 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [16:11:59] (03CR) 10Elukey: "I think that we need to make some kind of dependency between the symlink creation and the deploy of the hadoop packages that create the /v" [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) (owner: 10Razzi) [16:25:17] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @Cmjohnson wonderful news! I'll follow up in the task to help the owners of the hosts! [16:27:11] 10ops-eqiad: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10elukey) Added Filippo and Cole too for awareness. The idea is to shutdown the node, move it to a different rack within the same row (so no IP/vlan change) and boot it up again. The downtime requested will be around ma... [16:28:08] (03CR) 10Ottomata: [C: 03+1] "Luca we talked about this a ton, and it is complicated! Should we copy this code between both ::master and ::standby? More officially, t" [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) (owner: 10Razzi) [16:29:25] (03CR) 10Klausman: [C: 03+1] WIP: dbutil: Handle IP addresses in resolve() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661957 (owner: 10Kormat) [16:29:31] 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10elukey) Adding @hnowlan to understand if the time window is ok for the host (we briefly had a chat about it on IRC). The idea is to: 1) shutdown the node 2) move it to a different rack within the same row (no ip change... [16:35:16] (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) (owner: 10Razzi) [16:36:05] (03PS1) 10JMeybohm: Demo change on how to support switchable second staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/661961 [16:36:28] 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10herron) [16:36:32] (03PS2) 10JMeybohm: Demo change on how to support switchable second staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/661961 (https://phabricator.wikimedia.org/T269835) [16:38:03] (03CR) 10jerkins-bot: [V: 04-1] Demo change on how to support switchable second staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/661961 (https://phabricator.wikimedia.org/T269835) (owner: 10JMeybohm) [16:46:54] 10Puppet, 10SRE, 10Continuous-Integration-Config, 10Patch-For-Review, and 2 others: Shell/Python/other scripts should not be generated by ERB files; dynamic parts should be a simple ERB config file - https://phabricator.wikimedia.org/T254480 (10crusnov) I'm scoping part of the broader python3 porting to ge... [16:49:26] 10SRE: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10JMeybohm) >>! In T273953#6806822, @akosiaris wrote: > Indeed. Perhaps switching to a random day of the month is the solution after all. That's what I was trying to propose in a maybe not so clear way. Sorry for that 😊 [16:52:11] 10SRE, 10serviceops, 10User-jijiki: ifup@eno1.service fails on mc* hosts after 4.19.171-2 upgrade - https://phabricator.wikimedia.org/T273918 (10elukey) ` elukey@mc2029:~$ python3 Python 3.7.3 (default, Jul 25 2020, 13:03:44) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more inf... [17:26:07] 10SRE, 10LDAP-Access-Requests: Add STran to `wmf` LDAP group - https://phabricator.wikimedia.org/T267968 (10Tchanders) [17:26:27] 10SRE, 10LDAP-Access-Requests: Add wikitrent to `wmf` LDAP group - https://phabricator.wikimedia.org/T272489 (10Tchanders) [17:26:51] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1095 - https://phabricator.wikimedia.org/T273732 (10wiki_willy) a:05wiki_willy→03Cmjohnson [17:27:52] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) [17:29:33] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Phabricator, and 2 others: Duplicate LDAP user for cn=smccandlish - https://phabricator.wikimedia.org/T138672 (10bd808) >>! In T138672#6806956, @hashar wrote: > And as pointed above ( T138672#2408607 ) there are two accounts matching: > ` > $ ldapsearch -x -LL 'cn... [17:29:58] 10SRE, 10Gerrit, 10Phabricator, 10Release-Engineering-Team, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Tgr) Yeah, I would also use Oherrit if not for this issue. Nevertheless, I don't think a CSP exemption is the right way to fix it. [17:47:18] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Phabricator, and 3 others: Duplicate LDAP user for cn=smccandlish - https://phabricator.wikimedia.org/T138672 (10bd808) 05Open→03Resolved a:03bd808 `name=T138672.ldif dn: uid=smccandlish,ou=people,dc=wikimedia,dc=org changetype: modify replace: cn cn: Smccan... [17:47:40] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10CDanis) Thank you! Verified NDA signed in our tracking spreadsheet. @AikoChou Please also make sure you have read https://wikitech.wikimedia.org/wi... [17:48:42] (03PS1) 10CDanis: Shell account & Analytics access for aikochou [puppet] - 10https://gerrit.wikimedia.org/r/661970 (https://phabricator.wikimedia.org/T273602) [17:48:49] (03PS2) 10Ahmon Dancy: temp changes while experimenting [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661207 [17:48:51] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into dancy-k8s-dev [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661971 [17:48:54] (03PS1) 10Ahmon Dancy: copy wikiversions.json to wikiversions-dev.json [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661972 [17:48:56] (03PS1) 10Ahmon Dancy: Filthy MW_NO_ETCD hack for mergeMessageLists [mediawiki-config] (dancy-k8s-dev) - 10https://gerrit.wikimedia.org/r/661973 [17:49:50] (03PS2) 10CDanis: Shell account & Analytics access for aikochou [puppet] - 10https://gerrit.wikimedia.org/r/661970 (https://phabricator.wikimedia.org/T273602) [17:50:37] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Phabricator, and 3 others: Duplicate LDAP user for cn=smccandlish - https://phabricator.wikimedia.org/T138672 (10bd808) [17:55:39] (03PS3) 10CDanis: Shell account & Analytics access for aikochou [puppet] - 10https://gerrit.wikimedia.org/r/661970 (https://phabricator.wikimedia.org/T273602) [17:56:07] 10SRE, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10CDanis) Checked in with Miriam on IRC and Turnilo/Superset access isn't needed, but Kerberos is. Doing that now. [17:56:57] 10SRE, 10Gerrit, 10LDAP-Access-Requests, 10Phabricator, and 3 others: Duplicate LDAP user for cn=smccandlish - https://phabricator.wikimedia.org/T138672 (10hashar) Thank you @bd808 and extra kudos for keeping the proper account. @SMcCandlish that should work. I am not sure why it never got noticed before,... [17:57:37] (03CR) 10CDanis: [C: 03+2] Shell account & Analytics access for aikochou [puppet] - 10https://gerrit.wikimedia.org/r/661970 (https://phabricator.wikimedia.org/T273602) (owner: 10CDanis) [17:57:44] (03PS4) 10CDanis: Shell account & Analytics access for aikochou [puppet] - 10https://gerrit.wikimedia.org/r/661970 (https://phabricator.wikimedia.org/T273602) [17:58:46] _joe_: okay to merge your change? [17:59:36] _joe_: 854dc7e652 https://gerrit.wikimedia.org/r/c/operations/puppet/+/661757 [17:59:54] (03PS1) 10Elukey: Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) [18:01:28] (03CR) 10jerkins-bot: [V: 04-1] Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) (owner: 10Elukey) [18:03:43] I don't understand [18:03:55] the patch was merged almost two hours ago? [18:03:59] yet no unmerged changes alert [18:03:59] (03PS2) 10Elukey: Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) [18:04:08] puppetmaster1001 thinks it is still not merged there [18:05:34] (03CR) 10jerkins-bot: [V: 04-1] Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) (owner: 10Elukey) [18:05:43] the patch was reviewed by _joe_ and akosiaris, I am merging it I guess, although I wish I understood how we got into this state [18:10:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:11:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:13:19] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10CDanis) 05Open→03Resolved @AikoChou should have shell access within half an hour. Also is an email waiting for them about setting their Kerberos... [18:13:27] <_joe_> cdanis: ugh my bad :/ [18:13:38] np :) [18:13:39] <_joe_> I should not do stuff on friday after 5 pm [18:13:45] I'm surprised the alert hadn't fired by now [18:13:53] <_joe_> yup [18:14:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:30] same Telia link that has flapped twice this month [18:15:49] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:15:57] oh, just once; two emails, a report and a followup of "no issues identified" days later heh [18:18:39] (03PS1) 10Elukey: Add fake keytabs for Druid nodes [labs/private] - 10https://gerrit.wikimedia.org/r/661976 [18:18:53] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytabs for Druid nodes [labs/private] - 10https://gerrit.wikimedia.org/r/661976 (owner: 10Elukey) [18:19:11] cdanis: ok if I merge --^ ? [18:19:19] just to avoid spamming people doing puppet-merge [18:19:35] (asking for confirmation after what I've read above) [18:19:36] uhm [18:19:55] elukey: please go ahead, my patch should have been merged already though [18:20:02] is it giving you a 'multiple' prompt? [18:20:15] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 2081431 MB (26% inode=80%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [18:20:22] nono I didn't even try, I didn't want to step on your toes if you were still debugging [18:20:28] nope go ahead [18:20:30] ack thanks [18:21:11] (03PS3) 10Elukey: Set Apache Bigtop 1.5 as default hadoop distro [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) [18:25:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27885/console" [puppet] - 10https://gerrit.wikimedia.org/r/661974 (https://phabricator.wikimedia.org/T273711) (owner: 10Elukey) [18:26:39] (03PS1) 10CDanis: new LDAP user agueyte [puppet] - 10https://gerrit.wikimedia.org/r/661977 (https://phabricator.wikimedia.org/T273980) [18:26:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [18:26:46] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [18:26:53] here [18:27:29] hmm, that alert links doesn't help much does it [18:28:10] https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_80% [18:28:42] rzl: from https://librenms.wikimedia.org/alert-log I see mr1-eqsin, possible that pages? [18:28:56] no it should just be crs and asws [18:29:05] oh [18:29:07] but it isn't [18:29:13] just got to the same, agree that's what fired [18:29:50] mr1-eqsin what [18:29:55] it seems indeed used https://librenms.wikimedia.org/device/164 [18:30:09] (this only sort of "paged", it dropped the keyword in IRC but didn't talk to victorops) [18:30:27] I think that goes via alertmanager nowadays? [18:31:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [18:31:46] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [18:32:18] cdanis: yeah, and the alert didn't show on alertmanager, so it's possible it's suppressed there but the IRC notification still goes through [18:32:33] I have to confess I don't know what jinxer *is* [18:32:38] ah no, apparently alertmanger talks to jinxer? [18:32:41] haha yeah I just did the same [18:32:43] https://wikitech.wikimedia.org/wiki/Alertmanager [18:32:44] yeah [18:42:58] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to `wmf` LDAP group for AGueyte - https://phabricator.wikimedia.org/T273980 (10CDanis) 05Open→03Resolved a:03CDanis The `wmf` group does not require manager approval -- only verification that staff is staff :) Access granted; welcome to t... [18:43:51] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 173192144 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:44:01] 10SRE, 10LDAP-Access-Requests: Access to Product Superset for Rmurthy - https://phabricator.wikimedia.org/T273813 (10CDanis) a:03jrobell @jrobell Can you please confirm? Thanks! [18:45:09] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10CDanis) a:03KFrancis @KFrancis Can you please get an NDA signed with this WMDE staff member? Thanks! [18:46:13] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 714408 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:47:33] (03PS2) 10CDanis: new LDAP user agueyte [puppet] - 10https://gerrit.wikimedia.org/r/661977 (https://phabricator.wikimedia.org/T273980) [18:47:37] (03CR) 10CDanis: [C: 03+2] new LDAP user agueyte [puppet] - 10https://gerrit.wikimedia.org/r/661977 (https://phabricator.wikimedia.org/T273980) (owner: 10CDanis) [18:54:12] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10CDanis) Hi, just wanted to check in if anything more was needed here? [18:55:19] hi qchris [18:56:21] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10ppelberg) >>! In T271602#6807375, @CDanis wrote: > Hi, just wanted to check in if anything more was needed here? Yikes – thank you for bumping this @CDanis. I don't think anything more is needed. //See bel... [18:57:28] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10CDanis) 05Open→03Resolved Glad to hear, thanks! [19:27:04] jouncebot: next [19:27:04] In 12 hour(s) and 32 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210206T0800) [19:27:34] 10SRE, 10Instrument-ClientError, 10MediaWiki-extensions-WikimediaEvents, 10observability: Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts - https://phabricator.wikimedia.org/T264665 (10Jdlrobson) I think alerts are the way to go with this. The alert would... [19:31:42] Jdlrobson https://phabricator.wikimedia.org/T274000 the logstash link generates "Unable to completely restore the URL, be sure to use the share functionality." alerts for me [19:32:27] RECOVERY - Host cp5007.mgmt is UP: PING WARNING - Packet loss = 75%, RTA = 228.56 ms [19:33:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:34:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:35:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:37:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:38:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:38:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:39:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:39:32] !log reimaging 2 scap proxies in codfw because there are no deployments today [19:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:18] sigh, another Telia link flap? [19:50:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1395.eqiad.wmnet with reason: REIMAGE [19:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1394.eqiad.wmnet with reason: REIMAGE [19:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1395.eqiad.wmnet with reason: REIMAGE [19:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1394.eqiad.wmnet with reason: REIMAGE [19:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2289.codfw.wmnet with reason: REIMAGE [19:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:02] (03PS1) 10Razzi: wikireplicas: alert via email for analytics wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) [19:59:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2289.codfw.wmnet with reason: REIMAGE [19:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE [20:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE [20:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2254.codfw.wmnet with reason: REIMAGE [20:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2254.codfw.wmnet with reason: REIMAGE [20:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1395.eqiad.wmnet'] ` an... [20:16:35] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1394.eqiad.wmnet'] ` an... [20:19:20] (03PS1) 10Razzi: presto: set hive.max-pertitions-per-scan for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/661990 (https://phabricator.wikimedia.org/T273004) [20:19:28] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1395.eqiad.wmnet [20:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1394.eqiad.wmnet [20:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:53] (03CR) 10Ottomata: [C: 03+1] presto: set hive.max-pertitions-per-scan for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/661990 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [20:21:58] (03CR) 10Razzi: [C: 03+2] presto: set hive.max-pertitions-per-scan for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/661990 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [20:22:30] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia outage #01264616 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:22:30] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Telia outage #01264616 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:23:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2289.codfw.wmnet'] ` an... [20:27:59] (03CR) 10Dzahn: [C: 03+2] installserver::proxy: remove code that absented cron [puppet] - 10https://gerrit.wikimedia.org/r/661533 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:30:17] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2289.codfw.wmnet [20:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1395.eqiad.wmnet [20:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:38:13] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1394.eqiad.wmnet [20:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:40] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:39:19] (03PS1) 10Cwhite: profile: update w3creportingapi to use 12 weekly indexes [puppet] - 10https://gerrit.wikimedia.org/r/661993 (https://phabricator.wikimedia.org/T274005) [20:42:28] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2289.codfw.wmnet [20:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:50:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1393.eqiad.wmnet with reason: REIMAGE [20:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1307.eqiad.wmnet'] ` an... [20:52:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1393.eqiad.wmnet with reason: REIMAGE [20:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1392.eqiad.wmnet with reason: REIMAGE [20:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1397.eqiad.wmnet [20:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1397.eqiad.wmnet [20:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1392.eqiad.wmnet with reason: REIMAGE [20:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet [20:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1307.eqiad.wmnet [20:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:06] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:00:58] 10SRE, 10observability, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10hashar) In a nutshell the first issue is that the first spike of err... [21:01:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2266.codfw.wmnet with reason: REIMAGE [21:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2254.codfw.wmnet'] ` an... [21:03:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2266.codfw.wmnet with reason: REIMAGE [21:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2254.codfw.wmnet [21:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:00] (03PS1) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) [21:08:52] (03CR) 10jerkins-bot: [V: 04-1] Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [21:08:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2254.codfw.wmnet [21:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:14:54] (03PS2) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) [21:15:46] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1393.eqiad.wmnet'] ` an... [21:16:44] (03CR) 10jerkins-bot: [V: 04-1] Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [21:17:22] (03PS3) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) [21:19:40] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/27888/icinga1001.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [21:21:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1392.eqiad.wmnet'] ` an... [21:26:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2266.codfw.wmnet'] ` an... [21:26:57] 10SRE, 10observability: update logging ES's template index to type the 'age' field as an integer - https://phabricator.wikimedia.org/T266906 (10colewhite) 05Open→03Resolved a:03colewhite This was resolved in the transition to the custom w3creportingapi index pattern. [21:27:03] 10SRE, 10Product-Data-Infrastructure, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10colewhite) [21:27:05] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [21:28:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1306.eqiad.wmnet with reason: REIMAGE [21:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:28] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2266.codfw.wmnet [21:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1306.eqiad.wmnet with reason: REIMAGE [21:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:43] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1278.eqiad.wmnet [21:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [21:39:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1269.eqiad.wmnet with reason: REIMAGE [21:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:40] (03PS1) 10Ottomata: Alert if kafka max replica lag is steadily increasing [puppet] - 10https://gerrit.wikimedia.org/r/662005 (https://phabricator.wikimedia.org/T273702) [21:41:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1269.eqiad.wmnet with reason: REIMAGE [21:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:18] (03PS1) 10Dzahn: grafana: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662008 (https://phabricator.wikimedia.org/T209953) [21:48:29] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2266.codfw.wmnet [21:48:33] (03PS1) 10Cwhite: profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) [21:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:34] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1392.eqiad.wmnet [21:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1392.eqiad.wmnet [21:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1393.eqiad.wmnet [21:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:48] (03PS1) 10Razzi: Fix typo in hive.max-partitions-per-scan [puppet] - 10https://gerrit.wikimedia.org/r/662011 (https://phabricator.wikimedia.org/T273004) [21:52:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1393.eqiad.wmnet [21:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:43] (03CR) 10Ottomata: [C: 03+1] Fix typo in hive.max-partitions-per-scan [puppet] - 10https://gerrit.wikimedia.org/r/662011 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [21:56:50] (03CR) 10Razzi: [C: 03+2] Fix typo in hive.max-partitions-per-scan [puppet] - 10https://gerrit.wikimedia.org/r/662011 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [21:56:58] (03PS2) 10Razzi: Fix typo in hive.max-partitions-per-scan [puppet] - 10https://gerrit.wikimedia.org/r/662011 (https://phabricator.wikimedia.org/T273004) [22:01:27] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1285.eqiad.wmnet with reason: REIMAGE [22:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:30] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1285.eqiad.wmnet with reason: REIMAGE [22:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1306.eqiad.wmnet'] ` an... [22:16:08] (03PS1) 10Dzahn: netmon: replace hiera within hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662013 (https://phabricator.wikimedia.org/T209953) [22:16:17] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1306.eqiad.wmnet [22:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1306.eqiad.wmnet [22:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:47] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle) [22:24:53] (03CR) 10Volans: Revert "elasticsearch: return the cluster name in __str__ for ElasticsearchCluster" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659294 (owner: 10David Caro) [22:26:31] (03CR) 10Dzahn: "How about the alternative solution I have offered to fix this problem." [puppet] - 10https://gerrit.wikimedia.org/r/643919 (owner: 10Hashar) [22:27:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10wiki_willy) a:03Papaul [22:31:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1269.eqiad.wmnet'] ` an... [22:32:21] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1269.eqiad.wmnet [22:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1269.eqiad.wmnet [22:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:27] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [22:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:29] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:24] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [22:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:26] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:01] !log T267927 `sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927` failing with `ERROR org.wikidata.query.rdf.tool.Munge - Fatal error munging RDF` [22:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:07] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [22:43:37] (03PS1) 10Dzahn: hieradata/common: replace hiera within hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662021 (https://phabricator.wikimedia.org/T209953) [22:43:39] (03PS1) 10Dzahn: netbox: replace hiera inside hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/662022 (https://phabricator.wikimedia.org/T209953) [22:46:10] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [22:46:13] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:48] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` LDAP group for AGueyte - https://phabricator.wikimedia.org/T273980 (10Tchanders) Thanks @CDanis [22:50:00] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [22:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:03] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1285.eqiad.wmnet'] ` an... [22:56:51] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [22:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:53] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:48] 10SRE, 10Analytics, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10CDanis) I contacted Carol on Slack; this request is approved. [23:12:51] 10SRE, 10Analytics, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10CDanis) [23:13:28] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:19] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1285.eqiad.wmnet [23:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:32] 10SRE, 10Analytics, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10kzimmerman) Thank you @CDanis! [23:21:00] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:48] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` LDAP group for AGueyte - https://phabricator.wikimedia.org/T273980 (10marcella) Thanks @CDanis for being faster to say that I don't need to approve than I was in actually approving. :) [23:35:16] !log T267927 Re-downloading latest dumps (main database, lexeme) in tmux session `downloads_dumps` on `ryankemper@wdqs1009.eqiad.wmnet` [23:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:20] T267927: Reload wikidata journal from fresh dumps - https://phabricator.wikimedia.org/T267927 [23:37:35] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1285.eqiad.wmnet [23:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:16] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [23:52:14] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1008 is CRITICAL: 6.394e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [23:52:38] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1007 is CRITICAL: 6.538e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [23:53:34] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:54:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:56:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:58:55] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...