[00:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:01:53] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1007 is CRITICAL: 2.357e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [00:03:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1310.eqiad.wmnet with reason: REIMAGE [00:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:19] (03PS1) 10Dzahn: installserver::proxy: remove code that absented cron [puppet] - 10https://gerrit.wikimedia.org/r/661533 (https://phabricator.wikimedia.org/T273673) [00:04:07] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1003 is CRITICAL: 3.644e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [00:05:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1310.eqiad.wmnet with reason: REIMAGE [00:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:07] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) [00:08:22] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Legoktm) 05Open→03Resolved a:03Legoktm Conclusi... [00:10:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2280.codfw.wmnet'] ` an... [00:10:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) 05Stalled→03Open [00:10:54] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Legoktm) [00:11:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2279.codfw.wmnet'] ` an... [00:11:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2280.codfw.wmnet [00:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2279.codw.wmnet [00:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:16] (03PS1) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) [00:18:23] PROBLEM - Check systemd state on an-worker1124 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2280.codfw.wmnet [00:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2279.codfw.wmnet [00:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:05] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Release-Engineering-Team (Deployment services), and 2 others: Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade - https://phabricator.wikimedia.org/T273312 (10Dzahn) Thank you very much for the flamegraphs and fu... [00:43:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1318.eqiad.wmnet'] ` an... [00:44:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1318.eqiad.wmnet [00:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:28] (03PS1) 10Cwhite: profile: onboard icinga alerts to common logging schema [puppet] - 10https://gerrit.wikimedia.org/r/661539 (https://phabricator.wikimedia.org/T234565) [00:49:38] (03PS2) 10Cwhite: profile: onboard icinga logging to common logging schema [puppet] - 10https://gerrit.wikimedia.org/r/661539 (https://phabricator.wikimedia.org/T234565) [00:51:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1310.eqiad.wmnet'] ` an... [00:51:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1310.eqiad.wmnet [00:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1318.eqiad.wmnet [00:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T0100). [01:02:38] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 630245152 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:10] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2183407656 and 132 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:04] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3394500008 and 196 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1310.eqiad.wmnet [01:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1891624 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:34] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1790097024 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:26] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 263104 and 127 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:28] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 136040 and 189 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:26] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 60160 and 249 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:47] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@4b4872d]: transfer_to_es: Increase timeout waiting for source data to three hours [01:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:03] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@4b4872d]: transfer_to_es: Increase timeout waiting for source data to three hours (duration: 01m 16s) [01:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:45] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1009 is OK: (C)5e+06 ge (W)1e+06 ge 7.745e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [01:43:33] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1007 is OK: (C)5e+06 ge (W)1e+06 ge 2.491e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [01:50:03] (03PS2) 10Legoktm: logos: Update nowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661350 [01:50:05] (03PS2) 10Legoktm: logos: Update cawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661351 [01:50:07] (03PS2) 10Legoktm: logos: Update fiwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661352 [01:50:09] (03PS2) 10Legoktm: logos: Update ukwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661353 [01:50:11] (03PS2) 10Legoktm: logos: Update cswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661354 [01:50:13] (03PS2) 10Legoktm: logos: Update huwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661355 [01:50:15] (03PS2) 10Legoktm: logos: Update trwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661356 [01:52:59] (03CR) 10Legoktm: [C: 03+2] logos: Update nowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661350 (owner: 10Legoktm) [01:53:19] (03CR) 10Legoktm: [C: 03+2] logos: Update cawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661351 (owner: 10Legoktm) [01:53:35] (03CR) 10Legoktm: [C: 03+2] logos: Update fiwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661352 (owner: 10Legoktm) [01:53:49] (03CR) 10Legoktm: [C: 03+2] logos: Update ukwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661353 (owner: 10Legoktm) [01:54:07] (03CR) 10Legoktm: [C: 03+2] logos: Update cswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661354 (owner: 10Legoktm) [01:54:24] (03CR) 10Legoktm: [C: 03+2] logos: Update huwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661355 (owner: 10Legoktm) [01:54:29] (03Merged) 10jenkins-bot: logos: Update nowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661350 (owner: 10Legoktm) [01:54:43] (03CR) 10Legoktm: [C: 03+2] logos: Update trwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661356 (owner: 10Legoktm) [01:55:14] (03Merged) 10jenkins-bot: logos: Update cawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661351 (owner: 10Legoktm) [01:55:16] (03Merged) 10jenkins-bot: logos: Update fiwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661352 (owner: 10Legoktm) [01:55:18] (03Merged) 10jenkins-bot: logos: Update ukwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661353 (owner: 10Legoktm) [01:55:57] (03Merged) 10jenkins-bot: logos: Update cswiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661354 (owner: 10Legoktm) [01:56:29] (03Merged) 10jenkins-bot: logos: Update huwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661355 (owner: 10Legoktm) [01:56:43] (03Merged) 10jenkins-bot: logos: Update trwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661356 (owner: 10Legoktm) [01:58:11] (03CR) 10Legoktm: "Even though this was labs-only, please still pull it on deploy1001 and sync it so there aren't undeployed commits waiting for the next per" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [02:00:38] !log legoktm@deploy1001 Synchronized static/images/project-logos/: Update and recompress logos for nowiki, cawiki, fiwiki, ukwiki, cswiki, huwiki, trwiki (1/2) (duration: 01m 10s) [02:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:52] !log legoktm@deploy1001 Synchronized logos/config.yaml: Update and recompress logos for nowiki, cawiki, fiwiki, ukwiki, cswiki, huwiki, trwiki (2/2) (duration: 01m 06s) [02:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:37] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1003 is OK: (C)5e+06 ge (W)1e+06 ge 7.95e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [02:29:31] (03PS1) 10Legoktm: logos: Update rowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661545 [02:29:33] (03PS1) 10Legoktm: logos: Update kowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661566 [02:29:35] (03PS1) 10Legoktm: logos: Update eowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661567 [02:29:37] (03PS1) 10Legoktm: logos: Update dawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661568 [02:29:39] (03PS1) 10Legoktm: logos: Update arwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661569 [02:29:41] (03PS1) 10Legoktm: logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 [02:29:43] (03PS1) 10Legoktm: logos: Update vowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661571 [02:49:02] (03CR) 10Gergő Tisza: [C: 03+1] [WIP] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [02:54:41] 10ops-esams, 10DC-Ops: Esams: Delete rack OE10, OE11, OE12 and OE13 from Netbox - https://phabricator.wikimedia.org/T273841 (10Papaul) [03:01:00] 10ops-esams, 10DC-Ops: Esams: Delete rack OE10, OE11, OE12 and OE13 from Netbox - https://phabricator.wikimedia.org/T273841 (10Papaul) p:05Triage→03Medium [03:13:45] PROBLEM - SSH on mw2249.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:18:03] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:14:57] RECOVERY - SSH on mw2249.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:14:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:17:42] (03PS1) 10Marostegui: db1173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661575 (https://phabricator.wikimedia.org/T258361) [06:20:47] (03CR) 10Marostegui: [C: 03+2] db1173: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/661575 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:25:18] (03PS1) 10Marostegui: instances.yaml: Add db1173 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/661577 (https://phabricator.wikimedia.org/T258361) [06:26:04] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1173 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/661577 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:28:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1173 to dbctl - depooled T258361', diff saved to https://phabricator.wikimedia.org/P14179 and previous config saved to /var/cache/conftool/dbconfig/20210204-062836-marostegui.json [06:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:42] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:30:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 1%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14180 and previous config saved to /var/cache/conftool/dbconfig/20210204-063033-root.json [06:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 T266483', diff saved to https://phabricator.wikimedia.org/P14181 and previous config saved to /var/cache/conftool/dbconfig/20210204-064157-marostegui.json [06:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:04] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [06:42:47] !log Restart mysql on db1137 [06:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14182 and previous config saved to /var/cache/conftool/dbconfig/20210204-064544-root.json [06:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:51:59] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) [06:58:23] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [06:58:53] (03PS1) 10Elukey: Reduce the HDFS Namenode fsimage backups retention on Hadoop Backup [puppet] - 10https://gerrit.wikimedia.org/r/661579 [06:58:55] (03PS1) 10QChris: Add .gitreview [software/benchmw] - 10https://gerrit.wikimedia.org/r/661580 [06:58:57] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/benchmw] - 10https://gerrit.wikimedia.org/r/661580 (owner: 10QChris) [07:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14183 and previous config saved to /var/cache/conftool/dbconfig/20210204-070047-root.json [07:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27853/console" [puppet] - 10https://gerrit.wikimedia.org/r/661579 (owner: 10Elukey) [07:02:11] (03CR) 10Elukey: [V: 03+1 C: 03+2] Reduce the HDFS Namenode fsimage backups retention on Hadoop Backup [puppet] - 10https://gerrit.wikimedia.org/r/661579 (owner: 10Elukey) [07:05:21] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10ayounsi) WFM, sent you a calendar invitation. [07:07:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 2%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14184 and previous config saved to /var/cache/conftool/dbconfig/20210204-070726-root.json [07:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:50] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1117.eqiad.wmnet [07:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 20%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14185 and previous config saved to /var/cache/conftool/dbconfig/20210204-071551-root.json [07:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:16:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1117.eqiad.wmnet [07:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:54] (03PS1) 10Elukey: Move an-worker1117 to the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/661581 [07:20:09] (03CR) 10Elukey: [C: 03+2] Move an-worker1117 to the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/661581 (owner: 10Elukey) [07:22:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 3%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14186 and previous config saved to /var/cache/conftool/dbconfig/20210204-072229-root.json [07:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:49] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10elukey) If we keep the backups for conf1* it should be fine, conf2* replicates via etcdmirror from conf100*, so if this unblocks you it should be doable. Let's check with @Joe to be sure :) [07:30:47] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:30:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14187 and previous config saved to /var/cache/conftool/dbconfig/20210204-073054-root.json [07:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:09] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [07:37:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 5%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14188 and previous config saved to /var/cache/conftool/dbconfig/20210204-073733-root.json [07:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14189 and previous config saved to /var/cache/conftool/dbconfig/20210204-074558-root.json [07:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:55] (03PS1) 10ArielGlenn: prep for re-install of snapshot1009, 1010 with buster [puppet] - 10https://gerrit.wikimedia.org/r/661642 (https://phabricator.wikimedia.org/T269377) [07:49:08] (03CR) 10ArielGlenn: [C: 03+2] prep for re-install of snapshot1009, 1010 with buster [puppet] - 10https://gerrit.wikimedia.org/r/661642 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [07:51:05] (03CR) 10JMeybohm: [C: 04-1] "You can separate the chart and helmfile.d files into different CRs if you want to get the linter to like this change. The helmfile.d chang" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/661491 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [07:52:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 7%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14190 and previous config saved to /var/cache/conftool/dbconfig/20210204-075236-root.json [07:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:21] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1009.eqiad.wmnet ` The log c... [07:58:42] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10georginaburnett-wmde) [08:01:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14191 and previous config saved to /var/cache/conftool/dbconfig/20210204-080101-root.json [08:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:29] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 57, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14192 and previous config saved to /var/cache/conftool/dbconfig/20210204-080740-root.json [08:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:12] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1009.eqiad.wmnet with reason: REIMAGE [08:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:13] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1009.eqiad.wmnet with reason: REIMAGE [08:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:17] (03CR) 10Elukey: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661528 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [08:15:47] (03CR) 10Muehlenhoff: [C: 03+2] Offboard zfilipin from Release Engineering [puppet] - 10https://gerrit.wikimedia.org/r/661150 (https://phabricator.wikimedia.org/T267313) (owner: 10Thcipriani) [08:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: Repool db1137 after daemon restart', diff saved to https://phabricator.wikimedia.org/P14193 and previous config saved to /var/cache/conftool/dbconfig/20210204-081605-root.json [08:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/661533 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:21:13] RECOVERY - Check systemd state on xhgui2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:29] RECOVERY - Check systemd state on xhgui1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:03] !log reset failed ifup@ens5 on xhgui2001/xhgui1001 T273026 [08:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:08] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [08:22:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 12%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14194 and previous config saved to /var/cache/conftool/dbconfig/20210204-082243-root.json [08:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:29:17] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [08:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:17] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1009.eqiad.wmnet'] ` and were **ALL** successful. [08:32:55] RECOVERY - dhclient process on sretest1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:32:55] RECOVERY - configured eth on sretest1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [08:33:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [08:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 15%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14195 and previous config saved to /var/cache/conftool/dbconfig/20210204-083747-root.json [08:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 20%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14196 and previous config saved to /var/cache/conftool/dbconfig/20210204-085250-root.json [08:52:52] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) snapshot1009 was idle so I converted it. snapshot1010 should become idle in an hour or two, so I'll be able to do that later tod... [08:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:58] (03PS1) 10Elukey: superset: disable dashboard caching [puppet] - 10https://gerrit.wikimedia.org/r/661678 (https://phabricator.wikimedia.org/T273850) [08:57:09] (03PS2) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in codfw [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) [08:58:43] (03CR) 10Filippo Giunchedi: swift: limit rsync and swift-object-replicator memory to 5% in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [08:59:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27854/console" [puppet] - 10https://gerrit.wikimedia.org/r/661678 (https://phabricator.wikimedia.org/T273850) (owner: 10Elukey) [08:59:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] superset: disable dashboard caching [puppet] - 10https://gerrit.wikimedia.org/r/661678 (https://phabricator.wikimedia.org/T273850) (owner: 10Elukey) [08:59:40] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27855/console" [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [09:02:37] !log disable ping offload in codfw - T273278 [09:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ping2001.codfw.wmnet [09:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2001.codfw.wmnet [09:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:51] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use /monitoring/frontend for Swift's internal svc health checks [puppet] - 10https://gerrit.wikimedia.org/r/661369 (https://phabricator.wikimedia.org/T273453) (owner: 10Filippo Giunchedi) [09:06:57] (03PS2) 10Filippo Giunchedi: hieradata: use /monitoring/frontend for Swift's internal svc health checks [puppet] - 10https://gerrit.wikimedia.org/r/661369 (https://phabricator.wikimedia.org/T273453) [09:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14197 and previous config saved to /var/cache/conftool/dbconfig/20210204-090754-root.json [09:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:02] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:10:29] !log disable ping offload in eqiad (codfw-re-enabled) - T273278 [09:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:31] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ping1001.eqiad.wmnet [09:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:29] !log roll restart lvs low-traffic in codfw/eqiad for swift healthcheck updates [09:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1001.eqiad.wmnet [09:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:59] (03PS4) 10Jbond: zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371 [09:17:12] !log disable ping offload in esams (eqiad re-enabled) - T273278 [09:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27856/console" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [09:20:43] (03PS1) 10Marostegui: instances.yaml: Remove db1078 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/661684 (https://phabricator.wikimedia.org/T273597) [09:20:52] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ping3001.esams.wmnet [09:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 30%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14198 and previous config saved to /var/cache/conftool/dbconfig/20210204-092257-root.json [09:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1078 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/661684 (https://phabricator.wikimedia.org/T273597) (owner: 10Marostegui) [09:23:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping3001.esams.wmnet [09:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1078 from dbctl T273597', diff saved to https://phabricator.wikimedia.org/P14199 and previous config saved to /var/cache/conftool/dbconfig/20210204-092414-marostegui.json [09:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:21] T273597: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 [09:24:27] !log re-enable ping offload in esams - T273278 [09:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:53] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.488 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [09:33:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host flowspec1001.eqiad.wmnet [09:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:20] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup), 10User-fgiunchedi: ms-fe.svc.codfw.wmnet paged during Swift rebalance - https://phabricator.wikimedia.org/T273453 (10fgiunchedi) [09:35:11] (03CR) 10Hashar: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27856/ looks good thanks! :] Lets do it!" [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [09:35:26] jbond42: we can merge the "zuul::server: Add types" puppet patch now ^ :-] [09:36:16] hashar: sure ill do now thanks [09:36:39] (03CR) 10Jbond: [C: 03+2] "> For contint2001 it shows a diff for gerrit_event_delay (5 -> 5) which I guess is just an artifact of str > int conversion maybe." [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [09:36:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] zuul::server: Add types [puppet] - 10https://gerrit.wikimedia.org/r/661371 (owner: 10Jbond) [09:37:19] (03CR) 10Vgutierrez: [C: 03+2] tlsproxy::localssl hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/661070 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:37:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flowspec1001.eqiad.wmnet [09:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14201 and previous config saved to /var/cache/conftool/dbconfig/20210204-093801-root.json [09:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] (03CR) 10Vgutierrez: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/661073 (https://phabricator.wikimedia.org/T241239) (owner: 10Ladsgroup) [09:39:32] (03PS1) 10Elukey: Set the lvs_setup flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661687 (https://phabricator.wikimedia.org/T269160) [09:41:15] !log rebooting mw[2215-2219,2221-2243,2246-2249,2251-2253,2255,2258 for kernel update [09:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:55] hashar: merged and deployed [09:44:40] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 4:00:00 on 37 hosts with reason: reboot [09:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 37 hosts with reason: reboot [09:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:13] (03PS1) 10Marostegui: mariadb: Decommission db1078 [puppet] - 10https://gerrit.wikimedia.org/r/661688 (https://phabricator.wikimedia.org/T273597) [09:48:47] jbond42: thx! [09:49:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [09:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:49] (03PS1) 10Jbond: wmflib: Email type [puppet] - 10https://gerrit.wikimedia.org/r/661689 (https://phabricator.wikimedia.org/T273743) [09:50:15] hashar: can you take a look at this ^^^ and point me to a better list of valid/invalid test cases [09:50:44] (03PS7) 10Jbond: wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) [09:51:27] jbond42: ah yeah the email validation. I had some fun with those RFC / W3 spec a while ago [09:52:25] :) yes i was hop[ing you allready have a list in mediawiki unit tests i can use. the list if found has valid emails which dont pass and invl;aids wich do https://gist.github.com/cjaoude/fd9910626629b53c4d25 [09:52:34] (03CR) 10Vgutierrez: [C: 03+1] Set the lvs_setup flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661687 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [09:52:58] (03CR) 10Elukey: [C: 03+2] Set the lvs_setup flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661687 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [09:53:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 60%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14202 and previous config saved to /var/cache/conftool/dbconfig/20210204-095305-root.json [09:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27857/console" [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [09:55:00] pybal may complain soon, it is me adding a new LVS VIP [09:57:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: drop ensure_service in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661367 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [09:58:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:32] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1078.eqiad.wmnet` - db1078.eqiad.wmnet (**PASS**) - Downtimed host on I... [09:58:47] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) a:05Marostegui→03wiki_willy [09:58:58] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) @wiki_willy this is ready for #dc-ops [09:59:03] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10Marostegui) [10:00:07] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:00:41] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:01:19] (03CR) 10Hashar: "Nice! The code I wrote back in 2010 seems to have survived in mediawiki/core and has been improved. It can be find in Sanitizer::validateE" [puppet] - 10https://gerrit.wikimedia.org/r/661689 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:01:39] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.35:4992]) https://wikitech.wikimedia.org/wiki/PyBal [10:01:46] yep this is me --^ [10:01:50] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:02:57] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.35:4992]) https://wikitech.wikimedia.org/wiki/PyBal [10:04:10] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:04:30] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:04:51] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 117 connections established with conf1004.eqiad.wmnet:4001 (min=118) https://wikitech.wikimedia.org/wiki/PyBal [10:04:51] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 59 connections established with conf2001.codfw.wmnet:2379 (min=60) https://wikitech.wikimedia.org/wiki/PyBal [10:04:58] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:05:05] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 79 connections established with conf2001.codfw.wmnet:2379 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [10:05:19] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.35:4992]) https://wikitech.wikimedia.org/wiki/PyBal [10:05:39] 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10jcrespo) Let's make sure this is true, and only canonical data will be on conf100* before moving ahead with that, otherwise I prefer to get blocked and speed up the upgrade. [10:05:47] !log restart pybal on lvs2010 (low-traffic standby) to pick up new changes for eventstreams-internal (new VIP) - T269160 [10:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:51] T269160: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 [10:06:27] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.35:4992]) https://wikitech.wikimedia.org/wiki/PyBal [10:07:03] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 69 connections established with conf1004.eqiad.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal [10:08:00] (03PS19) 10Jcrespo: Bacula: Start using new storage/pools for es database content backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) [10:08:02] (03PS1) 10Jcrespo: backups: Enable External Store backups [puppet] - 10https://gerrit.wikimedia.org/r/661691 (https://phabricator.wikimedia.org/T79922) [10:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14203 and previous config saved to /var/cache/conftool/dbconfig/20210204-100808-root.json [10:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:53] !log restart pybal on lvs1016 (low-traffic standby) to pick up new changes for eventstreams-internal (new VIP) - T269160 [10:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:35] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:11:01] 10SRE, 10MediaWiki-Debug-Logger, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) [10:11:07] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 118 connections established with conf1004.eqiad.wmnet:4001 (min=118) https://wikitech.wikimedia.org/wiki/PyBal [10:11:21] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 80 connections established with conf2001.codfw.wmnet:2379 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [10:11:35] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:11:50] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:12:23] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:13:07] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:13:31] !log restart pybal on lvs2009 (low-traffic active) to pick up new changes for eventstreams-internal (new VIP) - T269160 [10:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:35] T269160: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 [10:15:07] !log restart pybal on lvs1015 (low-traffic active) to pick up new changes for eventstreams-internal (new VIP) - T269160 [10:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:27] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:17:11] (03PS2) 10Jbond: wmflib: drop ensure_link in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) [10:17:13] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 10.12 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [10:18:09] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 70 connections established with conf1004.eqiad.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal [10:18:09] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 60 connections established with conf2001.codfw.wmnet:2379 (min=60) https://wikitech.wikimedia.org/wiki/PyBal [10:18:11] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:18:14] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27860" [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:22:53] 10SRE, 10MediaWiki-Debug-Logger, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) -10002 would be: ` # define TRY_AGAIN 2 /* Non-Authoritat... [10:23:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Slowly pooling db1173 for the first time in s6', diff saved to https://phabricator.wikimedia.org/P14204 and previous config saved to /var/cache/conftool/dbconfig/20210204-102312-root.json [10:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:03] (03PS1) 10Elukey: Set the monitoring_setup flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661695 (https://phabricator.wikimedia.org/T269160) [10:24:46] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo) [10:24:56] (03PS3) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) [10:26:37] (03CR) 10Elukey: [C: 03+2] Set the monitoring_setup flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661695 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [10:29:22] (03CR) 10MSantos: [C: 03+1] conftool: restore maps1009 to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/661420 (owner: 10Hnowlan) [10:30:32] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10Vgutierrez) [10:34:20] (03PS2) 10Jbond: wmflib: Email type [puppet] - 10https://gerrit.wikimedia.org/r/661689 (https://phabricator.wikimedia.org/T273743) [10:34:36] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 4:00:00 on 93 hosts with reason: reboot [10:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:24] !log rebooting mw[2261-2262,2268-2271,2273-2277,2283-2288,2290-2335,2337-2339,2350-2376].codfw.wmnet [10:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 93 hosts with reason: reboot [10:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:06] (03PS27) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [10:37:50] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [10:41:30] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:44:24] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:47:20] (03PS1) 10Elukey: Set the production flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661697 (https://phabricator.wikimedia.org/T269160) [10:47:51] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [10:48:04] (03CR) 10Elukey: [C: 03+2] Set the production flag for eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/661697 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [10:48:36] (03PS28) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [10:48:46] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/661689 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:49:12] (03CR) 10Jbond: [C: 03+2] wmflib: Email type [puppet] - 10https://gerrit.wikimedia.org/r/661689 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:50:22] (03PS1) 10Elukey: Revert "Remove dns-disc config for eventstreams-internal" [dns] - 10https://gerrit.wikimedia.org/r/661649 [10:50:28] (03PS2) 10Elukey: Revert "Remove dns-disc config for eventstreams-internal" [dns] - 10https://gerrit.wikimedia.org/r/661649 [10:51:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1078 [puppet] - 10https://gerrit.wikimedia.org/r/661688 (https://phabricator.wikimedia.org/T273597) (owner: 10Marostegui) [10:51:30] jbond42: ok to merge your change? [10:51:41] yes please sorry [10:51:50] (03PS14) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [10:51:58] jbond42: merging! [10:52:11] thx [10:52:15] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) [10:53:10] (03PS1) 10Marostegui: install_server: Reimage db1157 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/661698 (https://phabricator.wikimedia.org/T258361) [10:53:37] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) [10:54:03] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1157 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/661698 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [11:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1100). [11:00:13] (03PS15) 10Jbond: utils::audit: add puppet audit script [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) [11:02:10] (03CR) 10Jbond: [C: 03+2] "mergeing this as it doesn't have any direct affect on anything post review/CR's welcome :)" [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [11:02:57] (03CR) 10Jbond: [C: 03+2] utils::audit: add puppet audit script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657571 (https://phabricator.wikimedia.org/T272559) (owner: 10Jbond) [11:05:24] (03CR) 10Elukey: [C: 03+2] Revert "Remove dns-disc config for eventstreams-internal" [dns] - 10https://gerrit.wikimedia.org/r/661649 (owner: 10Elukey) [11:05:45] 10SRE, 10CAS-SSO: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) p:05Triage→03Medium [11:07:18] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams-internal [11:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:06] (03PS1) 10Elukey: Set desired state for eventstreams-internal dns-disc [puppet] - 10https://gerrit.wikimedia.org/r/661702 (https://phabricator.wikimedia.org/T269160) [11:10:56] (03CR) 10jerkins-bot: [V: 04-1] Set desired state for eventstreams-internal dns-disc [puppet] - 10https://gerrit.wikimedia.org/r/661702 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [11:11:37] (03CR) 10JMeybohm: [C: 03+1] Set desired state for eventstreams-internal dns-disc [puppet] - 10https://gerrit.wikimedia.org/r/661702 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [11:12:58] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:14:11] (03PS2) 10Elukey: Set desired state for eventstreams-internal dns-disc [puppet] - 10https://gerrit.wikimedia.org/r/661702 (https://phabricator.wikimedia.org/T269160) [11:16:26] (03CR) 10Elukey: [C: 03+2] Set desired state for eventstreams-internal dns-disc [puppet] - 10https://gerrit.wikimedia.org/r/661702 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [11:20:34] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10Vgutierrez) What's the current DNS query retry policy o... [11:22:31] (03PS1) 10Joal: Bump AQS druid backend datasource to 2021-01 [puppet] - 10https://gerrit.wikimedia.org/r/661703 [11:22:37] elukey: --^ [11:23:35] (03CR) 10Elukey: [C: 03+2] Bump AQS druid backend datasource to 2021-01 [puppet] - 10https://gerrit.wikimedia.org/r/661703 (owner: 10Joal) [11:23:58] joal: I'll ask to test aqs1004 in a min [11:24:05] ack elukey [11:26:27] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [11:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:57] joal: please go ahead :) [11:27:15] (03CR) 10Hnowlan: [C: 03+2] conftool: restore maps1009 to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/661420 (owner: 10Hnowlan) [11:27:18] elukey: Good for me! [11:27:47] joal: ack the cookbook is doing the rest [11:29:12] awesome - thanks elukey [11:30:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [11:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:46] (03PS1) 10Elukey: Lower retention of fsimage backups for Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/661707 [11:35:19] (03CR) 10Elukey: [C: 03+2] Lower retention of fsimage backups for Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/661707 (owner: 10Elukey) [11:40:26] RECOVERY - Check systemd state on an-worker1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:46] (03PS1) 10Mforns: Migrate PrefUpdate from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661709 (https://phabricator.wikimedia.org/T267348) [11:44:04] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: onboard icinga logging to common logging schema [puppet] - 10https://gerrit.wikimedia.org/r/661539 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [11:47:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian-ssl,name=maps1009.eqiad.wmnet [11:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1009.eqiad.wmnet [11:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:27] 10SRE, 10CAS-SSO: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) The following is from idp-test after updateing case in config ` 2021-02-04 11:50:49,536 DEBUG [org.apereo.cas.tic... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1200). [12:00:04] mforns: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Logs [12:00:33] hey, I'm here [12:03:00] I’m in a meeting, not sure if I run the window :/ [12:03:16] no rush Lucas_WMDE :] [12:06:59] 10SRE, 10CAS-SSO: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) After investigating i notice that even though spring has renamed config properties from camel case (myCoolProperty) to hyphen naming (my-cool-property) it seems old config parameters are silently remapped so... [12:08:45] !log bounce rsyslog on centrallog1001 [12:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:12] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1237 days) https://wikitech.wikimedia.org/wiki/Logs [12:10:07] Is there anyone else that can run the window? I see Amir1, awight and Urbanecm listed as Deployers. If not possible, no problem at all, I can reschedule for next window :] [12:10:27] Lucas and I are in a meeting but I can do it [12:10:45] Amir1: no, don't worry if you're in a meeting, I can wait! [12:11:00] well the meeting runs until almost the end of the window ^^ [12:11:12] (03CR) 10Ladsgroup: [C: 03+2] Migrate PrefUpdate from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661709 (https://phabricator.wikimedia.org/T267348) (owner: 10Mforns) [12:11:31] I do it don't worry [12:11:42] ok, thanks a lot! [12:12:53] (03Merged) 10jenkins-bot: Migrate PrefUpdate from EventLogging to EventGate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661709 (https://phabricator.wikimedia.org/T267348) (owner: 10Mforns) [12:16:39] (03PS1) 10Jbond: apereo_cas: rename config properties [puppet] - 10https://gerrit.wikimedia.org/r/661713 (https://phabricator.wikimedia.org/T273867) [12:17:00] !log rebooting mw[1264-1268,1276-1277,1337-1338,1404-1409,1411,1413].eqiad.wmnet for kernel update [12:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:07] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 4:00:00 on 17 hosts with reason: reboot [12:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:19] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: rename config properties [puppet] - 10https://gerrit.wikimedia.org/r/661713 (https://phabricator.wikimedia.org/T273867) (owner: 10Jbond) [12:17:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 17 hosts with reason: reboot [12:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:23] mforns: it's live in mwdebu1002 [12:17:32] Amir1: ok, testing! [12:19:49] Amir1: mforns: sorry, just got here. I see Amir is deploying already, but if I'm needed, I'm around. [12:20:02] no problem at all! [12:20:50] (03PS2) 10Jbond: apereo_cas: rename config properties [puppet] - 10https://gerrit.wikimedia.org/r/661713 (https://phabricator.wikimedia.org/T273867) [12:21:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27862/console" [puppet] - 10https://gerrit.wikimedia.org/r/661713 (https://phabricator.wikimedia.org/T273867) (owner: 10Jbond) [12:27:34] !log installing libdatetime-timezone-perl updates on Buster [12:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:27] Amir1: I don't see any error, but the events I'm sending through testwiki using mwdebug1002 are not landing in Kafka... so, I guess we can cancel the deployment and I will troubleshoot further. [12:32:02] Amir1: should I revert the change? [12:33:08] mforns: hmm, maybe it's the mwdebug1002 being weird? [12:34:25] I see the events being sent to the correct URL https://intake-analytics.wikimedia.org/v1/events?hasty=true and correct payload [12:34:32] but not landing [12:35:43] okay, shall I revert then? [12:36:45] (03Abandoned) 10Muehlenhoff: Stop using Diamond on Cloud VPS/Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/632471 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [12:36:54] Amir1: yes please! [12:37:50] sure doing [12:38:40] (03PS1) 10Ladsgroup: Revert "Migrate PrefUpdate from EventLogging to EventGate on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661650 [12:38:50] (03CR) 10Ladsgroup: [C: 03+2] Revert "Migrate PrefUpdate from EventLogging to EventGate on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661650 (owner: 10Ladsgroup) [12:39:50] (03Merged) 10jenkins-bot: Revert "Migrate PrefUpdate from EventLogging to EventGate on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661650 (owner: 10Ladsgroup) [12:40:17] thanks a lot Amir1, and sorry for the noise [12:41:29] mforns: all good! Sorry I wasn't around [12:41:35] it's done now [12:41:43] ok, thanks :] [12:42:57] does that complete all deployments? (I'd resume reboots, then) [12:44:21] looks like it but Amir1 should confirm [12:44:43] moritzm: Lucas_WMDE yes, we are done [12:44:47] thx [12:48:56] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` snapshot1010.eqiad.wmnet ` The log c... [12:50:10] apergos: snpashot1010 is still configured to install Stretch in the DHCP config [12:50:21] wait wut [12:50:26] I thought I removed it [12:50:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/661642/1/modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 [12:50:55] sorry nvm, [12:50:56] my bad [12:50:59] whew! [12:51:07] * apergos takes a few deep breaths [12:51:38] that's the last one for this week :-) [12:51:54] 5,6 are to be replaced and the new ones were already due to have arrived [12:52:01] excellent! [12:52:16] 8 needs more tests of 'other' dumps before it can go, and that's it [12:52:33] yeah, it will be good to have this done [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1300) [13:02:49] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on snapshot1010.eqiad.wmnet with reason: REIMAGE [13:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:51] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on snapshot1010.eqiad.wmnet with reason: REIMAGE [13:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:31] !log upload cas_6.2.7 to downgrade cas T273867 [13:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:35] T273867: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 [13:10:48] PROBLEM - SSH on logstash2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:16:46] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1157.eqiad.wmnet'] ` The log ca... [13:18:34] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) On MediaWiki side it uses [[ https://www.php.ne... [13:20:04] (03PS1) 10Jbond: idp: failover to downgrade cas [dns] - 10https://gerrit.wikimedia.org/r/661719 [13:21:06] (03CR) 10Jbond: [C: 03+2] idp: failover to downgrade cas [dns] - 10https://gerrit.wikimedia.org/r/661719 (owner: 10Jbond) [13:23:15] 10SRE, 10User-fgiunchedi: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) 05Resolved→03Open Unfortunately I spoke too fast, the bug is still there, e.g. ` 4370 recvfrom(50, 0x7f062021acc3, 53, 0, NULL, NULL) = -1 EAGAIN... [13:25:25] (03PS1) 10Filippo Giunchedi: Revert "toil: remove rsyslog_tls_remedy" [puppet] - 10https://gerrit.wikimedia.org/r/661720 (https://phabricator.wikimedia.org/T199406) [13:25:32] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['snapshot1010.eqiad.wmnet'] ` and were **ALL** successful. [13:25:56] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Tested with `ssh -L 4992:eventstreams-internal.discovery.wmnet:4992 -N mwm... [13:26:36] (03CR) 10Filippo Giunchedi: "The bug isn't gone as I thought, put back the bandaid 😞" [puppet] - 10https://gerrit.wikimedia.org/r/661720 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [13:29:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host mwdebug1003.eqiad.wmnet [13:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE [13:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:46] (03PS1) 10Jbond: 6.4.0-RC1: test to see if issue is still present [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661721 (https://phabricator.wikimedia.org/T273867) [13:30:21] (03PS1) 10Jbond: Revert "idp: failover to downgrade cas" [dns] - 10https://gerrit.wikimedia.org/r/661651 [13:31:04] (03CR) 10Jbond: [C: 03+2] Revert "idp: failover to downgrade cas" [dns] - 10https://gerrit.wikimedia.org/r/661651 (owner: 10Jbond) [13:31:39] !log reboot logstash2005.codfw.wmnet, no ssh / stuck [13:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:48] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwdebug1003.eqiad.wmnet [13:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:59] RECOVERY - SSH on logstash2005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:36:38] !log installing openldap security updates on buster (client-side tools/libs only, slapd instance already updated) [13:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:44] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) This appears to be a regression related to the KRYO serialisation. As far as i can tell it only has issues after a restart. Early musing wonder if there is some nonce initiated on star... [13:38:02] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1157.eqiad.wmnet'] ` and were **ALL** successful. [13:42:51] (03PS2) 10Jbond: 6.4.0-RC1: test to see if issue is still present [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661721 (https://phabricator.wikimedia.org/T273867) [13:44:22] !log rolling restart of ncredir instances (kernel upgrade) [13:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir5002.eqsin.wmnet [13:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:25] (03PS3) 10Jbond: 6.4.0-RC1: test to see if issue is still present [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661721 (https://phabricator.wikimedia.org/T273867) [13:48:47] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) look srelated https://groups.google.com/g/jasig-cas-user/c/v2VTr1y_X8M/m/_gieSp0lDAAJ [13:48:50] (03CR) 10Hashar: "Looks magic :)" [puppet] - 10https://gerrit.wikimedia.org/r/661689 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [13:49:11] (03PS3) 10Urbanecm: [WIP] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) [13:50:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir5002.eqsin.wmnet [13:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir5001.eqsin.wmnet [13:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:38] (03PS4) 10Urbanecm: Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) [13:53:46] (03PS14) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [13:55:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27863/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [13:55:13] (03PS2) 10Urbanecm: bnwiki: wgGEHelpPanelLinks: Remove text in brackets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661522 (https://phabricator.wikimedia.org/T266020) [13:55:18] (03CR) 10Urbanecm: [C: 03+2] "no-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661522 (https://phabricator.wikimedia.org/T266020) (owner: 10Urbanecm) [13:56:18] (03Merged) 10jenkins-bot: bnwiki: wgGEHelpPanelLinks: Remove text in brackets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661522 (https://phabricator.wikimedia.org/T266020) (owner: 10Urbanecm) [13:57:11] PROBLEM - MariaDB Replica Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 852.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:57:12] 10SRE, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [13:58:30] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: NO-OP: 7c67b2f03cbc27cf9e5f214a6f0ea0856d8c1ae4: bnwiki: wgGEHelpPanelLinks: Remove text in brackets (T266020) (duration: 01m 12s) [13:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:35] T266020: Deploy Growth experiments at Bangla Wikipedia - https://phabricator.wikimedia.org/T266020 [13:59:03] (03CR) 10Jbond: [C: 03+2] wmflib: drop ensure_link in favour of stdlib::ensure [puppet] - 10https://gerrit.wikimedia.org/r/661372 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [13:59:58] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [14:00:04] hashar and dancy: #bothumor I � Unicode. All rise for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1400). [14:03:10] (03CR) 10CDanis: [C: 03+1] swift: limit rsync and swift-object-replicator memory to 5% in codfw [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [14:05:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir5001.eqsin.wmnet [14:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:39] (03PS1) 10DCausse: [wdqs] Add flink sideoutput stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) [14:06:42] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) Simlar issue with 6.4.0-RC1 ` java.lang.ClassCastException: class org.apereo.cas.authentication.DefaultAuthenticationHandlerExecutionResult cannot be cast to class org.apereo.cas.ticket... [14:07:09] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir4002.ulsfo.wmnet [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:10:11] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: limit rsync and swift-object-replicator memory to 5% in codfw [puppet] - 10https://gerrit.wikimedia.org/r/661408 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [14:10:42] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@46a2eaf]: (no justification provided) [14:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:55] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@46a2eaf]: (no justification provided) (duration: 00m 13s) [14:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:03] RECOVERY - tileratorui on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.026 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [14:11:13] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@0a38bc5]: (no justification provided) [14:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:15] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@0a38bc5]: (no justification provided) (duration: 00m 03s) [14:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:08] !log installing ffmpeg security updates on stretch [14:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4002.ulsfo.wmnet [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:39] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir4001.ulsfo.wmnet [14:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:25] (03CR) 10Jbond: [V: 03+1 C: 04-1] "This is required for >= 6.3, we had to downgrade to 6.2.7" [puppet] - 10https://gerrit.wikimedia.org/r/661713 (https://phabricator.wikimedia.org/T273867) (owner: 10Jbond) [14:16:43] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@47fc426]: (no justification provided) [14:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:56] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@47fc426]: (no justification provided) (duration: 00m 12s) [14:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:11] 10SRE, 10Security, 10cloud-services-team (Kanban): Implement SSH CA (certificate authority) for host keys? - https://phabricator.wikimedia.org/T268344 (10CDanis) [14:18:28] !log start rolling reboots of mc[2019-2027,2029-2037].codfw.wmnet T273278 [14:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir4001.ulsfo.wmnet [14:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:06] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Platform Team Workboards (Clinic Duty Team), 10Wikimedia-production-error: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10Joe) >>! In T231025#6745199, @holger.knust wrote: > Thi... [14:21:10] !log roll-restart rsync/swift-object-replicator in codfw to apply memory limits [14:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti5002.eqsin.wmnet [14:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir2002.codfw.wmnet [14:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir2002.codfw.wmnet [14:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5002.eqsin.wmnet [14:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:33] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2019.codfw.wmnet [14:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:35] (03CR) 10Gehel: [C: 04-1] "Nice to see this moving forward! A few comments inline." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [14:38:42] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [14:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:47] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [14:39:29] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) I think it might be useful to enforce our user-agent policy at least for this image, and see who comes around complaining, given we don't seem to find a... [14:41:46] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir2001.codfw.wmnet [14:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:02] (03PS1) 10Giuseppe Lavagetto: Remove a couple of useless DNS lookups from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) [14:43:03] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Logs [14:43:04] !log stop db1095 instance in preparation of its decom T273732 [14:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:08] T273732: decommission db1095 - https://phabricator.wikimedia.org/T273732 [14:44:09] 10SRE, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10Joe) Given most nodejs applications don't use statsd anymore (in kubernetes we just use the prometheus-statsd exporter), and I have submitted https://gerrit.wikimedi... [14:44:22] (03PS1) 10Jbond: CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 [14:47:03] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1381 days) https://wikitech.wikimedia.org/wiki/Logs [14:47:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir2001.codfw.wmnet [14:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:41] (03CR) 10Faidon Liambotis: CAS style changes (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [14:53:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2019.codfw.wmnet [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:44] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2020.codfw.wmnet [14:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:05] PROBLEM - Check systemd state on mc2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:06] (03PS4) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) [14:57:25] (03PS5) 10Jcrespo: install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) [14:57:55] (03CR) 10Jcrespo: [C: 03+1] "db1095 removed from tendril and zarcillo" [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo) [14:59:46] (03CR) 10Jcrespo: [C: 03+2] install_server: Decommission db1095, substitute with db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661345 (https://phabricator.wikimedia.org/T273732) (owner: 10Jcrespo) [15:01:16] !log jynus@cumin1001 START - Cookbook sre.hosts.decommission [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:40] (03PS2) 10Jbond: CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 [15:02:48] (03CR) 10Jbond: "updated" (033 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [15:08:48] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) snapshot1010 is done. I need to do a bunch more testing before I can reimage snapshot1008. [15:11:12] (03PS1) 10David Caro: wmcs.backups: Fix missing parameter when backing up a new VM [puppet] - 10https://gerrit.wikimedia.org/r/661739 (https://phabricator.wikimedia.org/T273892) [15:11:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2020.codfw.wmnet [15:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2021.codfw.wmnet [15:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:13] PROBLEM - Check systemd state on mc2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:46] (03PS1) 10Effie Mouzeli: hieradata: Remove mc1024 from config [puppet] - 10https://gerrit.wikimedia.org/r/661740 (https://phabricator.wikimedia.org/T272078) [15:18:33] PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:16] !log draining ganeti3003 for eventual reboot [15:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti3003.esams.wmnet [15:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2021.codfw.wmnet [15:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3003.esams.wmnet [15:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:15] !log draining ganeti3001 for eventual reboot [15:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:37] PROBLEM - Check systemd state on mc2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.backups: Fix missing parameter when backing up a new VM [puppet] - 10https://gerrit.wikimedia.org/r/661739 (https://phabricator.wikimedia.org/T273892) (owner: 10David Caro) [15:33:57] RECOVERY - Check systemd state on ms-be2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:15] (03CR) 10David Caro: [C: 03+2] wmcs.backups: Fix missing parameter when backing up a new VM [puppet] - 10https://gerrit.wikimedia.org/r/661739 (https://phabricator.wikimedia.org/T273892) (owner: 10David Caro) [15:38:59] 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10fgiunchedi) [15:40:35] (03CR) 10Cwhite: [C: 03+1] "😞" [puppet] - 10https://gerrit.wikimedia.org/r/661720 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [15:40:55] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2022.codfw.wmnet [15:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:33] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:44:35] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "toil: remove rsyslog_tls_remedy" [puppet] - 10https://gerrit.wikimedia.org/r/661720 (https://phabricator.wikimedia.org/T199406) (owner: 10Filippo Giunchedi) [15:47:07] (03CR) 10Volans: "As discussed in the meeting, this seems ok as a temporary solution to unblock the usage as non-root although not optimal. To be revisited " (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [15:48:03] (03CR) 10Muehlenhoff: "Looks good, a few nits/questions inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [15:50:19] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti3001.esams.wmnet [15:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:12] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3001.esams.wmnet [15:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2022.codfw.wmnet [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2023.codfw.wmnet [15:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:27] !log failover ganeti master in esams to ganeti3001 [15:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:53] PROBLEM - Check systemd state on mc2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:09] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) Testing different transcoders shows this is only an issue with KYRO: |cas eversion| transcoder| status | |6.3.1| KRYO | [[ https://phabricator.wikimedia.org/T273867#6803365 | fail ]] |... [16:00:40] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) [16:00:56] !log draining ganeti3002 for eventual reboot [16:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:59] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:02:51] 10SRE, 10Security, 10User-MoritzMuehlenhoff, 10cloud-services-team (Kanban): Implement SSH CA (certificate authority) for host keys? - https://phabricator.wikimedia.org/T268344 (10MoritzMuehlenhoff) [16:03:18] (03CR) 10Cwhite: [C: 03+2] profile: onboard icinga logging to common logging schema [puppet] - 10https://gerrit.wikimedia.org/r/661539 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:05:43] 10SRE, 10MediaWiki-Debug-Logger, 10Traffic, 10Patch-For-Review, and 2 others: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002 - https://phabricator.wikimedia.org/T231025 (10hashar) >>! In T231025#6803751, @Joe wrote: > ... > In the specific case, we're trying to resolve `m... [16:05:57] (03CR) 10Hashar: [C: 03+1] Remove a couple of useless DNS lookups from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) (owner: 10Giuseppe Lavagetto) [16:08:30] (03PS1) 10Urbanecm: Remove ruwiki's A/B test for WelcomeSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661744 (https://phabricator.wikimedia.org/T273900) [16:10:28] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:33] (03CR) 10Ottomata: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [16:12:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2023.codfw.wmnet [16:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:28] PROBLEM - Check systemd state on mc2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:50] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Gilles) You could even serve another image in its place to this UA, with some text and an email address to contact. You'd probably find out pretty quickly wh... [16:18:35] effie: there is something a little weird happending on mc20xx nodes [16:18:51] after the reboots? [16:19:06] let me check [16:19:19] yes there are the ifup@eno1 units showing up a python error [16:20:34] I thought we had it fixed [16:20:39] sigh [16:20:42] let's see [16:21:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti3002.esams.wmnet [16:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:36] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.6 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:23:34] elukey: it is not something disruptive, but it is different than the other time [16:23:41] thank you! [16:25:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3002.esams.wmnet [16:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:02] (03CR) 10Will Doran: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [16:29:02] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.379 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:29:54] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) git diff v6.2.7..v6.3.0-RC3 support/cas-server-support-memcached-core/src/main/java/org/apereo/cas/memcached/kryo (github kept crashing) ` diff --git a/support/cas-server-support-memcac... [16:36:29] (03CR) 10Cicalese: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [16:37:40] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.233 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:41:13] (03PS1) 10Mforns: Migrate PrefUpdate to EventPlatform on testwiki (2nd trial) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661746 (https://phabricator.wikimedia.org/T267348) [16:43:26] (03CR) 10Ottomata: [C: 03+2] Migrate PrefUpdate to EventPlatform on testwiki (2nd trial) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661746 (https://phabricator.wikimedia.org/T267348) (owner: 10Mforns) [16:46:52] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate PrefUpdate schema to Event Platform on testwiki - T267348 (duration: 01m 08s) [16:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:57] T267348: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 [16:50:01] (03CR) 10Legoktm: "See https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Problem:_undeployed_code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [16:55:13] (03CR) 10DCausse: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [16:57:35] (03CR) 10Cicalese: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661167 (https://phabricator.wikimedia.org/T270178) (owner: 10Alex Paskulin) [17:00:02] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:04] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1700). [17:02:38] (03CR) 10Ottomata: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [17:03:24] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.133 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:07:54] (03CR) 10Reedy: [C: 03+1] Remove a couple of useless DNS lookups from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) (owner: 10Giuseppe Lavagetto) [17:11:06] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10jbond) >>! In T273867#6804120, @jbond wrote: > Testing different transcoders shows this is only an issue with KYRO: > > |cas eversion| transcoder| status | > |6.3.1| KRYO | [[ https://phabrica... [17:15:55] (03PS2) 10Urbanecm: Enable GrowthExperiments on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/650012 (https://phabricator.wikimedia.org/T266020) (owner: 10Gergő Tisza) [17:16:40] (03PS1) 10Cwhite: profile: remove type field in icinga ecs compatibility step [puppet] - 10https://gerrit.wikimedia.org/r/661754 [17:16:42] (03PS1) 10Cwhite: profile: fix guard condition to check ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/661755 [17:18:27] (03PS3) 10Krinkle: CAS style changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [17:19:05] (03CR) 10Cwhite: [C: 03+2] profile: remove type field in icinga ecs compatibility step [puppet] - 10https://gerrit.wikimedia.org/r/661754 (owner: 10Cwhite) [17:20:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1078.eqiad.wmnet - https://phabricator.wikimedia.org/T273597 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks @Marostegui >>! In T273597#6803028, @Marostegui wrote: > @wiki_willy this is ready for #dc-ops [17:20:44] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.367 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:21:35] (03PS2) 10Cwhite: profile: fix guard condition to check ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/661755 [17:22:07] (03CR) 10Krinkle: CAS style changes (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [17:23:18] 10ops-esams, 10DC-Ops: Esams: Delete rack OE10, OE11, OE12 and OE13 from Netbox - https://phabricator.wikimedia.org/T273841 (10wiki_willy) Hi @Papaul - thanks for bringing it up. Unless there's some other dependencies that need to be removed beforehand, I think it should be ok. We officially term'd out of th... [17:25:00] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.154 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:25:42] legoktm: hey, need help with logos. When I want to revert enwiki back to standard, should I keep `enwiki20a: File:WP20 EnWiki20 SimplifiedLogo BillionEdits fixed.svg` under variants, or remove it? [17:25:59] hey [17:26:30] are you deleting the pngs right now too? or just switching logos.php? I think the variants block should be deleted whenever the pngs are deleted [17:26:42] legoktm: just switching logos.php [17:27:04] so I would say just remove the "selected" block to change logos.php, and then once the pngs are removed, we delete the variants block [17:27:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:27:31] okay, thanks legoktm [17:28:17] (03PS1) 10Urbanecm: Switch enwiki back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661756 (https://phabricator.wikimedia.org/T272108) [17:28:22] legoktm: mind reviewing ^^, please? [17:28:46] (03CR) 10Legoktm: [C: 03+1] Switch enwiki back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661756 (https://phabricator.wikimedia.org/T272108) (owner: 10Urbanecm) [17:28:54] lgtm [17:28:57] thanks! [17:30:44] (03CR) 10Cwhite: [C: 03+2] profile: fix guard condition to check ecs.version [puppet] - 10https://gerrit.wikimedia.org/r/661755 (owner: 10Cwhite) [17:31:03] (03PS1) 10Dduvall: docker_registry_ha: Allow docker push from releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/661757 (https://phabricator.wikimedia.org/T271477) [17:32:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:33:19] (03CR) 10Urbanecm: [C: 03+2] Switch enwiki back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661756 (https://phabricator.wikimedia.org/T272108) (owner: 10Urbanecm) [17:33:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:34:04] (03PS1) 10Mforns: Migrate PrefUpdate schema to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661758 (https://phabricator.wikimedia.org/T267348) [17:34:10] PROBLEM - SSH on mw2249.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:34:19] (03Merged) 10jenkins-bot: Switch enwiki back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661756 (https://phabricator.wikimedia.org/T272108) (owner: 10Urbanecm) [17:35:13] 10SRE, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10elukey) Probably best to ask to @Gehel :) [17:36:02] (03CR) 10Ottomata: [C: 03+2] Migrate PrefUpdate schema to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661758 (https://phabricator.wikimedia.org/T267348) (owner: 10Mforns) [17:36:12] ottomata: wait before you start your sync please [17:36:47] Urbanecm: can do [17:36:57] Urbanecm: ok if i scap pull on mwdebug1001? [17:36:57] I'm syncing right now, that's why :) [17:37:01] oh ok i'll wait [17:37:07] I'll ping you when done [17:37:19] i've already rebased in mediawiki-staging [17:37:21] should I undo? [17:37:28] you might've confused my scap sync-file [17:37:51] I'm changing logos.php and logos/config.yaml only [17:38:04] hm ok, i'm changing initialisesettings.php only [17:38:04] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.079 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:38:07] shoudl be ok [17:38:07] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: eed3c8e7294d03a62bc71e0a8d9a50044d1edbaa: Switch enwiki back to standard logo (T272108; 1/2) (duration: 03m 12s) [17:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:13] T272108: Change EnWiki logo's back to the standard one, on or after 2021-02-04 - https://phabricator.wikimedia.org/T272108 [17:38:22] ottomata: my scap sync-file says `ssh: connect to host mw1309.eqiad.wmnet port 22: Connection timed out` [17:38:40] weird, you are syncing just the one file at a time? [17:38:44] mw1309 is being reimaged right now [17:38:44] yes [17:38:50] that's me [17:38:50] dunno how my rebase would have caused a problem for that [17:39:04] Urbanecm: just try again? [17:39:22] i can also revert first in mw-staging if you like [17:39:22] normally doesnt happen because reimaging also takes it out of scap dsh group [17:39:29] as it's reimaged, it probably will do the same thing as well [17:39:29] oh ok phew [17:39:30] but we had really bad timing then [17:40:08] syncing the second file, and once that completes, I'll be done [17:40:32] jbond42: about T273867 how urgent is it ? Can you tag discovery search and wait our triage session next Monday ? [17:40:33] T273867: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 [17:41:01] !log urbanecm@deploy1001 Synchronized logos/config.yaml: eed3c8e7294d03a62bc71e0a8d9a50044d1edbaa: Switch enwiki back to standard logo (T272108; 2/2) (duration: 01m 07s) [17:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:15] I'll also resync logos.php just in case [17:41:47] ok [17:42:27] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: eed3c8e7294d03a62bc71e0a8d9a50044d1edbaa: Switch enwiki back to standard logo (T272108; resync) (duration: 01m 07s) [17:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:34] ottomata: all yours now :) [17:44:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1401.eqiad.wmnet with reason: REIMAGE [17:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:38] ok [17:46:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1401.eqiad.wmnet with reason: REIMAGE [17:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:23] (03PS7) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 [17:48:28] (03PS1) 10Jbond: puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) [17:48:30] (03PS1) 10Jbond: wmflib: drop shell_export function [puppet] - 10https://gerrit.wikimedia.org/r/661762 (https://phabricator.wikimedia.org/T273743) [17:49:39] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [17:51:16] (03PS2) 10Jbond: puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) [17:51:19] (03CR) 10David Caro: "Added some tests." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [17:51:46] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate PrefUpdate schema to Event Platform on all wikis - T267348 (duration: 01m 01s) [17:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2278.codfw.wmnet with reason: REIMAGE [17:51:50] T267348: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 [17:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:06] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [17:52:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27865/console" [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [17:53:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2278.codfw.wmnet with reason: REIMAGE [17:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:33] (03PS3) 10Jbond: puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) [17:56:53] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.042 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:57:12] (03CR) 10Krinkle: "I don't recall how memc sharding is configured, are the strings meaningful/considered, or is it more of a list where this means all keys r" [puppet] - 10https://gerrit.wikimedia.org/r/661740 (https://phabricator.wikimedia.org/T272078) (owner: 10Effie Mouzeli) [17:57:35] (03CR) 10jerkins-bot: [V: 04-1] remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [17:59:00] (03CR) 10Krinkle: Remove a couple of useless DNS lookups from mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661732 (https://phabricator.wikimedia.org/T231025) (owner: 10Giuseppe Lavagetto) [18:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1800). Please do the needful. [18:01:13] (03PS4) 10Jbond: puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) [18:01:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1309.eqiad.wmnet with reason: REIMAGE [18:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:28] (03CR) 10Effie Mouzeli: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/661740 (https://phabricator.wikimedia.org/T272078) (owner: 10Effie Mouzeli) [18:01:40] (03PS1) 10Mforns: Rollback migration of PrefUpdate to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661766 (https://phabricator.wikimedia.org/T267348) [18:03:11] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) a:05herron→03RobH [18:03:13] (03CR) 10Ottomata: [C: 03+2] Rollback migration of PrefUpdate to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661766 (https://phabricator.wikimedia.org/T267348) (owner: 10Mforns) [18:03:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1309.eqiad.wmnet with reason: REIMAGE [18:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27869/console" [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:04:38] (03CR) 10David Caro: "I seem to be unable to reproduce that error locally, will investigate, any pointers are appreciated." [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [18:05:01] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: revert - Migrate PrefUpdate schema to Event Platform on all wikis - leave on testwiki only, seeing validation errors. T267348 (duration: 01m 01s) [18:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:08] T267348: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 [18:06:12] (03PS5) 10Jbond: puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) [18:08:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27871/console" [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:09:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1401.eqiad.wmnet'] ` an... [18:10:27] (03PS6) 10Jbond: puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) [18:10:31] (03CR) 10Cwhite: WIP logstash: add ulogd ecs filter + tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [18:12:15] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10RobH) [18:12:28] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10RobH) [18:12:59] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install thumbor100[56] - https://phabricator.wikimedia.org/T273914 (10RobH) [18:13:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27874/console" [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:13:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: drop shell_exports function [puppet] - 10https://gerrit.wikimedia.org/r/661761 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:13:47] (03PS2) 10Jbond: wmflib: drop shell_export function [puppet] - 10https://gerrit.wikimedia.org/r/661762 (https://phabricator.wikimedia.org/T273743) [18:16:40] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-55] - https://phabricator.wikimedia.org/T273915 (10RobH) [18:16:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2278.codfw.wmnet'] ` an... [18:17:00] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-55] - https://phabricator.wikimedia.org/T273915 (10RobH) [18:17:55] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.938 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:17:59] 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10RobH) [18:18:33] (03CR) 10Jbond: [C: 03+2] wmflib: drop shell_export function [puppet] - 10https://gerrit.wikimedia.org/r/661762 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:19:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1401.eqiad.wmnet [18:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2278.codfw.wmnet [18:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:22] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [18:27:46] (03CR) 10CDanis: [C: 03+2] VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) (owner: 10CDanis) [18:28:27] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕜☕ sudo cumin A:cp 'disable-puppet "cdanis deploying I498a0c4af T263496"' [18:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:31] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [18:31:04] ^ going to let that soak on cp2027 for a bit [18:31:11] (manually ran puppet there) [18:32:06] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.3 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:39:13] hey folks, should i count on the deployment train being unblocked today? [18:43:15] 10SRE, 10serviceops, 10User-jijiki: ifup@eno1.service fails on mc* hosts after 4.19.171-2 upgrade - https://phabricator.wikimedia.org/T273918 (10jijiki) [18:43:53] ACKNOWLEDGEMENT - Check systemd state on mc2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Filed T273918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:53] ACKNOWLEDGEMENT - Check systemd state on mc2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Filed T273918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:53] ACKNOWLEDGEMENT - Check systemd state on mc2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Filed T273918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:53] ACKNOWLEDGEMENT - Check systemd state on mc2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Filed T273918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:53] ACKNOWLEDGEMENT - Check systemd state on mc2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Filed T273918 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:10] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2019.codfw.wmnet [18:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:46] (03PS1) 10Jbond: varnish::instance: drop use of array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661769 (https://phabricator.wikimedia.org/T273743) [18:45:26] (03PS1) 10Wolfgang Kandek: Calculator Service second try [deployment-charts] - 10https://gerrit.wikimedia.org/r/661770 (https://phabricator.wikimedia.org/T273151) [18:45:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27875/console" [puppet] - 10https://gerrit.wikimedia.org/r/661769 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:45:53] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:45:55] !log T263496 deployed I498a0c4af on cp2027 at 18:29; now deploying on cp3060 [18:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:00] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [18:48:40] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1309.eqiad.wmnet'] ` an... [18:48:51] (03PS2) 10Jbond: varnish::instance: drop use of array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661769 (https://phabricator.wikimedia.org/T273743) [18:50:49] (03Abandoned) 10Wolfgang Kandek: Adding calculator-service to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/661491 (https://phabricator.wikimedia.org/T273151) (owner: 10Wolfgang Kandek) [18:52:40] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:16] (03PS2) 10CRusnov: burrow/check_kafka_consumer_lag.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) [18:53:18] (03CR) 10CRusnov: burrow/check_kafka_consumer_lag.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:54:16] (03CR) 10CRusnov: [C: 03+2] ldap/rewrite-group-for-memberof.py: Port for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658415 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:57:55] (03PS1) 10Jbond: P:ntp: drop use of array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661773 (https://phabricator.wikimedia.org/T273743) [18:58:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2019.codfw.wmnet [18:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27876/console" [puppet] - 10https://gerrit.wikimedia.org/r/661773 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [18:59:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1401.eqiad.wmnet [18:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:42] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.271 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [19:00:47] (03PS2) 10Jbond: P:ntp: drop use of array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661773 (https://phabricator.wikimedia.org/T273743) [19:00:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1309.eqiad.wmnet [19:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [19:02:04] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2278.codfw.wmnet [19:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:32] (03CR) 10Urbanecm: [C: 04-1] Allow sysop to add/remove transwiki for zhwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/660795 (https://phabricator.wikimedia.org/T273405) (owner: 10Hamish) [19:02:38] (03PS1) 10Bstorm: pbuilder: create apt-cache directory before running pbuilder init [puppet] - 10https://gerrit.wikimedia.org/r/661777 [19:03:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27877/console" [puppet] - 10https://gerrit.wikimedia.org/r/661773 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [19:03:32] Daimona_: hey, you around? [19:03:47] Urbanecm: sort of [19:03:58] Viz, for around 5 minutes [19:04:07] Daimona_: wanna make wgAbuseFilterAflFilterMigrationStage READ_NEW on prod? [19:04:18] if you're here for 5 mins only I guess not, but asking anyway :D [19:04:29] I certainly want :D [19:04:34] I mean, now :D [19:04:50] I'm going to be back in like 20 minutes anyway, so if you can +2 it now I should be able to get back for when it's merged [19:05:09] it's a config change, so it'll be merged in a minute Daimona_ [19:05:12] not like backports :D [19:05:29] feel free to ping me in 20 mins, and then we can do it [19:05:34] Oh right, I'm stoopid [19:05:40] Nah, we can do it now [19:06:18] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕜☕ sudo cumin A:cp 'enable-puppet "cdanis deploying I498a0c4af T263496"' [19:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:23] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [19:07:03] Daimona_: okay, merging [19:07:12] (03PS2) 10Urbanecm: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657695 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [19:07:16] (03CR) 10Urbanecm: [C: 03+2] wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657695 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [19:08:00] (03CR) 10Mforns: [C: 03+1] "LGTM! Let me know when this can be deployed, and I'll ping an SRE to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight) [19:09:02] (03Merged) 10jenkins-bot: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657695 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [19:09:34] Daimona_: available at mwdebug1001 for testing [19:09:41] Going [19:09:48] Let's hope the world doesn't end. :-) [19:09:56] James_F: we can always rollback :) [19:10:16] (last famous words, I know) [19:10:21] James_F: if you say that, it will end [19:11:24] afl_id=1519726 on itwp [19:11:52] Daimona_: are you asking me to dump it? [19:12:08] (03PS1) 10Jbond: scap::target: drop array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661780 (https://phabricator.wikimedia.org/T273743) [19:12:26] Huh, actually, we're just checking the READ bit, so no need to [19:12:38] done anyway, https://phabricator.wikimedia.org/P14210 [19:12:38] Looks fine, checking elsewhere with a global filter now [19:12:52] okay [19:13:25] (03CR) 10Ebernhardson: [C: 03+1] [cirrus] rename ores_articletopics -> weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661383 (https://phabricator.wikimedia.org/T273508) (owner: 10DCausse) [19:15:09] It works! Guess this is how the world does not end. [19:16:54] Good! [19:16:57] syncing then :) [19:17:23] Perfect, thank you :) [19:17:23] enter pressed [19:18:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 74e7f70c7c8ae4c8ee9589262d088562c7274b98: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in production (T269712) (duration: 01m 11s) [19:18:36] and...we're live :) [19:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:37] T269712: Migrate afl_filter to afl_filter_id and afl_global - https://phabricator.wikimedia.org/T269712 [19:19:59] (03PS2) 10Urbanecm: Remove ruwiki's A/B test for WelcomeSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661744 (https://phabricator.wikimedia.org/T273900) [19:20:04] (03CR) 10Urbanecm: [C: 03+2] Remove ruwiki's A/B test for WelcomeSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661744 (https://phabricator.wikimedia.org/T273900) (owner: 10Urbanecm) [19:21:05] (03Merged) 10jenkins-bot: Remove ruwiki's A/B test for WelcomeSurvey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661744 (https://phabricator.wikimedia.org/T273900) (owner: 10Urbanecm) [19:22:27] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661784 [19:23:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 40): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27878/console" [puppet] - 10https://gerrit.wikimedia.org/r/661780 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [19:23:28] what is this? https://usercontent.irccloud-cdn.com/file/1ZOZVPnE/image.png [19:23:34] never seen this from scap before [19:24:03] an error ;p [19:24:23] Reedy: thank you...but...what does it mean? [19:24:32] something timed out [19:24:53] I'll sync this again, and see if it happens again, I guess [19:25:11] https://github.com/wikimedia/scap/blob/master/scap/log.py#L134 [19:25:19] it's to do with the !log message [19:25:28] so theoretically it should be just unlogged, but synced [19:25:38] Yeah [19:25:48] I guess failing to log to IRC is not a major failure concern [19:25:51] 10SRE, 10ops-esams, 10DC-Ops: Esams: Delete rack OE10, OE11, OE12 and OE13 from Netbox - https://phabricator.wikimedia.org/T273841 (10Papaul) 05Open→03Resolved All 4 racks removed from Netbox. [19:26:01] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 35e6e4014eee7946979fbf6cd782ae90a3612b82: Remove ruwiki A/B test for WelcomeSurvey (T273900) (duration: 01m 07s) [19:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:06] T273900: Display Welcome Survey to 100% of newcomers in ruwiki - https://phabricator.wikimedia.org/T273900 [19:26:08] seems it logged correctly on the second try [19:26:09] (03PS1) 10Jbond: prometheus: update profiles to drop array_concat [puppet] - 10https://gerrit.wikimedia.org/r/661785 (https://phabricator.wikimedia.org/T273743) [19:26:15] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['logstash1033.eqiad.wmnet', 'logstash1034.eqiad.wmnet', 'logstash1035.eqiad.w... [19:26:45] hahaha [19:26:54] i cannot run the image script before putting in puppet changes... [19:27:20] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) >>! In T267666#6804856, @ops-monitoring-bot wrote: > Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: > ` > ['logstash1033.eqiad.wmnet', 'logst... [19:28:14] (03PS1) 10RobH: logstash103[345] puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/661786 (https://phabricator.wikimedia.org/T267666) [19:28:18] (03PS1) 10Urbanecm: abusefilter: enwikibooks: Enable block action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661787 [19:28:38] (03CR) 10Urbanecm: [C: 03+2] abusefilter: enwikibooks: Enable block action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661787 (owner: 10Urbanecm) [19:28:48] (03CR) 10RobH: [C: 03+2] logstash103[345] puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/661786 (https://phabricator.wikimedia.org/T267666) (owner: 10RobH) [19:29:03] (03PS2) 10Urbanecm: abusefilter: enwikibooks: Enable block action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661787 (https://phabricator.wikimedia.org/T273864) [19:29:08] (03CR) 10Urbanecm: [C: 03+2] abusefilter: enwikibooks: Enable block action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661787 (https://phabricator.wikimedia.org/T273864) (owner: 10Urbanecm) [19:29:45] (03PS2) 10RobH: logstash103[345] puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/661786 (https://phabricator.wikimedia.org/T267666) [19:30:03] Reedy: if you have a second, could you review https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/655625 please? [19:30:04] (03Merged) 10jenkins-bot: abusefilter: enwikibooks: Enable block action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661787 (https://phabricator.wikimedia.org/T273864) (owner: 10Urbanecm) [19:31:33] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) @BBlack im looking at the `array_conact` function and trying to decide if its worth porting to the newer puppet API or dropping. I have looked at the... [19:31:36] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: a199b8384f4226b70fc00538f01e41a9a68b3ea3: abusefilter: enwikibooks: Enable block action (T273864) (duration: 01m 06s) [19:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:40] T273864: Add block for abusefilter (en.wb) - https://phabricator.wikimedia.org/T273864 [19:31:41] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2244.codfw.wmnet with reason: REIMAGE [19:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:49] (03PS1) 10Urbanecm: sysop_itwiki: Set wmgUsePopups to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661788 (https://phabricator.wikimedia.org/T259480) [19:32:32] (03PS2) 10Urbanecm: sysop_itwiki: Set wmgUsePopups to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661788 (https://phabricator.wikimedia.org/T259480) [19:32:37] (03CR) 10Urbanecm: [C: 03+2] sysop_itwiki: Set wmgUsePopups to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661788 (https://phabricator.wikimedia.org/T259480) (owner: 10Urbanecm) [19:33:08] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) [19:33:46] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2244.codfw.wmnet with reason: REIMAGE [19:33:47] (03Merged) 10jenkins-bot: sysop_itwiki: Set wmgUsePopups to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661788 (https://phabricator.wikimedia.org/T259480) (owner: 10Urbanecm) [19:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 968ae8b69d7f743f0e589ba3568de36bc462c7d6: sysop_itwiki: Set wmgUsePopups to false (T259480) (duration: 01m 06s) [19:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:46] T259480: Popups not working on sysop-it.wikipedia because REST endpoint is returning a 500 - https://phabricator.wikimedia.org/T259480 [19:36:16] * Urbanecm done [19:36:19] 10SRE, 10ops-eqiad, 10DC-Ops: update hostname labels on logstash103[345] - https://phabricator.wikimedia.org/T273922 (10RobH) [19:36:50] Urbanecm: if you wanted more stuff to deploy, I have some more logo patches up :) [19:37:06] legoktm: I don't mind deploying them :) [19:37:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27879/console" [puppet] - 10https://gerrit.wikimedia.org/r/661785 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [19:37:17] 10SRE, 10ops-eqiad, 10DC-Ops: update hostname labels on logstash103[345] - https://phabricator.wikimedia.org/T273922 (10RobH) p:05Triage→03Low [19:37:27] all of https://gerrit.wikimedia.org/r/q/topic:%2522logo-update-recompress%2522+status:open legoktm ? [19:37:31] yep [19:37:37] okay, let's do them too [19:38:04] (03PS2) 10Urbanecm: logos: Update rowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661545 (owner: 10Legoktm) [19:38:08] (03CR) 10Urbanecm: [C: 03+2] logos: Update rowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661545 (owner: 10Legoktm) [19:39:09] (03Merged) 10jenkins-bot: logos: Update rowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661545 (owner: 10Legoktm) [19:39:28] (03PS2) 10Urbanecm: logos: Update kowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661566 (owner: 10Legoktm) [19:39:33] (03CR) 10Urbanecm: [C: 03+2] logos: Update kowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661566 (owner: 10Legoktm) [19:40:40] (03PS2) 10Urbanecm: logos: Update eowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661567 (owner: 10Legoktm) [19:40:44] (03CR) 10Urbanecm: [C: 03+2] logos: Update eowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661567 (owner: 10Legoktm) [19:41:32] (03PS2) 10Urbanecm: logos: Update dawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661568 (owner: 10Legoktm) [19:41:38] (03CR) 10Urbanecm: [C: 03+2] logos: Update dawiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661568 (owner: 10Legoktm) [19:43:36] (03PS3) 10Urbanecm: logos: Update eowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661567 (owner: 10Legoktm) [19:43:42] (03CR) 10Urbanecm: [C: 03+2] logos: Update eowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661567 (owner: 10Legoktm) [19:43:58] (03PS2) 10Urbanecm: logos: Update arwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661569 (owner: 10Legoktm) [19:44:02] (03CR) 10Urbanecm: [C: 03+2] logos: Update arwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661569 (owner: 10Legoktm) [19:44:51] (03CR) 10Volans: "> Patch Set 7:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (owner: 10David Caro) [19:45:16] (03PS3) 10Urbanecm: logos: Update arwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661569 (owner: 10Legoktm) [19:45:23] (03CR) 10Urbanecm: [C: 03+2] logos: Update arwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661569 (owner: 10Legoktm) [19:45:41] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['logstash1033.eqiad.wmnet', 'logstash1034.eqiad.wmnet', 'logstash1035.eqiad.w... [19:45:59] (03PS2) 10Urbanecm: logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:46:04] (03CR) 10Urbanecm: [C: 03+2] logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:46:21] (03Merged) 10jenkins-bot: logos: Update arwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661569 (owner: 10Legoktm) [19:46:58] (03PS2) 10Urbanecm: logos: Update vowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661571 (owner: 10Legoktm) [19:47:02] (03CR) 10Urbanecm: [C: 03+2] logos: Update vowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661571 (owner: 10Legoktm) [19:47:17] (03PS3) 10Urbanecm: logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:47:22] (03CR) 10Urbanecm: [C: 03+2] logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:47:54] (03Merged) 10jenkins-bot: logos: Update vowiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661571 (owner: 10Legoktm) [19:48:01] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [19:48:16] (03PS4) 10Urbanecm: logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:48:21] (03CR) 10Urbanecm: [C: 03+2] logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:48:50] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [19:49:29] (03Merged) 10jenkins-bot: logos: Update idwiki from Commons and recompress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661570 (owner: 10Legoktm) [19:49:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1309.eqiad.wmnet [19:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:02] (03PS1) 10Bstorm: sbuild: in buster, sbuild installs createchroot cmd in /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/661791 [19:50:09] sorry for all the pings legoktm :) [19:50:17] <3 I don't mind at all [19:51:08] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Recompress several Wikipedia logos (1/2) (duration: 01m 07s) [19:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:52:26] !log urbanecm@deploy1001 Synchronized logos/config.yaml: Recompress several Wikipedia logos (2/2) (duration: 01m 05s) [19:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:36] legoktm: should be done! [19:52:51] except HTCP purges, which...I'll do now [19:53:10] (03PS1) 10Jbond: wmflib: drop conflicts method [puppet] - 10https://gerrit.wikimedia.org/r/661794 (https://phabricator.wikimedia.org/T273743) [19:53:12] (03PS1) 10Jbond: wmflib: drop conftool funtion [puppet] - 10https://gerrit.wikimedia.org/r/661795 (https://phabricator.wikimedia.org/T273743) [19:53:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:54:01] (03PS2) 10Jbond: wmflib: drop conftool funtion [puppet] - 10https://gerrit.wikimedia.org/r/661795 (https://phabricator.wikimedia.org/T273743) [19:54:31] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Identify and upstream useful functions from wmflib - https://phabricator.wikimedia.org/T273743 (10jbond) [19:54:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:56:39] !log Purge several recompressed Wikipedia logos [19:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] hashar and dancy: May I have your attention please! Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210204T2000) [20:09:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2267.codfw.wmnet with reason: REIMAGE [20:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1400.eqiad.wmnet with reason: REIMAGE [20:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2267.codfw.wmnet with reason: REIMAGE [20:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:07] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1400.eqiad.wmnet with reason: REIMAGE [20:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1308.eqiad.wmnet with reason: REIMAGE [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:03] (03PS1) 10RobH: correcting logstash103[345] macs [puppet] - 10https://gerrit.wikimedia.org/r/661800 (https://phabricator.wikimedia.org/T267666) [20:24:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2244.codfw.wmnet'] ` an... [20:25:16] (03CR) 10RobH: [C: 03+2] correcting logstash103[345] macs [puppet] - 10https://gerrit.wikimedia.org/r/661800 (https://phabricator.wikimedia.org/T267666) (owner: 10RobH) [20:25:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1308.eqiad.wmnet with reason: REIMAGE [20:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:45] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2267.codfw.wmnet'] ` an... [20:33:07] did someone recently edit the pagers file? [20:33:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['logstash1033.eqiad.wmnet', 'logstash1034.eqiad.wmnet',... [20:33:15] the icinga contacts file, I mean [20:33:19] one more with feeling [20:33:29] (not me, im just battling images) [20:35:24] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2267.wmnet [20:35:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1400.eqiad.wmnet'] ` an... [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1400.eqiad.wmnet [20:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:41] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2244.codfw.wmnet [20:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:23] (03PS4) 10Cwhite: profile: send w3creportingapi logs to indexes with custom schema [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) [20:54:16] (03CR) 10Cwhite: [C: 03+2] profile: send w3creportingapi logs to indexes with custom schema [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite) [20:56:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2267.wmnet [20:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2267.codfw.wmnet [20:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1400.eqiad.wmnet [20:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:01] jouncebot: next [21:02:02] In 2 hour(s) and 57 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210205T0000) [21:02:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:04:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:05:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:06:21] (03CR) 10Ottomata: [C: 03+1] "Cool, we just need to make sure that before we enable this somewhere we've manually created the targets and copied files there." [puppet] - 10https://gerrit.wikimedia.org/r/661391 (https://phabricator.wikimedia.org/T265126) (owner: 10Razzi) [21:10:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1308.eqiad.wmnet'] ` an... [21:13:40] (03PS1) 10Cwhite: profile: filename should have underscore [puppet] - 10https://gerrit.wikimedia.org/r/661805 [21:13:59] (03CR) 10Cwhite: [V: 03+2 C: 03+2] profile: filename should have underscore [puppet] - 10https://gerrit.wikimedia.org/r/661805 (owner: 10Cwhite) [21:17:30] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1308.eqiad.wmnet [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1308.eqiad.wmnet [21:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1399.eqiad.wmnet with reason: REIMAGE [21:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:22:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1398.eqiad.wmnet with reason: REIMAGE [21:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1399.eqiad.wmnet with reason: REIMAGE [21:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:04] Hello, can someone reload beta-mediawiki-config-update-eqiad. I'm not sure that 2 hours of waiting is good. ;) [21:24:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1398.eqiad.wmnet with reason: REIMAGE [21:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:49] (03CR) 10Ottomata: [C: 03+1] presto: require partitions predicate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661209 (https://phabricator.wikimedia.org/T273004) (owner: 10Razzi) [21:28:10] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Added some docs here: https://wikitech.wikimedia.org/wiki/Event_Platform... [21:28:28] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10Ottomata) Yeehaw thank you Luca! [21:28:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) So these aren't getthing dhcp leases and moving past pxe boot, need to investigate why in further detail. Puppet repo has been updated with the 10g interface mac addresses and... [21:30:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1263.eqiad.wmnet with reason: REIMAGE [21:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1263.eqiad.wmnet with reason: REIMAGE [21:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:43] (03PS7) 10Effie Mouzeli: memcached: enable unix sockets in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) [21:36:24] (03CR) 10Bstorm: [C: 03+2] "This doesn't work on buster without this patch, so I think I'm going to merge it." [puppet] - 10https://gerrit.wikimedia.org/r/661791 (owner: 10Bstorm) [21:44:55] (03PS1) 10Legoktm: Initial commit of scripts [software/benchmw] - 10https://gerrit.wikimedia.org/r/661807 [21:44:57] (03PS1) 10Legoktm: Clean up and make more re-usable [software/benchmw] - 10https://gerrit.wikimedia.org/r/661808 [21:45:23] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Initial commit of scripts [software/benchmw] - 10https://gerrit.wikimedia.org/r/661807 (owner: 10Legoktm) [21:46:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1399.eqiad.wmnet'] ` an... [21:47:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1398.eqiad.wmnet'] ` an... [21:50:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1311.eqiad.wmnet with reason: REIMAGE [21:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:19] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1398.eqiad.wmnet [21:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1399.eqiad.wmnet [21:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1311.eqiad.wmnet with reason: REIMAGE [21:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:50] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2244.codfw.wmnet [21:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:40] (03CR) 10Effie Mouzeli: memcached: enable unix sockets in memcached (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [21:56:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1398.eqiad.wmnet [21:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1399.eqiad.wmnet [21:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:55] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:11:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:22:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1263.eqiad.wmnet'] ` an... [22:27:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1397.eqiad.wmnet with reason: REIMAGE [22:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:48] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1396.eqiad.wmnet with reason: REIMAGE [22:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1397.eqiad.wmnet with reason: REIMAGE [22:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1396.eqiad.wmnet with reason: REIMAGE [22:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:10] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@700cd49]: partition ores staging tables by data source [22:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:08] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1311.eqiad.wmnet'] ` an... [22:38:30] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@700cd49]: partition ores staging tables by data source (duration: 01m 19s) [22:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1311.eqiad.wmnet [22:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1263.eqiad.wmnet [22:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:47] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 2 others: Unable to delete certain files due to "inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T141704 (10daniel) 05Open→03Declined No activity since 2019, nothing relevant in the logs... [22:52:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1397.eqiad.wmnet'] ` an... [22:53:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1396.eqiad.wmnet'] ` an... [22:54:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [22:55:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1263.eqiad.wmnet [22:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1311.eqiad.wmnet [23:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1397.eqiad.wmnet [23:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:32] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1396.eqiad.wmnet [23:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1396.eqiad.wmnet [23:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1397.eqiad.wmnet [23:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:00] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10wiki_willy) Hi @Jclark-ctr - can you confirm all the firmware/bios/idrac is all updated? I have an email queued up to send to our technical Dell... [23:24:17] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1278.eqiad.wmnet with reason: REIMAGE [23:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:29] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Jclark-ctr) @RKemper ` dmidecode -t 20 ` would be very useful to trace physical address of memory we are unsure why it will not return any in... [23:26:28] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1278.eqiad.wmnet with reason: REIMAGE [23:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:55] (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Allow docker push from releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/661757 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [23:35:46] RECOVERY - SSH on mw2249.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:40:10] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) Regarding -t 20, dmidecode reports SMBIOS 3.2 present. Per the [[https://www.dmtf.org/sites/default/files/standards/documents/DSP01... [23:43:34] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) Additionally, it seems this has happened again today and yet edac-utils claims no errors: ` ebernhardson@elastic1063:~$ sudo dmesg... [23:58:36] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@93bf374]: correct hql in ores_predictions_init_v3 [23:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:32] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1001 is CRITICAL: 5.708e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [23:59:43] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@93bf374]: correct hql in ores_predictions_init_v3 (duration: 01m 06s) [23:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log