[00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:49] bstorm: would you mind looking at https://gerrit.wikimedia.org/r/c/operations/puppet/+/636436 for me please? :-) [00:02:09] (03CR) 10Razzi: [C: 03+2] oozie: Add admin groups for authorization [puppet] - 10https://gerrit.wikimedia.org/r/637587 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi) [00:02:58] Sure [00:04:20] (03PS3) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 [00:04:24] (03CR) 10RLazarus: "Thanks for the quick review!" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:07:26] (03CR) 10jerkins-bot: [V: 04-1] decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:14:27] (03CR) 10Cwhite: [C: 03+1] "LGTM" [debs/kthxbye] - 10https://gerrit.wikimedia.org/r/638044 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [00:15:42] (03CR) 10Cwhite: [C: 03+1] profile: add prometheus jobs for am acks [puppet] - 10https://gerrit.wikimedia.org/r/639000 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [00:16:08] (03CR) 10Cwhite: [C: 03+1] alertmanager: enable acks and silences on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/638999 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [00:16:25] (03CR) 10Volans: "Thanks for the fixes!" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:16:36] (03PS4) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 [00:17:02] (03CR) 10Cwhite: [C: 03+1] alertmanager: add ack daemon [puppet] - 10https://gerrit.wikimedia.org/r/638998 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [00:17:43] volans: ^ haha drat, I was eleven seconds too late with the test fixes [00:17:56] ahahah [00:19:03] (03CR) 10Cwhite: [C: 03+1] thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [00:19:29] (03CR) 10jerkins-bot: [V: 04-1] decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:19:42] ... or not >:( thanks jerkins [00:19:52] (03CR) 10Cwhite: [C: 03+1] prometheus: add thanos query-frontend jobs [puppet] - 10https://gerrit.wikimedia.org/r/638120 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [00:20:21] (03CR) 10Cwhite: [C: 03+1] role: add query_frontend to thanos frontend [puppet] - 10https://gerrit.wikimedia.org/r/638121 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [00:25:22] (03PS5) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 [00:25:48] (03CR) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:29:36] (03PS16) 10ArielGlenn: per job batches file with locking and methods for claiming jobs etc [dumps] - 10https://gerrit.wikimedia.org/r/596504 (https://phabricator.wikimedia.org/T252396) [00:33:57] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, and 2 others: Document remaining database load groups - https://phabricator.wikimedia.org/T267077 (10Krinkle) p:05Triage→03Medium [00:53:21] (03CR) 10Volans: [C: 03+1] "LGTM! Whishlist for a separate patch we could add those log messages in the other usage of @retry" [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [00:59:02] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [01:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T0100). [01:00:32] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:00:32] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [01:02:04] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [01:13:24] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [01:47:47] !log milimetric@deploy1001 Started deploy [analytics/refinery@6913407]: Regular analytics weekly train [analytics/refinery@6913407] [01:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:21] !log milimetric@deploy1001 Finished deploy [analytics/refinery@6913407]: Regular analytics weekly train [analytics/refinery@6913407] (duration: 08m 34s) [01:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:54] !log milimetric@deploy1001 Started deploy [analytics/refinery@6913407] (thin): Regular analytics weekly train THIN [analytics/refinery@6913407] [01:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:03] !log milimetric@deploy1001 Finished deploy [analytics/refinery@6913407] (thin): Regular analytics weekly train THIN [analytics/refinery@6913407] (duration: 00m 08s) [01:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:10] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:11:18] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:53] 10Operations, 10Commons, 10SRE-swift-storage: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10tstarling) @matthiasmullie might understand what's going on here. [04:48:16] 10Operations, 10Commons, 10SRE-swift-storage: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10tstarling) Searching the DB for all files with size 5242880 clearly shows the beginning of a new issue on 2020-1... [04:54:22] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod access revoked - https://phabricator.wikimedia.org/T267292 (10RobH) p:05Triage→03High [04:55:09] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod access revoked - https://phabricator.wikimedia.org/T267292 (10RobH) This is particularly concerning since @dcaro was just granted global root as a member of the ops group via T267040. [04:56:52] (03PS1) 10RobH: revoke dcaro prod ssh key [puppet] - 10https://gerrit.wikimedia.org/r/639410 (https://phabricator.wikimedia.org/T267292) [04:57:29] (03CR) 10RobH: [C: 03+2] revoke dcaro prod ssh key [puppet] - 10https://gerrit.wikimedia.org/r/639410 (https://phabricator.wikimedia.org/T267292) (owner: 10RobH) [04:59:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: dcaro has same ssh key in wmcs and prod, prod access revoked - https://phabricator.wikimedia.org/T267292 (10RobH) [05:01:30] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod access revoked - https://phabricator.wikimedia.org/T267292 (10RobH) [05:03:05] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod ssh key revoked - https://phabricator.wikimedia.org/T267292 (10RobH) [05:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1017 (re)pooling @ 25%: After cloning es1028 T261717', diff saved to https://phabricator.wikimedia.org/P13188 and previous config saved to /var/cache/conftool/dbconfig/20201105-062446-root.json [06:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1013 (re)pooling @ 25%: After cloning es1027 T261717', diff saved to https://phabricator.wikimedia.org/P13189 and previous config saved to /var/cache/conftool/dbconfig/20201105-062454-root.json [06:24:55] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1016 (re)pooling @ 25%: After cloning es1029 T261717', diff saved to https://phabricator.wikimedia.org/P13190 and previous config saved to /var/cache/conftool/dbconfig/20201105-062507-root.json [06:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:32] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, 10Maps (Kartotherian): Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10RolandUnger) [06:27:34] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:27:44] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:46] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [06:29:47] (03PS1) 10Marostegui: instances.yaml: Add es1029, es1030, es1031 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/639415 (https://phabricator.wikimedia.org/T261717) [06:30:20] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es1029, es1030, es1031 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/639415 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:33:04] RECOVERY - Disk space on Hadoop worker on an-worker1113 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:33:34] (03PS1) 10Marostegui: mariadb: Enable notifications on es1029-es1031 [puppet] - 10https://gerrit.wikimedia.org/r/639416 (https://phabricator.wikimedia.org/T261717) [06:34:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable notifications on es1029-es1031 [puppet] - 10https://gerrit.wikimedia.org/r/639416 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [06:34:15] !log truncate application_1601916545561_129457's taskmanager.log (~600G) on an-worker1113 due to partition 'e' full [06:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:23] (03CR) 10Marostegui: [C: 03+1] "This requires a mysql restart" [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [06:36:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1017 (re)pooling @ 50%: After cloning es1031 T261717', diff saved to https://phabricator.wikimedia.org/P13191 and previous config saved to /var/cache/conftool/dbconfig/20201105-063603-root.json [06:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1013 (re)pooling @ 50%: After cloning es1030 T261717', diff saved to https://phabricator.wikimedia.org/P13192 and previous config saved to /var/cache/conftool/dbconfig/20201105-063610-root.json [06:36:10] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:06] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, 10Maps (Kartotherian): Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10RolandUnger) Some tests on https://de.wikivoyage.org/wiki/Kairo/Ta%E1%B8%A5r%C4%ABr-Platz The map markers... [06:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1016 (re)pooling @ 50%: After cloning es1029 T261717', diff saved to https://phabricator.wikimedia.org/P13193 and previous config saved to /var/cache/conftool/dbconfig/20201105-064010-root.json [06:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:21] win 27 [06:40:23] nope [06:46:27] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, 10Maps (Kartotherian): Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10RolandUnger) The map server returns a 404 error. [06:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1017 (re)pooling @ 75%: After cloning es1031 T261717', diff saved to https://phabricator.wikimedia.org/P13195 and previous config saved to /var/cache/conftool/dbconfig/20201105-065107-root.json [06:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1013 (re)pooling @ 75%: After cloning es1030 T261717', diff saved to https://phabricator.wikimedia.org/P13196 and previous config saved to /var/cache/conftool/dbconfig/20201105-065113-root.json [06:51:14] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1016 (re)pooling @ 75%: After cloning es1029 T261717', diff saved to https://phabricator.wikimedia.org/P13197 and previous config saved to /var/cache/conftool/dbconfig/20201105-065514-root.json [06:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:19] (03PS4) 10KartikMistry: WIP: Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 (https://phabricator.wikimedia.org/T266213) [07:03:53] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: increase innodb buffer size [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [07:03:58] (03PS3) 10Elukey: role::analytics_cluster::coordinator: increase innodb buffer size [puppet] - 10https://gerrit.wikimedia.org/r/638923 (https://phabricator.wikimedia.org/T257412) [07:05:04] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1017 (re)pooling @ 100%: After cloning es1031 T261717', diff saved to https://phabricator.wikimedia.org/P13198 and previous config saved to /var/cache/conftool/dbconfig/20201105-070610-root.json [07:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1013 (re)pooling @ 100%: After cloning es1030 T261717', diff saved to https://phabricator.wikimedia.org/P13199 and previous config saved to /var/cache/conftool/dbconfig/20201105-070616-root.json [07:06:18] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1016 (re)pooling @ 100%: After cloning es1029 T261717', diff saved to https://phabricator.wikimedia.org/P13200 and previous config saved to /var/cache/conftool/dbconfig/20201105-071017-root.json [07:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129 T267216', diff saved to https://phabricator.wikimedia.org/P13201 and previous config saved to /var/cache/conftool/dbconfig/20201105-072352-marostegui.json [07:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:59] T267216: Slow load times for Special:Homepage on cswiki - https://phabricator.wikimedia.org/T267216 [07:41:28] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 570 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:46:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: After regenerating table stats', diff saved to https://phabricator.wikimedia.org/P13202 and previous config saved to /var/cache/conftool/dbconfig/20201105-074631-root.json [07:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:40] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Add Debian directory [debs/kthxbye] - 10https://gerrit.wikimedia.org/r/638044 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [07:48:15] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [07:48:17] (03PS5) 10Filippo Giunchedi: thanos: add query-frontend [puppet] - 10https://gerrit.wikimedia.org/r/638119 (https://phabricator.wikimedia.org/T261281) [07:52:11] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [07:52:17] (03PS6) 10Filippo Giunchedi: pontoon: use frontends for query_frontend memcache [puppet] - 10https://gerrit.wikimedia.org/r/638122 (https://phabricator.wikimedia.org/T261281) [07:52:30] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 65 probes of 570 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:53:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1029 on es1 with minimium weight after being cloned from es1016 T261717', diff saved to https://phabricator.wikimedia.org/P13203 and previous config saved to /var/cache/conftool/dbconfig/20201105-075358-marostegui.json [07:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:04] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:55:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1030 on es2 with minimium weight after being cloned from es1013 T261717', diff saved to https://phabricator.wikimedia.org/P13204 and previous config saved to /var/cache/conftool/dbconfig/20201105-075507-marostegui.json [07:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es1031 on es3 with minimium weight after being cloned from es1017 T261717', diff saved to https://phabricator.wikimedia.org/P13205 and previous config saved to /var/cache/conftool/dbconfig/20201105-075625-marostegui.json [07:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: After regenerating table stats', diff saved to https://phabricator.wikimedia.org/P13206 and previous config saved to /var/cache/conftool/dbconfig/20201105-080135-root.json [08:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:40] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod ssh key revoked - https://phabricator.wikimedia.org/T267292 (10dcaro) a:05dcaro→03RobH Here is a new one: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAID46/gY7mfN96ylAdQb6ZBfrq9L3QwemMtN5ZjrJgEmK dcaro@magnum ` Would it be possib... [08:12:03] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod ssh key revoked - https://phabricator.wikimedia.org/T267292 (10MoritzMuehlenhoff) >>! In T267292#6605218, @dcaro wrote: > Here is a new one: > > ` > ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAID46/gY7mfN96ylAdQb6ZBfrq9L3QwemMtN5ZjrJgE... [08:16:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: After regenerating table stats', diff saved to https://phabricator.wikimedia.org/P13207 and previous config saved to /var/cache/conftool/dbconfig/20201105-081638-root.json [08:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: After regenerating table stats', diff saved to https://phabricator.wikimedia.org/P13208 and previous config saved to /var/cache/conftool/dbconfig/20201105-083142-root.json [08:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312', diff saved to https://phabricator.wikimedia.org/P13209 and previous config saved to /var/cache/conftool/dbconfig/20201105-083304-marostegui.json [08:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1090:3312', diff saved to https://phabricator.wikimedia.org/P13210 and previous config saved to /var/cache/conftool/dbconfig/20201105-084250-marostegui.json [08:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: Slowly pool es1029 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13211 and previous config saved to /var/cache/conftool/dbconfig/20201105-084323-root.json [08:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:29] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: Slowly pool es1030 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13212 and previous config saved to /var/cache/conftool/dbconfig/20201105-084334-root.json [08:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 10%: Slowly pool es1031 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13213 and previous config saved to /var/cache/conftool/dbconfig/20201105-084343-root.json [08:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:01] (03CR) 10Ayounsi: Add switch interface support to decom script (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [08:47:26] 10Operations, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10ema) a:05ema→03None [08:48:18] 10Operations, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) p:05Triage→03Medium [08:49:37] 10Operations, 10ops-codfw, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) a:03Papaul Hi @Papaul, Can you run a firmware upgrade on this host, please? Let me know a day that works for you, and i can have the host powered down safely. [08:50:43] 10Operations, 10Performance-Team, 10Traffic, 10observability: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10ema) @Krinkle: anything left TBD here? [08:58:03] (03CR) 10Volans: Add switch interface support to decom script (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [08:58:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: Slowly pool es1029 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13214 and previous config saved to /var/cache/conftool/dbconfig/20201105-085826-root.json [08:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:34] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: Slowly pool es1030 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13215 and previous config saved to /var/cache/conftool/dbconfig/20201105-085838-root.json [08:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 25%: Slowly pool es1031 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13216 and previous config saved to /var/cache/conftool/dbconfig/20201105-085846-root.json [08:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:26] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [09:04:05] (03PS1) 10Filippo Giunchedi: grafana: commit ldap-users-sync changes [puppet] - 10https://gerrit.wikimedia.org/r/639452 (https://phabricator.wikimedia.org/T265712) [09:04:57] (03CR) 10Volans: [C: 04-1] "See inline for the details, but basically I think it's better to make it a dedicated script." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [09:05:39] (03PS3) 10Muehlenhoff: Fold Grafana settings for CAS into the Hiera role data [puppet] - 10https://gerrit.wikimedia.org/r/636929 (https://phabricator.wikimedia.org/T265712) [09:11:21] (03CR) 10Muehlenhoff: [C: 03+2] Fold Grafana settings for CAS into the Hiera role data [puppet] - 10https://gerrit.wikimedia.org/r/636929 (https://phabricator.wikimedia.org/T265712) (owner: 10Muehlenhoff) [09:12:27] (03PS2) 10Muehlenhoff: Point grafana-rw to grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/636930 (https://phabricator.wikimedia.org/T265712) [09:13:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: Slowly pool es1029 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13217 and previous config saved to /var/cache/conftool/dbconfig/20201105-091329-root.json [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:37] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:13:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: Slowly pool es1030 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13218 and previous config saved to /var/cache/conftool/dbconfig/20201105-091341-root.json [09:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 50%: Slowly pool es1031 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13219 and previous config saved to /var/cache/conftool/dbconfig/20201105-091350-root.json [09:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:52] (03CR) 10Kormat: [C: 03+1] "LGTM from me. I can look at making it a little more general in the future so that we (data-persistence) can re-use it, but it's better to " [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [09:18:58] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:19:42] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: connect to address 10.192.0.160 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [09:20:12] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:00] (03PS1) 10Muehlenhoff: Remove stray enable_cas setting, which overrides the earlier setting [puppet] - 10https://gerrit.wikimedia.org/r/639455 [09:24:26] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove stray enable_cas setting, which overrides the earlier setting [puppet] - 10https://gerrit.wikimedia.org/r/639455 (owner: 10Muehlenhoff) [09:24:36] (03CR) 10Muehlenhoff: [C: 03+2] Remove stray enable_cas setting, which overrides the earlier setting [puppet] - 10https://gerrit.wikimedia.org/r/639455 (owner: 10Muehlenhoff) [09:25:39] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:07] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 61524 bytes in 0.161 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [09:28:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: Slowly pool es1029 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13222 and previous config saved to /var/cache/conftool/dbconfig/20201105-092833-root.json [09:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:40] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:28:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: Slowly pool es1030 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13223 and previous config saved to /var/cache/conftool/dbconfig/20201105-092845-root.json [09:28:48] !log enabling CAS on grafana1002, editing dashboards will be interrupted for a bit [09:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 75%: Slowly pool es1031 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13224 and previous config saved to /var/cache/conftool/dbconfig/20201105-092853-root.json [09:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:00] (03PS5) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [09:36:29] (03CR) 10jerkins-bot: [V: 04-1] pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [09:37:46] (03CR) 10Muehlenhoff: [C: 03+2] Point grafana-rw to grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/636930 (https://phabricator.wikimedia.org/T265712) (owner: 10Muehlenhoff) [09:43:23] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10jcrespo) > @jcrespo This issue should be resolved at this point as I now see the `logstash-*` filter pattern on logstash-next. Please let us know if it is still broken. If y... [09:43:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: Slowly pool es1029 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13225 and previous config saved to /var/cache/conftool/dbconfig/20201105-094336-root.json [09:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:43] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:43:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: Slowly pool es1030 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13226 and previous config saved to /var/cache/conftool/dbconfig/20201105-094348-root.json [09:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1031 (re)pooling @ 100%: Slowly pool es1031 after being recloned T261717', diff saved to https://phabricator.wikimedia.org/P13227 and previous config saved to /var/cache/conftool/dbconfig/20201105-094356-root.json [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:25] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:25] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:57:32] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:58:04] RECOVERY - Check systemd state on maps2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:17] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [09:58:18] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:32] RECOVERY - cassandra CQL 10.192.16.179:9042 on maps2002 is OK: TCP OK - 0.032 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [10:09:12] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Creating a backup before shutting them down, in case data got lost after maintenance. [10:14:55] (03CR) 10Jbond: "PCC Started: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26322" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [10:16:17] !log grafana-rw.wikimedia.org active and sso-enabled - T262512 [10:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:24] T262512: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 [10:21:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26324" [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [10:27:23] (03PS6) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [10:27:50] (03CR) 10jerkins-bot: [V: 04-1] pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [10:30:10] (03PS7) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [10:30:59] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [10:31:44] (03Abandoned) 10Jbond: change to test pcc [puppet] - 10https://gerrit.wikimedia.org/r/636641 (owner: 10Jbond) [10:31:51] (03PS8) 10Jbond: pcc: update PCC cli so that it posts to the gerrit change [puppet] - 10https://gerrit.wikimedia.org/r/636652 [10:35:25] (03PS1) 10Kormat: dbtools: Add host-to-instance [software] - 10https://gerrit.wikimedia.org/r/639470 [10:39:28] (03CR) 10Jbond: [C: 03+1] "LGTM, minor" [puppet] - 10https://gerrit.wikimedia.org/r/639014 (owner: 10Filippo Giunchedi) [10:55:04] 10Operations, 10Commons, 10SRE-swift-storage, 10Patch-For-Review: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10matthiasmullie) >>! In T266903#6604911, @tstarling wrote: > @matthiasmullie might understa... [10:55:41] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [10:55:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T1100). [11:02:13] (03PS6) 10Jbond: scap: migrate to shared spec helper and convert to mocha [puppet] - 10https://gerrit.wikimedia.org/r/638562 [11:03:29] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, and 2 others: Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10TheDJ) [11:03:35] (03CR) 10jerkins-bot: [V: 04-1] scap: migrate to shared spec helper and convert to mocha [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond) [11:03:48] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, and 2 others: Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10TheDJ) @MSantos this seems like a reordering of the deploy street broke the /geoline endpoints of maps.wikimedia.org... [11:05:37] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` in a tmux session updateVarDumps at mwmaint1002 (wiki=dewiki; T246539) [11:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:44] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [11:06:37] (03PS4) 10JMeybohm: Package binary kubernetes releases [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) [11:06:39] (03PS3) 10JMeybohm: Update to 1.16.15 [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/639192 (https://phabricator.wikimedia.org/T266766) [11:07:56] (03PS7) 10Jbond: scap: migrate to shared spec helper and convert to mocha [puppet] - 10https://gerrit.wikimedia.org/r/638562 [11:09:46] (03CR) 10Jbond: [C: 03+2] scap: migrate to shared spec helper and convert to mocha [puppet] - 10https://gerrit.wikimedia.org/r/638562 (owner: 10Jbond) [11:11:33] 10Operations, 10Commons, 10SRE-swift-storage, 10Patch-For-Review: Recently more broken files (premature end of file at 5MB size) that were cross-wiki uploaded to Commons - https://phabricator.wikimedia.org/T266903 (10Urbanecm) As this is a curent UBN, can we get someone to either review and backport https:... [11:12:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:16] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:21] (03PS1) 10Muehlenhoff: Also disable login page [puppet] - 10https://gerrit.wikimedia.org/r/639479 [11:22:15] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26327/" [puppet] - 10https://gerrit.wikimedia.org/r/639479 (owner: 10Muehlenhoff) [11:23:42] (03PS1) 10Filippo Giunchedi: grafana: sync multiple-groups users only once [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) [11:23:45] (03PS1) 10Filippo Giunchedi: grafana: enforce 'admin' user role [puppet] - 10https://gerrit.wikimedia.org/r/639483 (https://phabricator.wikimedia.org/T265712) [11:23:47] (03CR) 10Filippo Giunchedi: [C: 03+1] Also disable login page [puppet] - 10https://gerrit.wikimedia.org/r/639479 (owner: 10Muehlenhoff) [11:24:12] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10sguebo_WMF) [11:24:29] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [11:25:10] (03CR) 10Muehlenhoff: [C: 03+2] Also disable login page [puppet] - 10https://gerrit.wikimedia.org/r/639479 (owner: 10Muehlenhoff) [11:25:19] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10JanWMF) approved [11:27:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:27:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/639483 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:37:44] (03CR) 10Muehlenhoff: grafana: sync multiple-groups users only once (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:40:46] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [11:41:04] (03PS2) 10Filippo Giunchedi: grafana: sync multiple-groups users only once [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) [11:41:06] (03PS2) 10Filippo Giunchedi: grafana: enforce 'admin' user role [puppet] - 10https://gerrit.wikimedia.org/r/639483 (https://phabricator.wikimedia.org/T265712) [11:41:10] (03CR) 10Filippo Giunchedi: grafana: sync multiple-groups users only once (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:41:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [11:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:37] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1012 to es1 master, es1011 to es2 master, es1014 to es3 (this is a noop) T261717', diff saved to https://phabricator.wikimedia.org/P13230 and previous config saved to /var/cache/conftool/dbconfig/20201105-114223-marostegui.json [11:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:30] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [11:49:25] (03PS2) 10JMeybohm: aptrepo: add component for future kubernetes packages [puppet] - 10https://gerrit.wikimedia.org/r/637463 (https://phabricator.wikimedia.org/T266766) [11:50:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:52:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/639452 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [11:54:50] (03CR) 10JMeybohm: [C: 03+2] aptrepo: add component for future kubernetes packages [puppet] - 10https://gerrit.wikimedia.org/r/637463 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [11:55:52] !log Upgrade mysql on db1077 [11:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] (03CR) 10JMeybohm: "I've added error checking to the script and updated the docs:" (031 comment) [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [11:58:32] !log shutting down db1139 in preparation of maintenance T261405 [11:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:39] T261405: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 [11:58:56] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) [11:59:47] 10Operations, 10Design-Research, 10Domains, 10Traffic: Register wikipersonas.org and redirect URL - https://phabricator.wikimedia.org/T241944 (10Aklapper) @Dendelele: Ping - any news to share? Otherwise this task might end up as declined. [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T1200). Please do the needful. [12:00:05] No GERRIT patches in the queue for this window AFAICS. [12:00:06] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) [12:04:05] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) db1139 is down, backed up, and ready for maintenance - I have downtime'd until Friday. Let us know either if you will need more time or when it has been done to put it back into p... [12:09:04] PROBLEM - k8s API server requests latencies on acrab is CRITICAL: instance=10.192.16.26 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:09:55] !log Upgrade mysql on pc2010 [12:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:02] (03PS3) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 [12:12:28] RECOVERY - k8s API server requests latencies on acrab is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:14:51] (03CR) 10jerkins-bot: [V: 04-1] profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 (owner: 10Jbond) [12:17:46] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10Miriam) [12:19:18] (03PS4) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 [12:20:06] (03PS5) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/638678 [12:34:41] !log root@cumin1001 START - Cookbook sre.hosts.downtime [12:34:42] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:12] (03PS1) 10Marostegui: install_server: Allow install db11[51-76] [puppet] - 10https://gerrit.wikimedia.org/r/639521 (https://phabricator.wikimedia.org/T267043) [12:37:46] (03CR) 10Marostegui: [C: 03+2] install_server: Allow install db11[51-76] [puppet] - 10https://gerrit.wikimedia.org/r/639521 (https://phabricator.wikimedia.org/T267043) (owner: 10Marostegui) [12:41:38] PROBLEM - MariaDB Replica Lag: s4 on db2119 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3592.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:41:40] PROBLEM - MariaDB Replica Lag: s4 on db2110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3594.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:41:54] PROBLEM - MariaDB Replica Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3606.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:41:58] PROBLEM - MariaDB Replica Lag: s4 on db2140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3610.59 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:42:16] PROBLEM - MariaDB Replica Lag: s4 on db2106 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3629.67 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:42:26] PROBLEM - MariaDB Replica Lag: s4 on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3640.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:42:46] PROBLEM - MariaDB Replica Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3660.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:42:48] PROBLEM - MariaDB Replica Lag: s4 on db2136 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3661.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:42:56] PROBLEM - MariaDB Replica Lag: s4 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3669.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:44:31] kormat: your alter? ^ [12:44:36] ah yes, it is the alter [12:44:39] I will downtime those [12:44:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:44:57] ah :) [12:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:58] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:45:01] sorry, yeah, dt expired. [12:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:05] i've put in another for 2h [12:45:15] i was, in fact, just about to check how long was left on the dt [12:45:19] sorry for the spam :) [12:45:22] :* [12:45:24] (03CR) 10Muehlenhoff: "Should be fine to remove, but let's hear from Ema, who originally wrote the profile." [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [12:47:08] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [12:48:30] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38014744 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:49:57] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:49:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:10] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5760 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:12] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: enforce 'admin' user role [puppet] - 10https://gerrit.wikimedia.org/r/639483 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [12:51:17] (03PS3) 10Filippo Giunchedi: grafana: enforce 'admin' user role [puppet] - 10https://gerrit.wikimedia.org/r/639483 (https://phabricator.wikimedia.org/T265712) [12:52:37] !log upgrade freetype on jessie [12:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:28] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: sync multiple-groups users only once [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [12:53:33] (03PS3) 10Filippo Giunchedi: grafana: sync multiple-groups users only once [puppet] - 10https://gerrit.wikimedia.org/r/639482 (https://phabricator.wikimedia.org/T265712) [12:54:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:54:12] RECOVERY - MariaDB Replica Lag: s4 on db2137 is OK: OK slave_sql_lag Replication lag: 33.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:15] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:22] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:54:32] RECOVERY - MariaDB Replica Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:34] RECOVERY - MariaDB Replica Lag: s4 on db2136 is OK: OK slave_sql_lag Replication lag: 0.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:42] RECOVERY - MariaDB Replica Lag: s4 on db2139 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:58] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:55:06] RECOVERY - MariaDB Replica Lag: s4 on db2119 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:08] RECOVERY - MariaDB Replica Lag: s4 on db2110 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:20] RECOVERY - MariaDB Replica Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:24] 10Operations, 10Commons, 10DBA, 10Platform Engineering, and 2 others: Increase on database writes and deletes activity on Commonswiki leads to some replication lag - https://phabricator.wikimedia.org/T266432 (10Marostegui) 05Open→03Resolved a:03Marostegui This has ceased and we are back to normal val... [12:55:24] RECOVERY - MariaDB Replica Lag: s4 on db2140 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:42] RECOVERY - MariaDB Replica Lag: s4 on db2106 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:56:32] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:56:58] (03CR) 10Hnowlan: [C: 03+1] kartotherian: Don't page SREs on failure [puppet] - 10https://gerrit.wikimedia.org/r/639154 (owner: 10Alexandros Kosiaris) [12:57:38] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 3.762 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:00:05] (03CR) 10JMeybohm: [C: 03+2] Update to 1.16.15 [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/639192 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [13:00:10] (03CR) 10JMeybohm: [C: 03+2] Package binary kubernetes releases [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [13:00:43] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove kubernetes sources [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638114 (owner: 10JMeybohm) [13:03:41] (03PS2) 10Filippo Giunchedi: grafana: commit ldap-users-sync changes [puppet] - 10https://gerrit.wikimedia.org/r/639452 (https://phabricator.wikimedia.org/T265712) [13:03:43] (03PS1) 10Filippo Giunchedi: grafana: set 'admin' user as Org admin only [puppet] - 10https://gerrit.wikimedia.org/r/639525 (https://phabricator.wikimedia.org/T265712) [13:04:48] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10Marostegui) @colewhite I am still seeing lots of different results between: https://logstash-next.wikimedia.org/goto/690aac80b9993d9af88216d0e8103e74 and https://logstash.wiki... [13:05:59] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: commit ldap-users-sync changes [puppet] - 10https://gerrit.wikimedia.org/r/639452 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [13:06:02] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: set 'admin' user as Org admin only [puppet] - 10https://gerrit.wikimedia.org/r/639525 (https://phabricator.wikimedia.org/T265712) (owner: 10Filippo Giunchedi) [13:06:14] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [13:07:00] (03PS6) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [13:08:22] (03CR) 10jerkins-bot: [V: 04-1] confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [13:08:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:08:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:18] (03PS7) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [13:13:19] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: add param to manage kibana.index, and use for ECS instance [puppet] - 10https://gerrit.wikimedia.org/r/639227 (owner: 10Herron) [13:13:44] (03CR) 10Filippo Giunchedi: [C: 03+1] O:idp_test: Enable CORS on idp-test [puppet] - 10https://gerrit.wikimedia.org/r/639202 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [13:14:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "Can't really comment on the IDP configuration but LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/639201 (https://phabricator.wikimedia.org/T267186) (owner: 10Jbond) [13:15:05] (03PS2) 10Jbond: swift: split memcached servers and port [puppet] - 10https://gerrit.wikimedia.org/r/639014 (owner: 10Filippo Giunchedi) [13:16:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I fail to understand why would that default to true anyway. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/639297 (https://phabricator.wikimedia.org/T262350) (owner: 10Bstorm) [13:18:01] (03CR) 10Jbond: [C: 03+2] labstore::fileserver::exports: use sudo::safe_wildcard_cmd [puppet] - 10https://gerrit.wikimedia.org/r/616718 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [13:18:27] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Sync users and permissions from LDAP to Grafana - https://phabricator.wikimedia.org/T265712 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is live now, LDAP users are synced daily to Grafana. [13:18:30] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10fgiunchedi) [13:25:13] (03PS8) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [13:25:40] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [13:25:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:26:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:54] (03PS4) 10Jbond: role::puppetmaster::standalone: add type checking to autosign [puppet] - 10https://gerrit.wikimedia.org/r/566512 [13:34:49] (03PS5) 10Jbond: role::puppetmaster::standalone: add type checking to autosign [puppet] - 10https://gerrit.wikimedia.org/r/566512 [13:38:18] (03CR) 10Jbond: [C: 04-1] "First need to fix the following values nbeed to be updated to boolean not string in horizon" [puppet] - 10https://gerrit.wikimedia.org/r/566512 (owner: 10Jbond) [13:39:51] (03PS6) 10Jbond: role::puppetmaster::standalone: add type checking to autosign [puppet] - 10https://gerrit.wikimedia.org/r/566512 [13:43:25] (03CR) 10Ema: [C: 03+1] "One small detail to fix in the commit message, +1 otherwise!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [13:45:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:19] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, and 2 others: Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10MSantos) @TheDJ actually, geoshapes is not being able to connect to the database, maybe due to the recent OSM resync.... [13:52:19] (03CR) 10Alexandros Kosiaris: "\o/" [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/638115 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [13:55:29] (03CR) 10Muehlenhoff: [C: 03+2] Don't write out Prometheus config if prometheus actuator is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/639169 (owner: 10Muehlenhoff) [13:57:07] !log disable puppet fleet wide to restart puppetdb [13:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:36] PROBLEM - Check systemd state on maps2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:52] PROBLEM - cassandra CQL 10.192.16.179:9042 on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:58:53] (03CR) 10Muehlenhoff: [C: 03+2] Disable prometheus actuator/JMX for now [puppet] - 10https://gerrit.wikimedia.org/r/639194 (owner: 10Muehlenhoff) [13:59:36] PROBLEM - cassandra service on maps2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:01:40] (03CR) 10Jbond: "PCC production: https://puppet-compiler.wmflabs.org/compiler1002/26329/puppetmaster2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [14:01:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [14:01:47] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:27] !log enable puppet fleet wide to post restart puppetdb [14:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:44] (03PS1) 10Jbond: admin: update dcaro ssh key [puppet] - 10https://gerrit.wikimedia.org/r/639532 (https://phabricator.wikimedia.org/T267292) [14:17:28] (03PS1) 10Filippo Giunchedi: profile: redirect grafana login to rw host with SSO enabled [puppet] - 10https://gerrit.wikimedia.org/r/639533 (https://phabricator.wikimedia.org/T262512) [14:18:29] (03CR) 10David Caro: [C: 03+2] admin: update dcaro ssh key [puppet] - 10https://gerrit.wikimedia.org/r/639532 (https://phabricator.wikimedia.org/T267292) (owner: 10Jbond) [14:19:24] (03CR) 10Jbond: [C: 03+2] "authorised via gtalk" [puppet] - 10https://gerrit.wikimedia.org/r/639532 (https://phabricator.wikimedia.org/T267292) (owner: 10Jbond) [14:22:52] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/26336/" [puppet] - 10https://gerrit.wikimedia.org/r/639533 (https://phabricator.wikimedia.org/T262512) (owner: 10Filippo Giunchedi) [14:30:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/639533 (https://phabricator.wikimedia.org/T262512) (owner: 10Filippo Giunchedi) [14:31:29] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: redirect grafana login to rw host with SSO enabled [puppet] - 10https://gerrit.wikimedia.org/r/639533 (https://phabricator.wikimedia.org/T262512) (owner: 10Filippo Giunchedi) [14:31:34] (03PS1) 10C. Scott Ananian: language: Clean up $separatorTransformTable in km/la/my [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639491 (https://phabricator.wikimedia.org/T267091) [14:31:56] (03CR) 10C. Scott Ananian: [C: 03+2] language: Clean up $separatorTransformTable in km/la/my [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639491 (https://phabricator.wikimedia.org/T267091) (owner: 10C. Scott Ananian) [14:35:07] (03CR) 10Andrew Bogott: [C: 03+2] puppet_ca_server default to '' on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/639322 (owner: 10Andrew Bogott) [14:35:14] (03PS3) 10Andrew Bogott: puppet_ca_server default to '' on wmcs [puppet] - 10https://gerrit.wikimedia.org/r/639322 [14:42:19] (03PS1) 10Jbond: Revert "confd: only read from the master during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/639492 [14:43:52] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10BBlack) >>! In T258405#6589763, @TheDJ wrote: > We should probably also update https://wikitech.wikimedia.org/wiki/HTTPS with the new status quo Technically, an i... [14:44:19] 10Operations, 10ops-codfw, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Papaul) @Kormat you can doing it now if you have time. Thanks. [14:44:54] 10Operations, 10ops-codfw, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) Perfect. I'll bring it down now, and update here when done. [14:45:38] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:45:39] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:26] 10Operations, 10ops-codfw, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) @Papaul: it's powering off now. Thanks! [14:50:20] (03CR) 10Volans: diffscan: pyhotnify (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [14:55:26] !log shutdown kafka-jumbo1001 to swap NICs (1g -> 10g) [14:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:10] RECOVERY - cassandra service on maps2002 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:58:57] (03Merged) 10jenkins-bot: language: Clean up $separatorTransformTable in km/la/my [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639491 (https://phabricator.wikimedia.org/T267091) (owner: 10C. Scott Ananian) [15:01:42] (03CR) 10RLazarus: [C: 03+1] Revert "confd: only read from the master during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/639492 (owner: 10Jbond) [15:06:52] (03PS1) 10Mforns: Migrate EventLogging NewcomerTask to EventGate on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639539 (https://phabricator.wikimedia.org/T267333) [15:07:44] PROBLEM - Host kafka-jumbo1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:26] PROBLEM - Host db2077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:36] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 84 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [15:11:52] expected --^ [15:12:24] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 114 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [15:12:26] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 73 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [15:12:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 94 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [15:12:50] RECOVERY - Host db2077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.84 ms [15:13:28] RECOVERY - Host kafka-jumbo1001.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 1.48 ms [15:13:28] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1008 is CRITICAL: 12 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [15:15:27] ottomata: already added this to the morning backport window https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/639539 [15:17:49] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki/memcached: stop using role inheritance [puppet] - 10https://gerrit.wikimedia.org/r/637742 (owner: 10Dzahn) [15:18:03] 10Operations, 10ops-codfw, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Papaul) a:05Papaul→03Kormat Before ` BIOS Version 2.4.3 Firmware Version 2.40.40.40 Lifecycle Controller Firmware 2.40.40.40 `` After `` BIOS Version 2.11.0 Firmware Version 2.75.75.75 Lifecy... [15:19:10] (03CR) 10C. Scott Ananian: [C: 04-2] "wmf.16 train expected to complete on Monday Nov 9, would be safe to swat after that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [15:23:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1008 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [15:24:10] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [15:24:21] (03PS10) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [15:24:23] (03PS10) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [15:24:24] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [15:24:49] (03CR) 10jerkins-bot: [V: 04-1] AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [15:24:51] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Kormat) [15:24:56] 10Operations, 10ops-codfw, 10DBA: db2077 hung on reboot - https://phabricator.wikimedia.org/T267220 (10Kormat) 05Open→03Resolved Great, thanks :) [15:24:58] (03CR) 10jerkins-bot: [V: 04-1] Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [15:25:02] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [15:25:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [15:26:19] (03PS6) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [15:27:15] (03PS11) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [15:27:17] (03PS11) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [15:27:40] 10Operations, 10SRE-Access-Requests: dcaro has same ssh key in wmcs and prod, prod ssh key revoked - https://phabricator.wikimedia.org/T267292 (10RobH) 05Open→03Resolved >>! In T267292#6606098, @gerritbot wrote: > Change 639532 **merged** by Jbond: > [operations/puppet@production] admin: update dcaro ssh k... [15:30:36] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [15:31:57] (03PS5) 10Ladsgroup: [WIP] varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [15:33:01] (03PS1) 10Ottomata: Disallow ContentTranslationAbuseFilter from legacy eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/639548 (https://phabricator.wikimedia.org/T259163) [15:35:11] 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T267160 (10Cmjohnson) Dell reached out and needed more information and raid log. I sent over to them now. [15:36:24] (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [15:37:17] (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [15:37:42] (03PS2) 10Ottomata: Disallow ContentTranslationAbuseFilter from legacy eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/639548 (https://phabricator.wikimedia.org/T259163) [15:37:44] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) @wiki_willy the crossover cable needs to be made. We have cat5 cable... [15:40:11] (03CR) 10Herron: [C: 03+2] kibana: add param to manage kibana.index, and use for ECS instance [puppet] - 10https://gerrit.wikimedia.org/r/639227 (owner: 10Herron) [15:41:08] (03CR) 10Ottomata: [C: 03+2] Disallow ContentTranslationAbuseFilter from legacy eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/639548 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [15:41:35] !log installing junit4 security updates [15:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:03] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey There are 2 480GB SSDs and 12 4TB disks in each of the servers. They are all unpacked and I can rack some but not all of them. [15:49:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:50:01] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [15:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:53:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [15:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:06] (03CR) 10Jbond: "LGTM very minor comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:58:43] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey to answer some of the earlier questions. @wiki_willy and I identified all the 1G servers in 10G racks that we could potentially... [15:59:41] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) When these arrive they will be sitting on the floor until we have space to rack them. At this time I may be able to get 4 or 5 racked in 10G racks. [16:06:09] (03PS6) 10RLazarus: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 [16:06:47] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1032 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:07:12] (03CR) 10RLazarus: [C: 03+2] "> Patch Set 5: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [16:08:26] (03PS6) 10Ahmon Dancy: openstack: nova: set cpu_model_extra_flags = vmx,pcid [puppet] - 10https://gerrit.wikimedia.org/r/638146 [16:09:54] (03Merged) 10jenkins-bot: decorators: Add an optional custom failure message to @retry. Use it in dnsdisc. [software/spicerack] - 10https://gerrit.wikimedia.org/r/638753 (owner: 10RLazarus) [16:11:39] (03CR) 10Jbond: [C: 03+2] Revert "confd: only read from the master during the switchover" [puppet] - 10https://gerrit.wikimedia.org/r/639492 (owner: 10Jbond) [16:13:18] (03CR) 10Razzi: "> Patch Set 1: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [16:13:55] (03PS9) 10Jbond: confd: pass srv_dns directly instead of loading confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) [16:15:20] (03PS1) 10Krinkle: mediawiki.action.edit.preview: Add versionCallback to improve startup perf [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639495 (https://phabricator.wikimedia.org/T266311) [16:17:06] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [16:17:24] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Depool an entire swift cluster for a datacenter and do performance testing of batch downloads of wiki media (querying swift and/or MediaWiki) - https://phabricator.wikimedia.org/T267338 (10jcrespo) [16:17:37] (03CR) 10Jbond: "PCC (noop): https://puppet-compiler.wmflabs.org/compiler1003/26337/" [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [16:18:33] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [16:19:28] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Depool an entire swift cluster for a datacenter and do performance testing of batch downloads of wiki media (querying swift and/or MediaWiki) - https://phabricator.wikimedia.org/T267338 (10jcrespo) Most likely this would mean a depool of codfw. Ne... [16:19:32] (03PS1) 10Hnowlan: postgres: change max replication connections to nodes + 6 [puppet] - 10https://gerrit.wikimedia.org/r/639576 (https://phabricator.wikimedia.org/T266820) [16:23:53] (03PS1) 10Ottomata: Remove wgEventLoggingSchemas ContentTranslationAbuseFilter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639579 (https://phabricator.wikimedia.org/T259163) [16:24:21] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) @Cmjohnson we are going to decommission 9 2U hosts soon. I can prioritize to decommission at least 3 of them in the next 2 weeks (in different rows) so... [16:24:28] (03PS4) 10Jbond: profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) [16:25:50] (03CR) 10jerkins-bot: [V: 04-1] profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:26:05] (03PS5) 10Jbond: profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) [16:26:15] !log shutting down kafka-jumbo1002 to allow dcops to upgrade NIC [16:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:16] (03PS12) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [16:27:16] (03PS12) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [16:27:26] (03CR) 10jerkins-bot: [V: 04-1] profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:27:44] (03CR) 10jerkins-bot: [V: 04-1] Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [16:27:51] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) @Marostegui yes, db1091 is already gone from the racks. I did a more detailed count and right now, not removing any 1G servers from 10G racks I can fit... [16:27:55] (03CR) 10jerkins-bot: [V: 04-1] AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [16:29:04] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] "> Patch Set 2:" [software/gerrit] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/639121 (owner: 10Hashar) [16:29:12] (03PS1) 10Bstorm: toolforge bastion: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/639581 [16:29:30] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) Cool, we can plan for that. Count also with at least those 6U I am going to free up before those arrive [16:30:14] (03CR) 10Bstorm: [C: 03+2] toolforge bastion: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/639581 (owner: 10Bstorm) [16:30:32] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/639576 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [16:31:07] PROBLEM - Host kafka-jumbo1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:31:52] (03PS13) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [16:31:54] (03PS13) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [16:32:22] (03CR) 10jerkins-bot: [V: 04-1] AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) (owner: 10Ayounsi) [16:32:25] (03CR) 10jerkins-bot: [V: 04-1] Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [16:33:15] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={4,5} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasourc [16:33:15] ter=logging-eqiad&var-topic=All&var-consumer_group=All [16:36:19] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:36:43] RECOVERY - Host kafka-jumbo1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [16:37:47] (03PS14) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [16:37:47] (03PS14) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [16:37:59] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:39:08] (03PS7) 10Ayounsi: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) [16:40:14] cscott: around? wondering if i should a) deploy for T267091 (https://gerrit.wikimedia.org/r/639491), and b) unmark T267033 as a train blocker. [16:40:15] T267091: Undefined index: . at Language.php:3348 - https://phabricator.wikimedia.org/T267091 [16:40:15] T267033: CommonsMetadata bad wfTimestamp call - https://phabricator.wikimedia.org/T267033 [16:41:01] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 110 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [16:41:03] (03CR) 10Ayounsi: [C: 03+2] Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [16:41:19] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 15 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [16:42:30] (03Merged) 10jenkins-bot: Add switch interface support to decom script [cookbooks] - 10https://gerrit.wikimedia.org/r/633723 (https://phabricator.wikimedia.org/T265341) (owner: 10Ayounsi) [16:42:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [16:44:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [16:44:28] (03CR) 10Brennen Bearnes: [C: 03+2] mediawiki.action.edit.preview: Add versionCallback to improve startup perf [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639495 (https://phabricator.wikimedia.org/T266311) (owner: 10Krinkle) [16:46:11] hi operations folks! [16:46:27] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [16:46:35] fundraising tech is trying to get a file served up in the root dir on donatewiki and thankyouwiki [16:46:53] that will make iOS devices omit opening those URLs in the Wikipedia app [16:47:02] (see https://phabricator.wikimedia.org/T259312 ) [16:47:30] A similar file exists under docroot/wikipedia.org un the mediawiki-config repo [16:47:56] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage: Depool an entire swift cluster for a datacenter and do performance testing of batch downloads of wiki media (querying swift and/or MediaWiki) - https://phabricator.wikimedia.org/T267338 (10ayounsi) Ok. Let me know when you're doing it. Ideally b... [16:49:11] (03PS2) 10Razzi: nginx: Remove profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) [16:51:32] (03PS15) 10Ayounsi: Add CSV import to AssignIPs script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 [16:51:34] (03PS15) 10Ayounsi: AssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [16:51:36] (03CR) 10Hnowlan: [C: 03+2] postgres: change max replication connections to nodes + 6 [puppet] - 10https://gerrit.wikimedia.org/r/639576 (https://phabricator.wikimedia.org/T266820) (owner: 10Hnowlan) [16:53:20] (03CR) 10Volans: "Small things inline" (0310 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635849 (owner: 10Ayounsi) [16:53:40] (03CR) 10CDanis: [C: 03+1] pcc: update PCC cli so that it posts to the gerrit change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636652 (owner: 10Jbond) [16:54:58] Any suggestions how to configure a new apple-app-site-association file just for donate.wikipedia.org and thankyou.wikipedia.org? [16:57:28] ejegg: is that something that usually lives under the mediawiki-config docroots? [16:57:33] !log shutting down kafka-jumbo1003 to allow dcops to upgrade NIC [16:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:31] cdanis there's one there for the wikipedias in general [16:58:33] https://phabricator.wikimedia.org/T259312 [16:58:52] but we'll need a different one specifically for those two subdomains [16:59:03] ah yeah I see [17:00:00] ejegg: let me do a little poking around [17:00:04] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T1700). [17:00:36] (03PS1) 10Bstorm: cloud vps: unattendedupgrades seems to have a type issue for cleaning [puppet] - 10https://gerrit.wikimedia.org/r/639586 [17:00:39] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) [17:01:12] (03PS16) 10Ayounsi: ssingIPs, cleanup and standardize logs format [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/635853 (https://phabricator.wikimedia.org/T265339) [17:02:53] (03CR) 10Bstorm: "The errors this is fixing:" [puppet] - 10https://gerrit.wikimedia.org/r/639586 (owner: 10Bstorm) [17:04:05] thanks cdanis! [17:05:25] RECOVERY - Host mw1267.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.61 ms [17:05:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:05:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:46] !log restarting maps2004 postgres for config change [17:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) mw1267 issues have been fixed [17:06:52] (03Merged) 10jenkins-bot: mediawiki.action.edit.preview: Add versionCallback to improve startup perf [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639495 (https://phabricator.wikimedia.org/T266311) (owner: 10Krinkle) [17:09:17] jbond, cdanis - i don't see any puppet patches for this window, so i'm going to go ahead and sling out a couple of things for train blockers. [17:09:26] brennen: sgtm [17:11:46] ack [17:13:52] !imported kubernetes 1.16.15 to component/kubernetes-future stretch-wikimedia [17:14:12] !log imported kubernetes 1.16.15 to component/kubernetes-future stretch-wikimedia [17:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:53] !log rebuilding cassandra on maps2002 [17:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:04] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=maps,service=kartotherian,name=maps2002.codfw.wmnet [17:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:07] PROBLEM - Host kafka-jumbo1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:20:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 122 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [17:20:31] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 294 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [17:20:33] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1009 is CRITICAL: 13 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [17:20:43] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 186 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [17:21:21] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.16/resources/Resources.php: Backport: [[gerrit:639495|mediawiki.action.edit.preview: Add versionCallback to improve startup perf (T266311)]] (duration: 01m 10s) [17:21:23] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 241 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [17:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:27] T266311: Live preview shows unparsed, with raw wikitext - https://phabricator.wikimedia.org/T266311 [17:21:45] RECOVERY - Host kafka-jumbo1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [17:21:57] PROBLEM - Host kafka-jumbo1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:57] (03PS1) 10Herron: logstash: add logstash1032 to elk7 ES cluster [puppet] - 10https://gerrit.wikimedia.org/r/639590 [17:23:39] PROBLEM - Host db1139.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:24:38] 10Operations, 10ops-eqiad: (Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr I am assigning this to @Jclark-ctr. John, the new scs is in the flexspace, all of the cable ends may need to be snipped and re-done with a standard ti... [17:25:27] RECOVERY - IPMI Sensor Status on mw1267 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:26:34] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.16/languages: Backport: [[gerrit:639491|language: Clean up $separatorTransformTable in km/la/my (T267091)]] (duration: 01m 12s) [17:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:41] T267091: Undefined index: . at Language.php:3348 - https://phabricator.wikimedia.org/T267091 [17:28:25] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [17:30:03] PROBLEM - cassandra CQL 10.192.48.165:9042 on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:30:03] PROBLEM - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:30:03] PROBLEM - cassandra service on maps2008 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:30:09] PROBLEM - cassandra CQL 10.192.16.31:9042 on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:30:11] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:13] PROBLEM - cassandra CQL 10.192.16.107:9042 on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:30:19] PROBLEM - cassandra service on maps2006 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:30:21] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:21] PROBLEM - cassandra service on maps2007 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:30:23] PROBLEM - cassandra CQL 10.192.0.155:9042 on maps2005 is CRITICAL: connect to address 10.192.0.155 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:30:29] PROBLEM - cassandra CQL 10.192.48.166:9042 on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:30:47] PROBLEM - cassandra service on maps2010 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:30:57] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:11] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:15] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:19] PROBLEM - cassandra service on maps2005 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:31:25] PROBLEM - cassandra service on maps2009 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:07] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:27] PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:38] downtimes expiring, my bad [17:32:53] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:32:55] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:33:05] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:33:07] RECOVERY - Host kafka-jumbo1002 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [17:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:24] cscott, DannyS712: based on ticket, i'm inclined to remove T267033 as a blocker and proceed with the train. input welcome. [17:33:25] T267033: CommonsMetadata bad wfTimestamp call - https://phabricator.wikimedia.org/T267033 [17:33:35] PROBLEM - Check systemd state on maps1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:44] (seems like a longstanding issue.) [17:35:03] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1009 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [17:35:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [17:35:51] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [17:36:11] (03PS6) 10Jbond: profile::sre::check_mail: new script for checking user emails [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) [17:36:21] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [17:40:15] 10Operations, 10Android-app-Bugs, 10Fundraising-Backlog, 10Thank-You-Page, and 4 others: Deal with donatewiki Thank You page launching in apps - https://phabricator.wikimedia.org/T259312 (10CDanis) Sorry, there's a small mess here -- parts of the relevant behavior are specified in the Puppet repo (where Ap... [17:40:57] (03CR) 10Jbond: [C: 03+2] profile::sre::check_mail: new script for checking user emails (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [17:41:10] !log train is currently unblocked; rolling to group0 (T263182) [17:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:16] T263182: 1.36.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T263182 [17:41:28] ejegg: left a hopefully-helpful comment -- take a look and let me know :) [17:41:37] (03CR) 10Jbond: profile::sre::check_mail: new script for checking user emails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605596 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [17:41:47] thanks, taking a look [17:42:17] (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639592 [17:42:19] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639592 (owner: 10Brennen Bearnes) [17:42:44] cdanis: thanks for the suggestions! I'd be happy to take a stab at making those two patches [17:42:56] great :D [17:43:09] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639592 (owner: 10Brennen Bearnes) [17:44:33] (03PS1) 10Jbond: cluster::managment: drop old profile [puppet] - 10https://gerrit.wikimedia.org/r/639593 [17:44:48] (03CR) 10Jbond: [V: 03+2 C: 03+2] cluster::managment: drop old profile [puppet] - 10https://gerrit.wikimedia.org/r/639593 (owner: 10Jbond) [17:45:55] (03CR) 10CDanis: [C: 03+1] "nice cleanup, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/617716 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [17:46:06] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.16 [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:16] hmm, going to revert this one. [17:48:31] !log shutting down kafka-jumbo1004 to allow dcops to upgrade NIC [17:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:15] (03PS1) 10Jbond: profile::sre::check_user: fix hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/639594 [17:50:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] profile::sre::check_user: fix hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/639594 (owner: 10Jbond) [17:51:35] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group0 wikis to 1.36.0-wmf.14 [17:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:38] !log restart uwsgi-ores in all ores1* nodes per complaint on IRC that max redis clients have been reached T263910 [17:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:44] T263910: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 [17:52:47] (03PS1) 10Brennen Bearnes: Revert "group0 wikis to 1.36.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639595 [17:52:49] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group0 wikis to 1.36.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639595 (owner: 10Brennen Bearnes) [17:53:19] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10wiki_willy) Consolidated all the info @Cmjohnson provided in a Google doc, so we can add the service owners of the hosts and track future rack location, etc. below: https://d... [17:53:36] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.36.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639595 (owner: 10Brennen Bearnes) [17:54:18] (03PS1) 10Jbond: p:sre::check_user: fix script path [puppet] - 10https://gerrit.wikimedia.org/r/639596 [17:55:41] (03CR) 10Herron: [C: 03+2] mailman: Set the charset utf-8 as charset of English [puppet] - 10https://gerrit.wikimedia.org/r/637852 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [17:56:30] (03CR) 10Jbond: [C: 03+2] p:sre::check_user: fix script path [puppet] - 10https://gerrit.wikimedia.org/r/639596 (owner: 10Jbond) [17:56:53] PROBLEM - Host kafka-jumbo1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T1800). Please do the needful. [18:01:39] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10Patch-For-Review, 10User-Ladsgroup: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It's fixed now \... [18:02:33] RECOVERY - Host kafka-jumbo1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:04:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 98 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [18:05:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 122 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [18:05:59] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 129 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [18:06:01] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 96 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [18:06:21] (03PS1) 10Jbond: P:sre:check_user: add required dependencies [puppet] - 10https://gerrit.wikimedia.org/r/639597 [18:10:01] RECOVERY - Host db1139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [18:10:08] (03CR) 10Jbond: [C: 03+2] P:sre:check_user: add required dependencies [puppet] - 10https://gerrit.wikimedia.org/r/639597 (owner: 10Jbond) [18:10:16] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) @jcrespo mainboard replaced configured settings [18:14:57] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) The reimage of this host is giving me trouble. I've verified in the idrac bios setttings that IPMI over lan is enabled, but the script errors out with... [18:16:15] (03PS7) 10Ahmon Dancy: openstack: Enable support for nested VMs [puppet] - 10https://gerrit.wikimedia.org/r/638146 [18:16:47] (03CR) 10jerkins-bot: [V: 04-1] openstack: Enable support for nested VMs [puppet] - 10https://gerrit.wikimedia.org/r/638146 (owner: 10Ahmon Dancy) [18:16:54] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) 05Open→03Resolved [18:17:58] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) [18:18:18] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) This has sat ignored while I was doing procurement, but I'll pick this back up next week and start updating these again. (On clinic duty this week so I do... [18:20:56] (03PS1) 10Ladsgroup: mailman: Set utf-8 charset for all languages [puppet] - 10https://gerrit.wikimedia.org/r/639598 (https://phabricator.wikimedia.org/T261031) [18:24:39] (03CR) 10Herron: "LGTM overall, please see one nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639598 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [18:25:54] (03PS2) 10Ladsgroup: mailman: Set utf-8 charset for all languages [puppet] - 10https://gerrit.wikimedia.org/r/639598 (https://phabricator.wikimedia.org/T261031) [18:25:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [18:26:06] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install new controller into frdb1001 OR add to spares - https://phabricator.wikimedia.org/T261348 (10Cmjohnson) 05Open→03Resolved Had a conversation with Jeff about this and we're going to just hold on to the controller for now. There isn't any immedia... [18:26:23] (03CR) 10Ladsgroup: mailman: Set utf-8 charset for all languages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639598 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [18:26:39] (03PS8) 10Ahmon Dancy: openstack: Enable support for nested VMs [puppet] - 10https://gerrit.wikimedia.org/r/638146 [18:26:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [18:27:11] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [18:27:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [18:27:38] (03CR) 10Herron: [C: 03+2] mailman: Set utf-8 charset for all languages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639598 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [18:28:26] (03PS1) 10Ottomata: Remove no longer needed EventLoggingSchemas override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639600 (https://phabricator.wikimedia.org/T254606) [18:29:06] (03PS1) 10Ejegg: Special docroot for thankyou.wp.org (and donate) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639601 (https://phabricator.wikimedia.org/T259312) [18:31:25] (03PS3) 10Hnowlan: maps: add maps100[5-8] and maps1010 [puppet] - 10https://gerrit.wikimedia.org/r/638125 [18:32:04] !log shutting down kafka-jumbo1005 to allow dcops to upgrade NIC [18:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:50] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, and 2 others: Javascript errors: Unable to add datalayers to map - https://phabricator.wikimedia.org/T267296 (10sdkim) a:03hnowlan [18:35:46] (03CR) 10Hnowlan: "pcc looks okay. I will add but not pool these nodes tomorrow and kick off the replication process https://puppet-compiler.wmflabs.org/comp" [puppet] - 10https://gerrit.wikimedia.org/r/638125 (owner: 10Hnowlan) [18:42:08] (03CR) 10Bstorm: [C: 03+2] "Since this is currently broken, I'll merge this now (can't break it much more). If there's any form or best practices objections to the ma" [puppet] - 10https://gerrit.wikimedia.org/r/639586 (owner: 10Bstorm) [18:42:28] (03CR) 10Herron: [C: 03+2] logstash: add logstash1032 to elk7 ES cluster [puppet] - 10https://gerrit.wikimedia.org/r/639590 (owner: 10Herron) [18:42:53] bstorm: you can multiple mine if you want [18:43:03] 👍🏻 [18:43:05] Will do [18:43:07] kk thanks! [18:43:23] done [18:44:26] great thx [18:44:44] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10nskaggs) Thanks for the explanations @ayounsi . > the cloudgw project we are currently evaluating (see T261724) might change this subnet again. Perhaps it would be wise... [18:46:25] (03PS1) 10JMeybohm: Make the build pdebuild compatible [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/639604 (https://phabricator.wikimedia.org/T266766) [18:47:09] (03CR) 10JMeybohm: [C: 03+2] Make the build pdebuild compatible [debs/kubernetes] (future) - 10https://gerrit.wikimedia.org/r/639604 (https://phabricator.wikimedia.org/T266766) (owner: 10JMeybohm) [18:48:13] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 133 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [18:48:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 97 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [18:48:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 113 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [18:48:49] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 82 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [18:50:04] (03PS1) 10Bstorm: cloud-vps: fix quotes in the last fix [puppet] - 10https://gerrit.wikimedia.org/r/639605 [18:50:52] (03PS2) 10Bstorm: cloud-vps: fix quotes in the last fix [puppet] - 10https://gerrit.wikimedia.org/r/639605 [18:53:00] (03CR) 10Bstorm: [C: 03+2] cloud-vps: fix quotes in the last fix [puppet] - 10https://gerrit.wikimedia.org/r/639605 (owner: 10Bstorm) [18:55:11] (03PS1) 10Bartosz Dziewoński: Fix DiscussionTools wikis config for thwiki/tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639606 (https://phabricator.wikimedia.org/T266303) [18:55:17] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [18:56:13] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1032 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 13, number_of_data_nodes: 7, active_primary_shards: 462, active_shards: 875, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_f [18:56:13] ask_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:56:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [18:56:33] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [18:56:35] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [18:56:47] RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:46] (03PS1) 10C. Scott Ananian: Don't double-format numeric edit count [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639500 (https://phabricator.wikimedia.org/T267362) [18:58:03] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10AMooney) [18:58:12] (03CR) 10C. Scott Ananian: [C: 03+2] Don't double-format numeric edit count [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639500 (https://phabricator.wikimedia.org/T267362) (owner: 10C. Scott Ananian) [18:58:37] (03CR) 10Razzi: [C: 03+2] nginx: Remove profile::tlsproxy::service [puppet] - 10https://gerrit.wikimedia.org/r/638185 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [18:59:58] ejegg: ah I didn't realize we had to serve this on thankyouwiki as well... but ofc that makes sense [19:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:05] I am not sure how much that will complicate things [19:00:13] RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:32] cdanis: yeah, actually, if we can just do it on thankyouwiki for starters, that might be enough [19:01:36] before we had thankyouwiki, we were using donate.wikiPedia.org for the thank you pages for a short time, but then someone convinced us it was a bad idea to try to dual-host donate at wikiMedia.org and wikiPedia.org [19:01:50] (03PS1) 10Hnowlan: tilerator: enable in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/639608 (https://phabricator.wikimedia.org/T254014) [19:01:53] right [19:01:55] so we're currently only sending donors to thankyou.wikiPedia.org [19:02:24] and the apps generally don't hijack links to wikiMedia subdomains [19:02:45] (03Merged) 10jenkins-bot: Don't double-format numeric edit count [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639500 (https://phabricator.wikimedia.org/T267362) (owner: 10C. Scott Ananian) [19:02:54] ah yeah, and I see we are serving the apple-app-site-association file on the thankyou domain [19:04:01] uhhh [19:04:07] i added a patch for the current slot [19:04:11] jouncebot: ? [19:04:17] hi MatmaRex :) [19:04:25] jouncebot: refresh [19:04:26] I will have a SWAT patch shortly, too. [19:04:26] I refreshed my knowledge about deployments. [19:04:32] cscott: are you deploying? :-) [19:04:41] hi Urbanecm [19:04:50] cdanis: hmm, what about simply adding an extra virtualhost before the wikipedia one at https://gerrit.wikimedia.org/g/operations/puppet/+/711af11b1b87c0cc7732ad93776aa8c8a6d4089b/modules/mediawiki/templates/apache/sites/wwwportals.conf.erb? [19:05:29] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10AntiCompositeNumber) I think it would be useful to have a page or section outlining the //de facto// requirements for connecting to the flagship Wikimedia sites. ht... [19:05:42] Urbanecm: I guess that would probably work, although I think you know better than I :) [19:06:45] (03PS1) 10Gergő Tisza: Suggested edits: Export task count from start editing dialog [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639501 (https://phabricator.wikimedia.org/T266868) [19:06:50] cdanis: afaik, apache should just take the first virtualhost that matched - but I'm not a sre and I'm not sure whether that's the best solution [19:07:02] MatmaRex: I'll deploy your patch :) [19:07:16] thanks [19:07:22] (03CR) 10Urbanecm: [C: 03+2] Fix DiscussionTools wikis config for thwiki/tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639606 (https://phabricator.wikimedia.org/T266303) (owner: 10Bartosz Dziewoński) [19:07:34] Urbanecm: I think you're right, but also I still haven't pieced out the interactions between this and prod_sites.pp [19:08:11] (03Merged) 10jenkins-bot: Fix DiscussionTools wikis config for thwiki/tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639606 (https://phabricator.wikimedia.org/T266303) (owner: 10Bartosz Dziewoński) [19:09:10] grr [19:09:12] undeployed code [19:09:46] brennen: deployment host says "Your branch is ahead of 'origin/master' by 1 commit." and you have the last commit [19:09:49] can you fix that please? [19:10:41] Urbanecm: one second. [19:10:44] sure [19:11:00] (note I merged a patch to mw-config already) [19:11:19] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:13:25] Urbanecm: after fetch, `git log -p HEAD..@{u}` on mediawiki-staging shows a single InitialiseSettings change; i assume that's yours? [19:13:33] yes :) [19:13:34] thanks [19:13:44] I'm continuing then :) [19:14:12] MatmaRex: your patch is available at mwdebug1002, can you test, please? [19:14:28] yeah [19:14:49] (03CR) 10Herron: [C: 03+1] alertmanager: add ack daemon [puppet] - 10https://gerrit.wikimedia.org/r/638998 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [19:14:59] Urbanecm: looks good [19:15:03] syncing [19:15:13] (03CR) 10Herron: [C: 03+1] alertmanager: enable acks and silences on alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/638999 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [19:15:25] tgr_: should i do your backport too? [19:16:06] (03CR) 10Herron: [C: 03+1] profile: add prometheus jobs for am acks [puppet] - 10https://gerrit.wikimedia.org/r/639000 (https://phabricator.wikimedia.org/T266535) (owner: 10Filippo Giunchedi) [19:16:44] Urbanecm: yes, thanks! [19:16:53] MatmaRex: and...it's live! :) [19:16:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 453b9c64c44a256eafdfafe7a0023484377bbbd2: Fix DiscussionTools wikis config for thwiki/tgwiki (T266303) (duration: 01m 08s) [19:16:57] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:01] T266303: Enable Reply Tool as Beta Feature on "Phase 3" wikis - https://phabricator.wikimedia.org/T266303 [19:17:05] Urbanecm: note that this just merged https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/639500/ [19:17:13] liable to show up on a fetch for wmf.16 [19:17:18] brennen: I'm aware, thanks for the notice :) [19:17:38] but cscott doesn't seem to be :( [19:17:50] thanks [19:17:56] I'm not touching centralauth, so it shouldn't matter [19:18:10] (03CR) 10Urbanecm: [C: 03+2] Suggested edits: Export task count from start editing dialog [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639501 (https://phabricator.wikimedia.org/T266868) (owner: 10Gergő Tisza) [19:18:18] tgr_: will ping you once ready [19:21:03] cdanis: actually...ignore that suggestion, prod_sites.pp is a better choice. that apache config seems to be for portals rather than for mediawiki itself [19:21:21] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:07] PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=idp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:28:13] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [19:28:13] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [19:28:59] (03Merged) 10jenkins-bot: Suggested edits: Export task count from start editing dialog [extensions/GrowthExperiments] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639501 (https://phabricator.wikimedia.org/T266868) (owner: 10Gergő Tisza) [19:29:31] (03PS1) 10Urbanecm: Revert "Don't double-format numeric edit count" [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639502 (https://phabricator.wikimedia.org/T267362) [19:29:44] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "to unblock an ongoing window" [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639502 (https://phabricator.wikimedia.org/T267362) (owner: 10Urbanecm) [19:30:37] tgr_: can you test via mwdebug1002, please? [19:31:57] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [19:32:41] (03CR) 10Gehel: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/636811 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper) [19:33:34] tgr_: how does it go? :-) [19:35:50] (03CR) 10Gehel: [C: 03+1] "LGTM in principles, but please check with Erik before merging." [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266911) (owner: 10Ryan Kemper) [19:37:26] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [19:37:30] Urbanecm: please ping me when you're finished with backports. [19:37:35] sure brennen [19:37:39] thanks [19:37:48] I'm waiting on tgr_ now... [19:37:56] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:39:18] sorry, had to go afk for a few minutes [19:39:26] tgr_: no problem :) [19:39:29] let me know if it looks good [19:42:30] Urbanecm: looks good [19:42:37] thank you, syncing [19:42:41] thanks! [19:43:42] brennen: heh, seems I didn't have to force-revert it after all :-) [19:44:18] brennen: anyway, feel free to deploy once scap says I'm done [19:44:27] Urbanecm: ack, thanks. [19:44:37] cscott: about, by any chance? [19:44:38] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.16/extensions/GrowthExperiments/modules/homepage/: 81cb1c7b141d49d7fc931fdc13ffd1b48b3a25ab: Suggested edits: Export task count from start editing dialog (T266868; T263040) (duration: 01m 07s) [19:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:46] tgr_: live :) [19:44:46] T266868: Suggested edits activation: JavaScript error in updatePager - https://phabricator.wikimedia.org/T266868 [19:44:46] T263040: [leftovers] Newcomer tasks: show skeleton of filter and article card UI elements *on load* - https://phabricator.wikimedia.org/T263040 [19:44:50] !log Morning B&C window done [19:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:57] 10Operations, 10Technical-blog-posts, 10Traffic: 2nd part of blog post series: the evolution of Wikimedia's Content Delivery Network - https://phabricator.wikimedia.org/T266857 (10srodlund) @ema just checking in on this. Do you have a draft you are currently working on? I'll be on vacation next week and won... [19:54:31] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [19:55:44] 10Operations, 10ops-eqiad, 10decommission-hardware, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:56:09] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [19:56:13] 10Operations, 10ops-eqiad, 10decommission-hardware, 10netops: Decommission asw-b-eqiad - https://phabricator.wikimedia.org/T208788 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:56:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1030 with 10G interfaces - https://phabricator.wikimedia.org/T266623 (10wiki_willy) a:03Cmjohnson [19:57:04] (03PS1) 10Brennen Bearnes: Revert "Revert "Don't double-format numeric edit count"" [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639503 [19:57:17] 10Operations, 10ops-eqiad: Degraded RAID on an-presto1004 - https://phabricator.wikimedia.org/T267160 (10wiki_willy) a:03Cmjohnson [19:57:18] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Revert "Don't double-format numeric edit count"" [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639503 (owner: 10Brennen Bearnes) [19:59:03] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10RobH) [20:00:04] brennen and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201105T2000). [20:00:18] (03CR) 10MSantos: [C: 03+1] tilerator: enable in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/639608 (https://phabricator.wikimedia.org/T254014) (owner: 10Hnowlan) [20:01:51] current train status / plan: testing an ubn patch then will roll to group0, give that a bit to bake in, and move forward to group1. [20:02:43] (03Merged) 10jenkins-bot: Revert "Revert "Don't double-format numeric edit count"" [extensions/CentralAuth] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639503 (owner: 10Brennen Bearnes) [20:09:49] (03PS1) 10Bstorm: toolforge bastion: safelist shells and related procs for the killer [puppet] - 10https://gerrit.wikimedia.org/r/639617 (https://phabricator.wikimedia.org/T266300) [20:15:25] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.16/extensions/CentralAuth/includes/specials/SpecialCentralAuth.php: Backport: [[gerrit:639500|Dont double-format numeric edit count (T267362)]] (duration: 01m 06s) [20:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:33] T267362: Use of Language::formatNum with a non-numeric string was deprecated in MediaWiki 1.36. [Called from Language::formatNum] - https://phabricator.wikimedia.org/T267362 [20:16:40] (03PS1) 10Brennen Bearnes: group0 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639619 [20:16:41] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639619 (owner: 10Brennen Bearnes) [20:17:34] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639619 (owner: 10Brennen Bearnes) [20:19:20] (03CR) 10BryanDavis: toolforge bastion: safelist shells and related procs for the killer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639617 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [20:19:33] (03PS1) 10BryanDavis: toolforge bastion: tweak email wording for process killer [puppet] - 10https://gerrit.wikimedia.org/r/639620 (https://phabricator.wikimedia.org/T266300) [20:19:47] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.16 [20:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:15] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10User-Elukey: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10Cmjohnson) [20:21:57] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10Patch-For-Review, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Cmjohnson) 05Open→03Resolved this has been completed [20:22:37] !log train: waiting ~15 minutes before rolling forward to group1. [20:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:33] (03CR) 10Bstorm: toolforge bastion: safelist shells and related procs for the killer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/639617 (https://phabricator.wikimedia.org/T266300) (owner: 10Bstorm) [20:30:32] (03PS2) 10Bstorm: toolforge bastion: safelist shells and related procs for the killer [puppet] - 10https://gerrit.wikimedia.org/r/639617 (https://phabricator.wikimedia.org/T266300) [20:32:34] (03PS3) 10Bstorm: toolforge bastion: safelist shells and related procs for the killer [puppet] - 10https://gerrit.wikimedia.org/r/639617 (https://phabricator.wikimedia.org/T266300) [20:35:02] (03CR) 10Hnowlan: [C: 03+2] tilerator: enable in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/639608 (https://phabricator.wikimedia.org/T254014) (owner: 10Hnowlan) [20:37:05] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10RobH) >>! In T260448#6604137, @wiki_willy wrote: > Tracking #935433832396 Entered ticket 1-202596400888 to have this received and put in our... [20:37:11] hey razzi - I see you have an unmerged change to profile::tlsproxy::service. Is that safe to merge? [20:38:02] (03PS1) 10Brennen Bearnes: group1 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639623 [20:38:04] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639623 (owner: 10Brennen Bearnes) [20:38:49] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639623 (owner: 10Brennen Bearnes) [20:39:42] !log finished removenode of maps2002 cassandra [20:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:53] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.16 [20:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:33] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.16 (duration: 01m 39s) [20:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:40] \o/ [20:44:46] aaaaaand rolling back to group0. [20:44:51] what happened? [20:45:03] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 1.317e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:45:20] uhh... [20:46:08] wowza [20:46:18] usually scap catches something that drastic [20:46:45] PROBLEM - Disk space on maps2003 is CRITICAL: DISK CRITICAL - free space: /srv 57385 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2003&var-datasource=codfw+prometheus/ops [20:46:52] hnowlan: is that the removal of that file? i'm pretty sure that is safe [20:47:00] its just removing unused code iiuc [20:47:03] hrm, but error is only for commons, so maybe not, I guess. [20:47:16] still, it's one of the big wikis - it should've [20:47:16] i think it hit just after the canary check [20:47:21] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.36.0-wmf.14 [20:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:55] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.36.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639624 [20:47:57] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.36.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639624 (owner: 10Brennen Bearnes) [20:48:19] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:48:42] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.36.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/639624 (owner: 10Brennen Bearnes) [20:49:29] ottomata: ah, cool [20:51:41] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:53:19] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [20:54:29] RECOVERY - Check systemd state on maps1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:37] RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:47] RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:53] RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:58] !log reenabled tilerator in eqiad [20:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:31] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 59482112 and 16609 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:59:32] hnowlan: Yes it is; it's a no-op. Apologies for the late reply [21:05:19] brennen: is there a task for the SlotRoleRegistry bug yet? [21:07:14] cscott: T267146 [21:07:15] T267146: LogicException: Role mediainfo is already defined - https://phabricator.wikimedia.org/T267146 [21:07:23] RECOVERY - Disk space on maps2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2003&var-datasource=codfw+prometheus/ops [21:07:37] I tried to upload something cscott, but dunno if it's the right way forward [21:08:21] I'm making a task for the 'formatNum' error, which was the only other wmf.16 thing I saw in the logs [21:09:06] cscott: isn't that T266677? [21:09:07] T266677: Use of FormatMetadata::formatNum with non-numeric value was deprecated in MediaWiki 1.36. [Called from FormatMetadata::makeFormattedData] - https://phabricator.wikimedia.org/T266677 [21:10:22] Urbanecm: different stack trace [21:10:29] related though, yes [21:10:33] okay, didn't compared [21:10:36] *compare [21:12:09] thanks, was just in the process of trying to figure that out. [21:15:47] T267370 - at some point we should probably downgrade this to a debug log, rather than block the train for whack-a-mole [21:15:47] T267370: Use of FormatMetadata::formatNum with non-numeric value was deprecated in MediaWiki 1.36. [Called from FormatMetadata::makeFormattedData] - https://phabricator.wikimedia.org/T267370 [21:16:14] but i figure i've got until T267146 is fixed before it's me holding up the train [21:16:15] T267146: LogicException: Role mediainfo is already defined - https://phabricator.wikimedia.org/T267146 [21:17:45] i do favor not operating the train as whack-a-mole. [21:17:52] when at all feasible. [21:24:21] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10RobH) [21:25:38] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10RobH) a:03Swagoel I've updated the task description with the checklist for whoever is on SRE clinic duty to process this once the info is adde... [21:28:31] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) So there is an issue here, and unfortunately without @ZS commenting directly, this is blocked. The tas... [21:29:26] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) [21:31:45] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) @nuria: This request includes access to two analytics groups, which contain sudo rights. As such, we'r... [21:32:36] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) p:05Triage→03Medium [21:33:06] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10Nuria) This approvals are now handled by @Ottomata [21:33:08] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) [21:34:34] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) >>! In T267312#6607598, @Nuria wrote: > This approvals are now handled by @Ottomata Apologies, it tur... [21:36:19] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) This request includes the 'restricted' user group, which is a sudo enabled group for deployment access... [21:36:42] (03PS2) 10Bstorm: wmcs wikireplicas: add a dry_run option [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) [21:37:08] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) a:03ZS >>! In T267312#6607578, @RobH wrote: > So there is an issue here, and unfortunately without @Z... [21:38:06] 10Operations, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research volunteer Swagoel - https://phabricator.wikimedia.org/T267314 (10RobH) p:05Triage→03Medium [21:38:24] (03CR) 10Bstorm: "I think this might still be worth it to short circuit before the IRC logger at least." [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [21:39:32] 10Operations, 10DNS, 10Traffic, 10netbox, and 2 others: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation - https://phabricator.wikimedia.org/T266331 (10nskaggs) I was able to discuss this with @faidon briefly. I'll summarize the discussion below. //Note, w... [21:41:55] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [21:42:02] (03CR) 10Volans: "> Patch Set 2:" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [21:49:24] (03CR) 10Bstorm: "Ahah! --dry-run is an argument to cookbook, not to *this* cookbook. I'll add that to the doc about this process. Should I still switch the" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [21:50:17] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10Ottomata) > Apologies, it turns out analytics-privatedata-users doesn't appear to be a sudo role? (It has no... [21:52:09] (03CR) 10Volans: "> Patch Set 2:" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [21:54:28] (03PS3) 10Bstorm: wmcs wikireplicas: Fix the logger [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) [21:55:58] (03CR) 10Volans: [C: 03+1] "LGTM! To merge you can just +2, CI will run and auto-merge. As for deployment the next puppet run on the cumin hosts will update it." [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [21:56:45] (03PS4) 10Bstorm: wmcs wikireplicas: Fix the logger [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) [21:58:14] (03CR) 10Bstorm: [C: 03+2] wmcs wikireplicas: Fix the logger [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [21:59:36] (03Merged) 10jenkins-bot: wmcs wikireplicas: Fix the logger [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [22:04:22] (03CR) 10Bstorm: "Docs updated and all that. Thanks for the help!" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [22:05:05] (03CR) 10Volans: "> Patch Set 4:" [cookbooks] - 10https://gerrit.wikimedia.org/r/639320 (https://phabricator.wikimedia.org/T266266) (owner: 10Bstorm) [22:34:34] (03PS1) 10Subramanya Sastry: Bump wikimedia/parsoid to 0.13.0-a16 [vendor] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639504 (https://phabricator.wikimedia.org/T267146) [22:34:56] (03CR) 10Subramanya Sastry: [C: 03+2] "self-merging the cherry-pick" [vendor] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639504 (https://phabricator.wikimedia.org/T267146) (owner: 10Subramanya Sastry) [22:35:18] brennen, ^ should i not have +2ed that? [22:35:44] i can stop it if not.. because i know Urbanecm had flagged that for scott's patch earlier. [22:35:47] subbu: and don't forget to verify whether mediawiki-core picked up that git module update [22:35:52] subbu: no i think that's good [22:35:58] cscott, ok. [22:36:11] was just going to mention i +2'd the other one, which maybe i should not have done... [22:36:24] subbu, Urbanecm: did? I'd learned from James_F that a self-C+2 on a cherry-pick is ok, especially for UBN sorts of things [22:36:28] I +2ed the cherry-pick bcause you +2ed the other one ;-) [22:36:33] cscott: it is fine [22:36:43] but not if you disappear and a deployment window started :-) [22:37:08] (it's not self-merge that was my issue, it's that I suddently had undeployed code to deal with) [22:37:14] oh, i was actually wondering what the process was after the cherry-pick lands; i figured it would have to be deployed somehow [22:37:25] yeah, issue isn't self +2 on a cherry-pick, just it gets confusing to have an undeployed patch hanging out. [22:37:33] i checked ops and saw that brennen had manually sync'ed them [22:37:39] cscott: well whoever +2'es that is responsible for deploying it when it merges [22:37:46] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Zxane Soo - https://phabricator.wikimedia.org/T267312 (10RobH) [22:37:49] for clarity, i also probably should have held the earlier backport window for the blocked train [22:37:57] (if you're not sure how to deploy it, it's not a good idea to +2 it) [22:37:58] it has been a slightly ragged week all around for managing deploys [22:38:01] Urbanecm: ah! [22:38:21] that's the difference between cherry-picking to a train branch and cherry-picking to (say) 1.35. lesson learned! [22:38:25] yes! [22:38:49] read https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Problem:_undeployed_code if interested [22:39:30] i guess we wait on zuul at this point. [22:39:41] yes [22:39:45] so for clarity... actually, i need to look at state of the blockers. might be we can't go forward anyway this afternoon. [22:40:01] * brennen gets bearings [22:41:25] cscott: so, to summarize it, a .wmf backport can (and usually is) self-merged if a) you're going to deploy it b) it got merged to master (as in, reviewed in master) [22:41:31] both the known blockers have been +2ed afaik. [22:41:39] fixes for them, i mean. [22:41:50] yep, looks like it [22:42:09] it seems both you and subbu have deployment, but happy to answer any questions :) [22:43:04] * subbu has never actually deployed core / swat code before ... only lots of parsoid code back in the parsoid/js days. :) [22:43:32] hmm...so...who is going to actually deploy it? :D [22:43:53] won't it be part of the train? [22:43:57] no [22:44:04] you merged a .wmf branch [22:44:05] subbu: yeah, see, that was my mistake too! [22:44:11] *a patch to wmf branch [22:44:18] yeah, the code still needs to be synched. [22:44:24] it doesn't magically ride the train [22:44:24] that means you're taking responsibility for it to be deployed [22:44:37] (that's why only people with deploy access can +2 there) [22:44:43] for clarity, i will deploy both. [22:44:47] thanks brennen ! [22:45:01] cscott: what's the status of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/639635 - i was about to cherry pick then another patch came in there... [22:45:08] did we double-check that core's .gitmodules got updated appropriately? [22:45:13] i will need to read up and understand this better. [22:45:20] that vendor patch hasn't merged yet. [22:45:31] cscott: you mean, when updating an extension? [22:45:34] (ie. backporting) [22:45:42] brennen: as i was digging into it i discovered a bunch of other cameras that generated this same bogus(?) tag, so i updated the commit message and comments to better describe the situation [22:45:46] brendan_campbell: no code changes [22:46:05] cscott: ah, gotcha. i'll go ahead and cherry pick then. [22:46:06] and i haven't found anything in the logs from the group1 deploy due to anything *other* than this particular gpsaltituderef tag [22:46:47] Urbanecm: T259832 [22:46:47] T259832: mediawiki-vendor submodule doesn't get automatically bumped on release branches - https://phabricator.wikimedia.org/T259832 [22:46:55] got it [22:47:32] (03PS1) 10Brennen Bearnes: media: Support GPSAltitudeRef exif tag [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639505 (https://phabricator.wikimedia.org/T267370) [22:47:32] Urbanecm: we got bit by that in wmf.3, and legoktm wrote a patch to fix it... but then his patch was never merged and somehow magically the last few vendor cherry-picks *have* updated core appropriately [22:47:52] i guess we'll see once it merges [22:47:55] it's pretty easy to verify that [22:47:57] o.O [22:48:06] yeah. [22:48:33] Urbanecm, just so i understand this .. so the deploy of the cherry-picked code is required because group0 already has had the train deployed there, correct? [22:48:46] otherwise, this woudl have been part of normal train deploy. [22:48:48] subbu: more precisely, because the branch exists [22:49:02] subbu: yeah, there's no real sync w/ the train deploy as i understand it [22:49:08] ok [22:49:21] 'deploying to group0' is just changing a file in the http server config to change what directory it points to [22:49:25] brennen would probably understand that better, but it's safer to sync it than not to sync it [22:49:38] once you change the train code, you still need to do something to actually update that directory [22:49:39] cscott: wmf.16 _does_ have .gitmodules [22:49:40] 10Operations, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10BPirkle) [22:49:51] cscott, aah .. reg. pointer twiddling .. now that makes sense. [22:50:03] right, the code is already on all servers. to oversimplify somewhat, the train is rolled forward by updating wikiversions. code changes have to be synched out on their own. [22:50:04] so, it ha snothing to do with group 0/1/2 .. [22:50:11] got it. [22:51:25] thanks for the patient explanations. [22:51:40] it is not an un-confusing subject. :) [22:52:02] subbu: no problem :) [22:53:31] (03CR) 10Brennen Bearnes: [C: 03+2] media: Support GPSAltitudeRef exif tag [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639505 (https://phabricator.wikimedia.org/T267370) (owner: 10Brennen Bearnes) [22:53:51] Urbanec, brennen: I don't see a patch on wmf/1.36.0-wmf.16 corresponding to the mediawik-vendor update? [22:54:01] in mediawiki-core [22:54:19] my vendor patch hasn't merged yet .. presumably that is what would update it? [22:54:27] oh, ok [22:54:29] yeah, the vendor update's still in gate-and-submit [22:54:31] i'll wait patiently [22:54:32] yes, it needs to be merged [22:54:43] 3 minutes remaining, allegedly. [22:54:57] (gate-and-submit-wmf, rather.) [22:54:58] cscott: you won't see that as a gerrit patchset, it's a direct commit in the wmf branch [22:54:59] jenkins is a habitual liar [22:55:14] presumably checkouting that branch and fetching would show you that [22:55:18] but brennen will see that when deploying [22:55:31] Urbanecm: yeah, i was `git pull`ing my branch of mediawiki-core to see if there was a commit there [22:56:33] if you see other backports in git log, you should see this one too (hopefully) [22:56:43] we're going to bump up against our theoretical deploy cutoff here. in the interests of (hopefully) getting things to group1 today, i'll sync both of these things once merged and try to roll forward one last time for the day. [22:56:56] if anything crops up, we'll leave it at group0 for the weekend. [22:57:09] i should probably dust off my deploy skills at some point. i used to swat back 5 yrs ago, but haven't done it recently enough to feel comfortable doing it. [22:57:34] pointers make deploys and reverts efficient and quick ... but, just like all pointers, they can be somewhat confusing if mistaken for their contents. :) [22:57:49] brennen: the formatnum stuff is just deprecation log spam, if that's all that shows up in group1 logs i wouldn't mind leaving it for the weekend just to collect a larger sample of the possible bad formatnum callsites [22:58:06] subbu: wise words [22:58:37] cscott: if it were a low volume i'd have no objection, but at the rates it was coming in, it can make more meaningful stuff harder to diagnose. [22:58:44] +1 [22:58:48] Urbanecm, :) [23:01:48] cscott / subbu: is the parsoid change liable to be testable on mwdebug1002? [23:02:07] no. but, on beta cluster if beta cluster updates quickly enough. [23:02:25] i guess we do know quite quickly if it's broken on commons. :\ [23:03:43] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a16 [vendor] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639504 (https://phabricator.wikimedia.org/T267146) (owner: 10Subramanya Sastry) [23:05:16] \o/ [23:05:20] fingers crossed brennen [23:05:28] and cscott / subbu too :-) [23:05:55] :) [23:09:20] normally i would have seen t-267146 (which dannys filed tue) y'day and caught it then .. but was caught up in election result tracking (which i had on control over anyway .. lol ..). [23:09:53] i can relate. [23:09:58] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.16/vendor: Backport: [[gerrit:639504|Bump wikimedia/parsoid to 0.13.0-a16 (T267146)]] (duration: 01m 14s) [23:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:05] T267146: LogicException: Role mediainfo is already defined - https://phabricator.wikimedia.org/T267146 [23:10:37] cscott, do the necessary git modules exist? [23:10:53] ok, the parsoid change is synced but train is still at group0. [23:11:03] k [23:11:09] now just waiting on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/639505 [23:11:16] and yeah, did seem to update vendor submodule as expected. [23:12:00] what a good time for my ISP to suddenly start dropping packets. [23:13:12] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephmon200[12] - https://phabricator.wikimedia.org/T267378 (10RobH) [23:15:15] (03CR) 10Ebernhardson: [C: 03+1] cirrus: temporarily disable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/637809 (https://phabricator.wikimedia.org/T266911) (owner: 10Ryan Kemper) [23:16:17] safe to say we spent most of our time waiting on jenkins ... i wonder how much time swat deployers end up waiting on this in a regular swat window. [23:16:50] a nontrivial amount, i think. [23:19:43] cscott, we should get familiar with the deployment routines ourselves so we can do these syncs / swats ourselves in a pinch, if required. [23:20:31] looks like the vendor patch merged ... now to see if beta updates quicker or scott's cherrypicked patch merges faster. :) [23:21:06] (03Merged) 10jenkins-bot: media: Support GPSAltitudeRef exif tag [core] (wmf/1.36.0-wmf.16) - 10https://gerrit.wikimedia.org/r/639505 (https://phabricator.wikimedia.org/T267370) (owner: 10Brennen Bearnes) [23:21:15] beta lost the race. [23:21:34] comfortable with rolling to group1 at this point? [23:21:43] yup. cscott you? [23:22:21] * brennen syncs the GPSAltitudeRef patch in the meanwhile [23:22:45] brennen, lets do it. [23:29:38] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.16/languages/i18n/exif: Backport: [[gerrit:639505|media: Support GPSAltitudeRef exif tag - i18n/exif files (T267370)]] (duration: 01m 08s) [23:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:45] T267370: Use of FormatMetadata::formatNum with non-numeric value was deprecated in MediaWiki 1.36. [Called from FormatMetadata::makeFormattedData] - https://phabricator.wikimedia.org/T267370 [23:30:45] i verified commons access on scandium (which already has parsoid code for rt testing) and no crashers. [23:30:55] so, i expect it to be fine on group1 as well. [23:32:46] argh. i just realized this probably needs a full sync, since it touches i18n? [23:33:06] what does? [23:33:29] oh scott's patch. [23:33:36] yeah [23:33:44] i am not familiar with how that works .. so, not sure. [23:34:05] thcipriani: check me on that? [23:34:31] hits languages/i18n/exif/{en,qqq}.json [23:34:32] what's the patchset? [23:34:33] brennen: sync the non-i18n stuff first, that suppressed the error, you just see the untranslated message in the output [23:34:48] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/639505 [23:34:49] thcipriani, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/639505/ [23:34:51] brennen: the languages can be synced afterward to prevent undeployed changes [23:35:07] i've synced all files in that patch [23:35:13] (but only using sync-file) [23:35:22] yep, that one will need a full sync-world to generate the new l10n [23:36:07] i think this needs a judgment call on whether it's best to go ahead now with the time that'll take or just hold this 'til monday a.m. [23:37:05] i defer to you two. [23:37:05] https://en.wikipedia.org/wiki/MediaWiki:Exif-gpslatitude-s vs https://en.wikipedia.org/wiki/MediaWiki:Exif-gpsaltituderef-0 [23:37:16] what's the user impact? [23:37:36] cscott can answer that better. [23:38:22] if it's just an untranslated message for those files and doesn't cause overt breakage, maybe not a big deal? [23:38:27] !log brennen@deploy1001 Synchronized php-1.36.0-wmf.16/includes/media/FormatMetadata.php: Backport: [[gerrit:639505|media: Support GPSAltitudeRef exif tag - FormatMetData.php (T267370)]] (duration: 07m 22s) [23:38:30] currently the output is just a U+FFFD character [23:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:35] T267370: Use of FormatMetadata::formatNum with non-numeric value was deprecated in MediaWiki 1.36. [Called from FormatMetadata::makeFormattedData] - https://phabricator.wikimedia.org/T267370 [23:38:46] so it would be hard to argue that would be a regression [23:39:09] i mean, that's already more useful than the U+FFFD tofu character [23:39:22] and it only shows up if you click the 'additional information' drop down on the commons File page [23:39:56] so tl;dr it's not hugely user-visible, and to the extent a user sees it, even the untranslated output is probably an improvement. [23:40:03] ok after discussion i'm going to sync-world but not roll forward to group1 until monday morning (US time). [23:40:57] we're getting into the time of day where moving the code forward is a little iffy, especially given the likelihood that something else necessitating a rollback will crop up. [23:40:59] sounds good. works for me. [23:41:39] thanks everybody for your assistance and patience. i know it's been a pretty shambolic deploy situation all week. [23:41:42] brennen: i'll prepare a patch tomorrow that silences the formatNum deprecation warning, so that if we do roll group1 forward and hit more logspam we can just shutup the warning instead of having to hold the train [23:43:55] brennen, thanks to you too. cscott that sounds like a good idea. [23:44:00] !log brennen@deploy1001 Started scap: Synchronizing to pick up i18n for [[gerrit:639505]]. Will resume moving train to group1 on Monday morning (US) (T263182) [23:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:06] T263182: 1.36.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T263182 [23:44:13] alright, have a good weekend all. [23:44:52] (oh i guess it is not weekend yet ...) [23:45:06] deploy-weekend :) [23:45:11] haha, yeah. [23:45:16] although the deploy-weekend usually includes monday... [23:45:25] it's an odd time of year. [23:45:48] also an odd year. [23:51:22] ^ [23:51:48] technically an even year [23:53:08] i knew as i was typing that that _someone_ would have commentary. [23:53:34] even years are odd [23:53:50] :D [23:53:52] leap years, special non-leap years, olympics, etc [23:54:01] definitely odder than odd years [23:54:37] I agree: give me an odd year any day [23:55:12] 2019, 2021: hopefully both better than 2020! [23:55:53] so far, 2020 made 2019 look like 2018