[00:00:00] Amir1: the metawiki hosted source (if that's still the case for non-wp portals)? [00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T0000). Please do the needful. [00:00:28] yeah, the portals are being updated separately [00:00:48] Amir1: do you have a link to the logo handy? [00:01:09] Urbanecm: https://commons.wikimedia.org/wiki/Category:MediaWiki_logo_(2020) [00:01:34] https://commons.wikimedia.org/wiki/File:MediaWiki-2020-logo.svg [00:01:43] Amir1: The portals one gets built each week in the build process. [00:01:50] Amir1: Bringing that forward is… messy. [00:02:02] sounds like https://commons.wikimedia.org/wiki/File:MediaWiki-2020-small-icon.svg would be the new one Amir1 ? [00:02:02] oh okay, I thought it's manual [00:02:34] Urbanecm: that's for really small sizes, how big is it [00:02:41] it's an SVG? [00:03:36] right now it has https://upload.wikimedia.org/wikipedia/meta/1/16/MediaWiki-logo_sister_1x.png, which is 47x36 [00:03:48] so the small size would be better [00:03:57] the one you lined [00:03:59] okay [00:04:00] (03PS1) 10Papaul: Add MAC address for me2379 mw238[0-9] and mw239[0-6] [puppet] - 10https://gerrit.wikimedia.org/r/676170 (https://phabricator.wikimedia.org/T274171) [00:05:13] (03PS5) 10Dzahn: site/conftool-data: add 24 new codfw appservers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) [00:07:31] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: add 24 new codfw appservers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/676153 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [00:07:58] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [00:08:03] !log uploaded mailman3 3.2.1-1+wmf1, postorius 1.2.4-1+wmf1 to apt.wikimedia.org [00:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:16] (03PS2) 10Dzahn: Add MAC address for mw2379 mw238[0-9] and mw239[0-6] [puppet] - 10https://gerrit.wikimedia.org/r/676170 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [00:08:38] Amir1: for whatever reason, it loads https://upload.wikimedia.org/wikipedia/meta/1/16/MediaWiki-logo_sister_1x.png directly. I updated the PNGs, and it should be updated on portals too [00:09:18] (03PS3) 10Dzahn: Add MAC address for mw2379 mw238[0-9] and mw239[0-6] [puppet] - 10https://gerrit.wikimedia.org/r/676170 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [00:09:46] Urbanecm: Awesome. Thanks! [00:10:15] you need to trigger a jenkins job to build a new version of portals deploy repo [00:10:22] then merge that and bump the submodule in ops/mw-config [00:10:25] (03CR) 10Dzahn: [C: 03+2] Add MAC address for mw2379 mw238[0-9] and mw239[0-6] [puppet] - 10https://gerrit.wikimedia.org/r/676170 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [00:10:43] legoktm: not when the HTML stays same [00:11:05] It also would update the numbers out of sequence. [00:11:13] Which people may be relying upon? [00:11:33] Amir1: also updated the image size , as it's malformed a bit now [00:11:43] should be fixed when someone updates the portals [00:11:48] i think that happens on mondays, but i might be wrong [00:11:51] cccccclktlenhnhiflvgnknlktveidvnujrrggtkgivt [00:12:30] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:21] it looks hilarious [00:14:44] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [00:15:14] Somewhat oblate. [00:15:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2379.codfw.wmnet ` The log can be found in `/... [00:19:56] 10SRE, 10Wikimedia-Mailing-lists: lists-next: “confirm” and “welcome” emails lack List-Id header - https://phabricator.wikimedia.org/T278431 (10Legoktm) I backported the upstream fix and installed it on the cloud server for verification. ` Message-ID: <161723627213.7009.8546115803080405666@mailman-mailman02.m... [00:20:06] 10SRE, 10Wikimedia-Mailing-lists: lists-next: “confirm” and “welcome” emails lack List-Id header - https://phabricator.wikimedia.org/T278431 (10Legoktm) a:03Legoktm [00:22:25] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: lists-next: bad name in “welcome” email - https://phabricator.wikimedia.org/T278433 (10Legoktm) >>! In T278433#6947514, @Legoktm wrote: > Looks like it's https://gitlab.com/mailman/postorius/-/merge_requests/526 / https://gitlab.com/mailman/postorius/-/issues/429... [00:30:17] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: REIMAGE [00:30:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2380.codfw.wmnet ` The log can be found in `/... [00:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:56] and the designer says we should use medium version :D [00:32:23] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2379.codfw.wmnet with reason: REIMAGE [00:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2379.codfw.wmnet'] ` and were **ALL** successful. [00:39:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2381.codfw.wmnet ` The log can be found in `/... [00:43:54] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Legoktm) >>! In T278609#6959195, @Legoktm wrote: > @ladsgroup found https://lists.mailman3.org/archives/list/mailman-users@mailman3.... [00:44:53] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: REIMAGE [00:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:59] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2380.codfw.wmnet with reason: REIMAGE [00:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2381.codfw.wmnet with reason: REIMAGE [00:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2380.codfw.wmnet'] ` and were **ALL** successful. [00:55:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2382.codfw.wmnet ` The log can be found in `/... [00:56:36] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2381.codfw.wmnet with reason: REIMAGE [00:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2381.codfw.wmnet'] ` and were **ALL** successful. [01:06:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2383.codfw.wmnet ` The log can be found in `/... [01:07:25] (03PS1) 10Andrew Bogott: OpenStack: add package manifests for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676172 (https://phabricator.wikimedia.org/T261136) [01:07:27] (03PS1) 10Andrew Bogott: Add Designate files and manifests for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676173 (https://phabricator.wikimedia.org/T261136) [01:07:29] (03PS1) 10Andrew Bogott: Codfw1dev designate -> OpenStack version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676174 (https://phabricator.wikimedia.org/T261136) [01:08:13] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10Legoktm) [01:10:20] 10SRE, 10Wikimedia-Mailing-lists, 10cloud-services-team (Kanban): auto-subscribe cloud-vps and/or toolforge users to cloud-announce - https://phabricator.wikimedia.org/T278361 (10Legoktm) >>! In T278361#6957067, @Andrew wrote: >>>! In T278361#6957054, @Ladsgroup wrote: >> getting the mailman3's API to be ex... [01:10:28] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2382.codfw.wmnet with reason: REIMAGE [01:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:34] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2382.codfw.wmnet with reason: REIMAGE [01:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2382.codfw.wmnet'] ` and were **ALL** successful. [01:21:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2384.codfw.wmnet ` The log can be found in `/... [01:22:32] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: REIMAGE [01:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:36] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2383.codfw.wmnet with reason: REIMAGE [01:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:42] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10McLeod919) New user trying to setup a small website to track development. Installed Kartographer but think I need a URL for ma... [01:34:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2383.codfw.wmnet'] ` and were **ALL** successful. [01:35:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE [01:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2385.codfw.wmnet ` The log can be found in `/... [01:37:52] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE [01:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2384.codfw.wmnet'] ` and were **ALL** successful. [01:51:28] !log `echo "https://www.mediawiki.org/favicon.ico" | mwscript purgeList.php --wiki=enwiki` T268230 [01:51:29] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2385.codfw.wmnet with reason: REIMAGE [01:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:36] T268230: Rolling out the new logo of MediaWiki - https://phabricator.wikimedia.org/T268230 [01:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:37] !log `echo "https://www.mediawiki.org/static/images/footer/poweredby_mediawiki_176x62.png" | mwscript purgeList.php --wiki=enwiki` T268230 [01:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2385.codfw.wmnet with reason: REIMAGE [01:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2384.codfw.wmnet ` The log can be found in `/... [02:00:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2385.codfw.wmnet'] ` and were **ALL** successful. [02:04:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2387.codfw.wmnet ` The log can be found in `/... [02:13:59] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE [02:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:03] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2384.codfw.wmnet with reason: REIMAGE [02:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:57] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2387.codfw.wmnet with reason: REIMAGE [02:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:20:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:21:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2387.codfw.wmnet with reason: REIMAGE [02:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2384.codfw.wmnet'] ` and were **ALL** successful. [02:27:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2387.codfw.wmnet'] ` and were **ALL** successful. [02:32:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2386.codfw.wmnet ` The log can be found in `/... [02:48:00] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2386.codfw.wmnet with reason: REIMAGE [02:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2386.codfw.wmnet with reason: REIMAGE [02:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2386.codfw.wmnet'] ` and were **ALL** successful. [03:06:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [03:19:23] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:37] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:25] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Shizhao) [05:29:59] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:53:50] 10SRE, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10Aklapper) a:05eprodromou→03None [05:58:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline (and let's do an updated PCC on the latest revision?)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [06:11:03] 10SRE, 10Analytics: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10MoritzMuehlenhoff) >>! In T278371#6948029, @elukey wrote: > Yep most of the times it works fine, but when the fuse process gets into its weird state then everything trying to access /m... [06:31:11] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:35:19] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet [06:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:07] !log powercycle cp1087 (no ssh, no tty via serial console) - T278729 [06:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:15] T278729: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 [06:38:01] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10elukey) 05Resolved→03Open Happened again, just depooled and powercycled, going to add the ops-eqiad tag! [06:39:04] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10elukey) a:05BBlack→03None [06:41:23] RECOVERY - Host cp1087 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [06:41:59] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:59] PROBLEM - purged service on cp1087 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:43:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10elukey) Just acked in icinga a DOWN alert for cloudgw1001, the host seems not be reachable since hours ago. [06:44:17] RECOVERY - purged service on cp1087 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:07:43] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=frontend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [07:14:29] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [07:20:36] (03CR) 10Hashar: R:pbuilder_base: add extra packages to updates as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond) [07:22:49] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=frontend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [07:26:31] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [07:28:16] (03PS2) 10Hashar: R:pbuilder_base: add extra packages to updates as well [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond) [07:29:38] (03CR) 10Hashar: "I have removed the @("COMMAND"/L) Puppet heredoc syntax since that sends ruby to a death loop and breaks the rspec issue where as the goo" [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond) [07:49:49] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) Sounds good @papaul ! So in Icinga we're monitoring each phase to see if it hits 80%/85% of the 30A breaker, and in Prometheus we're collecting most of what we can via s... [07:51:11] (03CR) 10Effie Mouzeli: "PCC for scandium, testreduce1001 and parse2001" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [07:52:05] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version and set higher timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/676280 [07:52:45] (03PS2) 10Kosta Harlan: linkrecommendation: Bump version and set higher timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/676280 [07:53:07] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version and set higher timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/676280 (owner: 10Kosta Harlan) [07:54:32] (03Merged) 10jenkins-bot: linkrecommendation: Bump version and set higher timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/676280 (owner: 10Kosta Harlan) [07:55:42] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [07:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:36] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [07:58:36] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [07:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:37] (03PS2) 10Kosta Harlan: [WIP] linkrecommendation: Use rest.php endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/675755 [07:59:48] (03PS3) 10Kosta Harlan: linkrecommendation: Use rest.php endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/675755 [08:00:02] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Use rest.php endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/675755 (owner: 10Kosta Harlan) [08:01:28] (03Merged) 10jenkins-bot: linkrecommendation: Use rest.php endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/675755 (owner: 10Kosta Harlan) [08:05:00] (03PS1) 10Kosta Harlan: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676282 [08:05:11] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676282 (owner: 10Kosta Harlan) [08:06:03] (03PS2) 10Effie Mouzeli: modules: remove parsoidJS puppet module [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T268524) [08:06:34] (03Merged) 10jenkins-bot: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676282 (owner: 10Kosta Harlan) [08:07:58] (03PS1) 10Filippo Giunchedi: hieradata: allow monitoring.wmflabs.org to use Cloud IDP [puppet] - 10https://gerrit.wikimedia.org/r/676284 [08:09:13] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [08:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:28] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [08:10:28] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:42] (03CR) 10Effie Mouzeli: "PCC for testreduce1001, scandium, and parse2001 https://puppet-compiler.wmflabs.org/compiler1002/28856/parse2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [08:12:43] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [08:12:43] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:48] (03CR) 10Elukey: Let hive use the default logging config path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:20:58] (03PS1) 10Effie Mouzeli: hieradata: enable onhost memcached socket on mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/676285 (https://phabricator.wikimedia.org/T273115) [08:22:14] (03PS2) 10Elukey: hive: use the default logging config path for parquet logs [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:22:41] (03CR) 10jerkins-bot: [V: 04-1] hive: use the default logging config path for parquet logs [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:23:54] (03PS3) 10Elukey: hive: use the default logging config path for parquet logs [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:25:12] !log installing ldb security updates [08:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28858/console" [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:29:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,pdu_sentry4} site={eqiad,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:30:17] (03CR) 10Elukey: [V: 03+1 C: 03+2] hive: use the default logging config path for parquet logs [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [08:31:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:35:22] !log failover Ganeti master in eqiad to ganeti1009 [08:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:43] PROBLEM - ganeti-wconfd running on ganeti1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:45:04] ^ expected [08:51:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [08:52:05] !log drain ganeti1011 [08:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/675722 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [08:57:19] (03CR) 10David Caro: [C: 03+2] ceph: Add octopus repo entry [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [08:57:40] (03CR) 10David Caro: [C: 03+2] ceph: Add octopus repo entry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675812 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [08:57:55] (03CR) 10David Caro: [C: 03+2] Revert "wmcs.ceph.codfw: Upgrade to latest 5.X kernel" [puppet] - 10https://gerrit.wikimedia.org/r/675722 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [09:01:56] !log contint2001: compressing all fresnel trace--trace.json files: sudo -u jenkins find /srv/jenkins/builds/mediawiki-fresnel-patch-docker -name "*trace.json" -exec gzip {} \+ # T249268 [09:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:03] T249268: Reduce size of artifacts stored on the CI Jenkins master - https://phabricator.wikimedia.org/T249268 [09:07:30] !log contint2001: compressing files with 4 parallel executions: sudo -u jenkins find /srv/jenkins/builds/mediawiki-fresnel-patch-docker -name "*trace.json" -print0|xargs -0 -P4 gzip [09:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:00] (03CR) 10Awight: Suppress user notice on mobile (031 comment) [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/675883 (https://phabricator.wikimedia.org/T274927) (owner: 10Matthias Mullie) [09:26:15] 10SRE, 10observability, 10CAS-SSO, 10Patch-For-Review, and 2 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10Volans) Current situation for me with what I think is a valid SSO session (if I go to `idp.wikimedia.org` it redirect... [09:37:48] 10SRE, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10hnowlan) I don't think this is a necessary task, I am closing it - we can reopen if this becomes a major issue. [09:38:14] 10SRE, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Story: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10hnowlan) 05Open→03Resolved [09:39:16] (03PS1) 10David Caro: reprepro: force usage of expired google key [puppet] - 10https://gerrit.wikimedia.org/r/676292 (https://phabricator.wikimedia.org/T279042) [09:42:17] (03Abandoned) 10Awight: Stop logging parquet to the console [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [09:44:10] (03CR) 10Awight: hive: use the default logging config path for parquet logs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [09:47:34] (03CR) 10Elukey: hive: use the default logging config path for parquet logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675907 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [09:49:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/675127 (owner: 10Jbond) [09:57:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675128 (owner: 10Jbond) [10:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1000) [10:00:18] (03CR) 10Jbond: [C: 03+1] "LGTM and you reminded me i hit this issue once before. It is caused when the systemd::timer::job command parameter has a line ending in i" [puppet] - 10https://gerrit.wikimedia.org/r/676133 (owner: 10Jbond) [10:26:07] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2001-dev.codfw.wmnet [10:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:32] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2001-dev.codfw.wmnet [10:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:41] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2002-dev.codfw.wmnet [10:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:21] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2002-dev.codfw.wmnet [10:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:06] (03CR) 10Muehlenhoff: [C: 03+1] reprepro: force usage of expired google key [puppet] - 10https://gerrit.wikimedia.org/r/676292 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [10:44:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/675131 (owner: 10Jbond) [10:47:02] (03PS2) 10Matthias Mullie: Enable MediaSearch by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674820 [10:47:17] (03CR) 10Matthias Mullie: [C: 03+1] "Ready for deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674820 (owner: 10Matthias Mullie) [10:49:58] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:08] (03PS1) 10Elukey: Revert "Remove the dns_canonicalize=false option for Kerberos in Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/676197 [10:53:30] (03CR) 10Elukey: [C: 03+2] Revert "Remove the dns_canonicalize=false option for Kerberos in Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/676197 (owner: 10Elukey) [10:53:46] (03PS2) 10Elukey: Revert "Remove the dns_canonicalize=false option for Kerberos in Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/676197 [10:56:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [10:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, apergos, and duesen: Time to snap out of that daydream and deploy EU Backport and Config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1100). [11:00:04] matthiasmullie: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:15] o/ [11:00:22] !log drain ganeti1017 [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:19] oh, I’ve been kicked out of the window? :D [11:01:29] o/ [11:01:44] I am here, and so are you. maybe it only lists the first three? [11:02:12] where by "you" I mean "Lucas_WMDE" :-P [11:02:15] Lucas_WMDE: I don't see you on the wikitech calendar event https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1100 [11:02:26] yeah, it’s not jouncebot’s fault [11:02:37] I’m actually not in the calendar for whatever reason [11:02:41] this is a "training" one and not normal [11:02:44] I note I'm the only one in the google meet for this session, which seems odd [11:03:01] Majavah: yes, and I helped out with the last training [11:03:15] if no one else shows up in that session, maybe this won't be a "training" one after all [11:03:18] ah [11:04:02] (03PS1) 10Majavah: beta: use deployment-parsoid12 [puppet] - 10https://gerrit.wikimedia.org/r/676304 [11:05:58] apergos: yeah, I think we should have the window if someone wants to learn [11:06:15] if there's no one, it can turn into a normal backport window maybe? [11:06:18] I'm in the meet but if no one shows up then I guess no one shows. in the meantime [11:06:31] might as well go ahead with the backports, right? [11:09:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/675135 (owner: 10Jbond) [11:10:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [11:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:11] (03PS8) 10DharmrajRathod98: Improved: regex-validation in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [11:12:45] Amir1 or Lucas_WMDE, are one of you going to kick things off? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/674820 is first in the list I believe [11:14:39] I can do it I guess [11:15:30] (03CR) 10Ladsgroup: [C: 03+2] Enable MediaSearch by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674820 (owner: 10Matthias Mullie) [11:15:33] I’m having lunch now ^^ [11:15:39] unless you really need me [11:15:50] it should be fine, enjoy your meal! [11:16:09] (still no one in the google meet but me, I'll wait another 15 mins and then close it) [11:16:33] (03Merged) 10jenkins-bot: Enable MediaSearch by default for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674820 (owner: 10Matthias Mullie) [11:17:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:17:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [11:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:06] (03CR) 10DharmrajRathod98: "https://leetcode.com/playground/PVZPmKrw" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [11:18:48] matthiasmullie: live on mwdebug1002 now [11:19:31] Amir1: seems to work! LGTM [11:19:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:14] okie dokie, syncing [11:20:34] !log drain ganeti1018 [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:55] ugh, it didn't copy properly [11:23:05] the massage is malformed, sorry [11:23:46] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:674820|Enable MediaSearch by default for anonymous users (duration: 01m 10s) [11:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:03] matthiasmullie: synced [11:25:08] can you take a look [11:26:27] Amir1: seems to work just fine, thanks! [11:27:26] cool [11:29:06] (03CR) 10Ladsgroup: [C: 04-1] Disable RelatedArticles on Timeless skin on German Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) (owner: 10Zabe) [11:30:02] the config is wrong, not deploying [11:30:08] The window is closed [11:30:34] Zabe: ^^ [11:30:52] (03CR) 10Ladsgroup: [C: 04-1] "Also, I can't find community consensus for it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) (owner: 10Zabe) [11:31:07] there's 30 minutes left I suppose if the patch were to be corrected [11:31:11] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [11:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:10] Amir1: the patch is disabling it for timeless, so leaving overriding that config var with minerva only instead of the default of timeless and minerva is correct, right? [11:32:40] hmm, let me double check then [11:32:45] Amir1: yeah what Majavah says, it looks like the related skins setting is now whitelisted for minerva but NOT for timeless, so disabling it for timeless [11:33:12] (03CR) 10Ladsgroup: Disable RelatedArticles on Timeless skin on German Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) (owner: 10Zabe) [11:33:16] (03PS3) 10Ladsgroup: Disable RelatedArticles on Timeless skin on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) (owner: 10Zabe) [11:33:22] okay then, let's merge it [11:33:27] (03CR) 10Ladsgroup: [C: 03+2] Disable RelatedArticles on Timeless skin on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) (owner: 10Zabe) [11:34:11] (03Merged) 10jenkins-bot: Disable RelatedArticles on Timeless skin on German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675993 (https://phabricator.wikimedia.org/T278611) (owner: 10Zabe) [11:34:28] (03PS4) 10Hnowlan: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) [11:34:51] The community just needs to approve the configuration change and make it happen. <--- did this happen? [11:35:07] (03CR) 10Hnowlan: postgres: use remote script on replica to resync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [11:36:27] (03CR) 10H.krishna123: "Thank you for your contribution. Check my comments when possible. I haven't looked at actual code functionality, but I think you need to r" (034 comments) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [11:38:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [11:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:48] apergos: I don't think this is needed for this one [11:39:50] https://phabricator.wikimedia.org/T278611#6959213 [11:39:55] It seems it's a bug [11:40:42] I can;t tell from the report [11:40:53] I suppose if it turns out to be controversial we'll hear about it [11:41:15] !log drain ganeti1019 [11:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:53] yeah, it seems straightforward [11:43:54] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:675993|Disable RelatedArticles on Timeless skin on German Wikipedia]] (T278611) (duration: 01m 08s) [11:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:02] T278611: Request to disable RelatedArticles on Timeless skin on German Wikipedia - https://phabricator.wikimedia.org/T278611 [11:44:04] (03PS9) 10DharmrajRathod98: Improved: regex-validation in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) [11:45:42] (03CR) 10DharmrajRathod98: "let me know if any further changes required" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [11:45:47] Amir1: thx for your help [11:45:58] ^^ [11:46:31] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [11:46:42] (03CR) 10Volans: [C: 04-1] "Sorry for the delayed reply. Due to the time since last review and the original implementation without classes I've reviewed it as if it w" (0337 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [11:48:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [11:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:16] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:52:34] ^ expected [11:55:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1019.eqiad.wmnet [11:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:31] !log drain ganeti1020 [11:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:48] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [11:55:49] (03PS1) 10Alexandros Kosiaris: conftool: Reorder services alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/676326 [11:55:51] (03PS1) 10Alexandros Kosiaris: services_proxy: Reorder on port number ascending [puppet] - 10https://gerrit.wikimedia.org/r/676327 [11:55:53] (03PS1) 10Alexandros Kosiaris: services_proxy: Switch apertium to 6019 [puppet] - 10https://gerrit.wikimedia.org/r/676328 [11:55:55] (03PS1) 10Alexandros Kosiaris: services_proxy: Add thanos-{query,swift} [puppet] - 10https://gerrit.wikimedia.org/r/676329 (https://phabricator.wikimedia.org/T278385) [11:56:24] (03CR) 10Volans: [C: 04-1] "found another small thing" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [11:59:23] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [11:59:55] !log Start server upload of two video files (~4 GB in total) # T278856 [12:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] T278856: Server side upload for Lusccasdeutsch (master task) - https://phabricator.wikimedia.org/T278856 [12:00:27] 10SRE, 10DC-Ops, 10SRE-tools, 10Sustainability (Incident Followup): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10LSobanski) https://phabricator.wikimedia.org/T277007 was a recent case where reimagi... [12:00:50] (03CR) 10Volans: [C: 04-1] "replying to my last comment" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [12:10:11] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10Ladsgroup) There are two options in IMO: - Expose this to internal network upon request, per node (or wide range) - Have all scripts live in lists100x and a c... [12:12:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:32] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2003-dev.codfw.wmnet [12:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [12:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:14] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2003-dev.codfw.wmnet [12:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:59] !log drain ganeti1021 [12:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:34:05] 10SRE, 10Wikimedia-Mailing-lists: Import several public mailing lists archives from mailman2 to lists-next to measure database size - https://phabricator.wikimedia.org/T278609 (10Ladsgroup) >>! In T278609#6962942, @Legoktm wrote: >>>! In T278609#6959195, @Legoktm wrote: >> @ladsgroup found https://lists.mailma... [12:34:55] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcephosd2003-dev.codfw.wmnet [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [12:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:45] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd2003-dev.codfw.wmnet [12:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:10] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [12:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:02] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:31] !log drain ganeti1022 [12:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:49] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:11] !log dcaro@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [12:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:37] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [12:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] twentyafterfour and hashar: #bothumor I � Unicode. All rise for Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1300). [13:00:43] (03CR) 10David Caro: [C: 03+2] reprepro: force usage of expired google key [puppet] - 10https://gerrit.wikimedia.org/r/676292 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [13:07:38] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [13:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:21] (03PS1) 10David Caro: reprepro: the bang goes at the end [puppet] - 10https://gerrit.wikimedia.org/r/676342 [13:12:30] (03PS2) 10David Caro: reprepro: use a different key for the k8s repo [puppet] - 10https://gerrit.wikimedia.org/r/676342 (https://phabricator.wikimedia.org/T279042) [13:13:08] (03PS1) 10Elukey: Remove the krb dns_canonicalize setting from some Analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/676344 (https://phabricator.wikimedia.org/T278353) [13:13:36] (03CR) 10Elukey: [C: 03+2] Remove the krb dns_canonicalize setting from some Analytics roles [puppet] - 10https://gerrit.wikimedia.org/r/676344 (https://phabricator.wikimedia.org/T278353) (owner: 10Elukey) [13:14:57] (03PS3) 10David Caro: reprepro: use a different key for the k8s repo [puppet] - 10https://gerrit.wikimedia.org/r/676342 (https://phabricator.wikimedia.org/T279042) [13:15:58] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:16:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [13:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:13] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): [ceph] Test and upgrade to kernel ~15 - https://phabricator.wikimedia.org/T274565 (10dcaro) Reverted all the changes on codfw, closing. [13:19:18] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): [ceph] Test and upgrade to kernel ~15 - https://phabricator.wikimedia.org/T274565 (10dcaro) 05Open→03Resolved [13:21:14] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28860/console" [puppet] - 10https://gerrit.wikimedia.org/r/676326 (owner: 10Alexandros Kosiaris) [13:22:19] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28861/console" [puppet] - 10https://gerrit.wikimedia.org/r/676327 (owner: 10Alexandros Kosiaris) [13:23:12] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28862/console" [puppet] - 10https://gerrit.wikimedia.org/r/676328 (owner: 10Alexandros Kosiaris) [13:24:12] !log cp3054: reboot with Linux 4.19.181+1 -- the kernel was not upgraded earlier during T273278 reboots due to broken dpkg status [13:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:58] PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:04] (03CR) 10Muehlenhoff: "> Patch Set 7:" [software/netbox] - 10https://gerrit.wikimedia.org/r/672831 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [13:29:14] RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 106.96 ms [13:35:40] (03PS1) 10David Caro: toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) [13:37:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2388.codfw.wmnet ` The log can be found in `/... [13:41:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/676284 (owner: 10Filippo Giunchedi) [13:44:01] (03CR) 10David Caro: "If it helps, the key I'm adding is:" [puppet] - 10https://gerrit.wikimedia.org/r/676342 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [13:44:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2389.codfw.wmnet ` The log can be found in `/... [13:46:48] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [13:47:33] (03PS1) 10David Caro: wmcs.toolforge.etcd: make sure the etcdctl node is not the new one [cookbooks] - 10https://gerrit.wikimedia.org/r/676373 (https://phabricator.wikimedia.org/T267082) [13:47:52] 10SRE, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10jijiki) [13:51:46] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2388.codfw.wmnet with reason: REIMAGE [13:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:49] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2388.codfw.wmnet with reason: REIMAGE [13:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is CRITICAL: 400.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [13:59:54] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2389.codfw.wmnet with reason: REIMAGE [14:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2388.codfw.wmnet'] ` and were **ALL** successful. [14:01:59] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2389.codfw.wmnet with reason: REIMAGE [14:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2390.codfw.wmnet ` The log can be found in `/... [14:04:35] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [14:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [14:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "While this is in the right direction I would go even further. There is no reason for the clusters to share the same data structure at all." [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [14:09:22] (03CR) 10Elukey: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [14:09:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2389.codfw.wmnet'] ` and were **ALL** successful. [14:11:35] (03PS3) 10Effie Mouzeli: modules: remove parsoidJS puppet module [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T268524) [14:12:04] (03PS4) 10Effie Mouzeli: modules: remove parsoidJS puppet module [puppet] - 10https://gerrit.wikimedia.org/r/676071 (https://phabricator.wikimedia.org/T279059) [14:13:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2391.codfw.wmnet ` The log can be found in `/... [14:13:54] (03PS1) 10David Caro: wmcs.toolforge: don't pass --fix-alt-names when depooling node [cookbooks] - 10https://gerrit.wikimedia.org/r/676376 [14:13:57] (03PS1) 10David Caro: wmcs.toolforge: add upper level cookbook to remove etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/676377 [14:16:51] (03PS2) 10David Caro: toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) [14:17:05] (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached socket on mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/676285 (https://phabricator.wikimedia.org/T273115) [14:17:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: enable onhost memcached socket on mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/676285 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [14:17:55] (03PS2) 10Elukey: kubernetes: move infrastructure_users to the k8s master role [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) [14:19:26] (03PS3) 10Elukey: kubernetes: move infrastructure_users to the k8s master role [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) [14:19:33] (03CR) 10Effie Mouzeli: [C: 03+2] profile::parsoid: remove parsoid class from parsoid profile [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [14:19:50] (03CR) 10Effie Mouzeli: profile::parsoid: remove parsoid class from parsoid profile [puppet] - 10https://gerrit.wikimedia.org/r/676068 (https://phabricator.wikimedia.org/T268524) (owner: 10Effie Mouzeli) [14:20:11] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable onhost memcached socket on mw canaries [puppet] - 10https://gerrit.wikimedia.org/r/676285 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [14:20:26] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-psi-eqiad on cloudelastic1006 is CRITICAL: 200.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-psi-eqiad&var-instance=cloudelastic1006&panelId=37 [14:21:34] (03PS4) 10Elukey: kubernetes: move infrastructure_users to the k8s master role [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) [14:22:22] (03PS5) 10Elukey: kubernetes: move infrastructure_users to the k8s master role [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) [14:22:49] !log disable puppet on mw* canaries, rolling depool and pooling of canaries [14:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:43] (03PS1) 10Papaul: DHCP Fix MAC address for mw2390 mw2391 and mw2392 [puppet] - 10https://gerrit.wikimedia.org/r/676378 (https://phabricator.wikimedia.org/T274171) [14:24:01] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [14:24:46] (03CR) 10Elukey: "All right should be ready to review, I hope I got it right. If so I'll update the fake private repo with what it is needed for pcc, and th" [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [14:25:09] (03CR) 10Papaul: [C: 03+2] DHCP Fix MAC address for mw2390 mw2391 and mw2392 [puppet] - 10https://gerrit.wikimedia.org/r/676378 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [14:28:15] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: allow monitoring.wmflabs.org to use Cloud IDP [puppet] - 10https://gerrit.wikimedia.org/r/676284 (owner: 10Filippo Giunchedi) [14:30:13] PROBLEM - Memcached on mw2376 is CRITICAL: connect to address 10.192.48.148 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:30:29] PROBLEM - Memcached on mw2252 is CRITICAL: connect to address 10.192.0.81 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:30:31] PROBLEM - Memcached on mw2374 is CRITICAL: connect to address 10.192.48.146 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:30:43] PROBLEM - Memcached on mw2272 is CRITICAL: connect to address 10.192.48.94 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:31:17] PROBLEM - Memcached on mw2251 is CRITICAL: connect to address 10.192.0.80 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:31:17] PROBLEM - Memcached on mw2271 is CRITICAL: connect to address 10.192.48.93 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:33:05] PROBLEM - Memcached on mw1263 is CRITICAL: connect to address 10.64.0.58 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:33:29] PROBLEM - Memcached on mw1261 is CRITICAL: connect to address 10.64.0.56 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:34:57] (03PS1) 10Ottomata: Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) [14:35:17] PROBLEM - Memcached on mw1264 is CRITICAL: connect to address 10.64.0.59 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:35:35] PROBLEM - Memcached on mw1265 is CRITICAL: connect to address 10.64.0.60 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:36:00] effie: this is you right? --^ [14:36:04] (03CR) 10jerkins-bot: [V: 04-1] Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:36:15] PROBLEM - Memcached on mw1279 is CRITICAL: connect to address 10.64.0.74 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:36:19] (I mean all expected only noise, just double checking) [14:36:41] PROBLEM - Memcached on mw1277 is CRITICAL: connect to address 10.64.0.72 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:37:23] PROBLEM - Memcached on mw1278 is CRITICAL: connect to address 10.64.0.73 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:37:37] PROBLEM - Memcached on mwdebug1002 is CRITICAL: connect to address 10.64.0.46 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [14:37:43] (03PS2) 10Ottomata: Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) [14:38:45] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10MSantos) @McLeod919 I subscribed you to 2 tasks that has information about how to setup Kartographer configuration and its cur... [14:38:52] (03CR) 10jerkins-bot: [V: 04-1] Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:39:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2390.codfw.wmnet with reason: REIMAGE [14:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:50] (03PS3) 10David Caro: toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) [14:41:27] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2390.codfw.wmnet with reason: REIMAGE [14:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] (03PS3) 10Ottomata: Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) [14:44:41] (03CR) 10jerkins-bot: [V: 04-1] Set up refine_sanitize jobs in analytics test cluster. [puppet] - 10https://gerrit.wikimedia.org/r/676380 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:47:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10Cmjohnson) [14:48:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2390.codfw.wmnet'] ` and were **ALL** successful. [14:48:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10Cmjohnson) @Jgreen The disks have been installed, feel free to do the install. @Jclark-ctr I left the packing slip in the box, can you receive and d... [14:48:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10Cmjohnson) a:03Jclark-ctr [14:49:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2392.codfw.wmnet ` The log can be found in `/... [14:49:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10Cmjohnson) [14:49:55] (03PS1) 10Elukey: Add a more meaningful base dir for Hadoop in test [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) [14:50:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10Cmjohnson) a:05Cmjohnson→03RobH @robh this is ready for install when you have time. [14:52:36] !log uploaded python3-wmflib_0.0.7 to bullseye-wikimedia [14:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:51] volans: the future :D [14:53:14] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28864/console" [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [14:53:21] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-psi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-psi-eqiad&var-instance=cloudelastic1006&panelId=37 [14:53:47] elukey: lol [14:55:22] elukey: yes it is me, this was picked up pretty quickly [14:55:36] thank you [14:55:38] and sigh [14:57:08] (03PS1) 10Volans: remote: fix use_sudo on split() [software/spicerack] - 10https://gerrit.wikimedia.org/r/676394 [14:57:16] (03PS1) 10Muehlenhoff: Add postgresql-server-dev-all to package builder packages [puppet] - 10https://gerrit.wikimedia.org/r/676395 (https://phabricator.wikimedia.org/T277064) [15:03:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2392.codfw.wmnet with reason: REIMAGE [15:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:01] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [15:04:12] (03PS2) 10Elukey: Add a more meaningful base dir for Hadoop in test [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) [15:05:42] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2392.codfw.wmnet with reason: REIMAGE [15:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:53] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28865/console" [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:10:42] no takers for EU backport training it seems :( [15:11:42] jouncebot: now [15:11:42] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [15:11:47] jouncebot: next [15:11:47] In 0 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1600) [15:13:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2392.codfw.wmnet'] ` and were **ALL** successful. [15:13:40] (03CR) 10Ottomata: "Hm, ok." [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:13:46] (03CR) 10Ottomata: [C: 03+1] Add a more meaningful base dir for Hadoop in test [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:14:02] btw thcipriani the new calendar looks good [15:14:18] the design of the new calendar* [15:15:04] tabbycat: all credit to Krink.le for that one. He's been asking me to change to the new format for...a bit :) [15:15:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2391.codfw.wmnet ` The log can be found in `/... [15:16:37] (03CR) 10David Caro: [C: 03+1] remote: fix use_sudo on split() [software/spicerack] - 10https://gerrit.wikimedia.org/r/676394 (owner: 10Volans) [15:17:26] (03CR) 10Elukey: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:21:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2393.codfw.wmnet ` The log can be found in `/... [15:30:50] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2391.codfw.wmnet with reason: REIMAGE [15:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:51] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2391.codfw.wmnet with reason: REIMAGE [15:32:52] (03CR) 10Ottomata: [C: 03+1] "No it makes sense, to be more consistent with what we do elsewhere." [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [15:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:49] 10SRE, 10Product-Infrastructure-Team-Backlog, 10Services, 10serviceops, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10Aklapper) [15:37:06] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2393.codfw.wmnet with reason: REIMAGE [15:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:07] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2393.codfw.wmnet with reason: REIMAGE [15:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2391.codfw.wmnet'] ` and were **ALL** successful. [15:41:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2394.codfw.wmnet ` The log can be found in `/... [15:45:17] (03CR) 10Bstorm: [C: 03+1] toolforge.etcdctl: Allow getting the cluster health [software/spicerack] - 10https://gerrit.wikimedia.org/r/676367 (https://phabricator.wikimedia.org/T276338) (owner: 10David Caro) [15:45:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2393.codfw.wmnet'] ` and were **ALL** successful. [15:46:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2395.codfw.wmnet ` The log can be found in `/... [15:46:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [15:46:37] I'll be little late to the puppet request window, sorry in advance [15:46:55] (03CR) 10Volans: [C: 03+2] remote: fix use_sudo on split() [software/spicerack] - 10https://gerrit.wikimedia.org/r/676394 (owner: 10Volans) [15:54:10] (03Merged) 10jenkins-bot: remote: fix use_sudo on split() [software/spicerack] - 10https://gerrit.wikimedia.org/r/676394 (owner: 10Volans) [15:55:03] (03CR) 10Bstorm: [C: 03+1] "I almost thought to suggest that there's better ways to get an existing node and be sure of it, but any notions like that are all way more" [cookbooks] - 10https://gerrit.wikimedia.org/r/676373 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [15:55:14] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10Cmjohnson) [15:55:53] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH these are ready for installs when you have the time. [15:56:31] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2394.codfw.wmnet with reason: REIMAGE [15:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) @elukey that was my fault, I left it in it's BIOS settings when I left yesterday. I rebooted and it's back. [15:58:31] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2394.codfw.wmnet with reason: REIMAGE [15:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:54] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge.etcd: make sure the etcdctl node is not the new one [cookbooks] - 10https://gerrit.wikimedia.org/r/676373 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [15:59:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Cmjohnson) [15:59:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1084.eqiad.wmnet - https://phabricator.wikimedia.org/T276302 (10Cmjohnson) 05Open→03Resolved [15:59:55] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Cmjohnson) [16:00:05] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1600). [16:00:05] Majavah: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:17] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frqueue1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T277171 (10Cmjohnson) [16:00:22] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frqueue1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T277171 (10Cmjohnson) 05Open→03Resolved [16:00:42] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2395.codfw.wmnet with reason: REIMAGE [16:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:58] (03Merged) 10jenkins-bot: wmcs.toolforge.etcd: make sure the etcdctl node is not the new one [cookbooks] - 10https://gerrit.wikimedia.org/r/676373 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [16:02:48] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2395.codfw.wmnet with reason: REIMAGE [16:02:52] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Eqiad: Ports with no description on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T278726 (10Cmjohnson) 05Open→03Resolved There are new servers installed in this rack, most of these are the new cloudvirts. The servers will be racked and connect... [16:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:16] (03CR) 10Bstorm: [C: 03+1] "Simple UI seems good." [cookbooks] - 10https://gerrit.wikimedia.org/r/676377 (owner: 10David Caro) [16:05:19] (03CR) 10Bstorm: [C: 03+1] wmcs.toolforge: don't pass --fix-alt-names when depooling node [cookbooks] - 10https://gerrit.wikimedia.org/r/676376 (owner: 10David Caro) [16:05:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2394.codfw.wmnet'] ` and were **ALL** successful. [16:05:31] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): elastic1060 reported errors in getsel - https://phabricator.wikimedia.org/T278630 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The DIMM only reported the error that one day and has not returned. I am clearing the system log and reso... [16:05:36] Majavah: I don't know that I have enough Beta-awareness to review your patch competently :) [16:05:47] (03CR) 10David Caro: [C: 03+2] "> Patch Set 1: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/676377 (owner: 10David Caro) [16:05:53] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: don't pass --fix-alt-names when depooling node [cookbooks] - 10https://gerrit.wikimedia.org/r/676376 (owner: 10David Caro) [16:06:35] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10Cmjohnson) Looks like a possible DIMM error, since the server is already depooled I will run a couple of tests to determine if it's a DIMM, CPU or motherboard issue. [16:06:55] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add a more meaningful base dir for Hadoop in test [puppet] - 10https://gerrit.wikimedia.org/r/676382 (https://phabricator.wikimedia.org/T278422) (owner: 10Elukey) [16:08:05] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey I have not forgotten about this, A7 is a rack for the possible move but we are already maxing out our power utilization in that rack and addi... [16:09:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2396.codfw.wmnet ` The log can be found in `/... [16:09:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2395.codfw.wmnet'] ` and were **ALL** successful. [16:10:03] (03Merged) 10jenkins-bot: wmcs.toolforge: don't pass --fix-alt-names when depooling node [cookbooks] - 10https://gerrit.wikimedia.org/r/676376 (owner: 10David Caro) [16:10:18] cdanis: hi, sorry I'm late, the hiera patches were already cherry picked to deployment-puppetmaster04 but I guess I can hunt someone else to review and merge [16:10:33] (03Merged) 10jenkins-bot: wmcs.toolforge: add upper level cookbook to remove etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/676377 (owner: 10David Caro) [16:10:57] Majavah: ah, if they've been cherrypicked there I think that's good enough for me [16:11:02] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is CRITICAL: 166.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [16:12:11] cdanis: everything has been, hiera ones are only deployment-prep specific, the other patch is also used by other projects, should be a noop for existing instances but not for new ones [16:16:25] (03CR) 10CDanis: [C: 03+2] scap::sources: fix 3d2png/deploy on beta [puppet] - 10https://gerrit.wikimedia.org/r/675807 (owner: 10Majavah) [16:16:35] (03CR) 10CDanis: [C: 03+2] beta: use deployment-parsoid12 [puppet] - 10https://gerrit.wikimedia.org/r/676304 (owner: 10Majavah) [16:16:45] (03PS2) 10CDanis: beta: use deployment-parsoid12 [puppet] - 10https://gerrit.wikimedia.org/r/676304 (owner: 10Majavah) [16:18:26] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10wiki_willy) Hi @Cmjohnson - there should some power freed up, after some mw servers are decom'd for the T273915 refresh. There's going to 7x servers coming out... [16:18:58] (03CR) 10CDanis: [C: 03+2] scap::sources: beta: remove unused jobrunner and recommendationapi [puppet] - 10https://gerrit.wikimedia.org/r/675814 (owner: 10Majavah) [16:19:12] (03PS2) 10CDanis: beta: add deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/675815 (owner: 10Majavah) [16:19:59] (03CR) 10CDanis: [C: 03+2] beta: add deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/675815 (owner: 10Majavah) [16:20:08] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @wiki_willy That will work! Thanks [16:20:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:22:29] (03PS2) 10CDanis: role::deployment_server: do not always use lvm on cloud [puppet] - 10https://gerrit.wikimedia.org/r/675802 (owner: 10Majavah) [16:22:34] (03CR) 10CDanis: [C: 03+2] role::deployment_server: do not always use lvm on cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675802 (owner: 10Majavah) [16:23:53] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2396.codfw.wmnet with reason: REIMAGE [16:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:20] Majavah: merged, one (not blocking/not urgent) comment [16:25:25] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:25:53] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2396.codfw.wmnet with reason: REIMAGE [16:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Dell Ticket Created You have successfully submitted request SR1055971142. [16:28:51] (03CR) 10Majavah: role::deployment_server: do not always use lvm on cloud (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/675802 (owner: 10Majavah) [16:28:54] cdanis: thanks, replied [16:29:35] Majavah: ah okay, great, that all totally answers my misinformed concerns :) [16:30:06] (03CR) 10Alexandros Kosiaris: "This is specific to WMCS, production doesn't use those components, fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/676342 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [16:30:42] (03CR) 10BryanDavis: toolforge.checker: Update list of etcd nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [16:35:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2396.codfw.wmnet'] ` and were **ALL** successful. [16:36:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Papaul) [16:36:59] (03PS1) 10Urbanecm: hrwiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676414 (https://phabricator.wikimedia.org/T275684) [16:37:13] jouncebot: now [16:37:13] For the next 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1600) [16:37:25] (03CR) 10Urbanecm: [C: 03+2] hrwiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676414 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [16:37:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Cmjohnson) [16:38:12] (03Merged) 10jenkins-bot: hrwiki: Fix help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676414 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [16:39:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a7acf3357d5d148bad11a2d2718b4da56e1a0cb8: hrwiki: Fix help panel links (T275684) (duration: 01m 10s) [16:39:48] * Urbanecm done [16:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:52] T275684: Deploy Growth features on Croatian Wikipedia - https://phabricator.wikimedia.org/T275684 [16:44:32] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1007.eqiad.wmnet [16:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:12] !log Start server-side upload of two files (T279082, T279081) [16:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:21] T279082: Server side upload for Sturm - https://phabricator.wikimedia.org/T279082 [16:59:21] T279081: Server side upload for Sturm - https://phabricator.wikimedia.org/T279081 [17:00:04] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1700). [17:06:19] Urbanecm: why does v2commons do one task per upload? [17:07:00] RhinosF1: not sure. I saw multiple files in one task some time ago, so maybe users just don't use the batch feature? [17:07:06] Maybe [17:07:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/675566 (https://phabricator.wikimedia.org/T278224) (owner: 10Elukey) [17:07:51] (03PS1) 10Anne Tomasevich: Use appendChild() instead of append() [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676421 (https://phabricator.wikimedia.org/T278448) [17:08:54] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-omega-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 27.46 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-omega-eqiad&var-instance=cloudelastic1005&panelId=37 [17:09:08] (03CR) 10Eric Gardner: [C: 03+1] "This will clear up a lot of log errors now that MediaSearch is default for anon users." [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676421 (https://phabricator.wikimedia.org/T278448) (owner: 10Anne Tomasevich) [17:18:25] (03PS1) 10RobH: bast1003 setup info [puppet] - 10https://gerrit.wikimedia.org/r/676424 (https://phabricator.wikimedia.org/T276396) [17:19:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:06] (03CR) 10RobH: [C: 03+2] bast1003 setup info [puppet] - 10https://gerrit.wikimedia.org/r/676424 (https://phabricator.wikimedia.org/T276396) (owner: 10RobH) [17:21:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:16] PROBLEM - Host an-worker1080 is DOWN: PING CRITICAL - Packet loss = 100% [17:23:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` bast1003.wikimedia.org ` The log can be found in `... [17:23:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast1003.wikimedia.org'] ` Of which those **FAILED**: ` ['bast1003.wikimedia.org'] ` [17:24:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` bast1003.wikimedia.org ` The log can be found in `... [17:24:36] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:18] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:54] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast1003.wikimedia.org'] ` Of which those **FAILED**: ` ['bast1003.wikimedia.org'] ` [17:34:37] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add package manifests for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676172 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [17:34:44] (03CR) 10Andrew Bogott: [C: 03+2] Add Designate files and manifests for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676173 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [17:34:48] (03CR) 10Andrew Bogott: [C: 03+2] Codfw1dev designate -> OpenStack version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676174 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [17:35:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` bast1003.wikimedia.org ` The log can be found in `... [17:39:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Cmjohnson) a:05Cmjohnson→03RobH @RobH assigning this to you, 1040-1045 are ready for installs. I set up both ports in netbox. S... [17:47:36] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast1003.wikimedia.org with reason: REIMAGE [17:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:43] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1003.wikimedia.org with reason: REIMAGE [17:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast1003.wikimedia.org'] ` and were **ALL** successful. [17:59:30] jouncebot: next [17:59:30] In 0 hour(s) and 0 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1800) [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1800). [18:00:04] annet: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:10] i can deploy today [18:00:13] annet: around? [18:00:23] Urbanecm: I'm here! [18:00:26] great! [18:00:29] (03CR) 10Urbanecm: [C: 03+2] Use appendChild() instead of append() [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676421 (https://phabricator.wikimedia.org/T278448) (owner: 10Anne Tomasevich) [18:00:34] let's give CI a while then :) [18:00:51] 👍 [18:03:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10RobH) 05Open→03Resolved [18:03:34] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10RobH) [18:03:49] 10SRE: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10RobH) Please note bast1003 is now staged in netbox and is ready for new roles and migration as needed! [18:17:34] Urbanecm: Do you have time for a config deployment? [18:17:54] Zabe: sure [18:18:19] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/675319 [18:18:23] Zabe: can you update the calenaar please? [18:18:27] (if not done already) [18:18:30] yes [18:19:20] (03PS2) 10Urbanecm: Enable SandboxLink extension in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675319 (https://phabricator.wikimedia.org/T278634) (owner: 10Zabe) [18:19:23] (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink extension in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675319 (https://phabricator.wikimedia.org/T278634) (owner: 10Zabe) [18:20:02] (03CR) 10Urbanecm: [C: 03+2] "CR-1 removed, because concerns got resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675319 (https://phabricator.wikimedia.org/T278634) (owner: 10Zabe) [18:20:06] (03Merged) 10jenkins-bot: Enable SandboxLink extension in ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/675319 (https://phabricator.wikimedia.org/T278634) (owner: 10Zabe) [18:21:31] Zabe: please test mwdebug1001 [18:22:04] *on mwdebug1001 [18:22:46] it works the supposed way [18:26:39] (03Merged) 10jenkins-bot: Use appendChild() instead of append() [extensions/WikibaseMediaInfo] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676421 (https://phabricator.wikimedia.org/T278448) (owner: 10Anne Tomasevich) [18:28:33] thanks Zabe [18:28:35] syncing [18:28:46] (03CR) 10Ori.livneh: "No rush here, as this will only be effective once 1.36.0-wmf.38 goes out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676435 (https://phabricator.wikimedia.org/T277817) (owner: 10Ori.livneh) [18:29:08] (03PS2) 10Ori.livneh: Wikibase: sample function call counters at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676435 (https://phabricator.wikimedia.org/T277817) [18:29:55] annet: your patch is on mwdebug1001, can you test it there please? [18:30:02] on it... [18:31:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b485d1ca6779a03912345a094fa1101cef5f091a: Enable SandboxLink extension in ptwikinews (T278634) (duration: 01m 12s) [18:31:34] Zabe: live [18:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:40] Urbanecm: looks fine [18:31:41] T278634: Add SandboxLink extension in ptwikinews - https://phabricator.wikimedia.org/T278634 [18:31:45] thanks annet, syncing [18:32:41] thx [18:33:10] np Zabe [18:33:24] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.37/extensions/WikibaseMediaInfo/resources/mediasearch-vue/components/base/Dialog.vue: e77f2b98a4fcb7d9cf74c45caeb7cfbc68a063d0: Use appendChild() instead of append() (T278448) (duration: 01m 09s) [18:33:31] annet: it's live now [18:33:33] anything else? [18:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:35] T278448: document.body.append is not a function / undefined is not a function when mounting in Special:MediaSearch in ES5 browsers - https://phabricator.wikimedia.org/T278448 [18:34:28] 10SRE, 10serviceops: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [18:34:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) [18:35:13] !log Morning B&C window done [18:35:15] Urbanecm: all looks good, thanks! [18:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:20] np :) [18:36:00] (03PS1) 10Ebernhardson: Revert "Turn on glent m1 AB test" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676350 (https://phabricator.wikimedia.org/T262612) [18:37:18] (03PS1) 10Ebernhardson: Revert "Turn on glent m1 AB test" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/676351 (https://phabricator.wikimedia.org/T262612) [18:39:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:40:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:36] (03PS1) 10Dzahn: site/conftool-data: turn 4 more new servers into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/676437 (https://phabricator.wikimedia.org/T278396) [18:49:45] (03PS2) 10Dzahn: site/conftool-data: turn 4 more new servers into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/676437 (https://phabricator.wikimedia.org/T278396) [18:55:20] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: turn 4 more new servers into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/676437 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [18:57:08] when I go to https://www.mediawiki.org/wiki/ the image in the top left corner is still the old icon, but in incognito mode its the new logo - how to I purge it from my client? Normal action=purge on a page view does nothing [18:57:46] DannyS712: tryig hitting reload while holding the shift key [18:57:51] that just changed it for me [18:59:28] !log creating mcrouter certs for mw2379 thorugh mw2382 [18:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour and hashar: (Dis)respected human, time to deploy Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T1900). Please do the needful. [19:00:18] yay, that worked - thanks [19:01:51] cool, yw [19:03:38] 10SRE, 10ops-eqiad, 10Traffic: cp1087 powercycled - https://phabricator.wikimedia.org/T278729 (10wiki_willy) a:03Cmjohnson [19:05:48] Everything clear for the train? [19:08:40] no issues from my side, I am just setting up new servers in codfw that are not in scap group just yet [19:17:59] cool rolling the train then everything appears to be stable [19:19:21] (03PS1) 1020after4: group2 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676441 [19:19:24] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676441 (owner: 1020after4) [19:19:44] (03PS1) 10Dzahn: add fake mcrouter certs for mw2379 through mw2402 [labs/private] - 10https://gerrit.wikimedia.org/r/676442 (https://phabricator.wikimedia.org/T278396) [19:20:06] (03Merged) 10jenkins-bot: group2 wikis to 1.36.0-wmf.37 refs T278343 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676441 (owner: 1020after4) [19:21:48] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.36.0-wmf.37 refs T278343 [19:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:57] T278343: 1.36.0-wmf.37 deployment blockers - https://phabricator.wikimedia.org/T278343 [19:24:00] ok I see a bunch of weird errors now [19:24:24] MWException: No localisation cache found for English on parse2001 [19:26:06] twentyafterfour: I know it has been reimaged with buster as a test before reimaging all other parse, but that's about it [19:26:32] I could set it to inactive though if it causes issues fo deployers [19:26:33] PROBLEM - Apache HTTP on parse2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 924 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:27:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) When I went to go install these, it seems that cloudvirt1040 is the one iwth the Intel nics, which means it was the seed serve... [19:27:23] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet [19:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:34] depooling because of that Apache alert [19:28:02] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=parse2001.codfw.wmnet [19:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:15] setting to inactive, which will remove it from "dsh groups" [19:28:29] mutante: it's causing a lot of errors in the logs because it doesn't have current translation cache [19:28:59] mutante: don't remove from dsh groups! then I won't be able to fix it [19:29:06] I need to scap it [19:29:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=parse2001.codfw.wmnet [19:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:03] !log depooled parse2001 because on train deployment it caused "MWException: No localisation cache found for English" and then "HTTP CRITICAL: HTTP/1.1 500 Internal Server Error" (T268524) [19:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:11] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [19:30:11] twentyafterfour: back to pooled=no [19:30:50] no traffic, but scap [19:31:20] ok I'm going to rebuild the l10n for wmf.37 that should fix it [19:34:00] thanks [19:34:10] !log twentyafterfour@deploy1002 scap sync-l10n completed (1.36.0-wmf.37) (duration: 02m 38s) [19:34:11] !log mw2379, mw2380, mw2381, mw2382 - rebooting [19:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:17] mutante: can you repool parse2001 now to be sure that it's fixed? [19:35:25] RECOVERY - Apache HTTP on parse2001 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:36:28] twentyafterfour: ^ that is a good sign, pooling [19:37:51] !log pooled parse2001 again after twentyaftefour rebuilt the l10n cache for wmf.37 which fixed it and made Apache alert recover (T268524) [19:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:01] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [19:38:14] looks like it's good, thanks mutante [19:39:23] thanks as well, made sure the log lines are on the ticket so that E.ffie sees it when upgrading more parse* [19:39:46] seems like this needs to run again in the future [19:40:01] but so far it was only 2001 [19:41:02] in general, when any server gets provisioned on a wednesday it will wind up not getting the localization for the newest branch [19:41:22] this is essentially a very slow race condition between the scap operations performed on tuesday and thursday [19:42:00] if a server gets provisioned before the new branch is active, but after the new branch is initialized, it will wind up without localization for that branch [19:42:33] we should probably just do a full scap on thursdays for good measure but I always like to do the faster sync-wikiversions deployment because it is faster and also quicker to revert [19:44:12] when we run a local "scap pull" on the server after reimaging, it gets the current MW code but that does not fix the l10n cache issue, is that correct? [19:44:52] scap pull should do the same as scap world would, fromthe appserver's perspective [19:45:09] (assuming deployment server didn't change since the last time scap ran) [19:45:21] ok, so if this was just "forget to scap pull" then it's easy enough [19:45:31] for us to just do that [19:45:32] keyword being 'should'. I don't know if I'm right. [19:45:49] but yeah, if pull didn't suffice, that's something we should fix :) [19:46:05] we should confirm on parse2002 once we get to that [19:46:41] we could e.g. have a systemd startup thing that basically cripples the server if it hasn't had a scap pull since (uptime) ago. [19:46:50] to make it unpoolable [19:47:05] I feel like we've had this missed or suspected a dozen times or so over the past 1.5 year [19:47:16] if you do scap pull before the branch is active it doesn't sync the l10n for the new branch [19:47:51] twentyafterfour: does scap world push directories that scap pull doesn't know to ask for? [19:48:06] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676450 [19:48:09] or we could let scap pull run @reboot [19:48:19] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676450 (owner: 10Kosta Harlan) [19:48:22] which happens as part of the reimage cookbook [19:48:23] Krinkle: the thing is, it only syncs for the active wiki version [19:48:29] (03PS1) 10Ladsgroup: Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) [19:48:43] so if the newest branch isn't active yet when scap pull happens [19:48:45] oh, I assume it pulled /srv/mediawiki and only excluded stuff an app server never needs [19:48:53] well then.. it sounds more like the "full scap every Thursday" [19:49:21] mutante: yeah, that would work too, I wasn't sure if sre want scap-pull to happen automatically or that it's considered beter or safer to do intentionally [19:49:38] maybe there's use in rebooting and doing something else before pooling. [19:49:42] (03CR) 10jerkins-bot: [V: 04-1] Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [19:49:48] we full scap on tuesday but if a server comes online wednesday then it doesn't get the l10n for the not-yet-active branch afaik [19:49:55] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/676450 (owner: 10Kosta Harlan) [19:50:24] twentyafterfour: so if a server is online the whole time, at what point would it have received the l10n? [19:50:26] Krinkle: yea, there is. it should not pool automatically, but it never hurts to scap pull. afaik [19:50:37] Krinkle: tuesday [19:51:18] mutante: right, but we know we've forgotten pull sometimes before pooling, right? as long as we can make reasonably sure this happens unless intentionally skipped (e.g. easy by defualt, hard but possible to avoid; instead of forget by default, and hard to remember) [19:51:21] only a full scap will sync the l10n files afaik, no other process does [19:51:32] (03PS1) 10Andrew Bogott: OpenStack: add manifests, files and templates for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676453 (https://phabricator.wikimedia.org/T261136) [19:51:41] !log kharlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [19:51:45] twentyafterfour: hm. didn;t scap pull used to check l10n and rebuild if needed? [19:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:49] I mean, it needs to not just sync them but also run rebuildLocalizationCache afaikl [19:51:54] I recall waitnig for l10n rebuilds on mwdebug in years past [19:51:57] but not for a while indeed. [19:52:06] Krinkle: maybe? I may be totally off base here [19:52:50] twentyafterfour: if it wasn't built yet on the deploy server, then there's no problem indeed. but once the l10n directory exists, I see no reason not to pull it by default, assuming it's quick for the case where it hasn't changed (based on the json/md5 I assume) [19:52:51] I *think* that `scap pull` would get the most recent l10n cache build [19:52:59] I don't think that l10n rebuilds are automatic anymore, it's either `scap sync-l10n` or `scap sync-world` otherwise no l10n [19:53:19] we will have plenty of servers to confirm whether scap pull does it or not soon [19:53:21] right so we do already have a way to pull l10n and quick-skip if it's redundant, we just aren't considering all l10ns yet [19:53:38] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests, files and templates for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676453 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [19:54:09] twentyafterfour: it's okay not to rebuild, afaik we don't rebuild on ap servers anyway, only on deploy server (by rebuild I mean calling MW to build it, I don't mean the cdb->json->cdb, which is just a funny way of transporting it) [19:54:10] my understanding is that it skips l10n for versions that aren't active in wikiversions.json [19:54:28] ok, I guess we can remove that filter? [19:54:30] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [19:54:30] !log kharlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [19:54:32] so it might only be if the server is provisioned on tuesday between the branch cut and the deploy to testwikis [19:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:47] and have pull consider all versions, as world does when it pushes [19:55:51] Krinkle: yeah removing that filter would make sense, though it may not be a filter so much as it only does l10n for versions listed in wikiversions.json and if they aren't listed it doesn't even know they exist [19:56:01] yeah, I just realised that [19:56:28] and the reason it can't just ask for all of /srv/mw is because of the json>cdb syncing seems to need to ask for specific stuff [19:56:38] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for mw2379 through mw2402 [labs/private] - 10https://gerrit.wikimedia.org/r/676442 (https://phabricator.wikimedia.org/T278396) (owner: 10Dzahn) [19:56:46] !log razzi@deploy1002 Started deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 [19:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:58] !log razzi@deploy1002 Finished deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 (duration: 00m 12s) [19:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:11] but that seems fixable. I don't know where we do the md5 checks currently we could probably do that on the client side instead if we aren't already, so it just asks for all php* --exclude cdb and then look at the l10ns it has and re-convert the ones that changed. [19:57:26] (also: insert periodic reminder about opcache and moving away from cdb for l10n) [19:57:34] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [19:57:34] !log kharlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [19:57:38] PROBLEM - mediawiki-installation DSH group on mw2380 is CRITICAL: Host mw2380 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:40] @seen razzy [19:57:41] mutante: I have never seen razzy [19:57:45] @seen razzi [19:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:45] mutante: Last time I saw razzi they were talking in the channel, they are still in the channel #wikimedia-analytics at 4/1/2021 7:54:04 PM (3m41s ago) [19:58:14] Krinkle: I thought we experimented with ditching cdb and it didn't turn out so well? [19:58:25] (03PS2) 10Ladsgroup: Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) [19:58:50] twentyafterfour: it's blocked on disablng opcache revalidations after which we'll increase opcache memory to accomodate for it (and possibly decrease apcu a little) [19:58:55] !log razzi@deploy1002 Started deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 [19:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2379.codfw.wmnet [19:59:15] !log razzi@deploy1002 Finished deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 (duration: 00m 21s) [19:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:21] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2380.codfw.wmnet [19:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2381.codfw.wmnet [19:59:29] and opcache revalidation is blocked on what? [19:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:35] nothing really ? [19:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2382.codfw.wmnet [20:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:49] blocked on people's time [20:01:16] !log mw2379, mw2380, mw2381, mw2382 - scap pull [20:01:20] right [20:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:34] !log razzi@deploy1002 deploy aborted: Deployment of superset fd7c9eb71e193, released after 1.0.1hv (duration: 00m 00s) [20:01:38] !log razzi@deploy1002 Started deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 [20:01:41] !log razzi@deploy1002 Finished deploy [analytics/superset/deploy@5b8de4c]: Deployment of superset fd7c9eb71e193, released after 1.0.1 (duration: 00m 04s) [20:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:01] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10dr0ptp4kt) Hi all - I think I can help. What's the ordered set of nameservers that should be used in MarkMonitor? [20:02:36] PROBLEM - mediawiki-installation DSH group on mw2381 is CRITICAL: Host mw2381 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:05:19] ^ well yea, it's not supposed to be yet. it should be in downtime though [20:05:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2381.codfw.wmnet with reason: new_install [20:05:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2381.codfw.wmnet with reason: new_install [20:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:40] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2382.codfw.wmnet with reason: new_install [20:05:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2382.codfw.wmnet with reason: new_install [20:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: new_install [20:05:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2380.codfw.wmnet with reason: new_install [20:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: new_install [20:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2379.codfw.wmnet with reason: new_install [20:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:09] (03PS3) 10Krinkle: Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [20:10:36] (03CR) 10Krinkle: "Shaved off about 5% with Zopfli, PNGOut and AdvPNG (using ImageOptim.app for macOS)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [20:11:03] (03CR) 10Krinkle: Add hi-res version of mediawiki.org logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [20:11:49] (03CR) 10Krinkle: [C: 04-1] Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [20:14:01] (03PS1) 10RobH: cloudvirt104[0-6] setup info [puppet] - 10https://gerrit.wikimedia.org/r/676455 (https://phabricator.wikimedia.org/T275081) [20:15:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [20:16:36] (03CR) 10RobH: [C: 03+2] cloudvirt104[0-6] setup info [puppet] - 10https://gerrit.wikimedia.org/r/676455 (https://phabricator.wikimedia.org/T275081) (owner: 10RobH) [20:18:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2379.codfw.wmnet [20:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['cloud... [20:21:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2380.codfw.wmnet [20:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:38] RECOVERY - mediawiki-installation DSH group on mw2380 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:23:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2381.codfw.wmnet [20:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2382.codfw.wmnet [20:23:46] PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1060, Errmsg: Error Duplicate column name changed_by_fk on query. Default database: superset_production. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:13] I am assuming the MariaDB alert is related to razzi's work [20:27:54] because on the -analytics channel there was a comment that superset deploy had an issue and he is restoring from a dump [20:29:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) So the entire lot failed to hit the dhcp server, the network has NOT been setup on cloudsw1-d5-eqiad for... [20:36:46] ACKNOWLEDGEMENT - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1060, Errmsg: Error Duplicate column name changed_by_fk on query. Default database: superset_production. [Query snipped] daniel_zahn ongoing work by Razzi, restoring from dump https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:41:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2243.codfw.wmnet [20:41:33] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Stalled→03Open 4 new jobrunners have been created. This can now continue. [20:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2246.codfw.wmnet [20:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2247.codfw.wmnet [20:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:59] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2248.codfw.wmnet [20:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:16] (03PS1) 10Ppchelko: Make Title::isWatchable more strict [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676355 (https://phabricator.wikimedia.org/T278735) [20:42:39] !log mw2243, mw2246, mw2247, mw2248 - depooled - replaced by mw2379, mw2380, mw2381, mw2382 ( T277780) [20:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:46] T277780: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 [20:42:49] (03Abandoned) 10Ppchelko: Make Title::isWatchable more strict [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676355 (https://phabricator.wikimedia.org/T278735) (owner: 10Ppchelko) [20:43:29] (03PS1) 10Andrew Bogott: OpenStack Designate: forward the domain deletion fix to Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676461 (https://phabricator.wikimedia.org/T261136) [20:43:41] (03PS2) 10Dzahn: site/conftool-data: decom jobrunners mw2243,mw2246,mw2247,mw2248 [puppet] - 10https://gerrit.wikimedia.org/r/674736 (https://phabricator.wikimedia.org/T277780) [20:46:16] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Designate: forward the domain deletion fix to Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676461 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [20:50:00] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Legoktm) [20:50:58] 10SRE, 10observability, 10serviceops, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Legoktm) [20:53:16] (03PS4) 10Ladsgroup: Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) [20:53:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [20:53:54] (03PS2) 10Andrew Bogott: OpenStack: add manifests, files and templates for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676453 (https://phabricator.wikimedia.org/T261136) [20:53:57] (03PS1) 10Andrew Bogott: Openstack Nova: update config for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676464 (https://phabricator.wikimedia.org/T261136) [20:54:01] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Pchelolo) On the change-prop side we already route all video scaling jobs to videoscaler.discovery.wmnet and all other job... [20:54:02] (03PS1) 10Andrew Bogott: OpenStack nova: refresh our servers.py hack from Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676465 [20:54:03] (03PS1) 10Andrew Bogott: Openstack Neutron: update our l3 agent hacks for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676466 (https://phabricator.wikimedia.org/T261136) [20:54:05] (03PS1) 10Andrew Bogott: Neutron: Remove ip_lib.py hack for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676467 (https://phabricator.wikimedia.org/T261136) [20:54:08] (03PS1) 10Andrew Bogott: OpenStack Glance: update config for Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676468 (https://phabricator.wikimedia.org/T261136) [20:54:10] (03CR) 10Ladsgroup: Add hi-res version of mediawiki.org logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [20:55:32] (03CR) 10Krinkle: [C: 03+1] "We should probably have CI run some version of that script so that it fails builds, the same way we do for db lists check etc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [20:56:11] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: add manifests, files and templates for version Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676453 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [21:03:58] RECOVERY - mediawiki-installation DSH group on mw2381 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:10:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) I've manually added port descriptions and the primary interface to vlan-cloud-hosts1-eqiad. The secondary interface is not ye... [21:17:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:08] (03CR) 10Ladsgroup: [C: 03+1] "It has my (virtual) blessing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676435 (https://phabricator.wikimedia.org/T277817) (owner: 10Ori.livneh) [21:20:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:35:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [21:36:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2243.codfw.wmnet [21:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2243.codfw.wmnet [21:58:14] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2243.codfw.wmnet` - mw2243.codfw.wmnet (**PASS**) - Downtime... [21:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:54] (03PS1) 10Cwhite: logstash: remove logstash output on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/676477 (https://phabricator.wikimedia.org/T234854) [22:00:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2246.codfw.wmnet [22:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:30] (03CR) 10Bstorm: [C: 03+1] "If that works, it's better than sitting around waiting, I guess." [puppet] - 10https://gerrit.wikimedia.org/r/676342 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [22:10:14] (03PS1) 10Legoktm: mailman3: Avoid duplication in rsync definitions [puppet] - 10https://gerrit.wikimedia.org/r/676479 (https://phabricator.wikimedia.org/T278609) [22:12:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2246.codfw.wmnet [22:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:29] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2246.codfw.wmnet` - mw2246.codfw.wmnet (**PASS**) - Downtime... [22:17:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2247.codfw.wmnet [22:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:57] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28867/console" [puppet] - 10https://gerrit.wikimedia.org/r/676479 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [22:22:27] (03CR) 10Ladsgroup: [C: 03+1] mailman3: Avoid duplication in rsync definitions [puppet] - 10https://gerrit.wikimedia.org/r/676479 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [22:29:05] (03CR) 10Jforrester: "> Patch Set 4: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [22:29:17] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/28868/" [puppet] - 10https://gerrit.wikimedia.org/r/676479 (https://phabricator.wikimedia.org/T278609) (owner: 10Legoktm) [22:33:07] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2247.codfw.wmnet [22:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:14] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2247.codfw.wmnet` - mw2247.codfw.wmnet (**FAIL**) - Downtime... [22:34:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2248.codfw.wmnet [22:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:36] (03CR) 10Bstorm: "Ok, so if I merge this one, after building the image, it will only be useful from homemade pod definitions and jobs at first. The `webserv" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/673378 (https://phabricator.wikimedia.org/T277749) (owner: 10Bstorm) [22:47:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:12] !log deploying phatality [22:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:50:03] !log twentyafterfour@deploy1002 Started deploy [releng/phatality@27ddd0b]: deploy phatality [22:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:16] !log twentyafterfour@deploy1002 Finished deploy [releng/phatality@27ddd0b]: deploy phatality (duration: 00m 13s) [22:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2248.codfw.wmnet [22:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:24] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2248.codfw.wmnet` - mw2248.codfw.wmnet (**PASS**) - Downtime... [23:00:04] brennen: That opportune time is upon us again. Time for a US Backport and Config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210401T2300). [23:00:04] ebernhardson and Amir1: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:33] o/ [23:00:36] howdy! [23:00:37] here [23:00:53] I can deploy :) [23:01:03] ebernhardson: let me know when you're around [23:01:31] (03CR) 10Thcipriani: [C: 03+2] "Backport!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [23:02:20] (03Merged) 10jenkins-bot: Add hi-res version of mediawiki.org logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676451 (https://phabricator.wikimedia.org/T268230) (owner: 10Ladsgroup) [23:02:42] Amir1: I'm going to do static, logos, then wmf-config, sound reasonable? [23:02:53] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom jobrunners mw2243,mw2246,mw2247,mw2248 [puppet] - 10https://gerrit.wikimedia.org/r/674736 (https://phabricator.wikimedia.org/T277780) (owner: 10Dzahn) [23:03:03] thcipriani: yup [23:04:28] Amir1: on mwdebug1001, check please [23:05:14] thcipriani: done, looks majestic [23:05:36] cool :) [23:05:37] syncing [23:08:44] !log thcipriani@deploy1002 Synchronized static: Backport: Part I [[gerrit:676451|Add hi-res version of mediawiki.org logos]] T268230 (duration: 00m 59s) [23:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:53] T268230: Roll out the new logo of MediaWiki - https://phabricator.wikimedia.org/T268230 [23:10:13] !log thcipriani@deploy1002 Synchronized logos: Backport: Part II [[gerrit:676451|Add hi-res version of mediawiki.org logos]] T268230 (duration: 00m 57s) [23:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:41] thcipriani: here, sorry a few mins late :) [23:11:18] ebernhardson: no worries [23:11:35] (03CR) 10Thcipriani: [C: 03+2] Revert "Turn on glent m1 AB test" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/676351 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:11:39] (03CR) 10Thcipriani: [C: 03+2] Revert "Turn on glent m1 AB test" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676350 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:12:05] !log thcipriani@deploy1002 Synchronized wmf-config/logos.php: Backport: Part III [[gerrit:676451|Add hi-res version of mediawiki.org logos]] T268230 (duration: 00m 57s) [23:12:12] ^ Amir1 done! [23:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:22] \o/ Thanks [23:14:35] ebernhardson: this is the most commonly backported file, I realized the other day while pulling some stats about backports [23:14:49] I made an exception for the commit message: https://github.com/thcipriani/releng-junk/blob/master/train-backports/parse-backports.py#L29 [23:15:42] thcipriani: its probably all me turning tests on and off, sampling is (probably) moving to backend though because we somehow need to assign buckets to requests coming from outside (browser bars, wiki portal, web links, etc.) [23:16:25] that is what I determined :) [23:17:25] I thought I determined a smart way to find out which files cause the most breakage during train, but it turned out to have some edgecases :P [23:17:26] (03Merged) 10jenkins-bot: Revert "Turn on glent m1 AB test" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/676351 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:17:28] (03Merged) 10jenkins-bot: Revert "Turn on glent m1 AB test" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676350 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [23:18:05] thcipriani: always happen. People want to run pagerank against wiki, but it turns out the most important page per that is something linked from a geo location template. Always edge cases :) [23:19:13] of course I *just* realized the wmf.36 one isn't necessary since it's thursday and we rolled forward to all wikis shortly before this window: https://versions.toolforge.org/ [23:19:24] so I'll sync wmf.37 and fetch down wmf.36 in case of rollback [23:19:40] alrighty [23:26:12] ebernhardson: sorry, some weirdness with timedmediahandler: looked like master was checked out rather than wmf.37... [23:27:06] ooh, fun :) [23:28:58] !log reset /srv/mediawiki-staging/php-1.36.0-wmf.37/extensions/TimedMediaHandler to 1be781d (HEAD of wmf/1.36.0-wmf.37 -- from HEAD of master 49f417) [23:29:01] ooook [23:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:19] ebernhardson: live on mwdebug1001 if there's anything to check [23:29:48] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Open→03Resolved [23:30:38] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Resolved→03Open @Papaul These were old servers in rack A4. They are also ready to go now. [23:31:12] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) p:05High→03Medium a:05Dzahn→03Papaul [23:31:16] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) @Papaul This is about these: https://netbox.wikimedia.org/dcim/devices/?q=mw2&mac_address=&has_primary_ip=&local_context_data=&virtual_chassis_mem... [23:31:16] thcipriani: should all be good [23:31:21] cool going live [23:32:29] !log thcipriani@deploy1002 Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: [[gerrit:676350|Revert "Turn on glent m1 AB test"]] T262612 (duration: 00m 58s) [23:32:35] ^ ebernhardson live now! [23:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:37] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [23:34:06] 10SRE, 10serviceops, 10Patch-For-Review: bring 26 new mediawiki appserver in codfw into production, rack A3 (mw2377 - mw2402) - https://phabricator.wikimedia.org/T278396 (10Dzahn) [23:35:43] thcipriani: thanks! all looks reasonable [23:41:14] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Dzahn) @dr0ptp4kt If the domain should be treated just like any other project domain, that would be: Name Server: NS0.WIKIMEDIA.ORG Name Server: NS1.WIKIMEDIA.ORG Name S... [23:44:06] 10SRE, 10DNS, 10Traffic, 10Abstract Wikipedia team (Phase δ): Establish wikifunctions.org - https://phabricator.wikimedia.org/T275904 (10Dzahn) @dr0ptp4kt That being said, I noticed there is an existing Google IP behind it, 216.239.38.21, and a redirect to https://meta.wikimedia.org/wiki/Abstract_Wikipedia... [23:56:45] (03PS3) 10Ladsgroup: varnish: Replace "Expires" in Set-Cookie with "Max-Age" [puppet] - 10https://gerrit.wikimedia.org/r/637851 (https://phabricator.wikimedia.org/T147967)