[00:00:05] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:32] !log T266492 T268779 T265699 Rolling restart of `cloudelastic` was successful [00:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:38] T268779: Support cloudelastic in spicerack elasticsearch - https://phabricator.wikimedia.org/T268779 [00:00:38] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [00:00:38] T265699: 40-elasticsearch-readahead udev rule failing for cloudelastic100[5,6] - https://phabricator.wikimedia.org/T265699 [00:01:22] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [00:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:36] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [00:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2227 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2228 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2229 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2230 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2232 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2233 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:28] ACKNOWLEDGEMENT - PHP opcache health on mw2234 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:29] ACKNOWLEDGEMENT - PHP opcache health on mw2235 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:29] ACKNOWLEDGEMENT - PHP opcache health on mw2237 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:30] ACKNOWLEDGEMENT - PHP opcache health on mw2238 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:03:30] ACKNOWLEDGEMENT - PHP opcache health on mw2240 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn normal after reimaging for some time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:04:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2237.codfw.wmnet [00:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:25] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2238.codfw.wmnet [00:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2239.codfw.wmnet [00:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2240.codfw.wmnet [00:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:55] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27460/puppetmaster2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/655515 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:08:46] (03CR) 10Dzahn: [C: 03+1] scap: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [00:09:29] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [00:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:03] !log T266492 Beginning rolling restart of `relforge` [00:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:06] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [00:10:56] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99) [00:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:08] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [00:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:10] !log (Forgot to tell it `relforge` isn't lvs-managed) [00:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:38] !log `sudo -i cookbook sre.elasticsearch.rolling-restart relforge "relforge cluster restart" --task-id T266492 --nodes-per-run 1 --without-lvs` [00:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:29] !log crusnov@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [00:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:16] !log completed rebooting Netbox hosts, failure was due to report errors that would not have recovered. [00:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:12] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [00:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:40] (03PS4) 10Ladsgroup: query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) [00:22:23] !log T266492 Restart of `relforge` successful [00:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:28] T266492: Restart elasticsearch clusters to apply readahead changes - https://phabricator.wikimedia.org/T266492 [00:22:54] (03CR) 10Dzahn: [C: 03+2] docker: require_package->ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655520 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:26:41] (03CR) 10Ladsgroup: "Do you know why PCC fails for wdqs nodes completely? https://puppet-compiler.wmflabs.org/compiler1002/27462/ https://puppet-compiler.wmfla" [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [00:31:48] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10Krenair) >>! In T271808#6739578, @Vgutierrez wrote: > `root@deployment-cache-upload06:/etc/acmecerts/unified/live# openssl x509 -da... [00:31:56] (03CR) 10Dzahn: query_service: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [00:32:57] (03CR) 10Dzahn: "I think the inline comment explains https://puppet-compiler.wmflabs.org/compiler1002/27462/wdqs1013.eqiad.wmnet/change.wdqs1013.eqiad.wmne" [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [00:36:25] (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/656015 (https://phabricator.wikimedia.org/T209953) [00:36:48] (03PS2) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype in eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/656015 (https://phabricator.wikimedia.org/T209953) [00:38:07] (03CR) 10Ladsgroup: query_service: Migrate hiera() to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [00:39:11] (03PS5) 10Ladsgroup: query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) [00:42:02] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27463/" [puppet] - 10https://gerrit.wikimedia.org/r/656015 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [00:44:42] (03CR) 10Ladsgroup: "Now it works: https://puppet-compiler.wmflabs.org/compiler1003/27465/" [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [00:57:56] (03PS3) 10Ladsgroup: query_service: Remove gui files from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) [01:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T0100). [01:13:09] (03PS2) 10Mstyles: update flink config with swift and other values [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) [01:16:01] (03CR) 10Mstyles: update flink config with swift and other values (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles) [01:24:02] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:10] (03CR) 10Alex Monk: [C: 03+1] tlsproxy::localssl: Remove support for the acme_subjects param [puppet] - 10https://gerrit.wikimedia.org/r/655761 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [01:31:43] (03CR) 10Alex Monk: [C: 03+1] "Unused, unmaintained, and not particularly good use cases remaining." [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [01:53:30] PROBLEM - PHP opcache health on mw2239 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:58:47] 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10herron) [02:04:24] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:54] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:58] (03CR) 10Dzahn: [C: 03+1] "16:40:37 wmf-style: total violations delta -10" [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [03:53:57] (03CR) 10Dzahn: deployment::rsync: replace cron with systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655172 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [04:42:04] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:37] (03CR) 10Legoktm: [C: 03+2] Add ability to separate the apt and the general http proxy [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655412 (https://phabricator.wikimedia.org/T183545) (owner: 10Giuseppe Lavagetto) [05:05:23] (03CR) 10Legoktm: [C: 03+1] Always refresh the base images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [07:06:47] (03PS3) 10Giuseppe Lavagetto: Fix duplicate detection when running in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655904 (https://phabricator.wikimedia.org/T271901) [07:11:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Always refresh the base images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [07:11:28] (03CR) 10Giuseppe Lavagetto: Add ability to separate the apt and the general http proxy (034 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655412 (https://phabricator.wikimedia.org/T183545) (owner: 10Giuseppe Lavagetto) [07:12:46] (03Merged) 10jenkins-bot: Always refresh the base images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [07:12:50] (03Merged) 10jenkins-bot: Add ability to separate the apt and the general http proxy [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655412 (https://phabricator.wikimedia.org/T183545) (owner: 10Giuseppe Lavagetto) [07:18:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:20:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:04] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:27] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [08:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:40] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [08:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] (03CR) 10Elukey: [C: 03+1] "From my point of view the change seems good, the Location block shouldn't be necessary anymore in the httpd config!" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [08:25:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:28:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [08:39:27] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:02] (03PS1) 10Marostegui: db2140: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656104 (https://phabricator.wikimedia.org/T271084) [08:42:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2140 T271084', diff saved to https://phabricator.wikimedia.org/P13764 and previous config saved to /var/cache/conftool/dbconfig/20210114-084243-marostegui.json [08:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:47] T271084: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 [08:42:49] (03CR) 10Marostegui: [C: 03+2] db2140: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/656104 (https://phabricator.wikimedia.org/T271084) (owner: 10Marostegui) [08:43:11] 10SRE, 10Phabricator, 10Traffic: Excessive queries from vscode-phabricator - https://phabricator.wikimedia.org/T271528 (10Aklapper) Patch has been merged. Is there more to do or can this task be resolved? [08:43:47] !log standardize cloudsw interfaces to prepare for switches homerisation [08:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:59] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) 05Open→03Resolved a:05Marostegui→03Papaul Data check was ok. Notifications enabled and host repooled. Thanks Papaul for replacing its memory. [08:44:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:47] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10akosiaris) >>! In T261369#6746098, @sbassett wrote: >>>! In T261369#6695002, @akosiaris wrote: >>> As I und... [08:50:30] arturo: fyi, I'm shuffling some interface arounds on the cloudsw to make them ready for homer, changes are NOOP, only aesthetic [08:51:27] !log rolling restart of ncredir servers to catch up on kernel upgrades [08:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:53] XioNoX: ack [08:52:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [08:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13765 and previous config saved to /var/cache/conftool/dbconfig/20210114-085252-root.json [08:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:05] (03PS1) 10Volans: netbox: create the state file for the Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/656105 [09:01:32] (03CR) 10jerkins-bot: [V: 04-1] netbox: create the state file for the Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/656105 (owner: 10Volans) [09:01:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [09:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656105 (owner: 10Volans) [09:02:48] (03PS2) 10Volans: netbox: create the state file for the Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/656105 [09:05:27] (03CR) 10Volans: [C: 03+2] netbox: create the state file for the Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/656105 (owner: 10Volans) [09:07:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13766 and previous config saved to /var/cache/conftool/dbconfig/20210114-090756-root.json [09:07:58] (03PS4) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 [09:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:05] (03CR) 10David Caro: wmcs.ceph.osd: disable write caches when possible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [09:10:28] 10SRE, 10ops-eqiad, 10Analytics: an-test-worker1002 may need a DAC replace - https://phabricator.wikimedia.org/T272009 (10elukey) [09:11:02] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:07] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [09:13:34] (03CR) 10Filippo Giunchedi: [C: 03+1] dns: remove logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/655959 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:13:45] (03CR) 10Filippo Giunchedi: [C: 03+1] elk7: change kibana7 monitoring to critical [puppet] - 10https://gerrit.wikimedia.org/r/655957 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:14:01] (03CR) 10Filippo Giunchedi: [C: 03+1] elk7: remove logstash-next cache setting [puppet] - 10https://gerrit.wikimedia.org/r/655958 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:14:18] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove bast3004/bast4002/bast5001 from Prometheus Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/655916 (owner: 10Muehlenhoff) [09:14:56] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [09:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:29] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655902 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [09:17:34] (03PS2) 10Filippo Giunchedi: role: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655902 (https://phabricator.wikimedia.org/T271415) [09:18:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:29] arturo: (done) [09:19:04] ack [09:19:11] (03CR) 10Ryan Kemper: [C: 03+2] Fix /sparql rewrite and alias rules [puppet] - 10https://gerrit.wikimedia.org/r/655639 (https://phabricator.wikimedia.org/T267825) (owner: 10ZPapierski) [09:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13767 and previous config saved to /var/cache/conftool/dbconfig/20210114-092300-root.json [09:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:37] (03PS1) 10Volans: netbox: actually create the empty file [puppet] - 10https://gerrit.wikimedia.org/r/656109 [09:29:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [09:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:24] (03CR) 10Volans: [C: 03+1] "Looks sane, but I didn't check all the nitty gritty details. Arzhel tested it and is a noop." [homer/public] - 10https://gerrit.wikimedia.org/r/655446 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [09:34:06] (03CR) 10Ayounsi: [C: 03+2] Add new cloudsw switches [homer/public] - 10https://gerrit.wikimedia.org/r/655446 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [09:34:39] (03Merged) 10jenkins-bot: Add new cloudsw switches [homer/public] - 10https://gerrit.wikimedia.org/r/655446 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [09:35:32] (03CR) 10Arturo Borrero Gonzalez: "LGTM, but some comments inline before my +1" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [09:36:10] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 10.78 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [09:36:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix duplicate detection when running in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655904 (https://phabricator.wikimedia.org/T271901) (owner: 10Giuseppe Lavagetto) [09:37:04] (03PS2) 10Volans: netbox: actually create the empty file [puppet] - 10https://gerrit.wikimedia.org/r/656109 [09:37:49] (03Merged) 10jenkins-bot: Fix duplicate detection when running in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655904 (https://phabricator.wikimedia.org/T271901) (owner: 10Giuseppe Lavagetto) [09:38:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: After restarting mysql', diff saved to https://phabricator.wikimedia.org/P13768 and previous config saved to /var/cache/conftool/dbconfig/20210114-093803-root.json [09:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:31] (03CR) 10jerkins-bot: [V: 04-1] netbox: actually create the empty file [puppet] - 10https://gerrit.wikimedia.org/r/656109 (owner: 10Volans) [09:39:19] (03PS3) 10Volans: netbox: actually create the empty file [puppet] - 10https://gerrit.wikimedia.org/r/656109 [09:39:40] (03CR) 10David Caro: [C: 03+1] "I always get lost doing these calculations, trying to figure myself out :S" [puppet] - 10https://gerrit.wikimedia.org/r/655952 (owner: 10Bstorm) [09:40:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [09:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:40] (03CR) 10Volans: [C: 03+2] netbox: actually create the empty file [puppet] - 10https://gerrit.wikimedia.org/r/656109 (owner: 10Volans) [09:45:08] (03PS7) 10Elukey: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [09:47:10] (03CR) 10David Caro: wmcs.ceph.osd: disable write caches when possible (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [09:49:01] (03PS2) 10David Caro: wmcs.ceph.osd: disable write caches when possible [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) [09:49:46] 10SRE, 10Wikimedia-Logstash: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10fgiunchedi) [09:50:11] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10Joe) >>! In T261369#6746098, @sbassett wrote: >>>! In T261369#6695002, @akosiaris wrote: >>> As I understan... [09:50:18] PROBLEM - Check systemd state on ms-be1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:14] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10Joe) I also want to stress this is the wrong task to have the above discussions, please open a new one to k... [09:51:58] (03PS1) 10Filippo Giunchedi: logstash: update short link to indexing errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/656112 (https://phabricator.wikimedia.org/T272016) [09:52:16] 10SRE, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10Joe) @jeena we already have a .pipeline directory for Shellbox, and it works as intended in creating multip... [09:55:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [09:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:03] (03PS1) 10Filippo Giunchedi: monitoring: update varnish 5xx logstash short link [puppet] - 10https://gerrit.wikimedia.org/r/656113 (https://phabricator.wikimedia.org/T272016) [09:59:24] (03CR) 10Arturo Borrero Gonzalez: wmcs.ceph.osd: disable write caches when possible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [10:01:46] (03CR) 10JMeybohm: [C: 03+1] "Looks good. Tokens are in private repo as well, go ahead!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [10:01:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:16] (03PS1) 10Elukey: sre.druid.roll-restart-workers: adjust import [cookbooks] - 10https://gerrit.wikimedia.org/r/656115 [10:06:32] (03PS1) 10Awight: Remove unused WMDE TeWü QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656116 (https://phabricator.wikimedia.org/T253112) [10:08:19] 10SRE, 10Performance-Team, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) 05Open→03Resolved a:03ema Erratic CPU usage [[ https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=1&orgId=1&from=1610249262871&to=1610618827835&var... [10:12:01] (03CR) 10WMDE-Fisch: [C: 03+1] Remove unused WMDE TeWü QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656116 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [10:12:20] !log reboot apt2001 [10:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:48] (03PS1) 10Jbond: apt: failover to apt2001 to reboot apt1001 [dns] - 10https://gerrit.wikimedia.org/r/656117 (https://phabricator.wikimedia.org/T269596) [10:14:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [10:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:18] !log failover apt.wikimedia.org to apt2001 [10:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:21] (03CR) 10Jbond: [C: 03+2] apt: failover to apt2001 to reboot apt1001 [dns] - 10https://gerrit.wikimedia.org/r/656117 (https://phabricator.wikimedia.org/T269596) (owner: 10Jbond) [10:17:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/656115 (owner: 10Elukey) [10:19:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:43] (03PS1) 10Elukey: Add cookbook to reboot Analytics Presto nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/656120 [10:21:55] (03CR) 10Elukey: [C: 03+2] sre.druid.roll-restart-workers: adjust import [cookbooks] - 10https://gerrit.wikimedia.org/r/656115 (owner: 10Elukey) [10:22:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [10:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:31] (03Merged) 10jenkins-bot: sre.druid.roll-restart-workers: adjust import [cookbooks] - 10https://gerrit.wikimedia.org/r/656115 (owner: 10Elukey) [10:25:08] (03PS1) 10Arturo Borrero Gonzalez: labtestvirt2003: drop host from production [puppet] - 10https://gerrit.wikimedia.org/r/656121 (https://phabricator.wikimedia.org/T271519) [10:25:29] !log reboot apt1001 [10:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:50] (03PS1) 10Ayounsi: Add cloudsw to Homer [puppet] - 10https://gerrit.wikimedia.org/r/656122 [10:26:50] (03CR) 10Ayounsi: [C: 03+2] Add cloudsw to Homer [puppet] - 10https://gerrit.wikimedia.org/r/656122 (owner: 10Ayounsi) [10:27:14] (03PS1) 10Jbond: Revert "apt: failover to apt2001 to reboot apt1001" [dns] - 10https://gerrit.wikimedia.org/r/655935 [10:27:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestvirt2003: drop host from production [puppet] - 10https://gerrit.wikimedia.org/r/656121 (https://phabricator.wikimedia.org/T271519) (owner: 10Arturo Borrero Gonzalez) [10:28:24] !log aborrero@cumin2001 START - Cookbook sre.hosts.decommission [10:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:29] (03CR) 10Jbond: [C: 03+2] Revert "apt: failover to apt2001 to reboot apt1001" [dns] - 10https://gerrit.wikimedia.org/r/655935 (owner: 10Jbond) [10:28:58] !log failover apt.wikimedia.org back to apt1001 [10:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:21] (03CR) 10Volans: [C: 03+1] "LGTM, couple of nits inside" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656120 (owner: 10Elukey) [10:30:35] I'm going to do some restarts of backup hosts, this may momentarily generate some global alerts/metrics lost that are difficult to downtime individually [10:32:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [10:34:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655513 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:35:41] !log aborrero@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [10:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:10] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [10:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:04] 10SRE, 10Phabricator, 10Traffic: Excessive queries from vscode-phabricator - https://phabricator.wikimedia.org/T271528 (10jbond) >>! In T271528#6746757, @Aklapper wrote: > Patch has been merged. Is there more to do or can this task be resolved? @Aklapper This task is here for users who may hit the rate limi... [10:41:42] RECOVERY - Check systemd state on ms-be1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:51] (03PS1) 10Giuseppe Lavagetto: Release 2.1.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/656125 [10:53:19] (03PS2) 10Elukey: Add cookbook to reboot Analytics Presto nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/656120 [10:54:09] (03CR) 10Elukey: Add cookbook to reboot Analytics Presto nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656120 (owner: 10Elukey) [10:57:42] (03PS6) 10Gehel: query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:59:18] (03CR) 10Gehel: [C: 03+2] query_service: Migrate hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/642470 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:00:04] mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1100). [11:00:32] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release 2.1.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/656125 (owner: 10Giuseppe Lavagetto) [11:01:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single [11:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:14] (03CR) 10JMeybohm: [C: 04-1] Add new service eventstreams-internal (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [11:03:30] (03CR) 10Elukey: [C: 03+2] Add cookbook to reboot Analytics Presto nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/656120 (owner: 10Elukey) [11:03:47] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@4164318]: (no justification provided) [11:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:01] !log oblivian@deploy1001 deploy aborted: (no justification provided) (duration: 00m 14s) [11:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:11] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@4164318]: (no justification provided) [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] (03CR) 10Elukey: Add new service eventstreams-internal (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [11:11:48] (03PS1) 10Elukey: sre.presto.reboot-workers: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/656126 [11:14:32] (03CR) 10JMeybohm: [C: 04-1] Add new service eventstreams-internal (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [11:16:59] (03PS1) 10Elukey: Add deployment config for a new k8s service - eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/656129 (https://phabricator.wikimedia.org/T256973) [11:17:33] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [11:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:54] PROBLEM - Check systemd state on ncredir5002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:39] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/656129 (https://phabricator.wikimedia.org/T256973) (owner: 10Elukey) [11:18:54] (03CR) 10Elukey: [C: 03+2] sre.presto.reboot-workers: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/656126 (owner: 10Elukey) [11:19:52] (03CR) 10JMeybohm: [C: 04-1] Add deployment config for a new k8s service - eventstreams-internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656129 (https://phabricator.wikimedia.org/T256973) (owner: 10Elukey) [11:20:15] (03PS1) 10Volans: Upstream release v5.3.0 [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/656131 (https://phabricator.wikimedia.org/T266487) [11:20:40] ...I shold get some coffee [11:20:53] (03Merged) 10jenkins-bot: sre.presto.reboot-workers: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/656126 (owner: 10Elukey) [11:22:07] jayme: me too apparently, thanks :D [11:22:17] (03PS1) 10Filippo Giunchedi: swift: apply interface::rps to bnx2x as well [puppet] - 10https://gerrit.wikimedia.org/r/656132 (https://phabricator.wikimedia.org/T271415) [11:22:40] !log elukey@cumin1001 START - Cookbook sre.presto.reboot-workers for Presto analytics cluster: Reboot Presto nodes - elukey@cumin1001 [11:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:49] (03CR) 10Ayounsi: [C: 03+1] Upstream release v5.3.0 [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/656131 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [11:23:29] (03PS2) 10Elukey: Add deployment config for a new k8s service - eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/656129 (https://phabricator.wikimedia.org/T269160) [11:24:16] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27467/console" [puppet] - 10https://gerrit.wikimedia.org/r/656132 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [11:24:36] (03CR) 10JMeybohm: [C: 03+1] Add deployment config for a new k8s service - eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/656129 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [11:24:52] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] swift: apply interface::rps to bnx2x as well [puppet] - 10https://gerrit.wikimedia.org/r/656132 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [11:24:58] (03PS2) 10Filippo Giunchedi: swift: apply interface::rps to bnx2x as well [puppet] - 10https://gerrit.wikimedia.org/r/656132 (https://phabricator.wikimedia.org/T271415) [11:25:09] (03PS1) 10Giuseppe Lavagetto: docker-pkg: add ca_bundle configuration [puppet] - 10https://gerrit.wikimedia.org/r/656133 [11:27:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27468/console" [puppet] - 10https://gerrit.wikimedia.org/r/656133 (owner: 10Giuseppe Lavagetto) [11:28:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] docker-pkg: add ca_bundle configuration [puppet] - 10https://gerrit.wikimedia.org/r/656133 (owner: 10Giuseppe Lavagetto) [11:28:38] (03CR) 10Elukey: [C: 03+2] Add deployment config for a new k8s service - eventstreams-internal [puppet] - 10https://gerrit.wikimedia.org/r/656129 (https://phabricator.wikimedia.org/T269160) (owner: 10Elukey) [11:29:29] <_joe_> elukey: can I merge your changes too? [11:31:54] <_joe_> godog: yours too :) [11:33:06] _joe_: yep! good to merge, thanks [11:33:25] yep! [11:33:31] <_joe_> ack :) [11:33:45] <_joe_> done [11:33:49] thanks! [11:34:30] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:45] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@4164318]: (no justification provided) (duration: 30m 34s) [11:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:49] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:43:49] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [11:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:52] (03CR) 10Elukey: Add new service eventstreams-internal (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [11:48:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nfs: change the default throttles for primary cluster read and egress [puppet] - 10https://gerrit.wikimedia.org/r/655952 (owner: 10Bstorm) [11:53:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast3004/bast4002/bast5001 from Prometheus Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/655916 (owner: 10Muehlenhoff) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1200). [12:00:04] awight: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:36] I can self-deploy. [12:00:42] (03CR) 10Awight: [C: 03+2] "Config window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656116 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [12:01:29] (03Merged) 10jenkins-bot: Remove unused WMDE TeWü QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656116 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [12:02:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:26] !log rebooting miscweb1002 [12:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:26] (03PS2) 10Volans: Upstream release v5.3.0 [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/656131 (https://phabricator.wikimedia.org/T266487) [12:06:44] (03CR) 10Volans: "tested on deneb" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/656131 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [12:08:12] (03CR) 10Volans: [C: 03+2] Upstream release v5.3.0 [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/656131 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [12:09:44] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:656116|Remove unused WMDE TeWü QuickSurveys (T253112, T272013)]] (duration: 01m 07s) [12:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:48] T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112 [12:09:48] T272013: Remove QuickSurveys config for WMDE template topic - https://phabricator.wikimedia.org/T272013 [12:10:00] !log EU config window finished. [12:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:30] (03PS1) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [12:11:39] I was wondering why an-presto1004 was taking so long to reboot, it was PXE booting :D [12:11:42] lovely [12:12:18] Ah, same problem I had with the ml-serves perpetually reinstalling? ;) [12:14:04] !log built and uploaded python3-pynetbox 5.3.0-1 to apt.wikimedia.org - T266487 [12:14:04] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [12:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:09] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [12:14:47] !log elukey@cumin1001 END (ERROR) - Cookbook sre.presto.reboot-workers (exit_code=97) for Presto analytics cluster: Reboot Presto nodes - elukey@cumin1001 [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:11] klausman: realized it too late, now I have to reimage, luckily an-presto nodes don't have any real state to preserve [12:16:35] !log aborrero@cumin2001 START - Cookbook sre.hosts.decommission [12:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:39] the bios settings were wrong, just fixed them [12:16:49] !log upgraded python3-pynetbox to 5.3.0-1 on cumin2001 [12:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:24] PROBLEM - HP RAID on ms-be2022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:22:27] ACKNOWLEDGEMENT - HP RAID on ms-be2022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T272025 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Run [12:22:27] aid_Information_Gathering [12:22:31] 10SRE, 10ops-codfw: Degraded RAID on ms-be2022 - https://phabricator.wikimedia.org/T272025 (10ops-monitoring-bot) [12:24:06] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:phabricator: remove apache level blocking [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [12:24:33] !log push pfw3 firewall rules - T271935 [12:24:34] (03CR) 10Hashar: [C: 03+1] P:phabricator: remove apache level blocking [puppet] - 10https://gerrit.wikimedia.org/r/649882 (https://phabricator.wikimedia.org/T270285) (owner: 10Jbond) [12:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:43] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10ayounsi) [12:27:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1004.eqiad.wmnet with reason: REIMAGE [12:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:47] (03PS1) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [12:29:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1004.eqiad.wmnet with reason: REIMAGE [12:29:42] (03PS1) 10Volans: sre.hosts.decommission: fix for Netbox 2.9 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/656143 (https://phabricator.wikimedia.org/T266487) [12:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:58] !log aborrero@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [12:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:33] (03CR) 10jerkins-bot: [V: 04-1] dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [12:31:53] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix for Netbox 2.9 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/656143 (https://phabricator.wikimedia.org/T266487) (owner: 10Volans) [12:33:11] !log aborrero@cumin2001 START - Cookbook sre.hosts.decommission [12:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:04] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [12:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:09] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:31] (03PS1) 10Marostegui: check_private_data_report: Run the checks on db1155's sections [puppet] - 10https://gerrit.wikimedia.org/r/656145 (https://phabricator.wikimedia.org/T268742) [12:34:53] anyone around who could backport and deploy the fix to https://phabricator.wikimedia.org/T271978 ? [12:34:56] (03PS2) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [12:35:44] (03PS2) 10Marostegui: check_private_data_report: Run the checks on db1155's sections [puppet] - 10https://gerrit.wikimedia.org/r/656145 (https://phabricator.wikimedia.org/T268742) [12:36:54] Jdlrobson, Krinkle, ping [12:37:07] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Run the checks on db1155's sections [puppet] - 10https://gerrit.wikimedia.org/r/656145 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [12:42:52] (03PS3) 10Hashar: releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [12:43:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [12:47:01] !log aborrero@cumin2001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [12:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:49] (03PS2) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [12:49:49] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [12:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:03] !log upgraded python3-pynetbox to 5.3.0-1 on all affected hosts - T266487 [12:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:06] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [12:54:07] (03PS1) 10Ayounsi: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/656147 [12:54:37] PROBLEM - Check systemd state on ms-be2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:27] (03PS2) 10Ayounsi: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/656147 [12:56:48] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/656147 (owner: 10Ayounsi) [12:56:50] 10SRE, 10Phabricator, 10Traffic: Excessive queries from vscode-phabricator - https://phabricator.wikimedia.org/T271528 (10Aklapper) 05Open→03Resolved :-/ Well, Phab is not a support desk for general questions; it's a task tracking system to plan work. Closing. [12:56:56] 10SRE, 10SRE-Access-Requests: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10Bmueller) approved, thanks! [12:59:49] (03CR) 10Lars Wirzenius: "How does this differ from https://gerrit.wikimedia.org/r/c/mediawiki/skins/CologneBlue/+/655971/ ?" [skins/CologneBlue] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655932 (https://phabricator.wikimedia.org/T271978) (owner: 10Krinkle) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1300) [13:08:19] 10SRE, 10ops-eqiad, 10Traffic: lvs1016 interface down - https://phabricator.wikimedia.org/T271087 (10BBlack) @Cmjohnson - Please do it at your earliest convenience. It's not in the flow of live traffic and doesn't need any "depool" AFAIK (but it is problematic that we don't have it as a reliable backup opti... [13:08:28] (03CR) 10Hashar: "The release hosts use Buster and thus would get Docker from Debian which is good. The compiler fails due to a missing profile::docker::en" [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [13:08:36] (03CR) 10Hashar: [C: 04-1] releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [13:11:42] !log installing xerces-c security updates on stretch [13:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:35] (03CR) 10Hashar: [C: 03+2] "Lars: this change is a cherry pick to the branch wmf/1.36.0-wmf.26 :]" [skins/CologneBlue] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655932 (https://phabricator.wikimedia.org/T271978) (owner: 10Krinkle) [13:17:10] (03CR) 10Hashar: [C: 03+2] "(talked with Lars about, I am handling the deployment of this change)" [skins/CologneBlue] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655932 (https://phabricator.wikimedia.org/T271978) (owner: 10Krinkle) [13:17:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:17:57] !log installing openssl1.0 security updates on stretch [13:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:42] 10SRE, 10Graphoid, 10Platform Engineering, 10serviceops: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495 (10Jseddon) p:05Triage→03Low [13:19:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:12] (03Merged) 10jenkins-bot: Edit link may not be present, avoid undefined index notice [skins/CologneBlue] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655932 (https://phabricator.wikimedia.org/T271978) (owner: 10Krinkle) [13:22:20] !log aborrero@cumin2001 START - Cookbook sre.dns.netbox [13:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:42] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [13:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] !log restarting mw canaries for openssl update [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] jouncebot: now [13:25:21] For the next 0 hour(s) and 34 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1300) [13:26:07] moritzm: once you are done with the mw app server updates, I will scap a hotfix for a mediawiki skin [13:26:37] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2001-dev: introduce server [puppet] - 10https://gerrit.wikimedia.org/r/656148 (https://phabricator.wikimedia.org/T271519) [13:27:03] hashar: the cumin run will be complete in a minute [13:27:14] there is no rush don't worry :] [13:28:00] (03PS2) 10Arturo Borrero Gonzalez: cloudgw2001-dev: introduce server [puppet] - 10https://gerrit.wikimedia.org/r/656148 (https://phabricator.wikimedia.org/T271519) [13:29:45] (03PS3) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [13:36:32] (03CR) 10Hashar: [C: 04-1] "Will check with Dan, I guess it is just to build MediaWiki images and might not take too much disk space so it can fit in the current / pa" [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [13:38:03] RECOVERY - Check systemd state on ms-be2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:06] (03PS3) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [13:42:20] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single [13:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:34] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [13:47:58] !log Restart mysql on db2094 for openssl upgrades test [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:55] (03PS4) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [13:50:01] 10SRE, 10ops-codfw: Degraded RAID on ms-be2022 - https://phabricator.wikimedia.org/T272025 (10fgiunchedi) @papaul looks like the BBU is busted on this OOW host. Even though the host is going to be decom in a few weeks I'd still like to have the BBU working. [13:51:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [13:51:51] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [13:52:07] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for puppetboard/apache [puppet] - 10https://gerrit.wikimedia.org/r/656152 (https://phabricator.wikimedia.org/T135991) [13:56:09] (03CR) 10DCausse: [C: 03+1] query_service: Remove gui files from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup) [13:56:46] !log aborrero@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:27] (03CR) 10DCausse: [C: 04-1] "modules/query_service/manifests/gui.pp must be updated I think" [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup) [13:58:18] (03PS5) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [13:58:20] (03PS1) 10Jbond: sre.hosts.decommission: fix pylint error [cookbooks] - 10https://gerrit.wikimedia.org/r/656158 [13:58:49] hashar, how is it going? [13:59:04] oh yeah forgot [13:59:09] doing it [13:59:21] going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/skins/CologneBlue/+/655932/ [13:59:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2001-dev: introduce server [puppet] - 10https://gerrit.wikimedia.org/r/656148 (https://phabricator.wikimedia.org/T271519) (owner: 10Arturo Borrero Gonzalez) [14:00:05] liw and longma: #bothumor I � Unicode. All rise for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1400). [14:00:22] (03PS2) 10Jbond: sre.hosts.decommission: fix pylint error [cookbooks] - 10https://gerrit.wikimedia.org/r/656158 [14:00:46] hashar, cool, I can then promote train to group2 [14:01:01] hashar, when you're done [14:01:04] (03PS6) 10Jbond: cookbook sre.apt.reboot: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [14:01:19] (03CR) 10Ladsgroup: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup) [14:02:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/656152 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:02:17] (03CR) 10Ladsgroup: "The best solution would be to move wcqs to microsites as well and then just clean everything from here." [puppet] - 10https://gerrit.wikimedia.org/r/655955 (https://phabricator.wikimedia.org/T271851) (owner: 10Ladsgroup) [14:02:33] PROBLEM - Host an-presto1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:42] damn the bug does not show up :] [14:02:47] presto1005 is me :) [14:02:54] (downtime expired, it was stuck in boot) [14:04:33] RECOVERY - Host an-presto1005 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:05:15] at least on mwdebug1001 that looks fine [14:06:04] !log hashar@deploy1001 Synchronized php-1.36.0-wmf.26/skins/CologneBlue/includes/CologneBlueHooks.php: Edit link may not be present, avoid undefined index notice T271978 (duration: 01m 07s) [14:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:07] T271978: PHP Notice: Undefined index: edit (CologneBlueHooks.php) - https://phabricator.wikimedia.org/T271978 [14:06:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:48] liw: done [14:06:58] hashar, thank you muchly [14:07:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:01] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656161 [14:08:03] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656161 (owner: 10Lars Wirzenius) [14:08:26] (03PS1) 10Jbond: phabricator: drop absetned file [puppet] - 10https://gerrit.wikimedia.org/r/656162 [14:08:51] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656161 (owner: 10Lars Wirzenius) [14:10:22] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.26 [14:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] (03CR) 10Ottomata: "We should deploy this in codfw as well as eqiad, there's no real reason not to. That will allow the service to keep running if SRE needs " [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [14:24:19] !log running homer in asw-b-codfw* (T271519) [14:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:22] T271519: codfw1dev: repurpose/rename labtestvirt2003.codfw.wmnet as cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271519 [14:26:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single [14:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:45] liw: there is a new visual glitch introduced, but that is probably not a big deal [14:28:12] hashar, sounds like a backport opportunity? [14:28:23] well the backport is done [14:28:31] but introduce the visual glitch [14:28:38] it was not showing when I tested via mwdebug1001 [14:28:39] I mean, a new backport for someone else :) [14:28:42] yeah [14:28:47] they will fix it later today I guess [14:28:47] !log running homer in asw-b-codfw* (T271519) [14:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:09] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for puppetboard/apache [puppet] - 10https://gerrit.wikimedia.org/r/656152 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:30:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:55] (03PS1) 10Muehlenhoff: Add bast4003/bast5002 [puppet] - 10https://gerrit.wikimedia.org/r/656172 (https://phabricator.wikimedia.org/T257324) [14:53:29] (03CR) 10Herron: [C: 03+1] monitoring: update varnish 5xx logstash short link [puppet] - 10https://gerrit.wikimedia.org/r/656113 (https://phabricator.wikimedia.org/T272016) (owner: 10Filippo Giunchedi) [14:54:10] (03CR) 10Herron: [C: 03+1] logstash: update short link to indexing errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/656112 (https://phabricator.wikimedia.org/T272016) (owner: 10Filippo Giunchedi) [14:55:43] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1387.eqiad.wmnet, mw1270.eqiad.wmnet, wdqs1011.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [14:56:17] ah this is downtime expired --^ [14:56:31] lemme re-add it with more days [14:56:42] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:17] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Stein upgrade [puppet] - 10https://gerrit.wikimedia.org/r/655977 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [14:58:10] 10SRE, 10ops-eqiad, 10Traffic: lvs1016 interface down - https://phabricator.wikimedia.org/T271087 (10elukey) I added a week of downtime, the alarm popped up again, remember it if the issue gets solved sooner! [14:58:11] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/656158 (owner: 10Jbond) [14:58:47] (03CR) 10Jbond: [C: 03+2] sre.hosts.decommission: fix pylint error [cookbooks] - 10https://gerrit.wikimedia.org/r/656158 (owner: 10Jbond) [14:59:17] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: upgrading openstack [14:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrading openstack [14:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:37] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 93 hosts with reason: upgrading openstack [14:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 93 hosts with reason: upgrading openstack [15:00:13] (03PS1) 10Jbond: pki: move the pki service so its avalible via dns discovery [puppet] - 10https://gerrit.wikimedia.org/r/656179 [15:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:49] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix pylint error [cookbooks] - 10https://gerrit.wikimedia.org/r/656158 (owner: 10Jbond) [15:02:12] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: move eqiad1 from openstack 'rocky' to 'stein' [puppet] - 10https://gerrit.wikimedia.org/r/655979 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [15:04:36] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: update varnish 5xx logstash short link [puppet] - 10https://gerrit.wikimedia.org/r/656113 (https://phabricator.wikimedia.org/T272016) (owner: 10Filippo Giunchedi) [15:04:38] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: update short link to indexing errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/656112 (https://phabricator.wikimedia.org/T272016) (owner: 10Filippo Giunchedi) [15:06:02] (03CR) 10Alexandros Kosiaris: "> Patch Set 2:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [15:07:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Fix duplicate detection when running in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655904 (https://phabricator.wikimedia.org/T271901) (owner: 10Giuseppe Lavagetto) [15:11:03] (03PS5) 10Alexandros Kosiaris: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (https://phabricator.wikimedia.org/T271901) (owner: 10Giuseppe Lavagetto) [15:11:15] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [15:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:08] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch the base image to buster from stretch. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (https://phabricator.wikimedia.org/T271901) (owner: 10Giuseppe Lavagetto) [15:13:57] (03PS1) 10Herron: elk: update scap canary dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/656184 (https://phabricator.wikimedia.org/T272016) [15:16:55] !log otto@deploy1001 Started deploy [analytics/refinery@1117f45]: Explicitly set timeout in banner_activity-druid-monthly-coord - T264358 [15:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:59] T264358: Investigate oozie banner monthly job timeouts - https://phabricator.wikimedia.org/T264358 [15:17:48] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [15:19:10] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for netbox/apache [puppet] - 10https://gerrit.wikimedia.org/r/656187 (https://phabricator.wikimedia.org/T135991) [15:19:12] !log otto@deploy1001 Finished deploy [analytics/refinery@1117f45]: Explicitly set timeout in banner_activity-druid-monthly-coord - T264358 (duration: 02m 16s) [15:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:18] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) Adding bblack (traffic) and ayounsi, as we have a tentative date for codfw, proposed by Filippo and Jaime: week of 15th. It may change as persiste... [15:20:31] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) a:05jcrespo→03None [15:22:03] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [15:22:32] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [15:22:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, +Lars as heads up" [puppet] - 10https://gerrit.wikimedia.org/r/656184 (https://phabricator.wikimedia.org/T272016) (owner: 10Herron) [15:23:36] (03CR) 10Esanders: [C: 04-1] Enable DiscussionTools' newtopictool as beta feature on Beta Cluster (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) (owner: 10Bartosz Dziewoński) [15:25:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:26:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:28:00] 10SRE, 10Wikimedia-Logstash, 10Patch-For-Review: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10Lucas_Werkmeister_WMDE) Is it possible to restore the /goto/ links? [15:28:41] (03CR) 10Alexandros Kosiaris: update flink config with swift and other values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles) [15:29:19] (03PS3) 10David Caro: wmcs.ceph.osd: disable write caches when possible [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) [15:30:36] !log power down ms-be2022 for maintenance [15:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:24] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [15:31:27] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10jcrespo) 05Resolved→03Open p:05Medium→03High [15:32:22] PROBLEM - Host ms-be2022 is DOWN: PING CRITICAL - Packet loss = 100% [15:33:24] (03CR) 10David Caro: wmcs.ceph.osd: disable write caches when possible (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [15:34:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph.osd: disable write caches when possible [puppet] - 10https://gerrit.wikimedia.org/r/655923 (https://phabricator.wikimedia.org/T271527) (owner: 10David Caro) [15:34:47] (03CR) 10Muehlenhoff: [C: 03+2] Add bast4003/bast5002 [puppet] - 10https://gerrit.wikimedia.org/r/656172 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff) [15:36:56] (03PS2) 10Urbanecm: Close lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655898 (https://phabricator.wikimedia.org/T272041) (owner: 10Ladsgroup) [15:40:29] !log installing sqlite3 security updates on Stretch [15:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:36] RECOVERY - Host ms-be2022 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [15:45:00] (03CR) 10Lars Wirzenius: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/656184 (https://phabricator.wikimedia.org/T272016) (owner: 10Herron) [15:45:33] (03CR) 10Herron: [C: 03+2] elk: update scap canary dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/656184 (https://phabricator.wikimedia.org/T272016) (owner: 10Herron) [15:49:55] (03PS1) 10Jgreen: add frqueue2002 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/656197 [15:50:23] (03CR) 10Dzahn: [C: 03+2] "yep, confirmed via cumin. file is gone on phab*" [puppet] - 10https://gerrit.wikimedia.org/r/656162 (owner: 10Jbond) [15:51:26] (03PS2) 10Jgreen: add frqueue2002 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/656197 [15:52:04] RECOVERY - HP RAID on ms-be2022 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:52:32] (03CR) 10jerkins-bot: [V: 04-1] add frqueue2002 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/656197 (owner: 10Jgreen) [15:53:11] (03CR) 10Dzahn: "re: ganeti and additional disks: creating a second virtual hard disk and mounting it is easy, so it's much preferred over trying to resiz" [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [15:53:37] (03PS3) 10Jgreen: add frqueue2002 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/656197 (https://phabricator.wikimedia.org/T269481) [15:54:40] (03CR) 10Jgreen: [C: 03+2] add frqueue2002 to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/656197 (https://phabricator.wikimedia.org/T269481) (owner: 10Jgreen) [15:57:22] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27469/" [puppet] - 10https://gerrit.wikimedia.org/r/655513 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [15:57:23] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) [15:58:37] (03CR) 10Dzahn: "noop confirmed on neon, chlorine, k8s masters" [puppet] - 10https://gerrit.wikimedia.org/r/655513 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [15:58:47] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10Jgreen) 05Open→03Resolved [16:03:44] !log installing tomcat8 security updates [16:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:21] (03PS8) 10Elukey: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [16:32:02] !log installing php-pear updates on stretch [16:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:06] (03PS4) 10Dduvall: releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) [16:37:35] 10SRE, 10Analytics, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) [16:38:09] (03CR) 10Dduvall: releases: Provide docker to PipelineLib based jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [16:46:38] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 93 hosts with reason: upgrading openstack [16:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 93 hosts with reason: upgrading openstack [16:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 10 hosts with reason: upgrading openstack [16:47:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrading openstack [16:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:52] (03PS1) 10Elukey: sre.hadoop.reboot-workers: move to the new class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656212 (https://phabricator.wikimedia.org/T269925) [16:55:57] (03PS2) 10Elukey: sre.hadoop.reboot-workers: move to the new class API [cookbooks] - 10https://gerrit.wikimedia.org/r/656212 (https://phabricator.wikimedia.org/T269925) [16:59:04] (03CR) 10Elukey: "Amir can you run a pcc just to be sure? Then I'll merge! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/655790 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [16:59:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27470/console" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:00:04] jbond42 and cdanis: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1700). [17:01:16] (03PS10) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:02:27] 10SRE, 10Graphoid, 10Platform Engineering, 10serviceops: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495 (10phuedx) 👋 FYI this deployment caused a noticeable increase in the HTML and asset size for all enwiki pages with at least one graph. {F34001243,size=full} That 179.... [17:02:39] (03CR) 10jerkins-bot: [V: 04-1] start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:20:31] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [17:21:53] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [17:22:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:30:45] (03CR) 10Hnowlan: [C: 04-1] "There's enough overlap and similarity between planet_sync and imposm_planet_sync that some generic upper define might be worthwhile so as " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:37:49] 10SRE, 10SRE-Access-Requests: Access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10nnikkhoui) [17:40:46] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10razzi) [17:46:58] 10SRE, 10SRE-Access-Requests: Analytics access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10nnikkhoui) [17:55:42] 10SRE, 10SRE-Access-Requests: Analytics access for dev: Nikki Nikkhoui - https://phabricator.wikimedia.org/T272057 (10AMooney) Approved [18:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1800). [18:00:30] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 7:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [18:04:28] (03PS2) 10Bstorm: maintain-dbusers: close the connections where they open [puppet] - 10https://gerrit.wikimedia.org/r/656215 (https://phabricator.wikimedia.org/T269620) [18:05:25] !log restarting backup1001, backup2001 T271913 [18:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:20] (03PS1) 10Andrew Bogott: Nova monitoring: update the egrep string for the nova-api process [puppet] - 10https://gerrit.wikimedia.org/r/656225 (https://phabricator.wikimedia.org/T261134) [18:08:40] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:09:04] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:09:20] (03CR) 10Andrew Bogott: [C: 03+2] Nova monitoring: update the egrep string for the nova-api process [puppet] - 10https://gerrit.wikimedia.org/r/656225 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:09:57] 10SRE, 10ops-eqiad, 10Analytics-Radar: an-test-worker1002 may need a DAC replace - https://phabricator.wikimedia.org/T272009 (10razzi) [18:10:05] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:11:13] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [18:11:39] 10SRE, 10Analytics, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10razzi) p:05Triage→03Medium [18:13:53] 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10razzi) p:05Medium→03High [18:14:00] (03CR) 10Bstorm: [C: 03+2] nfs: change the default throttles for primary cluster read and egress [puppet] - 10https://gerrit.wikimedia.org/r/655952 (owner: 10Bstorm) [18:17:09] (03PS2) 10Dzahn: redis::slave: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) [18:17:14] (03CR) 10Dzahn: redis::slave: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:17:38] (03PS1) 10Andrew Bogott: nova-api monitoring: simplify grep for detecting processes [puppet] - 10https://gerrit.wikimedia.org/r/656231 (https://phabricator.wikimedia.org/T261134) [18:17:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:19:27] ^this could be me due to the last log, if it is me it should return back to normal soon [18:20:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:48] (03CR) 10Andrew Bogott: [C: 03+2] nova-api monitoring: simplify grep for detecting processes [puppet] - 10https://gerrit.wikimedia.org/r/656231 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:21:23] (03PS3) 10Dzahn: monitoring::host: move hostgroup_default to params, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) [18:21:38] (03CR) 10Dzahn: monitoring::host: move hostgroup_default to params, hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:22:52] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/655790 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [18:23:32] 10SRE, 10ops-eqiad, 10Analytics-Radar: an-test-worker1002 may need a DAC replace - https://phabricator.wikimedia.org/T272009 (10Cmjohnson) an-test-worker1002 is in the wrong vlan. I see it's in private1-c not analytics. Making the change now [18:23:36] (03CR) 10jerkins-bot: [V: 04-1] monitoring::host: move hostgroup_default to params, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:23:42] (03CR) 10Dzahn: profile::base: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:25:48] (03PS4) 10Dzahn: monitoring::host: move hostgroup_default to params, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) [18:26:04] (03PS1) 10Andrew Bogott: Keystone/stein: disable the packaged keystone init script [puppet] - 10https://gerrit.wikimedia.org/r/656237 (https://phabricator.wikimedia.org/T261134) [18:26:44] (03CR) 10jerkins-bot: [V: 04-1] Keystone/stein: disable the packaged keystone init script [puppet] - 10https://gerrit.wikimedia.org/r/656237 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:27:20] there was some puppet/netbox failures, but the alert was for cloudgw2001-dev [18:27:41] which probably is just cloud upgrades/ongoing work [18:28:52] (03PS2) 10Andrew Bogott: Keystone/stein: disable the packaged keystone init script [puppet] - 10https://gerrit.wikimedia.org/r/656237 (https://phabricator.wikimedia.org/T261134) [18:29:54] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Not doing this right now, while we discuss what we want" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) (owner: 10Bartosz Dziewoński) [18:31:45] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm, one minor concern" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656215 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [18:31:53] (03CR) 10Andrew Bogott: [C: 03+2] Keystone/stein: disable the packaged keystone init script [puppet] - 10https://gerrit.wikimedia.org/r/656237 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:32:35] 10SRE, 10ops-eqiad, 10Traffic: lvs1016 interface down - https://phabricator.wikimedia.org/T271087 (10Cmjohnson) 05Open→03Resolved @elukey @bblack swapped both the optics at the switch on a4 and on the server. It appears that the server side optic was the issue, the link is backup xe-4/0/7 up... [18:32:50] 10SRE, 10ops-eqiad, 10Analytics-Radar: an-test-worker1002 may need a DAC replace - https://phabricator.wikimedia.org/T272009 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The issue should be resolved now [edit interfaces interface-range vlan-analytics1-c-eqiad] member ge-5/0/13 { ... } + memb... [18:34:35] (03CR) 10Bstorm: maintain-dbusers: close the connections where they open (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656215 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [18:34:50] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:36:02] 10SRE, 10ops-eqiad, 10Analytics-Radar: an-test-worker1002 may need a DAC replace - https://phabricator.wikimedia.org/T272009 (10elukey) Thanks a lot! It is weird since it worked fine up to yesterday, was it done by mistake? Anyway, I'll also check vlans next time! [18:36:04] (03PS1) 10Andrew Bogott: keystone: remove obsolete service definition [puppet] - 10https://gerrit.wikimedia.org/r/656240 (https://phabricator.wikimedia.org/T261134) [18:36:33] !log restarting backup1002, backup2002 T271913 [18:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:21] (03CR) 10Andrew Bogott: [C: 03+2] keystone: remove obsolete service definition [puppet] - 10https://gerrit.wikimedia.org/r/656240 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [18:37:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2241.codfw.wmnet with reason: REIMAGE [18:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:13] !log started mass deletion of lrcwiki (T272041) - https://w.wiki/uPV [18:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:16] T272041: Close lrc.wikipedia.org for editing - https://phabricator.wikimedia.org/T272041 [18:38:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2242.codfw.wmnet with reason: REIMAGE [18:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2241.codfw.wmnet with reason: REIMAGE [18:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2255.codfw.wmnet with reason: REIMAGE [18:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:09] 10SRE, 10ops-eqiad, 10Analytics-Radar: an-test-worker1002 may need a DAC replace - https://phabricator.wikimedia.org/T272009 (10elukey) Yep it was part of a commit that lines up with the drop in connectivity: ` elukey@asw2-c-eqiad> show system rollback compare 2 1 [edit interfaces interface-range disabled... [18:41:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2258.codfw.wmnet with reason: REIMAGE [18:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2242.codfw.wmnet with reason: REIMAGE [18:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:12] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics dev for clarakosi - https://phabricator.wikimedia.org/T271973 (10Clarakosi) [18:42:33] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:42:45] 10SRE, 10SRE-Access-Requests: Analytics access for dev: Bill Pirkle - https://phabricator.wikimedia.org/T272065 (10BPirkle) [18:43:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2255.codfw.wmnet with reason: REIMAGE [18:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2258.codfw.wmnet with reason: REIMAGE [18:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:12] 10SRE, 10SRE-Access-Requests: Analytics access for dev: Bill Pirkle - https://phabricator.wikimedia.org/T272065 (10AMooney) Approved [18:54:08] (03PS1) 10Dzahn: mcrouter_wancache.yaml: fix incorrect hostname not matching IP [puppet] - 10https://gerrit.wikimedia.org/r/656243 [18:58:54] PROBLEM - Host mc1024 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:48] ouch [18:59:52] effie: ^ is that you? [19:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:23] mutante: I think the reimages are done [19:00:53] elukey: trying the mgmt console [19:01:14] super [19:01:43] arg, HP [19:03:12] !log mc1024 - attempting to power on via mgmt, went down and power down [19:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:48] Server powering on ....... [19:04:09] is there any message why it went down? [19:04:15] *about why [19:04:17] Virtual Serial Port Active: COM2 The server is not powered on. The Virtual Serial Port is not available. [19:04:21] does not look good [19:04:38] no, it was just completely off [19:04:44] and nothing on console [19:04:59] looks like it actually died [19:06:01] cmjohnson1: sorry to ping you, mc1024 in B6 went down a couple of mins ago, if you have a min can you check if the host is dead/fried? [19:07:14] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1024_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [19:08:36] it's "shard06" in the config. but at what point would we have to change that? [19:08:55] still nothing at boot, btw [19:09:06] PROBLEM - Check health of redis instance on 6379 on mc2024 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 339767 keys, up 24 days 50 minutes - replication_delay is 653 https://wikitech.wikimedia.org/wiki/Redis [19:10:16] mutante: so in theory the memcached part is failed over to the mc-gp100[1-3] gutter pool, and the redis part should be taken care by nutcracker on all mw nodes [19:10:53] IIRC the nodes need to be refreshed in Q4 so if one node is fried we might need to depool it from config [19:11:12] re: mc2024 that makes sense, it is the counterpart of that server, same shard 06 in codfw afaict [19:11:19] elukey: *nod*, ack, good [19:12:12] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All [19:12:20] yep traffic flowing to the gutter pool [19:12:41] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) I ran into some issues along the way, these are taking a little longer to get the idrac's setup. I am sorry for the delay. [19:12:46] the gutter picking it up as it should it seems [19:12:49] cool [19:14:33] could still be just the actual power cable. all i see is "Server power off" [19:15:30] mutante: let's see if Chris can check, it might be a loose cable (let's hope so) [19:15:44] joining dcops channel [19:16:00] agreed, that would be the best case and it happens [19:18:14] mutante: elukey: Hey, it seems you're debugging some issue, may I sync a quick change now? [19:20:26] 19:19 < cmjohnson1> it's not good, I am seeing a flashing green light [19:20:34] he is trying to reboot it right now [19:21:47] (03PS1) 10Andrew Bogott: Enable cinder in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/656250 (https://phabricator.wikimedia.org/T269511) [19:25:21] elukey: it's not coming back but I don't know what the next step should be in that case [19:25:52] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2241.codfw.wmnet'] ` and were **ALL** s... [19:26:16] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2242.codfw.wmnet'] ` and were **ALL** s... [19:26:30] (03CR) 10RLazarus: [C: 03+1] Add IRC/SAL notifications via tcpircbot. [software/klaxon] - 10https://gerrit.wikimedia.org/r/655998 (owner: 10CDanis) [19:27:29] Urbanecm: I don't think we need to block you. thanks for asking. [19:27:53] ack, just don't want to make debugging harder :) [19:27:56] mutante: we need to remove the host from configs etc.., I'd open a task and add effie in Cc (she'll be sooo pleased that a newly reimaged buster node died :D) [19:28:06] Urbanecm: it's physical hw failure [19:28:10] ack [19:28:19] (03CR) 10Urbanecm: [C: 03+2] Close lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655898 (https://phabricator.wikimedia.org/T272041) (owner: 10Ladsgroup) [19:28:42] now removing it from mcrouter should be not impactful, but let's triple check before doing it [19:28:42] elukey: ack, i'll create a ticket [19:28:54] removing it from nutcracker config should be easy enough [19:29:05] (03Merged) 10jenkins-bot: Close lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655898 (https://phabricator.wikimedia.org/T272041) (owner: 10Ladsgroup) [19:29:34] (mcrouter automatically reloads the config via inotify on the json config file, and it should use consistent hashing, so I don't expect problems, buut better to brainbounce with others first) [19:29:39] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2255.codfw.wmnet'] ` and were **ALL** s... [19:31:26] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2258.codfw.wmnet'] ` and were **ALL** s... [19:31:39] mutante: going afk but if you need anything I'll check later [19:31:53] !log urbanecm@deploy1001 Synchronized dblists/closed.dblist: d3e274e9b953f5edda07fa3a016b7291a451ceb2: Close lrcwiki (T272041) (duration: 00m 58s) [19:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:57] T272041: Close lrc.wikipedia.org for editing - https://phabricator.wikimedia.org/T272041 [19:32:33] elukey: ACK, thank you [19:38:20] (03CR) 10RLazarus: [C: 03+1] "Confirmed by resolving on cumin1001. Good catch." [puppet] - 10https://gerrit.wikimedia.org/r/656243 (owner: 10Dzahn) [19:55:30] (03CR) 10Andrew Bogott: [C: 03+2] Enable cinder in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/656250 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:57:45] 10SRE, 10Analytics, 10Event-Platform, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) 05Stalled→03Declined Declining in favor of {T253058} [19:59:52] (03PS3) 10Bstorm: maintain-dbusers: close the connections where they open [puppet] - 10https://gerrit.wikimedia.org/r/656215 (https://phabricator.wikimedia.org/T269620) [20:00:04] liw and longma: Your horoscope predicts another unfortunate Mediawiki train - European+American Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210114T2000). [20:00:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) a:05Jclark-ctr→03RobH [20:01:03] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ml-serve1001.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/... [20:01:33] (03CR) 10Bstorm: maintain-dbusers: close the connections where they open (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656215 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [20:05:17] !log razzi@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes - razzi@cumin1001 [20:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:53] (03PS1) 10Andrew Bogott: cinder: enable haproxy frontend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/656252 (https://phabricator.wikimedia.org/T269511) [20:13:25] ACKNOWLEDGEMENT - Host mc1024 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T272078 [20:14:00] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: close the connections where they open [puppet] - 10https://gerrit.wikimedia.org/r/656215 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [20:15:39] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: REIMAGE [20:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:32] !log ACKing all unhandled crit alerts about systemd on clouddb hosts - notifications are disabled but this cleans up Icinga web UI noise - T267090 [20:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [20:17:43] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: REIMAGE [20:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:19:33] ACKNOWLEDGEMENT - Check health of redis instance on 6379 on mc2024 is CRITICAL: CRITICAL: replication_delay is 3066 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 339767 keys, up 24 days 1 hours - replication_delay is 3066 daniel_zahn partner of mc1024 which broke - https://phabricator.wikimedia.org/T272078 https://wikitech.wikimedia.org/wiki/Redis [20:19:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:08] ACKNOWLEDGEMENT - PHP opcache health on mw2239 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged recently https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:23:08] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) [20:24:30] (03CR) 10Mstyles: update flink config with swift and other values (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles) [20:24:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve1001.eqiad.wmnet'] ` and were **ALL** successful. [20:27:04] (03PS1) 10Ottomata: Render kafka cluster connection info in helmfile-defaults/general-*.yaml [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) [20:28:34] (03CR) 10jerkins-bot: [V: 04-1] Render kafka cluster connection info in helmfile-defaults/general-*.yaml [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [20:30:22] (03PS2) 10Ottomata: Render kafka cluster connection info in helmfile-defaults/general-*.yaml [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) [20:31:44] (03PS2) 10Luke081515: Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) [20:31:53] (03CR) 10jerkins-bot: [V: 04-1] Render kafka cluster connection info in helmfile-defaults/general-*.yaml [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) (owner: 10Ottomata) [20:32:39] (03CR) 10Luke081515: "added the disclaimer, will prepare a seperate wikimedia messages patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515) [20:33:59] (03CR) 10Andrew Bogott: [C: 03+2] cinder: enable haproxy frontend in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/656252 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:38:10] (03PS3) 10Ottomata: Render kafka cluster connection info in helmfile-defaults/general-*.yaml [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) [20:41:17] (03PS4) 10Ottomata: Render kafka cluster connection info in helmfile-defaults/general-*.yaml [puppet] - 10https://gerrit.wikimedia.org/r/656253 (https://phabricator.wikimedia.org/T253058) [20:42:41] PROBLEM - PHP opcache health on mw2242 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:43:40] (03PS1) 10Andrew Bogott: Cinder: set default quotas to 0 so I can deploy in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/656255 (https://phabricator.wikimedia.org/T269511) [20:45:01] PROBLEM - PHP opcache health on mw2241 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:45:41] PROBLEM - PHP opcache health on mw2255 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:45:49] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: set default quotas to 0 so I can deploy in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/656255 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [20:46:05] PROBLEM - PHP opcache health on mw2258 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:49:09] Hi ops team, EnWikipedia is planning to update our logo for Wikipedia 20 at midnight UTC. What preparation should be done to make the deployment as easy for the devs as possible? [20:49:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:53] PROBLEM - mediawiki-installation DSH group on mw2242 is CRITICAL: Host mw2242 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:52:20] wugapodes: someone should upload a patch to the mediawiki-config repo. then it should be added to the Deployment calendar to a deployment slot [20:53:18] wugapodes: it's not really ops team anymore like in the past, btw, it's mw-deployment [20:53:58] wugapodes: that doesn't necessarily mean you have to do it yourself, just pointing the steps as you asked [20:54:08] (03CR) 10Bartosz Dziewoński: [C: 04-1] Enable DiscussionTools' newtopictool as beta feature on Beta Cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) (owner: 10Bartosz Dziewoński) [20:54:47] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics dev for clarakosi - https://phabricator.wikimedia.org/T271973 (10WDoranWMF) p:05Triage→03High I hope it's ok that I'm increasing the priority of the task. I don't know if that is done by the receiver of the task so please reset as appropriate. Th... [20:54:55] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools' newtopictool as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) [20:55:11] (03CR) 10Bartosz Dziewoński: "Updated per Peter on T272076." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) (owner: 10Bartosz Dziewoński) [20:56:19] PROBLEM - mediawiki-installation DSH group on mw2258 is CRITICAL: Host mw2258 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:56:31] PROBLEM - mediawiki-installation DSH group on mw2255 is CRITICAL: Host mw2255 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:57:06] is anyone around who'd like to deploy a beta-cluster-only config change for me? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/655966 [20:57:27] (if not, i'll just schedule it properly) [20:59:34] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['ml-serve1002.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet', 'ml-serve1004.eqiad.w... [21:08:03] ACKNOWLEDGEMENT - PHP opcache health on mw2241 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged recently https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:08:03] ACKNOWLEDGEMENT - PHP opcache health on mw2242 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged recently https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:08:03] ACKNOWLEDGEMENT - PHP opcache health on mw2255 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged recently https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:08:03] ACKNOWLEDGEMENT - PHP opcache health on mw2258 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn reimaged recently https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:08:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:10:29] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2241.codfw.wmnet [21:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:42] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2242.codfw.wmnet [21:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2255.codfw.wmnet [21:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2258.codfw.wmnet [21:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:14] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: REIMAGE [21:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE [21:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:11] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: REIMAGE [21:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:15] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1002.eqiad.wmnet with reason: REIMAGE [21:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2241.codfw.wmnet [21:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:45] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2242.codfw.wmnet [21:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:01] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE [21:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:18:36] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes - razzi@cumin1001 [21:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:52] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1003.eqiad.wmnet with reason: REIMAGE [21:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:24] (03CR) 10Dduvall: [C: 04-1] "I'll go ahead and open a subtask about the additional volume and come back to this." [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [21:23:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2255.codfw.wmnet [21:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2258.codfw.wmnet [21:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Jclark-ctr) @RobH @Cmjohnson DYV8773 is the ST not in netbox right now [21:25:59] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:26:17] (03CR) 10Dzahn: [C: 03+2] mcrouter_wancache.yaml: fix incorrect hostname not matching IP [puppet] - 10https://gerrit.wikimedia.org/r/656243 (owner: 10Dzahn) [21:27:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve1002.eqiad.wmnet', 'ml-serve1004.eqiad.wmnet', 'ml-serve1003.eqiad.wmnet'] ` and were **ALL** successful. [21:27:59] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) [21:28:52] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) 05Open→03Resolved all four ml-serve hosts have been installed. They do not yet have GPUs, but our sync up showed they could be used without for now. Ready for your team... [21:30:25] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:32:05] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:33:15] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:38:31] (03PS1) 10Andrew Bogott: Horizon: install real Cinder policies [puppet] - 10https://gerrit.wikimedia.org/r/656260 (https://phabricator.wikimedia.org/T269511) [21:39:29] (03CR) 10Dzahn: [C: 04-1] "Ladsgroup is right. looking at https://puppet-compiler.wmflabs.org/compiler1001/27421/acmechief1001.eqiad.wmnet/change.acmechief1001.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:41:32] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: install real Cinder policies [puppet] - 10https://gerrit.wikimedia.org/r/656260 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [21:42:53] (03Abandoned) 10Alexandros Kosiaris: Release 2.1.0 version [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/640268 (owner: 10Alexandros Kosiaris) [21:47:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2268.codfw.wmnet with reason: REIMAGE [21:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2269.codfw.wmnet with reason: REIMAGE [21:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2268.codfw.wmnet with reason: REIMAGE [21:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2270.codfw.wmnet with reason: REIMAGE [21:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2269.codfw.wmnet with reason: REIMAGE [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:02] RECOVERY - mediawiki-installation DSH group on mw2242 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:53:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2270.codfw.wmnet with reason: REIMAGE [21:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:20] (03PS3) 10Dzahn: profile::base: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) [21:54:59] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:55:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2236.codfw.wmnet with reason: REIMAGE [21:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:26] RECOVERY - mediawiki-installation DSH group on mw2258 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:56:38] RECOVERY - mediawiki-installation DSH group on mw2255 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:57:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2236.codfw.wmnet with reason: REIMAGE [21:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:28] anyone having trouble reaching gerrit suddenly? [22:00:20] hm, yes [22:00:30] Is it just me or gerrit is down [22:00:35] yup [22:00:37] yep, down forme [22:00:40] for me [22:01:00] hrm, dunno what it's doing...says it's running, no errors in the logs... [22:01:01] o.O [22:01:24] systemctl says it's running, cpu is pegged [22:01:42] cpu looks about normal for it...maybe a little low [22:01:48] one process at least [22:01:54] there are 32 core on that box iirc [22:01:58] ah yeah [22:02:12] major gc pause or something worse? [22:02:30] gc was fine a moment ago [22:02:39] https://grafana.wikimedia.org/d/Bw2mQ3iWz/javamelody?viewPanel=14&orgId=1&from=now-5m&to=now [22:03:18] got paged 👋 [22:03:20] Amir1: yes hello I'm here [22:03:24] restart it [22:03:37] * volans here on mobile [22:03:37] legoktm: sorry [22:03:48] :p [22:03:50] do I need to get my laptop? [22:03:56] yo? [22:04:16] volans: no [22:04:17] It said "wake up a SRE" I hope I didn't wake up the whole SRE [22:04:34] looks like gerrit is back up [22:04:35] I restarted apache [22:04:39] and it came back... [22:04:42] java.io.IOException: ..ACK [22:04:44] Amir1: for now is like any pagin icinga alert [22:04:48] !log restart apache on gerrit1001 [22:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:53] o/ [22:04:53] works for me, yep, thx [22:05:06] back up here [22:05:18] and usually doesn't wake up anyone as we have silent hours, so no prpb [22:05:25] cdanis: already resolved it seems [22:05:31] (by restarting) [22:05:34] hey, all under control now? [22:05:39] is there any followups, data loss? [22:05:57] now that we're all here, we should have a party or something [22:06:02] it would be great if we have a clearer criteria for pinging SRE with this [22:06:09] or just the usual java/concurrency? [22:06:23] haha [22:06:24] I'm still looking, but I think that apache just became unresponsive. I didn't touch any of the java [22:06:31] *jvm processes [22:06:31] ah [22:06:32] herron: yea, gerrit is back [22:06:42] weird then [22:07:08] weird indeed. [22:07:13] what woke everyone exactly? [22:07:24] klaxon [22:07:26] amir did :-D [22:07:27] uhh, i ignored that page [22:07:29] mutante: kk thanks [22:07:33] he he [22:07:36] sorry, i was deep in something and didnt even register it [22:07:39] ah [22:07:53] it kinda looks like it ran out of workers: https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?orgId=1&var-host=gerrit1001&var-port=9117&from=1610661081349&to=1610662022403 [22:07:54] im guesisng it was a test for everyone and not just me? ;D [22:07:56] (03PS1) 10Ladsgroup: query_service: Migrate hiera() to lookup() in common.pp [puppet] - 10https://gerrit.wikimedia.org/r/656266 (https://phabricator.wikimedia.org/T209953) [22:08:21] check if it needs followup, that sometimes is a signal or something else, like a crawler/bot [22:08:40] I'm extremely sorry, I thought it would page only one SRE [22:08:51] it says "Wake an SRE" [22:09:00] Amir1: some day we might have a proper rotation instead of the thing we have now :) [22:09:04] Amir1, not your fault, we need to organize shifts yet properly [22:09:13] you did right IMHO [22:09:16] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2268.codfw.wmnet'] ` and were **ALL** s... [22:09:28] yeah, +1 Amir1 did the right thing here, thank you for letting us know! [22:10:31] cpu load is still unusual [22:10:49] https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?viewPanel=4&orgId=1&var-host=gerrit1001&var-port=9117&from=1610658644911&to=1610662244911 [22:11:05] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2269.codfw.wmnet'] ` and were **ALL** s... [22:11:23] I suggest checking logs for unusual activity [22:11:43] * thcipriani does [22:11:57] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2270.codfw.wmnet'] ` and were **ALL** s... [22:12:35] thcipriani: do you want to pair on that? [22:12:37] * cdanis back to cooking dinner [22:12:55] or are you having fun stepping away from your manager role? :) [22:13:02] marxarelli: :D [22:15:59] watching the logs for a bit: nothing out of the ordinary -- lots of folks fetching -- so normal [22:17:13] thcipriani, there is swapping happening since around 20:30 [22:17:21] so it *could* be java [22:17:40] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&refresh=5m&var-server=gerrit1001&var-datasource=thanos&var-cluster=misc&from=1610651856795&to=1610662656796 [22:18:30] maybe java ate the memory and OOM killed apache instead? [22:18:51] I have seen wrong process getting killed by oom before [22:18:55] apache was running...just not well :) [22:19:01] yeah java is maxing out cpu and memoery right now [22:19:09] can see it on htop [22:19:32] consider giving it the same treatment as apache is my blind suggestion [22:19:35] :-) [22:19:46] * jynus logging off [22:20:10] :) [22:20:12] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27481/" [puppet] - 10https://gerrit.wikimedia.org/r/656266 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [22:20:16] Amir1: no indication of OOM around that time in syslog afaict [22:20:35] oh okay then [22:21:17] earlier apache children were killed but now everything in apach error log is quiet [22:26:59] (03PS1) 10Wugapodes: Change EnWiki logo for Wikipedia 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) [22:28:25] (03PS1) 10Bstorm: nfs: set default monitors for 10Gb Ethernet [puppet] - 10https://gerrit.wikimedia.org/r/656269 [22:30:11] (03CR) 10Wugapodes: [C: 04-1] "RfC closed different than I guessed so need to swap the images" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) (owner: 10Wugapodes) [22:31:14] also could not find anything valuable in syslog around the time it happened [22:31:31] all i see is how users logged in to check .. and then it was restarted [22:32:18] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2268.codfw.wmnet [22:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:30] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2270.codfw.wmnet [22:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:39] @wugapodes: Do we want to do it via deployment or just a common.css change? [22:32:44] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2269.codfw.wmnet [22:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:53] (03CR) 10Jforrester: "Normally we ask for this to be two distinct patches, but I can manage with it as it is; looking good, though yeah the logo needs updating " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) (owner: 10Wugapodes) [22:33:57] @Seddon those I spoke with preferred deployment over CSS change [22:35:08] wugapodes: sounds good. Was just going to offer my services if needed for the css change [22:35:10] (03PS2) 10Bstorm: nfs: set default monitors for 10Gb Ethernet [puppet] - 10https://gerrit.wikimedia.org/r/656269 (https://phabricator.wikimedia.org/T218338) [22:38:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2269.codfw.wmnet [22:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2268.codfw.wmnet [22:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2270.codfw.wmnet [22:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:20] (03PS2) 10Bstorm: wikireplicas: add a multiinstance role for the dedicated analytics host [puppet] - 10https://gerrit.wikimedia.org/r/654558 (https://phabricator.wikimedia.org/T269211) [22:39:56] jouncebot: next [22:39:56] In 1 hour(s) and 20 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210115T0000) [22:40:07] (03PS1) 10Ladsgroup: miscweb: Add tests for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) [22:40:48] 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) @Dzahn Added tests ^ please take a look when you have time [22:41:05] (03CR) 10Dzahn: "started compile job #27482" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:43:17] (03PS2) 10Wugapodes: Change EnWiki logo for Wikipedia 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) [22:43:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10wiki_willy) Thanks @Jclark-ctr, I just sent an email to Dell to figure out what's going on. >>! In T267043#6749435, @Jclark-ctr wrote: > @RobH @Cmjohnson DYV8773 is the... [22:43:36] (03CR) 10Dzahn: "thank you for making these:) testing on deploy1001:" [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [22:44:05] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2236.codfw.wmnet'] ` and were **ALL** s... [22:46:03] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [22:46:39] (03CR) 10Dzahn: "it needs underscores, but then it works:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [22:47:15] Amir1: Wikidata_Query_Service works [22:47:29] [deploy1001:~] $ curl -H "Host: query.wikidata.org" http://miscweb1002.eqiad.wmnet | less [22:47:38] (03PS2) 10Ladsgroup: miscweb: Add tests for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) [22:47:41] sure, amednign [22:47:43] *amending [22:47:45] thanks [22:48:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "tested :)" [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [22:48:28] (03CR) 10Ladsgroup: miscweb: Add tests for query.wikidata.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/656270 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [22:50:54] (03PS3) 10Jforrester: Change EnWiki logo for Wikipedia 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) (owner: 10Wugapodes) [22:52:25] Amir1: thanks a lot. it works also after deployment of the full test_miscweb.yaml [22:52:34] \o/ [22:52:37] except wikiworkshop.org tests are broken [22:52:43] unrelatedly [22:53:48] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) (owner: 10Wugapodes) [22:54:16] (lesson: do not use year strings like '2020' to test stuff and expect it to stay the same in 2021) [22:55:47] (03PS2) 10Bstorm: toolforge k8s: upgrade docker and containerd [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) [22:55:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2236.codfw.wmnet [22:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:03] (03PS3) 10Bstorm: toolforge k8s: upgrade docker and containerd [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) [22:57:59] Anyone upset if I do a logo change deploy right now? It's the 20th Birthday for enwiki for most of the world's population by now. :-) [22:58:37] not me, I am glad to see it happen after talking to wugapodes about it earlier today [22:58:41] James_F: fire away [22:58:55] mutante: ahaha oops [22:59:17] I'll stick around for a while in case you need me for anything [22:59:22] I am for "fire away" as well [22:59:30] (03CR) 10Jforrester: [C: 03+2] "Happy 20th, Wikipedia." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) (owner: 10Wugapodes) [22:59:55] Thu Jan 14 22:59:47 UTC 2021 [23:00:06] (03PS4) 10Bstorm: toolforge k8s: upgrade docker and containerd [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) [23:00:09] one hour early but close :) [23:00:22] * James_F grins. [23:00:42] if you have a minute afterwards, i have a beta cluster config patch i'd love to get deployed James_F [23:00:44] oh, look, guys [23:00:50] https://20.wikipedia.org [23:00:57] (03Merged) 10jenkins-bot: Change EnWiki logo for Wikipedia 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656268 (https://phabricator.wikimedia.org/T272094) (owner: 10Wugapodes) [23:01:03] (scheduled it for the backport window but i wouldn't mind getting it does an hour early) [23:01:03] comms also did the "go live" [23:01:12] MatmaRex: Sure. [23:01:23] https://gerrit.wikimedia.org/r/c/655966/ [23:01:35] (03CR) 10Bstorm: "If you think we should keep 1.16 around, we can. It just seems like something we don't really want to have installed anywhere anymore." [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [23:01:47] thanks James_F, was just going to +2 that myself :). Happy birthday, Wikipedia. [23:01:52] Neat. [23:02:05] !log Happy 20th Birthday Wikipedia - https://20.wikipedia.org - https://gerrit.wikimedia.org/r/656268 [23:02:05] (03CR) 10Jforrester: [C: 03+2] Enable DiscussionTools' newtopictool as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) (owner: 10Bartosz Dziewoński) [23:02:07] 🎂 🎉 [23:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:18] HBD! \o/ [23:02:20] mutante: Aww. Nice. [23:02:30] Thanks everyone for the help. It seems your friendly reputation is well earned :) [23:02:36] James_F: it's Twitter after all, heh [23:03:22] (03Merged) 10jenkins-bot: Enable DiscussionTools' newtopictool as beta feature on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655966 (https://phabricator.wikimedia.org/T267595) (owner: 10Bartosz Dziewoński) [23:03:36] 1 Billionth Edit [23:03:51] Oh oops. [23:04:04] wugapodes: You needed to point the 1.5x things too. I'll fix. :-) [23:05:05] 1 Billionth Edit within the same 24 hours of the Birthday. and people's guesses when that would happen reached from 2005 to the year 4001. lol [23:05:37] Oh rough I didn't see those; it may also be the same case for 2x in that case [23:05:47] https://en.wikipedia.org/wiki/Wikipedia%3ABillionth_edit_pool [23:05:51] Also, can I recommend adding margin-top: 1em; [23:05:58] https://en.wikipedia.org/wiki/Wikipedia:1.5-billionth_edit_pool [23:06:10] to #p-logo [23:07:06] @James_F ^^^ [23:07:23] Seddon: You can recommend it. But not to me. [23:07:25] !log jforrester@deploy1001 Synchronized static/images/project-logos/enwiki20.png: T272094 Sync out logo before going live, 1/3 (duration: 01m 02s) [23:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:29] T272094: Change EnWiki logo for Wikipedia's 20th anniversary - https://phabricator.wikimedia.org/T272094 [23:07:42] to wugapodes? [23:08:19] No I think that's a common.css change that an intadmin would need to make [23:08:27] James_F: (thanks for merging, beta cluster config gets deployed automatically, right?) [23:08:33] (03PS1) 10Jforrester: [enwiki] Also point 2x and 1.5x logos to birthday celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656274 (https://phabricator.wikimedia.org/T272094) [23:08:34] MatmaRex: yup :) [23:08:37] MatmaRex: Yeah, should be all set later. [23:08:39] wugapodes: I can do that via common.css [23:08:50] or rather a MW core change if it's not celebration-related Seddon [23:09:08] (03CR) 10Jforrester: [C: 03+2] [enwiki] Also point 2x and 1.5x logos to birthday celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656274 (https://phabricator.wikimedia.org/T272094) (owner: 10Jforrester) [23:09:32] Urbanecm: it's celebration related, with the new logo it seems like it needs the extra margin (based on what I'm seeing on mwdebug [23:09:34] Seddon: then you'd know better than me [23:09:51] !log jforrester@deploy1001 Synchronized static/images/project-logos/enwiki20-1.5x.png: T272094 Sync out logo before going live, 2/3 (duration: 00m 55s) [23:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:11] Urbanecm: https://usercontent.irccloud-cdn.com/file/ry309qxk/image.png [23:10:27] (03Merged) 10jenkins-bot: [enwiki] Also point 2x and 1.5x logos to birthday celebration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656274 (https://phabricator.wikimedia.org/T272094) (owner: 10Jforrester) [23:10:39] Seddon: then go with common.css change i think [23:10:51] Urbanecm: will do [23:10:57] !log jforrester@deploy1001 Synchronized static/images/project-logos/enwiki20-2x.png: T272094 Sync out logo before going live, 3/3 (duration: 00m 55s) [23:11:01] (03PS1) 10Dzahn: httpbb: update tests for wikiworkshop [puppet] - 10https://gerrit.wikimedia.org/r/656275 [23:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:24] I'll make the change when the new logo goes out [23:11:31] cool :) [23:11:43] Seddon: Live on mwdebug1002 if you want to change it now. [23:12:00] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "tested on deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/656275 (owner: 10Dzahn) [23:12:33] OK, syncing. [23:13:36] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T272094 Change enwiki logo to 20th Birthday Celebration one (duration: 00m 56s) [23:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:39] T272094: Change EnWiki logo for Wikipedia's 20th anniversary - https://phabricator.wikimedia.org/T272094 [23:14:41] And live. [23:14:53] \o/ [23:16:51] The 20.wp page mentions the "first edit" but does not link to it.. let's find it [23:17:15] unfortunatly can't upload that as fix in Gerrit though [23:18:50] oh there's a link at WP:OLDEST and its at HomePage [23:19:01] does anyone know how to sort RC by oldest first?:) [23:19:04] nowadays [23:19:16] ah, looking [23:19:18] mutante: Ho ho. [23:19:30] https://en.wikipedia.org/w/index.php?title=HomePage&oldid=908493298 [23:20:10] ooh.. right, "first edit" ---> "Earliest surviving edits" :p [23:20:45] wugapodes: thanks, and lol @ office.bomis.com [23:21:01] rev_id=908493298 is the lowest rev_timestamp (20010115192713) in the wiki replica dbs [23:21:35] ah, that matches this one. cool [23:21:41] Seddon: Did you want help editing enwiki's MediaWiki:Common.css ? [23:22:13] and why is it "WikiPedia"? because CamelCase made it a link. it was before [[ ]] [23:22:21] Yup. [23:22:35] Hence our oldest edit to [[W]] being to WwW or whatever. [23:22:50] https://en.wikipedia.org/wiki/User:Office.bomis.com has more, too [23:23:09] Seddon: Never mind, thanks. [23:23:18] https://en.wikipedia.org/wiki/Bomis [23:23:41] James_F: Yeah it's done and its showing up. I only specified vector, should I add monobook as well? [23:23:46] rzl: ^ "Bomis Babes" [23:24:00] Seddon: No-one cares about Monobook; do as you wish. ;-) [23:24:18] Wait I care about monobook [23:24:23] This is the fourth oldest edit in the rev table and a good one -- https://en.wikipedia.org/w/index.php?oldid=999508058 -- "Who knows where it will go?" [23:25:12] "this username would likely now be in violation of Wikipedia's modern username guidelines. [23:26:14] wugapodes: Are you seeing the WP20 logo on monobook? And do you want a touch of padding on the top of the image so that it's not flush with the browser [23:27:51] Seddon: I'm seeing it, though the navigation toolbar cuts off the bottom a tad [23:29:23] https://nostalgia.wikipedia.org/w/index.php?title=HomePage&action=history [23:30:23] that's the right skin for the birthday, hehe [23:31:19] https://nostalgia.wikipedia.org/wiki/Wikipedia_Announcements [23:32:16] December 1, 2001 - "why isn't our chief administrator working today" [23:34:59] (03CR) 10Dzahn: "[deploy1001:~] $ httpbb --hosts miscweb1002.eqiad.wmnet,miscweb2002.codfw.wmnet /srv/deployment/httpbb-tests/miscweb/test_miscweb.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/656275 (owner: 10Dzahn) [23:35:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2236.codfw.wmnet [23:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:35] (03CR) 10Dzahn: "still running - https://puppet-compiler.wmflabs.org/compiler1002/27482/ much better but still an issue with "domain_search" ?" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:52:40] PROBLEM - PHP opcache health on mw2269 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:53:16] 10SRE, 10ops-codfw: Degraded RAID on ms-be2022 - https://phabricator.wikimedia.org/T272025 (10Papaul) BBU replaced [23:53:38] 10SRE, 10ops-codfw: Degraded RAID on ms-be2022 - https://phabricator.wikimedia.org/T272025 (10Papaul) 05Open→03Resolved a:03Papaul [23:55:20] PROBLEM - PHP opcache health on mw2270 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:59:16] PROBLEM - PHP opcache health on mw2236 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health