[00:05:28] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10AikoChou) Hi @CDanis, Could you double check that I have LDAP access? because I'm not able to access the notebooks. I'm able to access the server... [00:05:54] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10AikoChou) 05Resolved→03Open [01:37:16] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [01:58:06] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:22:16] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:50] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.552 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:34] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:48:40] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 4167 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [06:11:30] (03PS1) 10Marostegui: instances.yaml: Remove db1094 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/662138 (https://phabricator.wikimedia.org/T273710) [06:12:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1094 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/662138 (https://phabricator.wikimedia.org/T273710) (owner: 10Marostegui) [06:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1094 from dbctl T273710', diff saved to https://phabricator.wikimedia.org/P14228 and previous config saved to /var/cache/conftool/dbconfig/20210208-061319-marostegui.json [06:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:29] T273710: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 [06:18:16] (03PS1) 10Marostegui: mariadb: Add new codfw databases to insetup [puppet] - 10https://gerrit.wikimedia.org/r/662139 (https://phabricator.wikimedia.org/T273568) [06:19:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Add new codfw databases to insetup [puppet] - 10https://gerrit.wikimedia.org/r/662139 (https://phabricator.wikimedia.org/T273568) (owner: 10Marostegui) [06:19:48] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Marostegui) These hosts have been added to puppet with: `insetup` role and also assigned a partman recipe for the installation. The only puppet change need... [06:36:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: Remove mc1024 from config [puppet] - 10https://gerrit.wikimedia.org/r/661740 (https://phabricator.wikimedia.org/T272078) (owner: 10Effie Mouzeli) [06:45:07] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: Remove mc1024 from config [puppet] - 10https://gerrit.wikimedia.org/r/661740 (https://phabricator.wikimedia.org/T272078) (owner: 10Effie Mouzeli) [06:50:56] !log Removed mc1024 from mcrouter, some resharding is expected [06:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1111 T273982', diff saved to https://phabricator.wikimedia.org/P14229 and previous config saved to /var/cache/conftool/dbconfig/20210208-070858-marostegui.json [07:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:04] T273982: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 [07:09:56] (03PS1) 10Marostegui: db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/662140 (https://phabricator.wikimedia.org/T273982) [07:28:40] 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10ayounsi) 05Open→03Resolved Looks good now, thanks! [07:39:05] (03PS16) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [07:45:33] (03PS1) 10Ayounsi: Revert "Add prepending to esams/knams transits" [homer/public] - 10https://gerrit.wikimedia.org/r/662064 [07:47:58] (03CR) 10Ayounsi: [C: 03+2] Revert "Add prepending to esams/knams transits" [homer/public] - 10https://gerrit.wikimedia.org/r/662064 (owner: 10Ayounsi) [07:48:36] (03Merged) 10jenkins-bot: Revert "Add prepending to esams/knams transits" [homer/public] - 10https://gerrit.wikimedia.org/r/662064 (owner: 10Ayounsi) [08:08:38] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:11:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:16:17] (03CR) 10Effie Mouzeli: [C: 04-1] "Since we are moving forward with the reimaging, I do not think it is necessary to have any stretch servers, or the extra overhead of addin" [puppet] - 10https://gerrit.wikimedia.org/r/662037 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [08:29:15] (03CR) 10Marostegui: [C: 03+2] db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/662140 (https://phabricator.wikimedia.org/T273982) (owner: 10Marostegui) [08:33:27] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [08:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [08:35:02] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:35:08] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:35:26] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:36:22] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:36:28] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:38:50] (03PS1) 10Elukey: aqs: allow deployment via git [puppet] - 10https://gerrit.wikimedia.org/r/662625 [08:39:46] the above alarms seems to be related to the GTT links, maintenance? [08:40:24] seems so yes, scheduled maintenance in Dallas for GTT [08:40:28] (03CR) 10jerkins-bot: [V: 04-1] aqs: allow deployment via git [puppet] - 10https://gerrit.wikimedia.org/r/662625 (owner: 10Elukey) [08:41:23] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: update w3creportingapi to use 12 weekly indexes [puppet] - 10https://gerrit.wikimedia.org/r/661993 (https://phabricator.wikimedia.org/T274005) (owner: 10Cwhite) [08:41:44] (03PS2) 10Elukey: aqs: allow deployment via git [puppet] - 10https://gerrit.wikimedia.org/r/662625 [08:41:52] (03PS17) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [08:43:36] 10SRE, 10Beta-Cluster-Infrastructure, 10SRE-swift-storage: Beta cluster Swift backend instances are missing profile::swift::storage::rsync_limit_memory_percent (puppet fails) - https://phabricator.wikimedia.org/T274092 (10hashar) [08:44:54] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27894/console" [puppet] - 10https://gerrit.wikimedia.org/r/662625 (owner: 10Elukey) [08:46:14] (03PS3) 10Elukey: aqs: allow deployment via git [puppet] - 10https://gerrit.wikimedia.org/r/662625 [08:46:54] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:47:00] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:18] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27895/console" [puppet] - 10https://gerrit.wikimedia.org/r/662625 (owner: 10Elukey) [08:48:16] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:48:18] (03CR) 10Elukey: [V: 03+1 C: 03+2] aqs: allow deployment via git [puppet] - 10https://gerrit.wikimedia.org/r/662625 (owner: 10Elukey) [08:48:22] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:50:00] (03CR) 10Majavah: [C: 04-1] mediawiki: use a data structure to define all virtualhosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [08:51:03] (03PS18) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [08:51:35] (03CR) 10Volans: "Reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659295 (owner: 10David Caro) [08:51:41] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/662117 (owner: 10Elukey) [08:51:45] (03CR) 10Volans: [C: 04-1] "Small typo inline, LGTM otherwise." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 (owner: 10Elukey) [08:51:49] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/662118 (owner: 10Elukey) [08:55:35] (03PS1) 10Marostegui: mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) [08:55:42] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, overall LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [08:56:37] !log push pfw policies T273989 [08:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:33] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10ayounsi) [08:59:14] (03CR) 10Volans: "Thanks for the patch! As we're in the process of introducing black to spicerack I'll take care of rebasing it and merging once that is don" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [08:59:44] (03CR) 10Marostegui: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [09:00:09] !log depool and restart blazegraph on wdqs1005 / wdqs1012 [09:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:40] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.069 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:04:38] (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy.py: Add x2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [09:05:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Just cherry-picked in beta and it seems to work flawlessly." [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:05:15] (03CR) 10Filippo Giunchedi: [C: 04-1] Alert if kafka max replica lag is steadily increasing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662005 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [09:05:28] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:08:28] (03PS10) 10David Caro: remote: allow prepending every command with sudo [software/spicerack] - 10https://gerrit.wikimedia.org/r/659009 (https://phabricator.wikimedia.org/T267412) [09:11:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM \o/" [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [09:11:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "This will need the LVS-side of things deconfigured as well (at minimum to set the checks non-paging if not already)" [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [09:18:18] (03CR) 10David Caro: "I think that it's missing updating the tests though ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662631 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [09:20:00] !log rolling restart of LVS instances to catch up on kernel upgrades [09:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:52] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4007.ulsfo.wmnet [09:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: some fixes to deploy-apache-change [puppet] - 10https://gerrit.wikimedia.org/r/660757 (owner: 10Giuseppe Lavagetto) [09:23:09] !log restart varnish-fe on cp1087 [09:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:44] RECOVERY - Varnish frontend child restarted on cp1087 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops [09:32:02] (03PS1) 10Elukey: cassandra::metrics: extend ensure to scap target [puppet] - 10https://gerrit.wikimedia.org/r/662634 (https://phabricator.wikimedia.org/T186567) [09:36:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4007.ulsfo.wmnet [09:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:06] PROBLEM - Check systemd state on lvs4007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:42:00] !log Stop MySQL on db1111 T273982 [09:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:07] T273982: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 [09:43:36] !log failover pfw3-eqiad RG1 to node 0 T263833 [09:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:37] (03PS2) 10Jcrespo: Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) [09:44:59] 10SRE: Integrate Buster 10.8s point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [09:45:05] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) @Cmjohnson db1111 is now off. You can proceed whenever you like. [09:45:10] 10SRE: Integrate Buster 10.8s point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:46:28] hashar: o/ when you have a moment I'd need some help with https://integration.wikimedia.org/ci/job/generic-node10-browser-docker/1854/console :( [09:47:21] (03PS19) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [09:47:38] (03CR) 10Giuseppe Lavagetto: mediawiki: use a data structure to define all virtualhosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:47:56] (03CR) 10Jcrespo: [C: 03+2] Bacula: Create a new set of storage daemons dedicated to db ES backups [puppet] - 10https://gerrit.wikimedia.org/r/661396 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:48:06] elukey: hi ;) [09:48:29] elukey: npm ERR! /usr/bin/git ls-remote -h -t ssh://git@github.com/wikimedia/wikimedia-ui-base.git [09:48:33] who knows really :] [09:48:50] there is no ssh client in the Docker image [09:49:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: use a data structure to define all virtualhosts [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:49:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27896/console" [puppet] - 10https://gerrit.wikimedia.org/r/662634 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [09:49:48] 10SRE, 10Traffic: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Vgutierrez) [09:49:55] hashar: but it used to work iirc, I am not sure if anything changed, we really didn't change the docker image (not even sure how to modify it) [09:50:44] elukey: then it is due to one of the dependencies maybe [09:50:46] 10SRE, 10Traffic: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Vgutierrez) p:05Triage→03Unbreak! [09:51:08] <_joe_> jynus: can I merge your patch? [09:51:43] yes! [09:52:01] elukey: and the huge change to package-lock.json looks really suspicious given the package.json just has a version bump [09:52:41] <_joe_> jynus: done [09:52:45] thanks [09:53:29] hashar: makes sense yes, will keep checking, thanks for the hint :) [09:54:33] elukey: seems like whatever generated the package-lock.json promoted lockfileVersion from "1" to "2" which brings a lot and lot of new lines. Some of them do refer to git+ssh:// for the url of wikimedia/wikimedia-ui-base [09:54:48] so I guess you can lookup whether the lock file can be generated with version 1 if that is at all possible [09:55:05] and possibly the upstream dependency wikimedia/wikimedia-ui-base might have changed their canonical url? [09:55:41] 10SRE, 10Traffic: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Volans) It seems to me that this is related to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/interface/files/interface-rps.py#186 ` >>> a = 'foo %d' >>> b = re.s... [09:57:04] 10SRE, 10Traffic: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10elukey) https://phabricator.wikimedia.org/T273918 was filed by Effie last week, it also happened for some MW servers. [09:57:36] (03CR) 10DCausse: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [09:59:51] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [09:59:58] (03PS2) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [10:00:30] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:03] (03PS1) 10Effie Mouzeli: interface: fix regex in interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) [10:04:03] (03PS2) 10Kormat: WIP: dbutil: Handle IP addresses in resolve() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661957 [10:04:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: fix escaping of % in hiera [puppet] - 10https://gerrit.wikimedia.org/r/662638 [10:04:40] 10SRE, 10Traffic, 10Patch-For-Review: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Volans) I did just a quick look but I think that if you use `\\d` it should do the right thing: ` >>> a = 'foo %d bar' >>> b = re.sub('%d', r'(\\d+)', a) >>> c = re.compile(r'^\s*([0-9]+):.*... [10:05:13] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2025.codfw.wmnet [10:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:25] !log updating netboot images to Buster 10.8 T274099 [10:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:29] T274099: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 [10:06:04] (03CR) 10Vgutierrez: [C: 03+1] interface: fix regex in interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:07:02] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27897/console" [puppet] - 10https://gerrit.wikimedia.org/r/662638 (owner: 10Giuseppe Lavagetto) [10:07:54] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:55] (03CR) 10Elukey: "Before merging: is this consistent between Stretch and Buster?" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:08:00] (03CR) 10Jcrespo: [C: 03+2] Bacula: Start using new storage/pools for es database content backups [puppet] - 10https://gerrit.wikimedia.org/r/659952 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [10:08:32] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 1.556e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:08:36] (03CR) 10Effie Mouzeli: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:12:29] (03PS3) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [10:13:09] (03CR) 10Elukey: "https://bugs.python.org/issue34304 give a nice explanation of the issue, it seems that python < 3.7 was more lenient with the need of back" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:14:15] (03PS2) 10Jcrespo: backups: Enable External Store backups [puppet] - 10https://gerrit.wikimedia.org/r/661691 (https://phabricator.wikimedia.org/T79922) [10:14:17] (03PS1) 10Jcrespo: Bacula: Fix depencency cycle on es backup storage setup [puppet] - 10https://gerrit.wikimedia.org/r/662639 (https://phabricator.wikimedia.org/T79922) [10:15:00] (03PS2) 10Jcrespo: Bacula: Fix depencency cycle on es backup storage setup [puppet] - 10https://gerrit.wikimedia.org/r/662639 (https://phabricator.wikimedia.org/T79922) [10:15:04] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "The pcc output shows that is doing the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/662638 (owner: 10Giuseppe Lavagetto) [10:16:22] (03CR) 10jerkins-bot: [V: 04-1] Bacula: Fix depencency cycle on es backup storage setup [puppet] - 10https://gerrit.wikimedia.org/r/662639 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [10:17:51] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::webserver: fix escaping of % in hiera [puppet] - 10https://gerrit.wikimedia.org/r/662638 (owner: 10Giuseppe Lavagetto) [10:18:00] (03PS3) 10Jcrespo: Bacula: Fix dependency cycle on es backup storage setup [puppet] - 10https://gerrit.wikimedia.org/r/662639 (https://phabricator.wikimedia.org/T79922) [10:18:10] (03PS4) 10Jcrespo: Bacula: Fix dependency cycle on es backup storage setup [puppet] - 10https://gerrit.wikimedia.org/r/662639 (https://phabricator.wikimedia.org/T79922) [10:19:57] (03CR) 10Elukey: [C: 03+2] cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117 (owner: 10Elukey) [10:20:53] (03CR) 10Volans: [C: 03+1] "LGTM! Same solution I mentioned in https://phabricator.wikimedia.org/T274103#6810100" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:21:28] (03CR) 10Jcrespo: [C: 03+2] Bacula: Fix dependency cycle on es backup storage setup [puppet] - 10https://gerrit.wikimedia.org/r/662639 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [10:22:43] (03PS3) 10Elukey: sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 [10:24:45] (03CR) 10Elukey: "Thanks!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 (owner: 10Elukey) [10:25:10] (03CR) 10Elukey: [C: 03+2] sre.hadoop.init-hadoop-worker: fix wipe argument [cookbooks] - 10https://gerrit.wikimedia.org/r/662118 (owner: 10Elukey) [10:25:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2025.codfw.wmnet [10:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:56] (03PS1) 10Ayounsi: Add option-82 to prod vlans [homer/public] - 10https://gerrit.wikimedia.org/r/662641 (https://phabricator.wikimedia.org/T269855) [10:27:56] (03Merged) 10jenkins-bot: sre.hadoop.init-hadoop-worker: fix wipe argument [cookbooks] - 10https://gerrit.wikimedia.org/r/662118 (owner: 10Elukey) [10:28:51] (03CR) 10Elukey: [C: 03+2] sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 (owner: 10Elukey) [10:31:15] (03PS1) 10Jcrespo: backups: Apply partial rename of jobdefaults without changing devicename [puppet] - 10https://gerrit.wikimedia.org/r/662644 (https://phabricator.wikimedia.org/T79922) [10:31:34] (03Merged) 10jenkins-bot: sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 (owner: 10Elukey) [10:33:19] 10SRE, 10netops: Upgrade Fastnetmon to 1.2.0 - https://phabricator.wikimedia.org/T271228 (10ayounsi) [10:34:17] (03PS4) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [10:34:27] (03CR) 10Jcrespo: [C: 03+2] backups: Apply partial rename of jobdefaults without changing devicename [puppet] - 10https://gerrit.wikimedia.org/r/662644 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [10:34:40] (03PS2) 10Jcrespo: backups: Apply partial rename of jobdefaults without changing devicename [puppet] - 10https://gerrit.wikimedia.org/r/662644 (https://phabricator.wikimedia.org/T79922) [10:34:45] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] backups: Apply partial rename of jobdefaults without changing devicename [puppet] - 10https://gerrit.wikimedia.org/r/662644 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [10:41:40] (03CR) 10Effie Mouzeli: "It works on buster, I tested on mc2025, I have not tested on stretch though" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:46:19] (03PS3) 10Jcrespo: backups: Enable External Store backups [puppet] - 10https://gerrit.wikimedia.org/r/661691 (https://phabricator.wikimedia.org/T79922) [10:46:27] (03CR) 10Jcrespo: [C: 03+2] backups: Enable External Store backups [puppet] - 10https://gerrit.wikimedia.org/r/661691 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [10:46:35] (03CR) 10Elukey: [V: 03+1] "The PCC looks a bit brutal, the scap user deploy-service seems removed, and it is clearly not great since it is used for other things on n" [puppet] - 10https://gerrit.wikimedia.org/r/662634 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [10:48:07] (03PS1) 10Kormat: dbutil: Add addr_split [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662649 [10:49:23] (03CR) 10Elukey: [C: 03+1] "I tested a quick example on an-worker1080 (stretch) and it looks good." [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:50:11] (03CR) 10Elukey: [C: 03+1] "Adding Filippo as well since IIRC the swift nodes also use interface-rps" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:50:16] godog: --^ [10:51:02] (03CR) 10Muehlenhoff: [C: 03+2] Add a new profile to install OpenLDAP client tools in production [puppet] - 10https://gerrit.wikimedia.org/r/661900 (owner: 10Muehlenhoff) [10:51:05] (03PS2) 10Kormat: dbutil: Add addr_split [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662649 [10:52:57] (03CR) 10Effie Mouzeli: [C: 03+2] interface: fix regex in interface-rps.py [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [10:53:57] (03PS1) 10Volans: debmonitor: ensure installation order [puppet] - 10https://gerrit.wikimedia.org/r/662650 [10:54:37] (03CR) 10Volans: "It should fix the cronspam from debmonitor@mw1313" [puppet] - 10https://gerrit.wikimedia.org/r/662650 (owner: 10Volans) [10:54:53] (03PS1) 10Daniel Kinzler: objectcache: Log more info when WANObjectCache async refresh fails [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662065 (https://phabricator.wikimedia.org/T264391) [10:55:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4007.ulsfo.wmnet [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:17] (03PS1) 10Jcrespo: Fix reference to still not yet renamed pool Databases [puppet] - 10https://gerrit.wikimedia.org/r/662653 (https://phabricator.wikimedia.org/T79922) [10:57:35] (03Abandoned) 10Elukey: cassandra::metrics: extend ensure to scap target [puppet] - 10https://gerrit.wikimedia.org/r/662634 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [10:58:36] (03CR) 10Muehlenhoff: "The patch looks good, but let's rebase it on https://gerrit.wikimedia.org/r/c/operations/puppet/+/661189?" [puppet] - 10https://gerrit.wikimedia.org/r/662650 (owner: 10Volans) [10:58:56] RECOVERY - Check systemd state on lvs4007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:07] (03PS2) 10Jcrespo: Bacula: Fix reference to still not yet renamed pool Databases [puppet] - 10https://gerrit.wikimedia.org/r/662653 (https://phabricator.wikimedia.org/T79922) [11:00:20] (03CR) 10Muehlenhoff: "The Swift hosts are still on Stretch and this should only be an issue with Py 3.7 (while Stretch has 3.5). I've rebooted most of the Swift" [puppet] - 10https://gerrit.wikimedia.org/r/662637 (https://phabricator.wikimedia.org/T274103) (owner: 10Effie Mouzeli) [11:00:27] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Bacula: Fix reference to still not yet renamed pool Databases [puppet] - 10https://gerrit.wikimedia.org/r/662653 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [11:00:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4007.ulsfo.wmnet [11:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:45] (03PS2) 10Volans: debmonitor: ensure installation order [puppet] - 10https://gerrit.wikimedia.org/r/662650 [11:02:47] (03PS2) 10Giuseppe Lavagetto: mediawiki: remove the unused site directive [puppet] - 10https://gerrit.wikimedia.org/r/659940 [11:03:00] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/662650 (owner: 10Volans) [11:03:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/662650 (owner: 10Volans) [11:04:39] jouncebot: now [11:04:40] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [11:04:42] jouncebot: next [11:04:42] In 0 hour(s) and 25 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1130) [11:05:49] Daimona: you around for T71617? [11:06:06] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10Aklapper) Hi @VeronicaThamaini, thanks and welcome! Please see https://phabricator.wikimedia.org/project/profile/1564/ for required info. (Also, could you please also link yo... [11:07:14] Urbanecm: Hey, sure! [11:07:24] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:26] great! [11:07:29] let me prep the stuff then [11:07:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4006.ulsfo.wmnet [11:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:45] 10SRE, 10Traffic, 10Patch-For-Review: interface-rps crashes on lsv4007 - https://phabricator.wikimedia.org/T274103 (10Vgutierrez) 05Open→03Resolved a:03jijiki Thanks @Volans and @jijiki [11:09:03] Sure [11:09:54] ^^ cr3-ulsfo is me [11:11:09] Daimona: pulled onto mwdebug1001. Let's move to -security for the testing part. [11:11:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4006.ulsfo.wmnet [11:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4005.ulsfo.wmnet [11:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:19:14] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10hnowlan) Sounds good, I will have everything ready at 1500. Thanks for the heads-up! [11:19:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4005.ulsfo.wmnet [11:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2010.codfw.wmnet [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:33] !log resyncing postgres on maps1001 [11:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:02] !log resyncing postgres on maps1005 [11:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [11:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:31] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2019.codfw.wmnet [11:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:34] !log Deploy security patch for T71617 [11:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:43] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database [11:25:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database [11:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:33] 10SRE, 10serviceops, 10User-jijiki: ifup@eno1.service fails on mc* hosts after 4.19.171-2 upgrade - https://phabricator.wikimedia.org/T273918 (10jijiki) 05Open→03Resolved a:03jijiki [11:26:36] Daimona: should be live [11:26:37] 10SRE, 10serviceops, 10User-jijiki: ifup@eno1.service fails on mc* hosts after 4.19.171-2 upgrade - https://phabricator.wikimedia.org/T273918 (10jijiki) thanks @elukey ! [11:27:59] 10SRE: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (10elukey) [11:28:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2010.codfw.wmnet [11:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:58] RECOVERY - Check systemd state on mc2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:05] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1130). [11:30:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2019.codfw.wmnet [11:30:09] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2020.codfw.wmnet [11:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:54] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662657 (https://phabricator.wikimedia.org/T128546) [11:35:07] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662657 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:35:40] RECOVERY - Check systemd state on mc2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:56] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662657 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:36:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2009.codfw.wmnet [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:09] (03PS1) 10Ladsgroup: Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662666 (https://phabricator.wikimedia.org/T274091) [11:37:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2020.codfw.wmnet [11:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:31] (03PS1) 10Ladsgroup: Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662667 (https://phabricator.wikimedia.org/T274091) [11:38:00] (03PS1) 10Hnowlan: role::maps: fix MOTD message [puppet] - 10https://gerrit.wikimedia.org/r/662659 [11:38:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:48] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:662657| Bumping portals to master (T128546)]] (duration: 01m 07s) [11:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:54] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:39:56] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:662657| Bumping portals to master (T128546)]] (duration: 01m 07s) [11:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:40] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:41:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2009.codfw.wmnet [11:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2008.codfw.wmnet [11:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:58] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:51:20] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:53:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2008.codfw.wmnet [11:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:57] (03PS1) 10Awight: New 2FA device [puppet] - 10https://gerrit.wikimedia.org/r/662661 [11:56:19] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2007.codfw.wmnet [11:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:24] (03PS1) 10Arturo Borrero Gonzalez: toolforge: front proxy: add dhparams file [puppet] - 10https://gerrit.wikimedia.org/r/662662 (https://phabricator.wikimedia.org/T274123) [11:56:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:58:27] (03CR) 10Awight: "I've tested the new key locally so I believe it's safe to cut over immediately, without leaving the old key in place." [puppet] - 10https://gerrit.wikimedia.org/r/662661 (owner: 10Awight) [11:59:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: front proxy: add dhparams file [puppet] - 10https://gerrit.wikimedia.org/r/662662 (https://phabricator.wikimedia.org/T274123) (owner: 10Arturo Borrero Gonzalez) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1200). Please do the needful. [12:00:05] dcausse: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:20] dcausse: i guess you'll self-service? [12:00:24] duesen, Amir1: you also scheduled patches, pinging for completeness :) [12:00:29] (looks like jouncebot didn’t refresh in time) [12:00:42] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:00:52] mine can wait for a while [12:00:58] o/ [12:01:00] we also have some security stuff [12:01:20] oh my [12:01:49] (03PS2) 10DCausse: [cirrus] rename ores_articletopics -> weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661383 (https://phabricator.wikimedia.org/T273508) [12:01:49] security or dcausse first? [12:02:14] dcausse first, then duesen as it's UBN, then security? [12:02:21] sounds good [12:02:24] thanks, deploying [12:02:32] well, this has been "UBN" for a week or so [12:02:32] probably +2 duesen already, since it’s a backport and will take a while in CI? [12:02:38] (mine will take forever to merge, I +2 them know) [12:02:41] (03CR) 10jerkins-bot: [V: 04-1] Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662667 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [12:02:44] and it's just a logging improvement, so we see what's going on [12:02:48] duesen: which it shouldn't :D [12:02:55] (03CR) 10DCausse: [C: 03+2] [cirrus] rename ores_articletopics -> weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661383 (https://phabricator.wikimedia.org/T273508) (owner: 10DCausse) [12:02:56] we can wait, no rush [12:03:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2007.codfw.wmnet [12:03:28] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] objectcache: Log more info when WANObjectCache async refresh fails [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662065 (https://phabricator.wikimedia.org/T264391) (owner: 10Daniel Kinzler) [12:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:50] so...who runs the B&C? :D [12:03:53] Amir1: yea, well, we really need more levels of UBN. This is "a feature is broken in production". But it's an obscure feature. Nothing terrible happens if it's broken for a week. [12:03:55] (03Merged) 10jenkins-bot: [cirrus] rename ores_articletopics -> weighted_tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661383 (https://phabricator.wikimedia.org/T273508) (owner: 10DCausse) [12:04:13] to whoever that person is: duesen's patch should be probably +2'ed? [12:04:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5003.eqsin.wmnet [12:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:52] duesen: that's called "High" [12:04:57] heh [12:05:00] shall I +2 it then? [12:05:20] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:05:23] (03CR) 10Urbanecm: [C: 03+2] objectcache: Log more info when WANObjectCache async refresh fails [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662065 (https://phabricator.wikimedia.org/T264391) (owner: 10Daniel Kinzler) [12:05:26] I just did it [12:05:30] nm :-) [12:05:35] Majavah: "high" means it sits around until the next quaterly planning. "medium" means we never get to it anyway. [12:05:38] we can cancel it in next ~30 minutes anyway if needed [12:05:49] duesen is right...sad, but true [12:05:52] (03CR) 10Ladsgroup: "recheck" [extensions/Wikibase] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662667 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [12:06:14] Urbanecm: did you deploy it? Hey, I was pllannign to get some practice ;) [12:06:16] those are only true if said piece of code has a code steward [12:06:26] otherwise it will take longer [12:06:27] only +2, not even merged yet [12:06:34] Majavah: indeed. [12:06:37] duesen: no, we'll now need for CI to wait to merge it [12:07:01] duesen: if you want to get some practice, works for me - just tell me if you have anything i should help with [12:07:18] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] rename ores_articletopics -> weighted_tags (duration: 01m 07s) [12:07:20] Will do [12:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:26] I'm here as his deploy buddy too :-) [12:07:32] apergos: good good :) [12:07:33] I'm done [12:07:40] apergos and me were wondering if it makes any sense to test this on a debug host first. [12:07:49] i mean, it's doesn't fix any thing, it just adds logging. [12:07:57] Amir1: wanna do security patches now, as we're now blocked on CI? [12:08:02] just making sure it oesn't break things, that's worth it [12:08:07] yeah [12:08:13] I mean, yes we read it and are sure, but... [12:08:33] apergos: duesen: if you can test at least something, it's always worth it [12:08:34] if you can reproduce the condition that would send the log message, why not [12:08:44] it's always fun trying to test if a fix works when you don't even know how to reproduce the breakage [12:08:48] Lucas_WMDE nope, we can't, which is why we need more logging [12:08:52] Urbanecm: I wasn't sure whether it was OK to merge the backport before the deployment window. [12:08:52] hm [12:09:05] We don't want the branch to get out of sync with what's deployed, right? [12:09:16] duesen: depends on what "before" is [12:09:26] in this case, one hour. [12:09:41] do you have a rule of thumb? [12:09:44] you shouldn't hit +2 on friday if you're targeting a window on monday [12:09:44] if it's like 30 mins before the window, it's probably okay, since CI takes usually around 30 mins on core anyway [12:09:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5003.eqsin.wmnet [12:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:56] it's usually better to +2 it beforehand like a couple minutes before so jenkins can catch up [12:09:58] yea ok. [12:10:18] i need to do this more often, i feel like a noobn [12:10:18] (03CR) 10Ladsgroup: [C: 03+2] Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662667 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [12:10:24] but if you +2 it like hour or two before, it is possible that it'll confuse someone who would want to sync a quick patch out-of-window :) [12:10:43] so my rule of thumb: if jenkins with it average speed can merge it around the start of window, it's ok [12:10:53] (03CR) 10Ladsgroup: [C: 03+2] Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662666 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [12:11:55] duesen: yeah, i totally understand that. We have so many deployers who deploy only...once per year [12:12:15] honestly I don't like that, but...what can i do :) [12:12:18] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:37] (03PS1) 10JMeybohm: md: Run checkarray on a random weekday of each month [puppet] - 10https://gerrit.wikimedia.org/r/662665 (https://phabricator.wikimedia.org/T273953) [12:12:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5002.eqsin.wmnet [12:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:20] until these stuff are getting merged. Shall we do the security fun stuff Urbanecm? [12:13:29] you said there are more than one IIRC [12:13:29] Amir1: yup [12:13:37] I already deployed the abusefilter one [12:13:40] it looks like we have enough deployers so I’ll leave you to it and have lunch :) [12:13:49] there are some more AF stuff, but I'll need to CR it first [12:15:25] Amir1: let me know if you want me to help with testing in any way [12:15:28] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:15:42] (03PS2) 10JMeybohm: md: Run checkarray on a random weekday of each month [puppet] - 10https://gerrit.wikimedia.org/r/662665 (https://phabricator.wikimedia.org/T273953) [12:16:01] 10SRE, 10Patch-For-Review: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10JMeybohm) a:03JMeybohm [12:16:09] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) a:03SarthakKundra [12:17:07] Urbanecm: it's not testable. Let's just deploy it [12:17:16] okay, wfm :) [12:17:30] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 57, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:18:42] ok, let me try and do it :) [12:18:50] bear with me, I'm slow [12:19:41] Uh... [12:19:43] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) Hi @Krinkle . I will like to work on this. Can you provide me a bit of insight as to what should I change? As far as I can see the URL is opening [[ ht... [12:19:49] On wmf.29, git status gives me this: [12:19:50] modified: extensions/AbuseFilter (new commits) [12:19:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5002.eqsin.wmnet [12:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:38] those don't get deployed out with the scap sync-file you'll do so it's ok [12:20:48] (03PS1) 10Marostegui: mariadb: Decommission db1094 [puppet] - 10https://gerrit.wikimedia.org/r/662687 (https://phabricator.wikimedia.org/T273710) [12:21:46] duesen: it's still not merged? [12:21:53] duesen: that's expected, btw [12:22:00] it USUALLY means there are security patches applied [12:22:01] no security patches? [12:22:11] apergos: Amir1's doing some [12:22:26] no I mean nothing marked [SECURITY] for core [12:22:32] if not, that's simpler! [12:22:39] apergos: you mean, currently applied? [12:22:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [12:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:49] mine is not core [12:22:54] it's an extension [12:22:57] yep [12:23:27] apergos: there are two sec patches currently applied to wmf.29 core [12:23:28] I mean yes, it's an xtension and yes, I meant in core applied and would show up for git status [12:23:47] but it should be just `Your branch is ahead of 'origin/wmf/1.36.0-wmf.29' by 2 commits.`-like message [12:23:51] ah there they are in git log, yeah [12:23:55] above HEAD as expected [12:23:58] it shows in git status for extensions [12:24:09] that's the modified: extensions/AbuseFilter (new commits) thing duesen complained about earlier [12:24:11] duesen: see those in git log? [12:24:43] sorry, see what? [12:24:51] when you do git log [12:25:00] (03PS1) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [12:25:04] duesen: if you go to /srv/mediawiki-stagging/php-..., and do git log there, you should see two sec patches there [12:25:06] you should see a couple patches on top of HEAD marked [SECURITY] [12:25:12] *marked SECURITY: [12:25:26] this means you'll need to rebase at a certain point. this is standard, nbd [12:25:30] (03CR) 10jerkins-bot: [V: 04-1] interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [12:25:42] apergos: won't he have to rebase either way? [12:25:42] I see one security patch [12:25:46] (03PS1) 10David Caro: ceph.osd: Allow setting the io scheduler of the osd disks [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) [12:25:56] duesen: there are two [12:26:02] duesen: what directory are you in? [12:26:22] /srv/mediawiki-staging/php-1.36.0-wmf.29 [12:26:49] git log for me in that dir shows two patches before HEAD [12:26:55] both tagged SECURITY [12:27:07] or rather [12:27:20] both before origin/wmf/1.36.0-wmf.29 sorry [12:27:33] both before the branch in any case [12:28:01] see 'em? [12:28:33] (03CR) 10Jbond: "PCC (running) https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27900/console" [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [12:28:36] (should be the very first two patches you see) [12:28:48] I still only see one. [12:29:11] for HEAD, anyway. [12:29:33] (03Merged) 10jenkins-bot: objectcache: Log more info when WANObjectCache async refresh fails [core] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662065 (https://phabricator.wikimedia.org/T264391) (owner: 10Daniel Kinzler) [12:29:38] ok, let's look at the first three patches [12:29:55] apergos: HEAD should be what'S deployed, right? and origin/wmf/1.36.0-wmf.29 is what is about to be deployed. [12:30:01] (03CR) 10jerkins-bot: [V: 04-1] Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662667 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [12:30:13] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1094 [puppet] - 10https://gerrit.wikimedia.org/r/662687 (https://phabricator.wikimedia.org/T273710) (owner: 10Marostegui) [12:30:34] apergos: sorry, my bad. I cd-ed into the extension dir earlier to see what's going on there [12:30:36] that git log shows you; the first two should have commit message starting with SECURITY and the third one should say "Update git submodules" and if your version of git is nice, it will also [12:30:38] ... [12:30:43] ah :-) [12:30:43] now i was looking at the extension's log. [12:30:50] that would be it! [12:30:56] indeed :) [12:31:12] ok, the patch is merged [12:31:14] fetching [12:31:23] ok so those are deploye but not in the branch, exactly [12:31:46] (03PS2) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [12:32:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [12:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:19] duesen: did amir1 say he is finished? :) [12:32:31] I haven't applied the patch yet [12:32:34] I thought you do it [12:32:34] ah, okay [12:32:38] no! [12:32:40] 🤦‍♂️ [12:32:41] i thought you're on it :D [12:32:44] sorry [12:32:48] (03CR) 10Effie Mouzeli: [C: 03+2] memcached: enable unix sockets in memcached [puppet] - 10https://gerrit.wikimedia.org/r/659085 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [12:32:54] I do it then [12:32:59] okay [12:33:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (backup1002, ...), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:33:19] ok, shall we wait for Amir to do his thing first? [12:33:33] although maybe now there is a git fetch running... [12:33:39] yea. Amir1 , go ahead [12:33:59] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10Marostegui) a:05Marostegui→03wiki_willy Ready for #dc-ops [12:34:32] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [12:34:44] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10Marostegui) [12:35:12] apergos: so, how do i reconcile HEAD with the branch? reset, and then cherry-pick the security commits? [12:35:18] no [12:35:36] you'll git rebase to put that stuff on top of yours [12:35:42] after making sure it's only your change that got fetches [12:37:46] apergos: go i need --onto, or should i rely on git to figure out the common ancestor? [12:37:50] syncing [12:38:14] this might cause a little bit of jump in logstash errors as files might not arrive at the same time [12:38:18] that would be 5727b9db27500e611f61c6b146f9af80c2f3d484. That's before the security patches on head, and before my patch on the branch [12:38:26] hi folks, can i still add a patch to the backport window? [12:38:34] i usually do `git rebase` and let git figure it out [12:38:40] (i want to backport the fix for https://phabricator.wikimedia.org/T272853) [12:38:43] duesen: try this magic incantation [12:38:45] git log -p HEAD..@{u} [12:38:56] (sorry, `git fetch` to fetch it, then see what got fetched via apergos's command, and then `git rebase`) [12:39:00] Urbanecm: git rebase origin/wmf/1.36.0-wmf.29? [12:39:05] just git rebase [12:39:09] so the backport for chd_seen thingy is failing on wmf.27 [12:39:11] :((( [12:39:33] MatmaRex: create cherry-picks and add them to the calendar please, i think we can make it :) [12:39:41] thanks. doing [12:39:45] a plain git rebase will do the trick, assuming you only see your one patch in that git log I pasted [12:40:11] I have it in aliased my .gitconfig, useful if you don't remember it at all :) [12:40:48] apergos: yea, just the one. Uh, what does @{u} do? [12:40:51] I'm done with the deployment [12:40:58] ok. [12:41:05] i'll go ahead and rebase then [12:41:06] doing all sorts of clean up [12:41:21] once done, there's another set of patches need to go in [12:41:22] it's the upstream branch [12:41:35] (03PS1) 10Bartosz Dziewoński: Move position:relative to inner wrapper [extensions/SyntaxHighlight_GeSHi] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662668 (https://phabricator.wikimedia.org/T272853) [12:41:36] btw this is my handbook https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [12:41:37] so you are showing what's changed between your local head and the upstream [12:41:49] apergos: obviously, why would there be a more readable name for that ;) [12:42:06] ok, i'm seeing the security patches on top of our patch on top of the base patch in the log. [12:42:08] looking good [12:42:12] well {u} is pretty short, who wants to type upstream every time? it's like ls, cp, mv :-P [12:42:18] good1 [12:42:41] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers here's another handbook... [12:42:47] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/SyntaxHighlight_GeSHi] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662668 (https://phabricator.wikimedia.org/T272853) (owner: 10Bartosz Dziewoński) [12:43:02] and another even longer one: https://wikitech.wikimedia.org/wiki/How_to_deploy_code :-D [12:43:10] so many handbooks :) [12:43:30] yep [12:43:38] ok, so scap sync-file is next. [12:43:41] so now duesn you should head on to mwdebug1002 (Lets say) [12:43:42] added https://gerrit.wikimedia.org/r/662668 [12:44:00] and scap pull there and then have te extension in your browser set up to mwdebug1002 to load a few testwiki pages [12:44:02] thanks MatmaRex, +2'ed, will ping you once it's time to test [12:44:18] 1003 is buster so don't use that I'd say for now [12:44:49] the patch need to have the branch dir as a prefix, right? like this? scap sync-file php-1.36.0-wmf.29/includes/libs/objectcache/wancache/WANObjectCache.php [12:44:49] load them as a logged in user so you're not getting varnishe's cache of them [12:45:05] duesen: you need to add a message at the end of the command [12:45:12] you're not sap sync-file yet, you need to just get it onto mwdebug100x first [12:45:41] (and I'll stop talking and let apergos do the talking) [12:45:41] and yes, you'll want a message when you do sync-file, there's a nice format for it but we'll come to that [12:46:08] Urbanecm: it's fine to chime in or especially stop me if i say something wrong please [12:46:40] okay :) [12:46:45] apergos: working on debug1001 [12:46:49] ok! [12:47:00] well, at least test.wikipedia.org didn't explode on debug1001 [12:47:08] that will do :-) [12:47:15] ok now we are on to te sync-file piece [12:47:48] apergos: is the path i posted correct? [12:47:50] again, bacula monitoring is me deploying, I will ack alert for 24 hours [12:48:26] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 2 (backup1002, ...), Fresh: 102 jobs Jcrespo backups running - The acknowledgement expires at: 2021-02-09 09:48:04. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:49:16] (03PS1) 10Hnowlan: api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) [12:49:59] anything special i should put into the log message? [12:50:13] gah cat [12:50:15] i was going to just use the patche's commit message [12:50:18] um [12:50:35] so the message (cat is stealing my mouse, my dice, everything, it is getting untenable) [12:51:06] 'Backport: [[gerrit:[GERRIT-NUMBER]|[COMMIT-MESSAGE] ([PHABRICATOR-TASK])]]' [12:51:14] that's the standard format and all the links will just work in SAL [12:51:22] and yeah you're in the right location and the path looks ok to me [12:51:26] (03Abandoned) 10Ladsgroup: Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662667 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [12:51:40] the gerrit number is the uh short change number, not the long I-hash [12:51:58] the pab task should be Txxxxx with the T [12:52:33] ok I have saved the earbuds from the cat for now, she was starting to chew on the wires (!) [12:53:15] apergos: it's just wikitext, yes? [12:53:20] yep! [12:53:20] yeah [12:53:29] scap sync-file php-1.36.0-wmf.29/includes/libs/objectcache/wancache/WANObjectCache.php "Backport: [[gerrit:662065|objectcache: Log more info when WANObjectCache async refresh fails]] ([[phab:T264391]])" [12:53:30] T264391: FeaturedFeedChannel must not contain a User object, since it cannot be serialized safely. - https://phabricator.wikimedia.org/T264391 [12:53:41] like that? [12:53:52] (03PS2) 10Hnowlan: api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) [12:54:03] that looks ok to me, Urbanecm? [12:54:09] LGTM [12:54:16] syncing [12:55:27] !log daniel@deploy1001 Synchronized php-1.36.0-wmf.29/includes/libs/objectcache/wancache/WANObjectCache.php: Backport: [[gerrit:662065|objectcache: Log more info when WANObjectCache async refresh fails]] ([[phab:T264391]]) (duration: 01m 07s) [12:55:30] woo hoo! [12:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:09] congrats duesen ! [12:56:14] :P [12:56:15] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:19] *phew* [12:56:22] look at those nice links in the log message! [12:56:33] high five! [12:56:57] duesen: i assume you're done? [12:57:09] apergos, Urbanecm: thanks for holding my hand through this ;) [12:57:15] any time :) [12:57:16] happy to assist! [13:00:25] Amir1: do you still plan to roll out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/662666? [13:00:54] Urbanecm: yes. This Satanic patch has not been merged yet [13:01:00] waiting for it [13:01:03] okay, good :) [13:01:18] selenium is taking so long to finish [13:03:10] (03Merged) 10jenkins-bot: Cast chd_seen as signed integer [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662666 (https://phabricator.wikimedia.org/T274091) (owner: 10Ladsgroup) [13:03:13] (03Merged) 10jenkins-bot: Move position:relative to inner wrapper [extensions/SyntaxHighlight_GeSHi] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/662668 (https://phabricator.wikimedia.org/T272853) (owner: 10Bartosz Dziewoński) [13:03:27] 10SRE, 10Packaging: Copy cassandra packages to buster-wikimedia - https://phabricator.wikimedia.org/T274119 (10Peachey88) [13:03:30] finally [13:04:05] finally [13:04:14] Amir1: ping me once done, so i can ship MatmaRex's patch [13:04:18] (or just do it yourself, as you please) [13:06:23] I can do it [13:06:29] (03PS3) 10Jcrespo: install_server: Reenable notifications and disable disk format for db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) [13:06:50] cool :) [13:06:53] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/Wikibase/repo/includes/Store/Sql/SqlChangeDispatchCoordinator.php: [[gerrit:662666|Cast chd_seen as signed integer (duration: 01m 10s) [13:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:13] ping me anyway, I'd like to do the rest of AF patches :) [13:07:32] (03PS4) 10Jcrespo: install_server: Disable disk format for db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) [13:08:44] (03CR) 10Jcrespo: "I believe I edited netboot.cnf, correctly removing db1171, but please double check." [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [13:08:48] (03CR) 10Jcrespo: [C: 03+2] install_server: Disable disk format for db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [13:08:57] (03PS1) 10Kormat: mysql_root_clients: Allow orch access to clouddb [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) [13:09:06] Sure [13:09:09] It's now syncing [13:09:48] !log ladsgroup@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/SyntaxHighlight_GeSHi/modules/pygments.wrapper.less: [[gerrit:662668|Move position:relative to inner wrapper]] (T272853) (duration: 01m 08s) [13:09:49] MatmaRex: it's synced now. [13:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:52] T272853: Syntaxhighlight block can make floated items unclickable - https://phabricator.wikimedia.org/T272853 [13:09:55] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27903/console" [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) (owner: 10Kormat) [13:10:05] thanks Amir1 Urbanecm [13:10:20] Urbanecm: the floor is yours [13:10:27] thank you [13:10:28] (looks good on mw.org) [13:10:32] Daimona: still around? [13:10:41] SNOOWWWW [13:10:52] (03PS2) 10Kormat: mysql_root_clients: Allow orch access to clouddb [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) [13:11:56] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27904/console" [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) (owner: 10Kormat) [13:11:59] (03PS1) 10Jbond: interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) [13:12:04] (03PS1) 10Jbond: numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) [13:13:51] Yep [13:13:55] (03CR) 10Marostegui: [C: 03+1] install_server: Disable disk format for db1171 [puppet] - 10https://gerrit.wikimedia.org/r/661080 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [13:13:57] (03CR) 10jerkins-bot: [V: 04-1] numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [13:14:22] great :) [13:14:24] let me prepare it [13:15:34] Daimona: do the patches depend on each other? [13:15:49] No, they should be independent [13:16:08] okay, thanks [13:17:16] Daimona: both pulled to mwdebug1001 on both MW versions [13:17:24] please ping me if you need any special permissions to test 'em [13:17:47] Cool [13:18:08] First I need to sync my brain and understand what exactly the bug was about :D [13:18:14] :D [13:18:17] (03PS1) 10Filippo Giunchedi: hieradata: add defaults for profile::swift::storage::replication_limit_memory_percent [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) [13:18:44] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:18:49] (03CR) 10jerkins-bot: [V: 04-1] hieradata: add defaults for profile::swift::storage::replication_limit_memory_percent [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [13:19:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] pbuilder: create apt-cache directory before running pbuilder init (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [13:19:58] (03PS2) 10Jbond: numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) [13:20:00] (03CR) 10Marostegui: [C: 03+1] "dborch only has orchestrator access to the databases, which is limited to orchestrator db" [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) (owner: 10Kormat) [13:20:19] 10SRE, 10MediaWiki-Docker: Create and publish arm64 images of wikimedia-stretch and wikimedia-buster - https://phabricator.wikimedia.org/T274140 (10kostajh) [13:20:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5001.eqsin.wmnet [13:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:51] (03CR) 10jerkins-bot: [V: 04-1] numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [13:22:12] (03PS2) 10Filippo Giunchedi: hieradata: add defaults for profile::swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) [13:23:26] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:16] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27907/console" [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [13:25:43] (03CR) 10JMeybohm: ">pcc 662665 're:(ganeti|kube|conf).*wmnet'" [puppet] - 10https://gerrit.wikimedia.org/r/662665 (https://phabricator.wikimedia.org/T273953) (owner: 10JMeybohm) [13:25:46] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 57, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:26:00] 10SRE, 10Beta-Cluster-Infrastructure, 10SRE-swift-storage, 10Patch-For-Review: Beta cluster Swift backend instances are missing profile::swift::storage::rsync_limit_memory_percent (puppet fails) - https://phabricator.wikimedia.org/T274092 (10fgiunchedi) My bad, I introduced the new parameter in the context... [13:26:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5001.eqsin.wmnet [13:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27908/console" [puppet] - 10https://gerrit.wikimedia.org/r/662665 (https://phabricator.wikimedia.org/T273953) (owner: 10JMeybohm) [13:27:13] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) After deployment, the latest operational steps are being done to ensure backups will be generated correctly, after that (and its documentation), we cou... [13:27:18] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] md: Run checkarray on a random weekday of each month [puppet] - 10https://gerrit.wikimedia.org/r/662665 (https://phabricator.wikimedia.org/T273953) (owner: 10JMeybohm) [13:30:22] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) [13:30:29] 10SRE, 10Data-Persistence-Backup, 10Patch-For-Review: Implement logic to be able to perform incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) [13:30:47] 10SRE, 10Data-Persistence-Backup, 10Patch-For-Review: Implement logic to be able to perform incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) [13:30:49] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) [13:30:51] 10SRE, 10Data-Persistence-Backup, 10Patch-For-Review: Implement logic to be able to perform incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) 05Stalled→03Open p:05High→03Low [13:37:23] 10SRE, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10hashar) I have removed the Jenkins jobs for tilerator and kartotherian. They used NodeJS 6 and we were no more able to maintain them after the removal of Je... [13:41:03] hashar: when you get a chance, https://gerrit.wikimedia.org/r/c/operations/puppet/+/662703 [13:54:53] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [13:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:59] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [13:59:22] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2021.codfw.wmnet [13:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:47] (03PS3) 10Jbond: numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) [14:02:22] (03PS1) 10Kormat: dbutil: Resolve IPs to hostnames [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662709 [14:02:46] RECOVERY - Check systemd state on mc2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27909/console" [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [14:05:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2021.codfw.wmnet [14:05:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2022.codfw.wmnet [14:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:07] (03Abandoned) 10Kormat: WIP: dbutil: Handle IP addresses in resolve() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661957 (owner: 10Kormat) [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:31] (03PS2) 10Kormat: dbutil: Resolve IPs to hostnames [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662709 [14:07:19] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Ottomata) @Cdanis `nda` LDAP is also needed for Jupyter access. Pretty much all users should get LDAP access if they are getting any access at all. [14:07:58] !log Deploy security patch for T223654 [14:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:42] RECOVERY - Check systemd state on mc2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/662641 (https://phabricator.wikimedia.org/T269855) (owner: 10Ayounsi) [14:10:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2022.codfw.wmnet [14:10:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2023.codfw.wmnet [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:02] (03PS1) 10Lucas Werkmeister (WMDE): Fix Travis CI build on release branches [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662669 [14:12:25] (03CR) 10Volans: [C: 03+1] "Seems reasonable to me" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661925 (owner: 10Kormat) [14:13:21] (03CR) 10Kormat: [C: 03+2] tox/unit: Allow unit tests to be indepdenent of env vars [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661925 (owner: 10Kormat) [14:13:30] RECOVERY - Check systemd state on mc2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:38] (03Merged) 10jenkins-bot: tox/unit: Allow unit tests to be indepdenent of env vars [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/661925 (owner: 10Kormat) [14:17:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2023.codfw.wmnet [14:17:02] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet [14:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:58] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:20:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Not really equipped to review envoy configs, but this seems clean enough to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [14:21:14] 10SRE, 10User-MoritzMuehlenhoff: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10MoritzMuehlenhoff) [14:21:50] (03CR) 10Ottomata: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [14:22:07] (03CR) 10Kormat: [C: 03+2] dbutil: Add addr_split [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662649 (owner: 10Kormat) [14:24:23] (03Merged) 10jenkins-bot: dbutil: Add addr_split [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662649 (owner: 10Kormat) [14:25:03] (03CR) 10Volans: [C: 04-1] "Thanks for the effort!" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [14:25:14] (03CR) 10Kormat: [C: 03+2] dbutil: Resolve IPs to hostnames [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662709 (owner: 10Kormat) [14:26:49] (03PS1) 10Jgreen: switch frdata.wm.o cname to point to frdata-eqiad.wm.o [dns] - 10https://gerrit.wikimedia.org/r/662713 (https://phabricator.wikimedia.org/T255435) [14:27:05] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [14:27:32] (03Merged) 10jenkins-bot: dbutil: Resolve IPs to hostnames [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/662709 (owner: 10Kormat) [14:29:14] (03CR) 10BBlack: interface: update rps script to also set the number of queues via ethtool (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [14:30:44] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10CDanis) 05Open→03Resolved >>! In T273602#6811137, @Ottomata wrote: > @Cdanis `nda` LDAP is also needed for Jupyter access. Pretty much all users... [14:30:54] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1024_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:31:06] I’d like to do a harmless backport (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/662669), any objections? [14:31:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2024.codfw.wmnet [14:31:38] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2026.codfw.wmnet [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:28] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Ottomata) Yeha it is not clear. I just updated https://wikitech.wikimedia.org/wiki/Analytics/Data_access with recommendations to also ask for that. [14:34:19] 10SRE, 10Research, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 (10Miriam) @Ottomata @CDanis thanks both! [14:34:32] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database [14:34:32] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database [14:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Deploying this now." [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662669 (owner: 10Lucas Werkmeister (WMDE)) [14:35:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: remove the unused site directive [puppet] - 10https://gerrit.wikimedia.org/r/659940 (owner: 10Giuseppe Lavagetto) [14:37:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2026.codfw.wmnet [14:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:10] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10Iniquity) >>! In T242500#6810706, @SarthakKundra wrote: > Hi @Krinkle . I will like to work on this. Can you provide me a bit of insight as to what should I change? A... [14:41:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [14:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:16] (03PS1) 10Ottomata: Add an MW local envoy listener for eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/662716 (https://phabricator.wikimedia.org/T272998) [14:43:27] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10Aklapper) Hi, could you please [elaborate what is specifically unclear](https://www.mediawiki.org/wiki/New_Developers/Communication_tips) with the task description? T... [14:46:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [14:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:03] (03CR) 10Jgreen: [C: 03+2] switch frdata.wm.o cname to point to frdata-eqiad.wm.o [dns] - 10https://gerrit.wikimedia.org/r/662713 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [14:48:11] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10VeronicaThamaini) Hello, I just linked the MediaWiki account. Let me know what the next steps might be. Thanks! [14:48:47] (03PS1) 10Ottomata: Add eventgate-analytics-external to (Production|Labs)Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) [14:49:16] (03PS2) 10Ottomata: Add eventgate-analytics-external to (Production|Labs)Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) [14:49:56] (03PS2) 10DCausse: [wdqs] Add flink sideoutput stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) [14:50:30] (03PS3) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [14:50:30] !log stopped ES on logstash1020 in prep for re-rack T273984 [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:35] T273984: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 [14:50:55] (03CR) 10DCausse: [wdqs] Add flink sideoutput stream definitions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) (owner: 10DCausse) [14:51:06] (03CR) 10jerkins-bot: [V: 04-1] interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [14:52:27] herron: <3 [14:53:33] (03PS3) 10DCausse: [wdqs] Add flink sideoutput stream definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661727 (https://phabricator.wikimedia.org/T269619) [14:55:03] (03CR) 10Jbond: "Thanks updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [14:55:32] (03PS1) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) [14:56:07] (03PS4) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [14:57:02] (03PS3) 10Giuseppe Lavagetto: kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) [15:00:02] (03CR) 10Alexandros Kosiaris: "Aside from a minor comment, I like this. I 'd like to see some test before +1ing as we haven't yet deployed such a resource in our cluster" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [15:01:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ldap-replica2003.wikimedia.org [15:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:06] (03PS1) 10Hashar: ci: remove docker-pkg seed_image [puppet] - 10https://gerrit.wikimedia.org/r/662722 [15:04:09] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on maps1001.eqiad.wmnet with reason: Server being relocated [15:04:10] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on maps1001.eqiad.wmnet with reason: Server being relocated [15:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:19] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1001.eqiad.wmnet [15:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2003.wikimedia.org [15:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:19] (03CR) 10Hashar: "The docker-pkg failure is due to https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-pkg/+/655411" [puppet] - 10https://gerrit.wikimedia.org/r/662722 (owner: 10Hashar) [15:09:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ldap-replica2004.wikimedia.org [15:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:39] (03Merged) 10jenkins-bot: Fix Travis CI build on release branches [extensions/Wikibase] (wmf/1.36.0-wmf.29) - 10https://gerrit.wikimedia.org/r/662669 (owner: 10Lucas Werkmeister (WMDE)) [15:09:49] alright, deploying ^ (should be a no-op) [15:10:26] briefly testing on mwdebug1001 [15:11:47] !log set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619 [15:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:53] T269619: Create pipelines for late/spurious/failed events - https://phabricator.wikimedia.org/T269619 [15:12:22] syncing… [15:13:23] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.29/extensions/Wikibase/build/travis/install.sh: Backport: [[gerrit:662669|Fix Travis CI build on release branches]] (prod no-op, syncing only to avoid drift) (duration: 01m 08s) [15:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2004.wikimedia.org [15:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:50] 10SRE, 10UploadWizard: Uploading via UploadWizard gets stuck for a 11 MB JPG - https://phabricator.wikimedia.org/T274150 (10Urbanecm) [15:16:02] (03PS2) 10Alexandros Kosiaris: ores: Switch from oresrdb.svc to host names [puppet] - 10https://gerrit.wikimedia.org/r/655426 (https://phabricator.wikimedia.org/T270071) [15:16:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ldap-replica1001.wikimedia.org [15:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] ores: Switch from oresrdb.svc to host names [puppet] - 10https://gerrit.wikimedia.org/r/655426 (https://phabricator.wikimedia.org/T270071) (owner: 10Alexandros Kosiaris) [15:16:48] PROBLEM - Host db1111.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:17:31] (03CR) 10Volans: [C: 03+1] "LGTM for the needs of the related task. Thanks a lot for adding it!" [homer/public] - 10https://gerrit.wikimedia.org/r/662641 (https://phabricator.wikimedia.org/T269855) (owner: 10Ayounsi) [15:17:46] 10SRE, 10UploadWizard: Uploading via UploadWizard gets stuck for a 11 MB JPG - https://phabricator.wikimedia.org/T274150 (10Urbanecm) I discussed this briefly with @Joe, who tried to convince me this actually never worked, see T97539. I'm still pretty sure this used to work before. I uploaded https://commons.... [15:17:49] (03CR) 10David Caro: "> Patch Set 5: Code-Review-1" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [15:18:11] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1001.wikimedia.org [15:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:04] 10SRE, 10UploadWizard: Uploading via UploadWizard gets stuck for a 11 MB JPG - https://phabricator.wikimedia.org/T274150 (10Urbanecm) [15:19:58] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ldap-replica1002.wikimedia.org [15:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:13] 10SRE, 10Commons, 10UploadWizard: Uploading via UploadWizard gets stuck for a 11 MB JPG - https://phabricator.wikimedia.org/T274150 (10AntiCompositeNumber) [15:23:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1002.wikimedia.org [15:23:20] RECOVERY - Host db1111.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [15:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:14] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [15:26:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host krb2001.codfw.wmnet [15:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:38] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:49] hnowlan I am ready to move maps1001, let me know when it's okay [15:30:00] herron Is it okay to move logstash1020 now? [15:30:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2001.codfw.wmnet [15:30:22] cmjohnson1: yup! ready to go [15:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:29] thx [15:30:50] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 71780280 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:32:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:54] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [15:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:44] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 688616 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:36:40] PROBLEM - Host logstash1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:36:42] cmjohnson1: sorry, go for it! [15:37:02] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on maps1001.eqiad.wmnet with reason: Server being relocated [15:37:02] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on maps1001.eqiad.wmnet with reason: Server being relocated [15:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:28] jouncebot: now [15:37:28] No deployments scheduled for the next 2 hour(s) and 22 minute(s) [15:37:30] jouncebot: next [15:37:31] In 2 hour(s) and 22 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1800) [15:37:36] * Urbanecm does sec deploys now [15:37:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [15:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:07] (03CR) 10Bstorm: pbuilder: create apt-cache directory before running pbuilder init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [15:41:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: remove docker-pkg seed_image [puppet] - 10https://gerrit.wikimedia.org/r/662722 (owner: 10Hashar) [15:49:31] !log repool wdqs1012 - catched up on lag [15:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:27] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Cmjohnson) 05Open→03Resolved @Marostegui Thanks, the server move was successful and you are able to ssh. [15:50:55] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Marostegui) Thank you Chris - I will take care of starting mysql and repooling the host. [15:51:09] 10SRE, 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10Cmjohnson) 05Open→03Resolved @herron Thanks! all finished and I was able to ssh to the server. [15:52:01] (03PS2) 10Alexandros Kosiaris: Remove oresrdb.svc RRs [dns] - 10https://gerrit.wikimedia.org/r/655443 (https://phabricator.wikimedia.org/T270071) [15:52:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove oresrdb.svc RRs [dns] - 10https://gerrit.wikimedia.org/r/655443 (https://phabricator.wikimedia.org/T270071) (owner: 10Alexandros Kosiaris) [15:52:26] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10Cmjohnson) 05Open→03Resolved Thanks @hnowlan all finished [15:55:37] (03PS1) 10Jgreen: adjust nsca_frack.cfg.erb remove frdata1001, add frdata1002,frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/662728 (https://phabricator.wikimedia.org/T255435) [15:59:45] 10SRE, 10SRE-tools, 10serviceops-radar, 10Patch-For-Review: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10akosiaris) >>! In T270071#6736069, @akosiaris wrote: >>>! In T270071#6689398, @akosiaris wrote: >>> * DNS Records with non-standard TTL. We have just one for ores... [16:01:47] (03CR) 10David Caro: style: this introduces black+isort as autoformatter (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [16:01:49] (03PS6) 10David Caro: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) [16:02:26] !log Deploy security patch (T71367) [16:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:25] (03CR) 10Jgreen: [C: 03+2] adjust nsca_frack.cfg.erb remove frdata1001, add frdata1002,frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/662728 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [16:08:54] (03PS7) 10David Caro: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) [16:19:42] (03PS1) 10Jgreen: nsca_frack.cfg.erb switch frdata* IPs to external ones b/c ping is flapping [puppet] - 10https://gerrit.wikimedia.org/r/662737 (https://phabricator.wikimedia.org/T255435) [16:20:11] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10MPhamWMF) [16:26:23] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb switch frdata* IPs to external ones b/c ping is flapping [puppet] - 10https://gerrit.wikimedia.org/r/662737 (https://phabricator.wikimedia.org/T255435) (owner: 10Jgreen) [16:27:55] (03CR) 10Ayounsi: [C: 03+2] Add option-82 to prod vlans [homer/public] - 10https://gerrit.wikimedia.org/r/662641 (https://phabricator.wikimedia.org/T269855) (owner: 10Ayounsi) [16:28:40] (03Merged) 10jenkins-bot: Add option-82 to prod vlans [homer/public] - 10https://gerrit.wikimedia.org/r/662641 (https://phabricator.wikimedia.org/T269855) (owner: 10Ayounsi) [16:30:10] !log adding option-82 to all prod vlans DHCP - T269855 [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:14] T269855: Manage DHCP from Netbox - https://phabricator.wikimedia.org/T269855 [16:36:40] (03CR) 10JMeybohm: [C: 03+2] md: Run checkarray on a random weekday of each month [puppet] - 10https://gerrit.wikimedia.org/r/662665 (https://phabricator.wikimedia.org/T273953) (owner: 10JMeybohm) [16:39:06] 10SRE, 10Patch-For-Review: Stagger software raid checks even more - https://phabricator.wikimedia.org/T273953 (10JMeybohm) 05Open→03Resolved [16:39:34] (03PS1) 10Jgiannelos: Release latest version of push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/662739 [16:44:59] (03PS1) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:46:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [16:47:33] 10SRE, 10CAS-SSO, 10Discovery-Search (Current work), 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10MPhamWMF) [16:47:58] (03PS2) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:49:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [16:52:38] (03PS3) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:54:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [16:56:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1094.eqiad.wmnet - https://phabricator.wikimedia.org/T273710 (10wiki_willy) a:05wiki_willy→03Cmjohnson Thanks @Marostegui >>! In T273710#6810772, @Marostegui wrote: > Ready for #dc-ops [16:56:36] (03PS4) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:58:42] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [17:00:32] (03PS5) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [17:02:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [17:03:54] (03PS3) 10Ottomata: Add eventgate-analytics-external to LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) [17:05:11] (03CR) 10jerkins-bot: [V: 04-1] Add eventgate-analytics-external to LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [17:05:27] (03PS4) 10Ottomata: Add eventgate-analytics-external to LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) [17:06:08] (03PS6) 10Jcrespo: [WIP] Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [17:06:40] (03CR) 10jerkins-bot: [V: 04-1] Add eventgate-analytics-external to LabsServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [17:06:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2026.codfw.wmnet [17:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:39] 10SRE, 10serviceops, 10Epic: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) [17:10:13] (03PS5) 10Ottomata: Add eventgate-analytics-external to (Production|Labs)Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) [17:12:08] (03CR) 10Ottomata: [C: 03+2] Add eventgate-analytics-external to (Production|Labs)Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662719 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [17:12:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2026.codfw.wmnet [17:12:22] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2027.codfw.wmnet [17:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:51] (03PS1) 10Ottomata: Add eventgate-analytics-external to wgEventServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662746 (https://phabricator.wikimedia.org/T272998) [17:17:32] (03PS1) 10Hashar: ci: properly disable docker-pkg seed_image [puppet] - 10https://gerrit.wikimedia.org/r/662748 [17:17:56] (03CR) 10Ottomata: [C: 03+2] Add eventgate-analytics-external to wgEventServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662746 (https://phabricator.wikimedia.org/T272998) (owner: 10Ottomata) [17:18:04] (03CR) 10Hashar: "And that fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/662722 (owner: 10Hashar) [17:18:35] (03CR) 10Hashar: "That follows up https://gerrit.wikimedia.org/r/c/operations/puppet/+/662722 and I have confirmed it works on contint2001 by manually edit" [puppet] - 10https://gerrit.wikimedia.org/r/662748 (owner: 10Hashar) [17:19:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2027.codfw.wmnet [17:19:06] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2029.codfw.wmnet [17:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:07] !log otto@deploy1001 Synchronized wmf-config/LabsServices.php: LabsServices - Add eventgate-analytics-external - T272998 (duration: 01m 08s) [17:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:10] T272998: Wikibase CI test failure in FederatedProperties SetClaimTest: Argument 1 passed to PHPUnit\Framework\Assert::fail() must be of the type string, array given - https://phabricator.wikimedia.org/T272998 [17:20:26] !log otto@deploy1001 sync-file aborted: ProductionServices - Add eventgate-analytics-external - T272863 (no-op) (duration: 00m 02s) [17:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:30] T272863: EventLogging PHP EventServiceClient should use EventBus->send(). - https://phabricator.wikimedia.org/T272863 [17:20:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:38] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: ProductionServices - Add eventgate-analytics-external - T272863 (no-op) (duration: 01m 06s) [17:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:58] !log otto@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings - Add eventgate-analytics-external - T272863 (no-op) (duration: 01m 06s) [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:49] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2029.codfw.wmnet [17:25:50] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2030.codfw.wmnet [17:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:51] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Milimetric) +1 to @Gilles's idea. Reverse image searches don't yield anything obvious. [17:31:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2030.codfw.wmnet [17:31:32] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2031.codfw.wmnet [17:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:40] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) [17:36:11] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-druid100[345] - https://phabricator.wikimedia.org/T274163 (10RobH) [17:38:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2031.codfw.wmnet [17:38:25] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2032.codfw.wmnet [17:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2032.codfw.wmnet [17:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:01] (03PS1) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [17:48:09] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) @Aklapper That was exactly my doubt. That is something needed to be added to the file or a new file is to be created. Thanks for the clarification! [17:48:28] (03PS2) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [17:49:20] (03CR) 10Jbond: "bit early for code review but comments on the commit message/direction would be welcome" [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [17:50:19] (03CR) 10jerkins-bot: [V: 04-1] (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [17:57:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database [17:57:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on maps1005.eqiad.wmnet with reason: Resyncing database [17:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:11] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) I am guessing Wikimedia wants the site to be accessible from all of the common user-agents. Any specific user-agents that you want to block @Iniquity... [17:58:17] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Hnowlan Host not pooled https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:17] ACKNOWLEDGEMENT - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused Hnowlan Host not pooled https://phabricator.wikimedia.org/T93886 [17:58:17] ACKNOWLEDGEMENT - cassandra service on maps2007 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan Host not pooled https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:00:04] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1800). [18:00:40] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mforns) The crazy request volume starts on July 2020 https://pageviews.toolforge.org/mediaviews/?project=commons.wikimedia.org&platform=&referer=all-referers... [18:03:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] ci: properly disable docker-pkg seed_image [puppet] - 10https://gerrit.wikimedia.org/r/662748 (owner: 10Hashar) [18:04:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2033.codfw.wmnet [18:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2033.codfw.wmnet [18:10:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2034.codfw.wmnet [18:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:05] !log ppchelko@deploy1001 Started deploy [restbase/deploy@a458845]: Add trwikivoyage T271262 and restore restbase2009 [18:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:09] T271262: Add trwikivoyage to RESTBase - https://phabricator.wikimedia.org/T271262 [18:14:37] (03PS1) 10CDanis: Fix up some HTML validation errors. [software/klaxon] - 10https://gerrit.wikimedia.org/r/662754 [18:15:32] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10RobH) [18:15:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2034.codfw.wmnet [18:15:58] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2035.codfw.wmnet [18:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:09] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10RobH) [18:16:21] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10RobH) [18:16:34] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10RobH) [18:17:20] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10elukey) >>! In T273741#6812144, @mforns wrote: > The crazy request volume starts on July 2020 > https://pageviews.toolforge.org/mediaviews/?project=commons.w... [18:21:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2035.codfw.wmnet [18:21:39] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2036.codfw.wmnet [18:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2036.codfw.wmnet [18:27:20] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2037.codfw.wmnet [18:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:36] (03CR) 10Jcrespo: "This is just a start, and will need testing and more followup, but please provide early feedback if you see something terrible happening h" [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:29:18] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@a458845]: Add trwikivoyage T271262 and restore restbase2009 (duration: 17m 13s) [18:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:23] T271262: Add trwikivoyage to RESTBase - https://phabricator.wikimedia.org/T271262 [18:32:11] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) codfw -> eqiad completed rather quickly. ` 305897 Full 5,561 3.078 T OK 08-Feb-21 17:19 backup2002.codfw.wmnet-Monthly-1st-Wed-EsRwEqi... [18:32:51] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: / 1449 MB (2% inode=91%): /tmp 1449 MB (2% inode=91%): /var/tmp 1449 MB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [18:34:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2037.codfw.wmnet [18:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:58] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) @ayounsi Could you think of a reason for this discrepancy at network layer? I cannot think of one at hw or software level. The only thing I see differe... [18:44:28] 10SRE, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10Krinkle) This task is not asking for new robots rules to be created or modified. The existing content should stay exactly as-is. No changes or additions are needed in... [18:47:52] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [18:48:43] a png isn't working? [18:48:55] that's just what monitoring probes to test the health of the service [18:49:08] I cannot access maps.wikimedia.org images [18:49:08] it does look like that several maps servers in eqiad are pegged at 100% cpu [18:49:20] https://maps.wikimedia.org/v4/marker/pin-m-fuel+ffffff@2x.png loads for me [18:49:27] you're probably hitting codfw [18:49:32] https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=now-12h&to=now [18:49:42] I'm not sure what happened around ~15:00 [18:49:48] (ack'd the page) [18:50:07] cdanis, eqiad now loading for me, very slowly [18:50:12] ah, looks like maps1001 is being physically moved? [18:50:13] * akosiaris around [18:50:19] ditto [18:51:53] and maps1005 has downtime too [18:52:02] there was also something more recent, close to the timestamp of the page -- a sharp increase in "static snapshot requests" [18:52:07] https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=26&orgId=1&from=now-12h&to=now [18:52:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:52:24] jouncebot: next [18:52:24] In 0 hour(s) and 7 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1900) [18:52:26] yeah, I was about to paste the same graph [18:52:27] hnowlan: are you still around? [18:52:39] I think it recovered? [18:52:53] I guess the server move was for https://phabricator.wikimedia.org/T273983 [18:52:54] !log mw1391 - reimaging [18:52:54] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [18:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:11] there it is [18:53:22] sure, the other servers are still showing 100% load though [18:53:31] no idea about when it is safe to add 1001 back to serving [18:53:39] https://config-master.wikimedia.org/pybal/eqiad/kartotherian nearly half the hosts are depooled [18:54:23] I think the other servers are not yet added to serving? not sure [18:54:36] IIRC a lot of that hardware is recently-added [18:55:00] * legoktm nods [18:55:08] I think latency is not great ATM but not terrible, maybe more prone to overloads during maintenance? [18:55:25] jynus: Maps has been long underprovisioned, yes [18:55:34] codfw is sufficiently loaded that I do not think depooling eqiad is the most wise thing to do [18:55:42] nah [18:55:51] if we had proportional weights in gdnsd I'd twiddle them, but, we don't :) [18:55:58] plus tiles are something that high rate of failures is more permissive than, say, text [18:56:25] why did it even page us ? [18:56:39] should we file a bug for this? [18:56:47] (03CR) 10Kosta Harlan: [C: 04-1] "Let's make sure T266913 is done before this is merged." [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [18:56:54] akosiaris: that's one of the LVS service alerts [18:57:02] probably not disabled? [18:57:22] legoktm: I am just going to add a quick note on T273983 [18:57:22] T273983: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 [18:57:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/639154 should have done it I may have forgotten something [18:57:40] however, like with the recent lvs page of swift, the lvs check should be of the service, not its content [18:57:59] thanks, cdanis legoktm to jump [18:58:12] cheers [18:58:52] (03PS2) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661784 (owner: 10PipelineBot) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T1900). [19:00:05] AaronSchulz and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:26] hi [19:00:48] I can deploy today [19:00:59] AaronSchulz: around? [19:01:00] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661784 (owner: 10PipelineBot) [19:01:08] (03PS3) 10Urbanecm: Make DiscussionTools' newtopictool available on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 (owner: 10Esanders) [19:01:12] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools' newtopictool available on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 (owner: 10Esanders) [19:01:50] :( [19:02:18] (03Merged) 10jenkins-bot: Make DiscussionTools' newtopictool available on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661130 (owner: 10Esanders) [19:02:55] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/661784 (owner: 10PipelineBot) [19:03:20] sorry, i got disconnected [19:03:23] no problem [19:03:25] cdanis: still having issues [19:03:26] ? [19:03:31] MatmaRex: pulled onto mwdebug1001 [19:03:32] * volans here late, was out [19:04:33] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10mforns) If it's an app, it would need to be **very** popular. Maybe Aarogya Setu, the app for reducing Covid infections? IIUC it's mandatory in India. [19:04:48] Urbanecm: thanks, looks good [19:04:53] syncing [19:05:19] (03PS1) 10Alexandros Kosiaris: kartotherian: Followup for I7f77fd9a1c8d49f3e23dc [puppet] - 10https://gerrit.wikimedia.org/r/662757 [19:06:40] cdanis: yeah I had missed. https://gerrit.wikimedia.org/r/c/operations/puppet/+/662757 it was the only service overriding for some reason (I'll dig more tomorrow) the catalog value [19:06:51] Urbanecm: fyi, mw1391 is me if it shows up. but only that single one and hoping it was already out of scap [19:07:00] anyone mind if i do a blubberoid deployment? it shouldn't affect mediawiki CI builds, but i can also wait [19:07:05] mutante: thanks for the headsup [19:07:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3e94e2177b7f31bea1c6bc21b272a4529a38b4b3: Make DiscussionTools newtopictool available on testwiki (duration: 01m 07s) [19:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:42] marxarelli: go ahead i think, it's kubernetes, right? [19:07:45] I 'll merge and deploy. I 've also set a scheduled downtime of 2 hours, that should suffice [19:07:48] yes [19:07:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] kartotherian: Followup for I7f77fd9a1c8d49f3e23dc [puppet] - 10https://gerrit.wikimedia.org/r/662757 (owner: 10Alexandros Kosiaris) [19:08:16] marxarelli: feel free to do it then :) [19:08:24] Done [19:08:25] right on. ty :) [19:08:29] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Samwalton9) > Maybe Aarogya Setu, the app for reducing Covid infections? If it is, it isn't part of the initial app setup process, which I just tested out o... [19:08:45] !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [19:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:14] (03PS5) 10Urbanecm: Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) [19:09:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1391.eqiad.wmnet with reason: REIMAGE [19:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:23] (03CR) 10jerkins-bot: [V: 04-1] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [19:09:40] let me fix that merge conflict [19:10:13] (03PS6) 10Urbanecm: Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) [19:11:07] (03CR) 10Urbanecm: Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [19:11:09] (03CR) 10Urbanecm: [C: 03+2] Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [19:11:21] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1391.eqiad.wmnet with reason: REIMAGE [19:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:43] !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:38] (03Merged) 10jenkins-bot: Enable GrowthExperiments at dawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661419 (https://phabricator.wikimedia.org/T256126) (owner: 10Urbanecm) [19:13:29] !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:53] (03CR) 10Dduvall: "Deployed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/661784 (owner: 10PipelineBot) [19:14:05] (03PS2) 10Bstorm: pbuilder: create apt-cache directory before running pbuilder init [puppet] - 10https://gerrit.wikimedia.org/r/661777 [19:14:57] (03Abandoned) 10Dzahn: trafficserver: add new debug servers to debug routing [puppet] - 10https://gerrit.wikimedia.org/r/662038 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [19:15:45] (03Abandoned) 10Dzahn: site: add mwdebug servers on buster [puppet] - 10https://gerrit.wikimedia.org/r/662037 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [19:16:14] mutante: no more mwdebug servers? [19:17:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3f39eefaa4c0dabfbc5b03fdc1b12e48913089bd: Enable GrowthExperiments at dawiki (T256126) (duration: 01m 05s) [19:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:17] T256126: Deploy Growth features on Danish Wikipedia - https://phabricator.wikimedia.org/T256126 [19:17:51] 10SRE, 10Wikimedia-Portals, 10Patch-For-Review, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) Here's a Lighthouse CI score run on dev-server. Let me know if any changes are required :) {F34095486} [19:18:21] Urbanecm: well, people say just reimage the existing ones to buster, basically [19:18:29] (03CR) 10Bstorm: pbuilder: create apt-cache directory before running pbuilder init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [19:18:29] no need to keep both sets [19:18:34] !log urbanecm@deploy1001 Synchronized dblists/growthexperiments.dblist: 3f39eefaa4c0dabfbc5b03fdc1b12e48913089bd: Enable GrowthExperiments at dawiki (T256126; 2/3) (duration: 01m 03s) [19:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:50] mutante: we should probably keep one as stretch, if people need to compare for any reason [19:19:47] Urbanecm: survey says 33% what you say, 33% "if others think it's needed" and 33% "nah, don't bother" [19:20:06] I personally don't think we need two mdebug servers per os [19:20:15] !log urbanecm@deploy1001 Synchronized wmf-config/config/dawiki.yaml: 3f39eefaa4c0dabfbc5b03fdc1b12e48913089bd: Enable GrowthExperiments at dawiki (T256126; 3/3) (duration: 01m 04s) [19:20:18] but i do think we should have at least one per os, so people can compare [19:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:19] (03CR) 10Bstorm: [C: 04-1] wikireplicas: alert via email for analytics wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [19:20:21] it's about the rows [19:20:24] they are in different rows [19:20:47] hmm, if some rows are malfunctioning, I probably shouldn't be deploying anyway? [19:20:50] or am i missing the point? [19:21:30] there are also 2 designated hardware servers, not VMs, that are picked to "stay on stretch to the end" [19:21:32] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10nshahquinn-wmf) >>! In T273741#6812424, @mforns wrote: > If it's an app, it would need to be **very** popular. > Maybe Aarogya Setu, the app for reducing Cov... [19:21:36] one app and one api [19:22:12] but still, if people want to play with sth on stretch vs. buster, they should have a debug server rather than hack at a live mw server [19:22:18] anyway, just my opinion :) [19:23:14] ok, got a plan. we have currently 3 VMs, 1 and 2 are the normal debug servers and 3 was made as "buster preview", right [19:23:26] yup [19:23:31] I can just switch it around, make 1 and 2 buster as people say [19:23:40] +1 [19:23:44] then instead of deleting 1003 right away, it can be the special case on stretch [19:23:49] until we kill it [19:23:53] yup, that sounds good to me [19:24:14] ack [19:25:55] Urbanecm: except we said nothing about codfw being the same [19:26:17] i'll make it the same either way [19:26:41] Codfw doesn't matter much IMO, unless a switchover is going to happen [19:27:21] yea, true.. but IF that happens it should already have the same stuff and we dont want to start creating VMs [19:27:27] Agreed [19:28:26] 10SRE, 10Wikimedia-Portals, 10Patch-For-Review, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10SarthakKundra) Can you tell me where I can find the puppet Git repository? Is it the one on Github? [19:28:41] ok, guess the lowest hanging fruit is to reimage mwdebug in codfw now [19:29:56] mutante: sounds so. Please send a heads-up to ops list when you start the eqiad ones. [19:30:16] Urbanecm: ok, will do [19:31:13] i hope it doesnt need anything manual besides scap pull after reimage [19:31:56] and not the same reimaging (script) as for hardware [19:32:31] (03CR) 10CRusnov: [C: 03+1] "I have tested this for functionality on netbox-dev2001. It works as expected, fixing the interface names for any hosts in the cluster unde" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [19:32:33] (03CR) 10jerkins-bot: [V: 04-1] ganeti-netbox-sync: Run InterfaceAutomation when necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [19:33:01] if buster gets moved to other mwdebug hosts can folks let me know (or send an email)? I keep track of which are which for testing [19:33:06] thanks in advance! [19:33:47] apergos: mu.tante will send an email to ops@lists.wm.o [19:33:59] oh, that's perfect! [19:35:19] apergos: do you think there is a risk people have personal testing stuff on these VMs, does it need back up or warning before reimaging? [19:35:33] which vms in particular? [19:35:40] any mwdebug* [19:35:43] uh [19:35:58] it would not occur to me to have any personal crap there but maybe I should poke around [19:36:19] now I wonder if I should add home to bacula.. taking a look [19:36:21] a deployer's perspective: i normally only run scap pull there to fetch changes. Ocasionally, when i debug stuff, i make temporary files in my home, but i don't mind them getting deleted [19:36:26] be back in 2 mins with cereal and a yubikey [19:37:00] i have all files that should be permanently in my home pupettized, so they should be created on reimage [19:38:11] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2001.codfw.wmnet [19:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:19] for the record, mwmaint hosts are in bacula, mwdebug are not [19:39:05] Urbanecm: that's how the admins like it [19:39:32] mutante: I'm going through find /home at mwdebug1001, and it does seem there are some personal files. I would probably give people a heads-up via ops, so they can backup it somewhere if they want to preserve it [19:39:37] (03PS3) 10CRusnov: ganeti-netbox-sync: Run InterfaceAutomation when necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/662762 (https://phabricator.wikimedia.org/T263768) [19:40:03] and i would also promo pupettized homes in the mail 🙂 [19:40:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1391.eqiad.wmnet'] ` an... [19:40:36] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10ayounsi) Overall there is more traffic in the eqiad->codfw direction, so that could be part of the reason. Is it possible to know a bit more about the source,... [19:40:40] yes people have cruft in teir home directories [19:40:59] my guess is people care about eqiad but not codfw [19:41:00] better send out an email and ask them to clean up, some of it's clearly temp junk, some of it not so temp (php scripts and whatnot) [19:41:12] that's a spot check of mwdebug1001 [19:41:21] ok, I will send that mail before just killing them [19:41:25] good good [19:41:25] thanks [19:41:28] sure [19:42:03] alternative is to add them to bacula with /home as fileset [19:42:12] wait until one is created tomorrow [19:42:19] you can but you oughta have people clean up first [19:42:28] and then just tell people "in case you need it. it can be restored" [19:42:36] I mean it's not that much space but e.g. tere are copies of logs [19:42:37] true [19:42:45] and maybe that's not awesome, right [19:43:03] i personally don't think /home should be backed up at mwdebug [19:43:45] right, people should have stuff on the bastions or in puppet, or be able to stash it someplace if they are in the middle of work on a thing and the host needs to be re-imaged or whatever [19:44:29] I understand dthe urge to just back it up so folks can get on with the reimage [19:44:42] but we would have to make it easier for them to run something like "backupthisdirnow" [19:44:44] though. [19:44:51] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2001.codfw.wmnet [19:44:52] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2002.codfw.wmnet [19:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:17] I'd rather not back up stuff wholesale from there because it could include logs and other stuff that they don't think about [19:45:26] and our policy of we don't keep logs more than 90 days etc [19:46:25] on a related note, can I copy files from mwdebug1001 to mwmaint1002 without copying them to my laptop first? [19:46:25] RECOVERY - exim queue on mx1001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [19:46:28] ok, forgetting about bacula, the next option is to rsync /home to another host, reimage and sync back.. [19:46:42] which needs a little puppet to open firewall but not a big deal [19:46:59] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ca9bba1]: cirrus_namespace_map: only overwrite on success [19:47:00] yeah that might be ok (as long as not cross dc in plaintext) [19:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:05] that gets it done very fsat [19:47:35] and then a reminder can still be sent after "hey we copied your dirs but please clean up, espcially logs or etc that's older than 90 days kthxbai" [19:48:19] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ca9bba1]: cirrus_namespace_map: only overwrite on success (duration: 01m 19s) [19:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:24] yea, this is how I have done it before for other servers with home dir data [19:48:35] apergos: or perhaps tell them in advance copy whatever is needed yourself, you can use this rsync command? [19:49:15] if you tell tem in advance you have to wait for response and if someone is ooo or whatever then it's sudddenly a pain [19:49:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1391.eqiad.wmnet [19:49:28] the nice thing about jfdi is that it's done and then you can nag about cleanup and move on [19:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:00] in cases where the host name changes I would put a "home_oldserver.tar.gz" into people's home dirs [19:50:08] that was supposed to remind them [19:50:14] yeah, that's not a bad idea right there [19:50:14] and speak for itself [19:50:29] email gets sent, the tar file is there, people can do the work [19:50:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2002.codfw.wmnet [19:50:31] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc-gp2003.codfw.wmnet [19:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:34] that works [19:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:47] if...we also have a script to remind people with a non-deleted tar to finish cleaning up :) [19:50:57] 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [19:51:27] Urbanecm: it gets noticed in 2 or 3 years when we upgrade to bullseye and home dirs are compounding :) [19:51:28] there will always be someone who doesn't. them's the rules, I didn't make them [19:51:34] 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install backup1003 - https://phabricator.wikimedia.org/T274184 (10RobH) [19:55:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:55:34] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/662764 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:55:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2003.codfw.wmnet [19:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:59:12] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1391.eqiad.wmnet [19:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:00:07] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1389.eqiad.wmnet'] ` Of... [20:00:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:01:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:03:23] (03CR) 10Jgiannelos: [C: 03+2] Release latest version of push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/662739 (owner: 10Jgiannelos) [20:05:10] (03Merged) 10jenkins-bot: Release latest version of push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/662739 (owner: 10Jgiannelos) [20:09:08] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10RobH) [20:09:16] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10RobH) [20:10:08] (03PS1) 10Ottomata: Undo migration of SpecialMuteSubmit on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662766 (https://phabricator.wikimedia.org/T268517) [20:11:03] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [20:11:51] !log jgiannelos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [20:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:17] PROBLEM - Disk space on wdqs1009 is CRITICAL: DISK CRITICAL - free space: /srv 37733 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1009&var-datasource=eqiad+prometheus/ops [20:12:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1390.eqiad.wmnet with reason: REIMAGE [20:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:06] (03CR) 10Ottomata: "I should be able to backfill these events from the validation error logs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662766 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [20:13:10] (03CR) 10Ottomata: [C: 03+2] Undo migration of SpecialMuteSubmit on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662766 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [20:14:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1390.eqiad.wmnet with reason: REIMAGE [20:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:42] !log jgiannelos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [20:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1389.eqiad.wmnet with reason: REIMAGE [20:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:23] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Undo migration of SpecialMuteSubmit on all wikis except testwiki - T268517 (duration: 01m 06s) [20:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:28] T268517: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 [20:19:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1389.eqiad.wmnet with reason: REIMAGE [20:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:38] !log jgiannelos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [20:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1305.eqiad.wmnet with reason: REIMAGE [20:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:36] (03PS1) 10Mholloway: Sample mediawiki.client.session_tick at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) [20:25:24] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1305.eqiad.wmnet with reason: REIMAGE [20:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1304.eqiad.wmnet with reason: REIMAGE [20:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1304.eqiad.wmnet with reason: REIMAGE [20:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:48] (03PS2) 10Jbond: interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) [20:34:09] (03PS3) 10Jbond: interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) [20:34:30] (03PS4) 10Jbond: numa_networking: drop numa_networking global variable [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) [20:37:10] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1390.eqiad.wmnet'] ` an... [20:40:42] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10KFrancis) @CDanis Please provide an email address for Georgina and I'll put the agreement together. Thanks! [20:42:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1389.eqiad.wmnet'] ` an... [20:46:21] (03CR) 10Razzi: wikireplicas: alert via email for analytics wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [20:46:38] (03PS2) 10Razzi: wikireplicas: alert via email for analytics wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) [20:51:20] (03PS3) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [20:52:39] RECOVERY - Disk space on wdqs1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1009&var-datasource=eqiad+prometheus/ops [20:53:17] (03CR) 10jerkins-bot: [V: 04-1] (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [20:56:12] (03PS1) 10Mholloway: Update sampling config syntax for test.instrumentation.sampled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662770 [20:56:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [20:57:29] (03CR) 10Bstorm: [C: 03+1] "This also highlights that the new server role needs a hiera file 😊" [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [20:57:51] (03PS2) 10Mholloway: Update sampling config syntax for test.instrumentation.sampled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662770 [20:58:17] 10SRE, 10Wikimedia-Portals, 10Patch-For-Review, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10Aklapper) >>! In T242500#6812522, @SarthakKundra wrote: > Can you tell me where I can find the puppet Git repository? Is it the one on Github?... [21:00:05] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T2100). [21:05:27] (03CR) 10Jbond: numa_networking: drop numa_networking global variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662700 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [21:05:39] (03PS1) 10Dzahn: parsoid::testreduce: set mysql datadir to /var/lib/mysql [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) [21:06:07] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testreduce: set mysql datadir to /var/lib/mysql [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) (owner: 10Dzahn) [21:06:32] (03CR) 10Bstorm: [C: 03+1] wikireplicas: alert via email for analytics wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [21:07:18] (03PS2) 10Dzahn: parsoid::testreduce: set mysql datadir to /var/lib/mysql [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) [21:07:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1390.eqiad.wmnet [21:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:05] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1390.eqiad.wmnet [21:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:36] (03PS1) 10Bstorm: wikireplicas: Add hiera for dedicated analytics_multiinstance role [puppet] - 10https://gerrit.wikimedia.org/r/662772 (https://phabricator.wikimedia.org/T269211) [21:09:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:09:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1389.eqiad.wmnet [21:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:40] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1305.eqiad.wmnet'] ` an... [21:10:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1389.eqiad.wmnet [21:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:15] cdanis: o/ . re your comment here: T273602#6811246 . I'm not sure where the template resides. I'm happy to look into changing it if that needs me to work on it. [21:13:16] T273602: Access to analytics-privatedata-users for Research contractor AikoChou - https://phabricator.wikimedia.org/T273602 [21:13:32] cdanis: and thank you for helping us with that particular access. [21:13:49] leila: to be honest I don't know either! [21:14:34] updating the wikitech page about access (thanks Andrew) is a help for sure [21:15:04] cdanis: 'got it. then let me know if I'm a blocker at any point for updating the template. [21:15:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1304.eqiad.wmnet'] ` an... [21:15:42] (03CR) 10Razzi: wikireplicas: alert via email for analytics wikireplicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [21:16:13] (03PS9) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [21:16:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [21:16:48] (03PS3) 10Razzi: wikireplicas: alert via email for analytics wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) [21:17:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:17:15] (03PS10) 10Kosta Harlan: [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [21:17:34] (03CR) 10jerkins-bot: [V: 04-1] [WIP] linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [21:18:23] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1305.eqiad.wmnet [21:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1304.eqiad.wmnet [21:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:56] (03CR) 10Bstorm: [C: 03+1] wikireplicas: alert via email for analytics wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [21:19:11] (03CR) 10Razzi: [C: 03+2] wikireplicas: alert via email for analytics wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/661988 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [21:19:38] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10CDanis) @hnowlan just a heads up that it looks like the depool of maps1001 left maps@eqiad underprovisioned: https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=1612777514651&to=16128190573... [21:20:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1305.eqiad.wmnet [21:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:23:57] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1272.eqiad.wmnet with reason: REIMAGE [21:23:57] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1273.eqiad.wmnet with reason: REIMAGE [21:23:57] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1271.eqiad.wmnet with reason: REIMAGE [21:23:59] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1273.eqiad.wmnet with reason: REIMAGE [21:23:59] !log legoktm@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1271.eqiad.wmnet with reason: REIMAGE [21:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:19] great [21:24:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1304.eqiad.wmnet [21:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:04] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1274.eqiad.wmnet with reason: REIMAGE [21:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:56] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1272.eqiad.wmnet with reason: REIMAGE [21:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:03] (03CR) 10Krinkle: "This changes the favicon file but also adds an external reference. Do we need both?" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661734 (owner: 10Jbond) [21:26:41] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1388.eqiad.wmnet with reason: REIMAGE [21:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:28:04] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1274.eqiad.wmnet with reason: REIMAGE [21:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:51] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1271.eqiad.wmnet with reason: reimaging [21:28:52] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1271.eqiad.wmnet with reason: reimaging [21:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:01] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mw1273.eqiad.wmnet with reason: reimaging [21:29:03] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw1273.eqiad.wmnet with reason: reimaging [21:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1388.eqiad.wmnet with reason: REIMAGE [21:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:48] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/27915/" [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) (owner: 10Dzahn) [21:30:59] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/27915/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) (owner: 10Dzahn) [21:31:51] (03CR) 10Subramanya Sastry: [C: 03+1] parsoid::testreduce: set mysql datadir to /var/lib/mysql [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) (owner: 10Dzahn) [21:32:38] (03CR) 10Bstorm: ceph.osd: Allow setting the io scheduler of the osd disks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662689 (https://phabricator.wikimedia.org/T273791) (owner: 10David Caro) [21:33:00] (03CR) 10Dzahn: [V: 03+1 C: 03+2] parsoid::testreduce: set mysql datadir to /var/lib/mysql [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) (owner: 10Dzahn) [21:34:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1397.eqiad.wmnet with reason: REIMAGE [21:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1397.eqiad.wmnet with reason: REIMAGE [21:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:38] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T274034#6813117" [puppet] - 10https://gerrit.wikimedia.org/r/662771 (https://phabricator.wikimedia.org/T274034) (owner: 10Dzahn) [21:42:14] (03PS1) 10Jbond: systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) [21:44:40] (03CR) 10jerkins-bot: [V: 04-1] systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [21:47:48] (03PS2) 10Jbond: systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) [21:50:09] (03CR) 10jerkins-bot: [V: 04-1] systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [21:51:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2245.codfw.wmnet with reason: REIMAGE [21:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1388.eqiad.wmnet'] ` an... [21:53:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2245.codfw.wmnet with reason: REIMAGE [21:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1303.eqiad.wmnet with reason: REIMAGE [21:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1303.eqiad.wmnet with reason: REIMAGE [21:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:41] (03PS3) 10Jbond: systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) [21:57:55] PROBLEM - Hadoop DataNode on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:58:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1397.eqiad.wmnet'] ` an... [21:59:16] (03CR) 10jerkins-bot: [V: 04-1] systemd: Add support for setting cpu affinity [puppet] - 10https://gerrit.wikimedia.org/r/662775 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [22:00:04] Reedy and sbassett: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210208T2200) [22:00:11] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10RobH) [22:02:34] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10RobH) [22:06:43] RECOVERY - Hadoop DataNode on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [22:08:50] this was me --^ [22:11:05] 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10RobH) [22:12:57] (03PS1) 10Ladsgroup: wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) [22:13:17] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) [22:13:25] (03CR) 10jerkins-bot: [V: 04-1] wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [22:13:37] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) [22:14:17] (03CR) 10Razzi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/662772 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm) [22:14:53] 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) [22:15:10] (03PS8) 10Volans: style: this introduces black+isort as autoformatter [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [22:15:12] (03PS1) 10Volans: documentation: add a development page [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) [22:16:10] 10ops-eqiad, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] - https://phabricator.wikimedia.org/T274206 (10RobH) a:03jcrespo @jcrespo: Conversation between you and Arzhel on T272018 seems to indicate some kind of discussion is still pending for these to determine where they can be racke... [22:16:39] (03CR) 10Gergő Tisza: [WIP] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [22:18:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1271.eqiad.wmnet', 'mw12... [22:18:29] RECOVERY - Disk space on an-worker1118 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [22:18:34] 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10RobH) a:05Papaul→03jcrespo >>! In T274206#6813364, @RobH wrote: > @jcrespo: Conversation between you and Arzhel on T272018 seems to indicate some kind of discussion is still pend... [22:20:58] (03CR) 10Bstorm: [C: 03+2] wikireplicas: Add hiera for dedicated analytics_multiinstance role [puppet] - 10https://gerrit.wikimedia.org/r/662772 (https://phabricator.wikimedia.org/T269211) (owner: 10Bstorm) [22:21:31] (03PS1) 10Elukey: role::analytics_test_cluster::presto::server: fix hive settings [puppet] - 10https://gerrit.wikimedia.org/r/662784 [22:22:11] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::presto::server: fix hive settings [puppet] - 10https://gerrit.wikimedia.org/r/662784 (owner: 10Elukey) [22:22:37] bstorm: ok to merge? :) [22:22:42] Go for it! [22:22:45] I was about to ask the same [22:23:47] (03CR) 10Volans: "Comments related to the changes of my last PS" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/659785 (https://phabricator.wikimedia.org/T211750) (owner: 10David Caro) [22:25:49] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Ladsgroup) @jbond: Hey, you wrote this as a checkbox > [] migrate all cron types to systemd::timer::job (the cron type is no longer a native pup... [22:26:11] 10SRE, 10SRE-tools, 10tox-wikimedia, 10Patch-For-Review, 10User-Kormat: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Legoktm) >>! In T211750#6806180, @Volans wrote: > I had another pass at black and also a long chat with @dcaro about it (thanks for resurfacing thi... [22:29:36] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1271.eqiad.wmnet [22:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:41] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1272.eqiad.wmnet [22:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:46] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1273.eqiad.wmnet [22:29:47] (03PS2) 10Ladsgroup: wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) [22:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:52] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1274.eqiad.wmnet [22:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:14] (03CR) 10jerkins-bot: [V: 04-1] wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [22:31:20] (03PS3) 10Ladsgroup: wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) [22:38:33] (03CR) 10CDanis: [C: 03+1] hieradata: add defaults for profile::swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/662703 (https://phabricator.wikimedia.org/T221904) (owner: 10Filippo Giunchedi) [22:39:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1303.eqiad.wmnet'] ` an... [22:40:55] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/662124 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [22:41:31] (03CR) 10CDanis: [C: 03+1] profile: add prometheus job for udpmxircecho (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [22:42:01] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Michaelrhanson) I found several places where this URL is being used in sample code, which might have been picked up by somebody and built into an app: https... [22:42:58] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Daniel.gayo) Could it be this app? https://apps.apple.com/hk/app/iclass-corporate/id1439400748?l=en The picture appears in a screenshot... [22:43:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2245.codfw.wmnet'] ` an... [22:45:41] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1397.eqiad.wmnet [22:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1303.eqiad.wmnet [22:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1388.eqiad.wmnet [22:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2245.codfw.wmnet [22:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:55] (03PS1) 10Subramanya Sastry: parsoid-rt / parsoid-vd: Pass in database socket path in server config [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) [22:50:15] (03CR) 10Subramanya Sastry: "https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/testreduce/+/662788/ is the corresponding code in the testreduce codebase." [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [22:51:10] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1271.eqiad.wmnet [22:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:15] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1272.eqiad.wmnet [22:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:20] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1273.eqiad.wmnet [22:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:27] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1274.eqiad.wmnet [22:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:33] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mwmaint2002 - https://phabricator.wikimedia.org/T274170 (10Papaul) [22:59:39] (03CR) 10Gergő Tisza: [WIP] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [22:59:48] (03PS1) 10Anne Tomasevich: Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) [23:01:12] (03CR) 10jerkins-bot: [V: 04-1] Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [23:08:06] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Legoktm) @spinda found that this image is used in quite a few different places: * https://github.com/triniwiz/nativescript-image-cache-it/issues/11 * https:... [23:10:23] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Michaelrhanson) Hm! It is included in the imagenet URL list, I think. Could we be looking at some CV training pipeline that's not caching properly? http:/... [23:14:13] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) >>! In T273741#6813531, @Michaelrhanson wrote: > I found several places where this URL is being used in sample code, which might have been picked up... [23:14:39] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10ssingh) >>! In T273741#6813536, @Daniel.gayo wrote: > Could it be this app? > > https://apps.apple.com/hk/app/iclass-corporate/id1439400748?l=en > > The pi... [23:14:53] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1397.eqiad.wmnet [23:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2245.codfw.wmnet [23:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:53] (03CR) 10CRusnov: "Looks good to me. Useful too!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [23:16:58] (03CR) 10CRusnov: [C: 03+1] documentation: add a development page [software/spicerack] - 10https://gerrit.wikimedia.org/r/662783 (https://phabricator.wikimedia.org/T211750) (owner: 10Volans) [23:17:51] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1303.eqiad.wmnet [23:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:46] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10fdans) >>! In T273741#6813616, @Michaelrhanson wrote: > Hm! It is included in the imagenet URL list, I think. Could we be looking at some CV training pipel... [23:19:32] (03CR) 10Dzahn: [C: 04-1] "the compiler says it would fail with "Testreduce::Server[parsoid-vd]: has no parameter named 'db_socket'"" [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:20:53] (03CR) 10Dzahn: [C: 04-1] "modules/testreduce itself which the profile classes use, also needs the new parameter" [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:21:46] (03PS8) 10Ryan Kemper: relforge: service impl of relforge100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/661229 (https://phabricator.wikimedia.org/T262211) [23:22:05] 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10Papaul) [23:22:23] 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10Papaul) p:05Triage→03Medium [23:22:43] 10ops-codfw: codfw: relocate logstash2035 - https://phabricator.wikimedia.org/T274214 (10Papaul) [23:27:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:28:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:29:28] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1388.eqiad.wmnet [23:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:40] 10SRE, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10fdans) As was suggested on Twitter, this surge coincides almost perfectly with the ban of TikTok, as well as other 223 Chinese apps, in India [23:30:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:32:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:39:05] (03PS2) 10Dzahn: parsoid-rt / parsoid-vd: Pass in database socket path in server config [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:40:56] (03CR) 10Dzahn: [V: 03+1] "amended to fix, compiles now: https://puppet-compiler.wmflabs.org/compiler1001/27917/" [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:42:10] (03CR) 10Dzahn: [V: 03+1] parsoid-rt / parsoid-vd: Pass in database socket path in server config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:42:48] (03PS3) 10Dzahn: parsoid-rt / parsoid-vd: Pass in database socket path in server config [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:44:03] (03CR) 10Dzahn: [C: 03+2] parsoid-rt / parsoid-vd: Pass in database socket path in server config [puppet] - 10https://gerrit.wikimedia.org/r/662789 (https://phabricator.wikimedia.org/T274034) (owner: 10Subramanya Sastry) [23:44:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1387.eqiad.wmnet with reason: REIMAGE [23:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1387.eqiad.wmnet with reason: REIMAGE [23:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:23] 10SRE, 10Commons, 10Traffic: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10AntiCompositeNumber) [23:46:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Papaul) @Dzahn if the server is still in service can it please be de-pool so i can work on it tomorrow while on site. Thanks [23:47:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1386.eqiad.wmnet with reason: REIMAGE [23:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1386.eqiad.wmnet with reason: REIMAGE [23:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2220.codfw.wmnet [23:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:15] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2220.codfw.wmnet [23:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803 [23:52:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2220.codfw.wmnet with reason: T273803 [23:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:51] T273803: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 [23:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Dzahn) @Papaul Server is set to pooled=inactive and downtime for 2 days. Go ahead and thank you! [23:54:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops-radar: mw2220 - broken IPMI / mgmt - https://phabricator.wikimedia.org/T273803 (10Papaul) Thank you will work on it tomorrow. [23:58:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1302.eqiad.wmnet with reason: REIMAGE [23:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1301.eqiad.wmnet with reason: REIMAGE [23:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:51] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1007 is CRITICAL: 2.235e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007