[00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:11:06] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:11:10] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:07:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.31 [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664407 [03:18:00] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:09] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10wiki_willy) Hi @Jclark-ctr - let's just move the hard drives over to the chassis of one of the decom'd hosts. (assuming the decom'd host doesn't have any hw issues) It'll probably save some time trying to figure out i... [03:53:50] (03PS2) 10Jforrester: Branch commit for wmf/1.36.0-wmf.31 [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664407 (https://phabricator.wikimedia.org/T271345) (owner: 10TrainBranchBot) [04:17:05] !log jforrester@deploy1001 Started deploy [integration/docroot@864afdb]: Update docroot with changes from this weekend. [04:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:22] !log jforrester@deploy1001 Finished deploy [integration/docroot@864afdb]: Update docroot with changes from this weekend. (duration: 00m 17s) [04:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:33] 10SRE, 10Traffic, 10Patch-For-Review, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10aaron) [06:04:29] (03PS1) 10Marostegui: production-m2.sql: Add SELECT to adminlinkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/664416 (https://phabricator.wikimedia.org/T267214) [06:06:50] (03CR) 10Marostegui: [C: 03+2] production-m2.sql: Add SELECT to adminlinkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/664416 (https://phabricator.wikimedia.org/T267214) (owner: 10Marostegui) [06:09:36] PROBLEM - Check systemd state on analytics1061 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:54] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:12:06] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 331347913608 and 364912 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:13:00] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) >>! In T258361#6829557, @Marostegui wrote: >>>! In T258361#6822070, @jcrespo wrote: >> I am taking db1163 to, at least temporarily,... [06:14:34] (03PS1) 10Marostegui: instances.yaml: Remove db1093 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664417 (https://phabricator.wikimedia.org/T273955) [06:15:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1093 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664417 (https://phabricator.wikimedia.org/T273955) (owner: 10Marostegui) [06:25:22] RECOVERY - Check systemd state on analytics1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:40] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [06:30:44] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:31:26] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:32:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1093 from dbctl T273955', diff saved to https://phabricator.wikimedia.org/P14364 and previous config saved to /var/cache/conftool/dbconfig/20210216-063250-marostegui.json [06:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:57] T273955: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955 [06:36:28] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:37:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:37:10] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [06:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:28] (03PS1) 10Marostegui: mariadb: Remove db1093 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/664418 (https://phabricator.wikimedia.org/T273955) [06:43:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:51] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955 (10Marostegui) [06:45:07] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:46:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092 to clone db1172 T258361', diff saved to https://phabricator.wikimedia.org/P14365 and previous config saved to /var/cache/conftool/dbconfig/20210216-064602-marostegui.json [06:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:08] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:49:27] !log Reboot pc2010 pc2009 pc2008 pc2007 for kernel upgrade [06:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:59] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10elukey) @razzi very interesting use case, I am going to add in here what I usually do and we can translate this into a procedure on wikitech if you want. In this case, if you execute `dmesg -T` on t... [06:57:55] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10elukey) Correction - in this case the umount command failed, telling me that the target was busy (so either yarn or hdfs daemons were reading from it). I had to stop both to umount :) [07:01:55] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [07:18:39] !log Reboot dbproxy2* for kernel upgrade [07:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:37] !log Reboot dbproxy1012, 1015, 1016, 1017 for kernel upgrade [07:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:27] RECOVERY - MegaRAID on an-worker1099 is OK: OK: optimal, 24 logical, 24 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:27:21] !log Reboot dbproxy1021 for kernel upgrade [07:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1093 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/664418 (https://phabricator.wikimedia.org/T273955) (owner: 10Marostegui) [07:35:09] 10SRE, 10ops-eqiad, 10Analytics-Radar: Degraded RAID on an-worker1099 - https://phabricator.wikimedia.org/T273034 (10elukey) @razzi today I remembered this task by chance, I had to follow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk to add the new d... [07:39:48] (03PS1) 10Marostegui: mariadb: Productionize db1172 [puppet] - 10https://gerrit.wikimedia.org/r/664483 (https://phabricator.wikimedia.org/T258361) [07:40:05] !log restarting blazegraph on wdqs1013 [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:19] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:51:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see a small suggestion to improve the dockerfile." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [07:54:36] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10R4356th) This was [[ https://meta.wikimedia.org/wiki/Tech/News/2021/07 | announced ]] in Tech News yesterday. [07:56:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While the change is correct, it shows again why I prefer not to use "FROM scratch" in images (and why it's problematic within our ecosyste" (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [07:56:24] (03PS2) 10Marostegui: mariadb: Productionize db1172 [puppet] - 10https://gerrit.wikimedia.org/r/664483 (https://phabricator.wikimedia.org/T258361) [07:57:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1172 [puppet] - 10https://gerrit.wikimedia.org/r/664483 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:58:59] 10SRE, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart x1 database master (db1103) - https://phabricator.wikimedia.org/T273758 (10Marostegui) Excellent - thanks [07:59:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "change lgtm, but maybe we should move to run as user nobody instead." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [07:59:48] (03CR) 10Ladsgroup: "it would be great if this can be merged. It doesn't have any effect on production and I'm one of maintainer of that service in the cloud." [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:01:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikilabels: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/662781 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:02:37] <_joe_> Amir1: done [08:02:53] _joe_: Thanks! [08:02:54] Awesome [08:03:03] <_joe_> Amir1: also, I'd frankly recommend getting away from puppet for anything that will never go to production [08:03:06] <_joe_> just my 2c [08:03:16] it supposed to go into production [08:03:23] <_joe_> but if I had to maintain something in cloud VPS I'd leave puppet to do the basics [08:03:25] <_joe_> oh I see [08:03:31] <_joe_> then I regret merging your change [08:03:39] <_joe_> if that is speeding up adoption :D [08:03:57] haha, nah, plans changed, it won't go to production [08:04:09] <_joe_> phew! [08:04:16] <_joe_> ehm I mean dang! [08:04:39] I actually like to have a simpler puppet system (I can +2) so I can easily spawn up new VM and throw away the old ones [08:04:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but again, we might get away with using nobody here." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [08:04:49] e.g. for the jitsi nodes, etc. [08:05:07] <_joe_> Amir1: I would suggest to use ansible [08:05:10] <_joe_> for your own stuff [08:05:20] <_joe_> it's simpler and doesn't need a master node [08:05:49] <_joe_> so letting the base system managed by wmcs with puppet, and you install your services on top with ansible or something similar is how *I* would do it [08:06:42] hmm, or capistrano? never worked with it [08:07:38] <_joe_> I'd use ansible, but anything masterless would do [08:07:49] good point. Thanks. [08:16:30] (03PS1) 10Ladsgroup: wikilabels: Add description to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664499 [08:16:48] _joe_: It got error :D [08:16:56] (03PS1) 10Marostegui: wmnet: Failover m2-master proxy [dns] - 10https://gerrit.wikimedia.org/r/664500 [08:16:58] (03CR) 10jerkins-bot: [V: 04-1] wikilabels: Add description to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664499 (owner: 10Ladsgroup) [08:18:16] (03PS2) 10Ladsgroup: wikilabels: Add description to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664499 [08:19:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikilabels: Add description to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664499 (owner: 10Ladsgroup) [08:20:22] (03PS1) 10Marostegui: production-m2.sql: Add DELETE grant to adminlinkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/664501 (https://phabricator.wikimedia.org/T267214) [08:20:49] (03CR) 10Marostegui: [C: 03+2] production-m2.sql: Add DELETE grant to adminlinkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/664501 (https://phabricator.wikimedia.org/T267214) (owner: 10Marostegui) [08:25:57] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:30:56] !log Deploy schema change on s6 codfw - T273359 [08:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:03] T273359: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359 [08:34:53] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete cloudera config from reprepro [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: 10Muehlenhoff) [08:35:28] 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10MoritzMuehlenhoff) Is there anything which needs to be rectified in Netbox for bast3004? [08:37:25] !log swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836 [08:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:29] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [08:44:59] 10SRE, 10MediaWiki-Docker: Create and publish arm64 images of wikimedia-stretch and wikimedia-buster - https://phabricator.wikimedia.org/T274140 (10MoritzMuehlenhoff) This can't be easily done, our repository is currently only built for amd64 (64 bit x86, which is the only architecture we use to run the site)... [08:55:43] 10SRE, 10Commons, 10UploadWizard: Uploading via UploadWizard gets stuck for a 11 MB JPG - https://phabricator.wikimedia.org/T274150 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:57:31] (03CR) 10Kormat: [C: 03+1] wmnet: Failover m2-master proxy [dns] - 10https://gerrit.wikimedia.org/r/664500 (owner: 10Marostegui) [08:59:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Removing my -1 as converting to debian is already in the plans" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:02:48] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10MoritzMuehlenhoff) >>! In T274106#6814245, @VeronicaThamaini wrote: > Username: Veronica Thamaini > Shell access: I have a shell name. Should I share the name here? Kindly conf... [09:07:23] (03PS1) 10Muehlenhoff: Add access to Superset and cn=wmf for vthamaini [puppet] - 10https://gerrit.wikimedia.org/r/664506 (https://phabricator.wikimedia.org/T274106) [09:11:34] (03PS1) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator rate limiting on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664507 (https://phabricator.wikimedia.org/T272032) [09:15:35] 10SRE, 10Wikimedia-Mailing-lists: All WMF mailing lists should be publicly listed - https://phabricator.wikimedia.org/T124324 (10MoritzMuehlenhoff) p:05Triage→03Low [09:19:56] (03CR) 10JMeybohm: [C: 03+2] tiller: Run tiller as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:20:01] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] tiller: Run tiller as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:20:13] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:20:22] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] fluent-bit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:20:31] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] ratelimit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:21:43] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1172 is now replicating on s8, will start pooling tomorrow [09:22:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 10%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14367 and previous config saved to /var/cache/conftool/dbconfig/20210216-092213-root.json [09:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:25] (03PS1) 10Jcrespo: dbbackups: Reenable read-write backups, disable ro, document job ids [puppet] - 10https://gerrit.wikimedia.org/r/664508 (https://phabricator.wikimedia.org/T79922) [09:24:28] (03PS2) 10JMeybohm: tiller: Run tiller as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254) [09:24:30] (03PS2) 10JMeybohm: eventrouter: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254) [09:24:32] (03PS2) 10JMeybohm: fluent-bit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254) [09:24:34] (03PS2) 10JMeybohm: ratelimit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254) [09:25:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] tiller: Run tiller as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664095 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:26:37] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] ratelimit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664098 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:26:40] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] fluent-bit: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664097 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:26:43] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] eventrouter: Use numeric UID [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664096 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [09:28:34] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master proxy [dns] - 10https://gerrit.wikimedia.org/r/664500 (owner: 10Marostegui) [09:28:59] !log Failover m2-master from dbproxy1013 to dbproxy1015 [09:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:54] <_joe_> jayme: maybe we should make a fresh release of the base images before rebuilding all those images [09:37:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 20%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14368 and previous config saved to /var/cache/conftool/dbconfig/20210216-093716-root.json [09:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:39] 10SRE, 10MediaWiki-Docker: Create and publish arm64 images of wikimedia-stretch and wikimedia-buster - https://phabricator.wikimedia.org/T274140 (10kostajh) 05Open→03Declined >>! In T274140#6832335, @MoritzMuehlenhoff wrote: > This can't be easily done, our repository is currently only built for amd64 (64... [09:38:01] _joe_: well, that would have been smart indeed. Unfortunately the build of the first set already finished [09:38:39] but I can just bump them again, then [09:38:39] <_joe_> heh :P [09:38:42] <_joe_> nah [09:38:47] <_joe_> don't worry for now [09:38:55] or we wait for the runuser refactor [09:39:16] <_joe_> yeah [09:39:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete cloudera config from reprepro [puppet] - 10https://gerrit.wikimedia.org/r/664304 (https://phabricator.wikimedia.org/T274797) (owner: 10Muehlenhoff) [09:39:37] (03PS3) 10Giuseppe Lavagetto: Move scaffold functions to ruby [deployment-charts] - 10https://gerrit.wikimedia.org/r/663807 [09:39:39] (03PS11) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [09:40:03] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'staging' . [09:40:03] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' . [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:22] !log deploy new certs for apertium [09:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:26] (03CR) 10Giuseppe Lavagetto: "> Patch Set 10: Code-Review-1" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [09:44:47] (03PS1) 10Muehlenhoff: Readd pulls file, but empty for now [puppet] - 10https://gerrit.wikimedia.org/r/664510 [09:47:02] (03CR) 10Muehlenhoff: [C: 03+2] Readd pulls file, but empty for now [puppet] - 10https://gerrit.wikimedia.org/r/664510 (owner: 10Muehlenhoff) [09:47:49] 10SRE, 10LDAP-Access-Requests: Access to Product Superset for Rmurthy - https://phabricator.wikimedia.org/T273813 (10jrobell) Hi @CDanis . I approve. Thank you! [09:48:06] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:17] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:51:14] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) I've documented the new architecture to support ES backups, with the additional pools: https://wikitech.wikimedia.org/wiki/Bacula#Retention Once the l... [09:51:41] (03CR) 10Jbond: [C: 03+1] "+1 assuming my assumption in the comments is correct" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [09:51:54] PROBLEM - Prometheus k8s-staging cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [09:52:17] (03Merged) 10jenkins-bot: linkrecommendation: Set backoffLimit to 1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:52:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 40%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14369 and previous config saved to /var/cache/conftool/dbconfig/20210216-095220-root.json [09:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:30] <_joe_> akosiaris / jayme it looks like there is a problem on kubestagemaster2001 [09:52:42] _joe_: it's me [09:52:51] <_joe_> ah ok [09:52:56] <_joe_> the apiserver is down [09:53:04] testing something out so we can avoid customizing the GlobalNetworkPolicies just for tiller [09:53:08] (03CR) 10Jbond: [C: 03+1] Add access to Superset and cn=wmf for vthamaini [puppet] - 10https://gerrit.wikimedia.org/r/664506 (https://phabricator.wikimedia.org/T274106) (owner: 10Muehlenhoff) [09:53:17] and our sslcert puppetization is pissing me off a bit right now [09:53:18] PROBLEM - Prometheus k8s-staging cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [09:53:33] <_joe_> lmk if I can help with that [09:55:20] (03PS5) 10Giuseppe Lavagetto: kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) [09:57:00] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [09:57:52] (03CR) 10Jcrespo: "Correct, see answer." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [09:58:47] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [09:59:19] (03CR) 10Jbond: [C: 03+1] "sgtm thx 😊" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [09:59:25] (03PS1) 10Alexandros Kosiaris: Use the new kubestagemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/664515 [10:02:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] Use the new kubestagemaster.svc.codfw.wmnet cert [puppet] - 10https://gerrit.wikimedia.org/r/664515 (owner: 10Alexandros Kosiaris) [10:05:13] 10SRE, 10Patch-For-Review: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10MoritzMuehlenhoff) p:05Triage→03Low [10:05:49] 10SRE, 10Patch-For-Review: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10MoritzMuehlenhoff) The underlying bug was fixed in apt 2.1.16 via https://salsa.debian.org/apt-team/apt/-/merge_requests/140 and will be part of bullseye. In the mean time e... [10:06:39] (03PS6) 10Giuseppe Lavagetto: kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) [10:07:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 60%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14370 and previous config saved to /var/cache/conftool/dbconfig/20210216-100723-root.json [10:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28077/console" [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [10:11:14] (03CR) 10Jcrespo: "Sadly this cannot be deployed until we setup some kind of backups on conf* hosts, like the parent commit attempts." [puppet] - 10https://gerrit.wikimedia.org/r/664314 (https://phabricator.wikimedia.org/T273182) (owner: 10Jcrespo) [10:12:36] (03CR) 10Jcrespo: "We could do this, or if Buster upgrade is imminent, we could do it for conf2* host on buster, whatever is more likely to happen." [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [10:13:43] (03PS1) 10Marostegui: instances.yaml: Remove db1075 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664517 (https://phabricator.wikimedia.org/T274235) [10:14:24] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:04] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1075 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/664517 (https://phabricator.wikimedia.org/T274235) (owner: 10Marostegui) [10:17:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1075 from dbctl T274235', diff saved to https://phabricator.wikimedia.org/P14371 and previous config saved to /var/cache/conftool/dbconfig/20210216-101710-marostegui.json [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] T274235: decommission db1075.eqiad.wmnet - https://phabricator.wikimedia.org/T274235 [10:17:57] (03PS1) 10Elukey: hadoop: deploy kernel 4.19 to master/worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664519 (https://phabricator.wikimedia.org/T274860) [10:18:27] !log Reboot pc1010 for kernel upgrade [10:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' . [10:18:58] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [10:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:06] (03PS1) 10Arturo Borrero Gonzalez: conntrackd: also install the conntrack tool [puppet] - 10https://gerrit.wikimedia.org/r/664521 (https://phabricator.wikimedia.org/T272963) [10:20:52] (03PS1) 10JMeybohm: eventrouter: Bump image version to 0.3.0-5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664522 (https://phabricator.wikimedia.org/T274254) [10:20:54] (03PS1) 10JMeybohm: api-gateway: Update fluent-bit and ratelimit images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) [10:20:56] (03PS1) 10JMeybohm: Drop unused logging stanza from blubberoid, mathoid and zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/664524 [10:20:58] (03PS1) 10JMeybohm: admin_ng: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664525 (https://phabricator.wikimedia.org/T274254) [10:21:00] (03PS1) 10JMeybohm: admin: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664526 (https://phabricator.wikimedia.org/T274254) [10:21:36] (03CR) 10Elukey: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28078/console" [puppet] - 10https://gerrit.wikimedia.org/r/664099 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [10:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 80%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14372 and previous config saved to /var/cache/conftool/dbconfig/20210216-102227-root.json [10:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28079/" [puppet] - 10https://gerrit.wikimedia.org/r/664521 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [10:24:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28080/console" [puppet] - 10https://gerrit.wikimedia.org/r/664519 (https://phabricator.wikimedia.org/T274860) (owner: 10Elukey) [10:24:48] RECOVERY - Prometheus k8s-staging cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [10:25:08] RECOVERY - Prometheus k8s-staging cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [10:25:18] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "The latest pcc result does what we want. I'd declare this a success for now." [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [10:26:04] PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:26:18] (03PS1) 10Muehlenhoff: Add rmurthy to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/664527 (https://phabricator.wikimedia.org/T273813) [10:27:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664519 (https://phabricator.wikimedia.org/T274860) (owner: 10Elukey) [10:27:30] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] kubernetes::deployment_server: add yaml to configure MediaWiki sites [puppet] - 10https://gerrit.wikimedia.org/r/659941 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [10:27:33] (03PS2) 10Elukey: hadoop: deploy kernel 4.19 to master/worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664519 (https://phabricator.wikimedia.org/T274860) [10:27:48] RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:28:14] (03CR) 10Volans: [C: 03+2] "Tested on netbox-next, merging." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664332 (https://phabricator.wikimedia.org/T274802) (owner: 10CRusnov) [10:28:30] (03PS1) 10Filippo Giunchedi: aptrepo: fix elastic curator url [puppet] - 10https://gerrit.wikimedia.org/r/664528 (https://phabricator.wikimedia.org/T274797) [10:29:08] (03CR) 10Muehlenhoff: [C: 03+2] Add rmurthy to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/664527 (https://phabricator.wikimedia.org/T273813) (owner: 10Muehlenhoff) [10:29:32] _joe_: shall I merge your patch along? [10:29:44] <_joe_> yes, thanks [10:29:47] doing [10:30:07] done [10:30:20] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10elukey) @wkandek Hi! Do you think that we could find somebody in your team to work with me on this task? It seems very important and potentially blocking others (also the hosts are still... [10:30:31] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28081/console" [puppet] - 10https://gerrit.wikimedia.org/r/664519 (https://phabricator.wikimedia.org/T274860) (owner: 10Elukey) [10:32:53] (03CR) 10Elukey: [V: 03+1 C: 03+2] hadoop: deploy kernel 4.19 to master/worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664519 (https://phabricator.wikimedia.org/T274860) (owner: 10Elukey) [10:33:02] 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10Volans) @MoritzMuehlenhoff I've tested and merged the above change and then I manually run the Netbox script https://netbox.wikime... [10:33:37] (03PS1) 10Kosta Harlan: linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) [10:34:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Access to Product Superset for Rmurthy - https://phabricator.wikimedia.org/T273813 (10MoritzMuehlenhoff) 05Open→03Resolved a:05jrobell→03MoritzMuehlenhoff @RMurthy : I enabled your access, you should be able to login now. You can find initial documen... [10:34:21] (03PS2) 10Kosta Harlan: linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) [10:35:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664528 (https://phabricator.wikimedia.org/T274797) (owner: 10Filippo Giunchedi) [10:36:47] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:37:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1092 (re)pooling @ 100%: Slowly repool db1092', diff saved to https://phabricator.wikimedia.org/P14373 and previous config saved to /var/cache/conftool/dbconfig/20210216-103730-root.json [10:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:06] 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10MoritzMuehlenhoff) Great, thanks. [10:38:09] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: fix elastic curator url [puppet] - 10https://gerrit.wikimedia.org/r/664528 (https://phabricator.wikimedia.org/T274797) (owner: 10Filippo Giunchedi) [10:40:07] (03PS1) 10Alexandros Kosiaris: staging-codfw: Switch tillers to internal DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/664530 [10:40:46] 10SRE, 10Patch-For-Review: reprepro unable to run checkupdate and import upgraded packages - https://phabricator.wikimedia.org/T274797 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We're back: ` root@apt1001:~# reprepro --noskipold checkupdate Calculating packages to get... Updates needed for 'buster-... [10:40:55] (03PS3) 10Kosta Harlan: linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) [10:41:35] (03PS4) 10Kosta Harlan: linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) [10:42:38] (03PS5) 10Kosta Harlan: linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) [10:46:39] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [10:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Apart from the commit message fix, I think we got to a level of complexity where the mcrouter_wancache profile could use some spec tests, " (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [10:53:02] !log Reboot es2023, es2024 and es2025 for kernel upgrade [10:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:15] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please separate the fix of the memcached module to a separate patch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [10:53:22] (03PS1) 10Phuedx: vector: Enable WVUI search on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664531 (https://phabricator.wikimedia.org/T259798) [11:04:50] (03PS2) 10Phuedx: vector: Enable WVUI search on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664531 (https://phabricator.wikimedia.org/T259798) [11:05:36] (03PS2) 10Giuseppe Lavagetto: varnish: fix escaping of variables in test run script [puppet] - 10https://gerrit.wikimedia.org/r/663017 [11:09:10] 10SRE, 10Patch-For-Review: netbox update (triggered from reimage script) failed: 'ImportPuppetDB' object has no attribute 'log_error' - https://phabricator.wikimedia.org/T274802 (10Volans) 05Open→03Resolved a:05crusnov→03None [11:12:25] (03CR) 10Effie Mouzeli: [C: 03+2] memcached::instance: Add support for memcached 1.6.x [puppet] - 10https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [11:14:46] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudgw: allow incoming conntrackd TCP connection [puppet] - 10https://gerrit.wikimedia.org/r/664538 (https://phabricator.wikimedia.org/T272963) [11:16:02] (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudgw: allow incoming conntrackd TCP connection [puppet] - 10https://gerrit.wikimedia.org/r/664538 (https://phabricator.wikimedia.org/T272963) [11:18:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/28082/" [puppet] - 10https://gerrit.wikimedia.org/r/664538 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [11:18:24] (03CR) 10Hnowlan: mtail: create separate metrics histogram based on endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) (owner: 10Hnowlan) [11:19:38] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:25:39] (03CR) 10JMeybohm: [C: 03+2] eventrouter: Bump image version to 0.3.0-5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664522 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [11:26:06] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664525 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [11:26:25] (03PS2) 10JMeybohm: admin_ng: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664525 (https://phabricator.wikimedia.org/T274254) [11:26:49] (03PS2) 10JMeybohm: Drop unused logging stanza from blubberoid, mathoid and zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/664524 [11:27:03] (03Merged) 10jenkins-bot: eventrouter: Bump image version to 0.3.0-5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664522 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [11:27:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [11:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:46] !log elukey@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [11:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:23] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [11:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:48] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:33:12] (03Merged) 10jenkins-bot: linkrecommendation: Disable cronjob for external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/664529 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:35:58] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: install memcached 1.6 on mc1037 [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [11:36:14] (03PS2) 10Effie Mouzeli: hiera: install memcached 1.6 on mc2037 [puppet] - 10https://gerrit.wikimedia.org/r/664271 (https://phabricator.wikimedia.org/T270315) [11:37:31] (03CR) 10Noa wmde: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664507 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE)) [11:38:28] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:40:01] !log Reboot dbproxy1013 for kernel upgrade [11:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:39] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1023.eqiad.wmnet [11:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:00] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:42:06] 10SRE, 10SRE-Access-Requests, 10Machine Learning Team (Active Tasks): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10MoritzMuehlenhoff) 05Open→03Resolved I'm closing this task since the group has been created. @klausman will reopen the task (... [11:42:09] !log upgrade mc2037 to memcached 1.6 - T270315 [11:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:13] T270315: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 [11:42:17] jouncebot: now [11:42:17] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [11:42:21] jouncebot: next [11:42:21] In 0 hour(s) and 17 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1200) [11:43:28] (03PS1) 10Marostegui: Revert "wmnet: Failover m2-master proxy" [dns] - 10https://gerrit.wikimedia.org/r/664258 [11:43:40] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664254 (https://phabricator.wikimedia.org/T274709) (owner: 10Bartosz Dziewoński) [11:44:20] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m2-master proxy" [dns] - 10https://gerrit.wikimedia.org/r/664258 (owner: 10Marostegui) [11:44:40] jouncebot: refr [11:44:43] jouncebot: refresh [11:44:44] I refreshed my knowledge about deployments. [11:44:49] no tab completion for commands ^^ [11:44:54] hehe [11:45:03] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [11:45:12] (03PS1) 10Giuseppe Lavagetto: Add proxy settings to running tests [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664543 [11:45:14] (03PS1) 10Giuseppe Lavagetto: Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 [11:45:20] !log Failover m2-master back from dbproxy1015 to dbproxy1013 [11:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1023.eqiad.wmnet [11:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:46] !log kharlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [11:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:56] (03Merged) 10jenkins-bot: CommentFormatter: Fix problems with editsection and quotes [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664254 (https://phabricator.wikimedia.org/T274709) (owner: 10Bartosz Dziewoński) [11:50:46] PROBLEM - k8s API server requests latencies on acrab is CRITICAL: instance=10.192.16.26 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:52:11] Urbanecm: thanks for the backport on 664254 [11:52:18] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [11:52:18] !log kharlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:52:19] twentyafterfour: no problem [11:52:23] it works, so I'm going tosync it [11:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:32] RECOVERY - k8s API server requests latencies on acrab is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:53:17] twentyafterfour: considerng that the other blocker happens even in wmf.27, i guess we can roll out? [11:54:05] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:54:05] !log kharlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [11:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:13] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.30/extensions/DiscussionTools/includes/CommentFormatter.php: 5f4f516177a355b42b896ee142d66c0c969e20f1: CommentFormatter: Fix problems with editsection and quotes (T274709) (duration: 01m 12s) [11:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:18] should be fixed [11:54:19] T274709: Section titles duplicated (and no section edit links) on pages with quotes " in the title when DiscussionTools is enabled - https://phabricator.wikimedia.org/T274709 [11:55:03] Urbanecm: I suppose so [11:55:08] \o/ [11:55:24] hope we won't get 10 new blockers the moment we roll out to group2 [11:56:39] 🤔 [11:57:15] 🤞 [11:57:26] (03CR) 10Volans: [C: 03+1] "Much nicer! Looks good to me, just some final nits inline, no blockers." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [11:59:33] (03CR) 10David Caro: [C: 03+2] toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1200). [12:00:04] phuedx and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:19] o/ still in a meeting for a few minutes [12:00:26] i can deploy today [12:01:07] or also not, i don't see phuedx [12:01:50] hi phuedx [12:02:01] Hullo Urbanecm. Sorry I'm alte [12:02:03] *late [12:02:06] no problem [12:02:19] (03CR) 10Urbanecm: [C: 03+2] vector: Enable WVUI search on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664531 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:03:07] (03Merged) 10jenkins-bot: vector: Enable WVUI search on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664531 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:03:30] phuedx: can you test your patch at mwdebug1001, please? [12:03:40] Urbanecm: On it [12:06:06] !log Deploy schema change on s5 codfw - T273359 [12:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:12] T273359: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359 [12:08:02] (03PS2) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator rate limiting on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664507 (https://phabricator.wikimedia.org/T272032) [12:08:34] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [12:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:49] Urbanecm: There seems to be a problem with the patch. testwiki and test2wiki don't inherit from the desktop-improvement config [12:10:15] phuedx: that wasn't configured at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/664531 :) [12:10:25] or am i missing sth? [12:10:44] I missed it [12:10:46] ah [12:10:54] phuedx: upload a follow-up please :) [12:12:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [12:13:16] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, and 3 others: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) 05Open→03Resolved I think this is done; we can open new tasks as needed. Thank you for your... [12:13:32] (03PS1) 10Phuedx: testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) [12:13:42] Urbanecm: ^ [12:14:27] (03CR) 10jerkins-bot: [V: 04-1] testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:14:37] (03CR) 10Urbanecm: [C: 04-1] "run composer buildDBLists as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:14:54] phuedx: ^ [12:18:01] (03PS11) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [12:18:55] (03CR) 10Alexandros Kosiaris: "TIL, thanks for this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:20:56] (03CR) 10Kosta Harlan: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664310 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:22:34] phuedx: ping? do you need any help? [12:23:04] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: use address per interface in the cloud-instance-transport subnet [puppet] - 10https://gerrit.wikimedia.org/r/664549 (https://phabricator.wikimedia.org/T272963) [12:23:30] Urbanecm: Sorry. Just updating Composer, which, for some reason, requires an update to other software [12:23:45] ah [12:24:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] configcluster: Enable etcd v3 backups for stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [12:24:17] you can also add it manually to dblists/xxx.dblist if that's faster, CI will complain if you did it incorrectly (the composer script doesn't do anything else) [12:25:39] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: use address per interface in the cloud-instance-transport subnet [puppet] - 10https://gerrit.wikimedia.org/r/664549 (https://phabricator.wikimedia.org/T272963) [12:25:50] (03PS2) 10Phuedx: testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) [12:25:58] PROBLEM - k8s API server requests latencies on acrab is CRITICAL: instance=10.192.16.26 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:26:48] (03CR) 10jerkins-bot: [V: 04-1] testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:28:01] CI wants testwiki+test2wiki in the opposite order 🙄 [12:28:19] (03PS1) 10Kosta Harlan: linkrecommendation: Disable cron job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) [12:28:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add the add_user filter (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [12:28:46] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: use address per interface in the cloud-instance-transport subnet [puppet] - 10https://gerrit.wikimedia.org/r/664549 (https://phabricator.wikimedia.org/T272963) [12:29:19] (03PS3) 10Phuedx: testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) [12:29:26] RECOVERY - k8s API server requests latencies on acrab is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:29:29] Lucas_WMDE: CI always wants you didn't put [12:29:35] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'eventrouter' . [12:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:38] (03PS2) 10Kosta Harlan: linkrecommendation: Disable cron job on codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) [12:29:44] RhinosF1: +1 [12:30:22] (03CR) 10jerkins-bot: [V: 04-1] testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:30:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/28085/" [puppet] - 10https://gerrit.wikimedia.org/r/664549 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:31:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add proxy settings to running tests [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664543 (owner: 10Giuseppe Lavagetto) [12:31:38] phuedx: you need to re-add wikipedia to testwiki.yaml :) [12:31:56] Urbanecm: I think I'm not going to continue down this route ;) [12:32:06] wdym? [12:32:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Drop unused logging stanza from blubberoid, mathoid and zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/664524 (owner: 10JMeybohm) [12:32:14] should i rollback it phuedx? [12:32:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664525 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [12:32:40] Urbanecm: Yes. Thank you. I'll have a good look over testwiki's configuration [12:32:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin: Update tiller image to 2.16.7-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/664526 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [12:32:50] On the plus side, I can now run composer buildDBLists ;) [12:33:04] (03PS1) 10Urbanecm: Revert "vector: Enable WVUI search on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664259 (https://phabricator.wikimedia.org/T259798) [12:33:09] (03CR) 10Urbanecm: [C: 03+2] Revert "vector: Enable WVUI search on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664259 (https://phabricator.wikimedia.org/T259798) (owner: 10Urbanecm) [12:33:11] (03Abandoned) 10Phuedx: testwikis: Inherit from desktop-improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664548 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:33:16] (03CR) 10Alexandros Kosiaris: Add support for php deployments (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [12:33:22] (03Merged) 10jenkins-bot: Add proxy settings to running tests [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664543 (owner: 10Giuseppe Lavagetto) [12:33:45] (03Merged) 10jenkins-bot: Drop unused logging stanza from blubberoid, mathoid and zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/664524 (owner: 10JMeybohm) [12:34:02] (03Merged) 10jenkins-bot: Revert "vector: Enable WVUI search on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664259 (https://phabricator.wikimedia.org/T259798) (owner: 10Urbanecm) [12:34:16] Lucas_WMDE: floor is yours [12:34:20] ok thanks [12:34:26] (03PS3) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator rate limiting on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664507 (https://phabricator.wikimedia.org/T272032) [12:34:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Wikibase Repo ID generator rate limiting on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664507 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE)) [12:35:26] (03Merged) 10jenkins-bot: Enable Wikibase Repo ID generator rate limiting on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664507 (https://phabricator.wikimedia.org/T272032) (owner: 10Lucas Werkmeister (WMDE)) [12:36:03] I’ll quickly test on mwdebug1001 that it doesn’t blow up [12:36:08] but I can only properly test it after sync [12:36:26] Urbanecm: Haha! *facepalm* I'd removed the wikipedia line... [12:36:44] phuedx: I noted that in the cr :) [12:36:45] I see what you meant now [12:36:53] mwdebug looks fine, syncing [12:36:55] sorry if it was a confusing note [12:37:32] (03CR) 10JMeybohm: [C: 04-1] "I would recommend not to do that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:38:05] Urbanecm: Not at all. I was distracted by having to do updates and I didn't follow what you meant. Better to have rolled back to a known-good state [12:38:35] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:664507|Enable Wikibase Repo ID generator rate limiting on Test Wikidata (T272032)]] 1/2 (duration: 01m 12s) [12:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:40] T272032: Add rate limit for creating Item IDs - https://phabricator.wikimedia.org/T272032 [12:38:58] sure [12:39:54] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/Wikibase.php: Config: [[gerrit:664507|Enable Wikibase Repo ID generator rate limiting on Test Wikidata (T272032)]] 2/2 (duration: 01m 06s) [12:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:26] !log jayme@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:50] (03PS1) 10Phuedx: vector: Enable WVUI search on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664552 (https://phabricator.wikimedia.org/T259798) [12:44:09] phuedx: looks good now! [12:44:35] RhinosF1: Yeah! It took me longer than it should've :D [12:44:55] (03PS1) 10Filippo Giunchedi: grafana: update home dashboard to Grafana 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/664555 (https://phabricator.wikimedia.org/T263747) [12:46:03] phuedx: it's fine [12:46:12] happens to even the most experienced devs, phuedx :) [12:46:16] Lucas_WMDE: you done? [12:46:19] just about [12:46:22] quickly looking at logstash [12:46:37] okay [12:46:44] looks fine, feel free to take over again :) [12:46:47] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [12:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:51] thanks [12:46:55] phuedx: should we try again? [12:47:38] Urbanecm: Sure. If time allows [12:47:47] sure [12:47:51] (03CR) 10Urbanecm: [C: 03+2] vector: Enable WVUI search on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664552 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:48:44] (03Merged) 10jenkins-bot: vector: Enable WVUI search on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664552 (https://phabricator.wikimedia.org/T259798) (owner: 10Phuedx) [12:49:12] phuedx: pulled to mwdebug1001 again :) [12:49:49] (03CR) 10Kosta Harlan: "> Patch Set 2: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:52:01] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add gerrit log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [12:53:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:05] (03Abandoned) 10Filippo Giunchedi: interfaces: allow setting queues on i40e NICs [puppet] - 10https://gerrit.wikimedia.org/r/661053 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [12:59:09] phuedx: ping? [13:02:18] Urbanecm: OK. Roll it back. I'll speak with a colleague about an error that I'm seeing [13:02:27] !log jayme@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:31] okay [13:02:58] (03PS1) 10Urbanecm: Revert "vector: Enable WVUI search on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664260 (https://phabricator.wikimedia.org/T259798) [13:03:06] (03CR) 10Urbanecm: [C: 03+2] Revert "vector: Enable WVUI search on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664260 (https://phabricator.wikimedia.org/T259798) (owner: 10Urbanecm) [13:03:56] (03Merged) 10jenkins-bot: Revert "vector: Enable WVUI search on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664260 (https://phabricator.wikimedia.org/T259798) (owner: 10Urbanecm) [13:04:10] anyway, done [13:04:32] Thanks, Urbanecm. I'll write a comment on the associated task [13:04:41] cool [13:05:27] (03PS1) 10Kormat: integration: Tidy up use of fixtures [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664561 [13:06:42] so backports all done? should we roll the train forward now? [13:07:15] jouncebot: next [13:07:15] In 3 hour(s) and 52 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1700) [13:07:26] jouncebot: now [13:07:26] No deployments scheduled for the next 3 hour(s) and 52 minute(s) [13:07:50] twentyafterfour: mind waiting a second? [13:09:06] (03PS1) 10Urbanecm: Temporarily add cswiki-black-ribbon.png as a static resource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664562 [13:09:22] (03PS2) 10Urbanecm: Temporarily add cswiki-black-ribbon.png as a static resource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664562 [13:09:34] (03CR) 10Urbanecm: [C: 03+2] Temporarily add cswiki-black-ribbon.png as a static resource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664562 (owner: 10Urbanecm) [13:10:42] (03PS3) 10Urbanecm: Temporarily add cswiki-black-ribbon.png as a static resource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664562 [13:10:50] (03CR) 10Urbanecm: [C: 03+2] Temporarily add cswiki-black-ribbon.png as a static resource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664562 (owner: 10Urbanecm) [13:11:28] (03PS1) 10Hnowlan: Add simple blubber image [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 [13:11:37] (03Merged) 10jenkins-bot: Temporarily add cswiki-black-ribbon.png as a static resource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664562 (owner: 10Urbanecm) [13:13:16] !log urbanecm@deploy1001 Synchronized static/images/cswiki-black-ribbon.png: 5d5b5c41d889f6f30566f23bd9f71d16337b9d6d: Temporarily add cswiki-black-ribbon.png as a static resource (duration: 01m 07s) [13:13:17] twentyafterfour: all yours now, thanks! [13:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:41] (03PS1) 10Jbond: P:puppet_compiler: refactor a bit [puppet] - 10https://gerrit.wikimedia.org/r/664565 [13:17:10] (03PS1) 10Hnowlan: tegola: remove image in favour of blubber-built image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664566 (https://phabricator.wikimedia.org/T270170) [13:19:08] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: refactor a bit [puppet] - 10https://gerrit.wikimedia.org/r/664565 (owner: 10Jbond) [13:27:11] (03CR) 10Kormat: [C: 03+2] integration: Tidy up use of fixtures [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664561 (owner: 10Kormat) [13:30:36] (03Merged) 10jenkins-bot: integration: Tidy up use of fixtures [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664561 (owner: 10Kormat) [13:40:31] !log Deploy schema change on s2 codfw - T273359 [13:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:36] T273359: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359 [13:46:10] (03CR) 10Giuseppe Lavagetto: Add the add_user filter (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [13:52:23] (03PS1) 10Klausman: secrets: Add dummy keys for ml_etcd clusters [labs/private] - 10https://gerrit.wikimedia.org/r/664568 [13:53:56] (03PS2) 10Klausman: secrets: Add dummy keys for ml_etcd clusters [labs/private] - 10https://gerrit.wikimedia.org/r/664568 (https://phabricator.wikimedia.org/T273071) [13:54:25] (03CR) 10Klausman: [C: 03+2] secrets: Add dummy keys for ml_etcd clusters [labs/private] - 10https://gerrit.wikimedia.org/r/664568 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [13:57:47] (03CR) 10Klausman: [V: 03+2 C: 03+2] secrets: Add dummy keys for ml_etcd clusters [labs/private] - 10https://gerrit.wikimedia.org/r/664568 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [13:59:46] (03PS3) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [13:59:52] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:01:09] (03CR) 10Klausman: Add etcd role for ML Team's new clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [14:02:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Sadly, while this is desirable, etcd3 needs the backup script to be adapted a bit." [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [14:04:07] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [14:04:17] :( [14:04:20] * volans ere [14:04:24] * volans here even [14:04:55] eqsin again [14:05:25] yep [14:05:28] hey [14:06:14] cp5004 && cp5006 are struggling [14:07:14] !log rolling restart of cp500[1-6] [14:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:18] here [14:07:37] rzl: go have coffee first, we have it covered [14:07:44] akosiaris: <3 [14:07:58] (03PS4) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [14:08:35] (03PS1) 10Kormat: WMFReplication: Support reading different my.cnf in test env. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664569 [14:08:45] (03PS1) 10Jbond: P:puppet_compiler: update cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664570 (https://phabricator.wikimedia.org/T273673) [14:08:59] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: update cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664570 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [14:09:49] (03PS2) 10Jbond: P:puppet_compiler: update cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664570 (https://phabricator.wikimedia.org/T273673) [14:10:14] (03CR) 10Jcrespo: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/664313 (https://phabricator.wikimedia.org/T271573) (owner: 10Jcrespo) [14:11:47] (03PS1) 10Kormat: integration: Allow different scopes for fixture use. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664571 [14:12:57] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [14:13:06] RECOVERY - Device not healthy -SMART- on an-worker1097 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops [14:13:15] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: update cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/664570 (https://phabricator.wikimedia.org/T273673) (owner: 10Jbond) [14:13:19] (03PS1) 10Kormat: test: Add query_db util func [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664572 [14:14:19] (03PS1) 10Kormat: integration: Add tests for db-move-replica [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664573 [14:14:36] (03CR) 10Kormat: [C: 03+2] WMFReplication: Support reading different my.cnf in test env. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664569 (owner: 10Kormat) [14:14:40] (03CR) 10Kormat: [C: 03+2] integration: Allow different scopes for fixture use. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664571 (owner: 10Kormat) [14:16:15] (03CR) 10Kormat: [C: 03+2] test: Add query_db util func [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664572 (owner: 10Kormat) [14:17:20] (03Merged) 10jenkins-bot: WMFReplication: Support reading different my.cnf in test env. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664569 (owner: 10Kormat) [14:17:22] (03Merged) 10jenkins-bot: integration: Allow different scopes for fixture use. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664571 (owner: 10Kormat) [14:18:13] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [14:18:30] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:30] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5004 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.445 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:19:53] (03Merged) 10jenkins-bot: test: Add query_db util func [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664572 (owner: 10Kormat) [14:20:01] (03CR) 10Kormat: [C: 03+2] integration: Add tests for db-move-replica [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664573 (owner: 10Kormat) [14:23:22] (03Merged) 10jenkins-bot: integration: Add tests for db-move-replica [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664573 (owner: 10Kormat) [14:24:08] !log MediaWiki train: prepare to promote all wikis to 1.36.0-wmf.30 refs T271344 [14:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:14] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [14:30:13] RECOVERY - MegaRAID on an-worker1097 is OK: OK: optimal, 23 logical, 23 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:30:29] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Ottomata) Approved, it looks like this is a case of not needing direct Hadoop access, so no Kerberos principal is required. https://wikitech.wikimedia.org/wiki/Analytics/Data_access#ssh_log... [14:31:48] (03PS1) 1020after4: all wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664576 [14:32:40] (03CR) 1020after4: [C: 03+2] all wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664576 (owner: 1020after4) [14:34:06] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664576 (owner: 1020after4) [14:34:12] (03PS1) 10Jbond: puppet-diffs: update [puppet] - 10https://gerrit.wikimedia.org/r/664577 [14:35:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10VeronicaThamaini) Hello, Yes. I can confirm that the shell name is vthamaini. Thanks! [14:37:12] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.07 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [14:38:07] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.30 refs T271344 bfc73b6e8b33e49e916d9d93cf5cdb7624297d44 [14:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:16] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [14:41:00] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:41:08] (03CR) 10Jbond: [C: 03+2] puppet-diffs: update [puppet] - 10https://gerrit.wikimedia.org/r/664577 (owner: 10Jbond) [14:41:36] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:41:54] uhm [14:42:11] related to train? [14:43:43] (03PS1) 10Jbond: P:puppet-compiler: add description to timer job [puppet] - 10https://gerrit.wikimedia.org/r/664579 [14:45:53] Should be [14:45:58] Just means someone is uploading a lot [14:46:30] what about the elasticsearch index errors? [14:46:45] seems like a transient spike... [14:48:55] (03PS2) 10JMeybohm: charts/calico: Add GlobalNetworkPolicy for tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/659863 [14:48:57] (03PS3) 10JMeybohm: admin_ng: Spectify IPv4 and IPv6 addresses to kubernetes API [deployment-charts] - 10https://gerrit.wikimedia.org/r/659864 [14:48:59] (03PS2) 10JMeybohm: calico: Typha needs to get endpoints to discover it's instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/660399 (https://phabricator.wikimedia.org/T267653) [14:49:01] (03CR) 10Jbond: [C: 03+2] P:puppet-compiler: add description to timer job [puppet] - 10https://gerrit.wikimedia.org/r/664579 (owner: 10Jbond) [14:51:02] (03PS3) 10JMeybohm: calico: Typha needs to get endpoints to discover it's instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/660399 (https://phabricator.wikimedia.org/T267653) [14:52:19] (03PS1) 10Jbond: P:puppet-compiler: use fully qualified cmd [puppet] - 10https://gerrit.wikimedia.org/r/664580 [14:52:59] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/664581 [14:53:50] (03PS2) 10Jcrespo: dbbackups: Reenable read-write backups, disable ro, document job ids [puppet] - 10https://gerrit.wikimedia.org/r/664508 (https://phabricator.wikimedia.org/T79922) [14:55:08] PROBLEM - Device not healthy -SMART- on an-worker1097 is CRITICAL: cluster=analytics device=sat+megaraid,13 instance=an-worker1097 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1097&var-datasource=eqiad+prometheus/ops [14:55:32] (03CR) 10Jbond: "LGTM see comment if pcc fails" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664581 (owner: 10CDanis) [14:56:14] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable read-write backups, disable ro, document job ids [puppet] - 10https://gerrit.wikimedia.org/r/664508 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [14:57:28] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/664581 [14:57:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28090/console" [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [14:59:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28092/console" [puppet] - 10https://gerrit.wikimedia.org/r/664581 (owner: 10CDanis) [14:59:45] (03CR) 10Jbond: [C: 03+1] "LGTM and PCC agrees" [puppet] - 10https://gerrit.wikimedia.org/r/664581 (owner: 10CDanis) [15:00:21] (03PS3) 10CDanis: Allow upload frontends to cache objects up to 4MiB [puppet] - 10https://gerrit.wikimedia.org/r/664581 (https://phabricator.wikimedia.org/T274888) [15:00:29] jbond42: thanks for the help, agree that PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/28091/ [15:01:02] cdanis: yes and as you have confined it to esqin i think its safeto go [15:03:19] <_joe_> you will probably have to deploy it one node at a time [15:03:34] (03CR) 10Elukey: [V: 03+1 C: 03+1] Add etcd role for ML Team's new clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:04:01] (03CR) 10Jbond: [C: 03+2] P:puppet-compiler: use fully qualified cmd [puppet] - 10https://gerrit.wikimedia.org/r/664580 (owner: 10Jbond) [15:04:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Ottomata) Approved! [15:06:45] (03PS2) 10Muehlenhoff: Add access to Superset and cn=wmf for vthamaini [puppet] - 10https://gerrit.wikimedia.org/r/664506 (https://phabricator.wikimedia.org/T274106) [15:07:44] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28095/console" [puppet] - 10https://gerrit.wikimedia.org/r/664581 (https://phabricator.wikimedia.org/T274888) (owner: 10CDanis) [15:07:46] (03CR) 10CDanis: "PCC shows intended change in eqsin and no-op in eqiad https://puppet-compiler.wmflabs.org/compiler1001/28091/" [puppet] - 10https://gerrit.wikimedia.org/r/664581 (https://phabricator.wikimedia.org/T274888) (owner: 10CDanis) [15:08:01] (03CR) 10Muehlenhoff: [C: 03+2] Add access to Superset and cn=wmf for vthamaini [puppet] - 10https://gerrit.wikimedia.org/r/664506 (https://phabricator.wikimedia.org/T274106) (owner: 10Muehlenhoff) [15:08:57] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10jcrespo) 05Open→03Resolved This is now done, we have full-covered, regularly-scheduled ES cluster backups on bacula. [15:09:59] (03PS2) 10Mholloway: Sample mediawiki.client.session_tick at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) [15:11:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Allow upload frontends to cache objects up to 4MiB [puppet] - 10https://gerrit.wikimedia.org/r/664581 (https://phabricator.wikimedia.org/T274888) (owner: 10CDanis) [15:11:34] (03CR) 10Mholloway: [C: 03+2] Sample mediawiki.client.session_tick at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [15:11:38] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin 'A:cp-upload and A:eqsin' 'disable-puppet "cdanis deploying Iab4d211 T263496"' [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:42] (03CR) 10CDanis: [C: 03+2] Allow upload frontends to cache objects up to 4MiB [puppet] - 10https://gerrit.wikimedia.org/r/664581 (https://phabricator.wikimedia.org/T274888) (owner: 10CDanis) [15:11:43] T263496: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 [15:11:56] argh wrong bug [15:12:12] !log previous message was re: T274888 [15:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:17] T274888: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 [15:12:18] (03PS5) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [15:12:29] (03Merged) 10jenkins-bot: Sample mediawiki.client.session_tick at 1:100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [15:12:42] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:58] (03CR) 10Ottomata: [C: 03+1] Remove MaxMind archiving code [puppet] - 10https://gerrit.wikimedia.org/r/663687 (https://phabricator.wikimedia.org/T273891) (owner: 10Razzi) [15:13:00] (03CR) 10Klausman: Add etcd role for ML Team's new clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:13:03] (03PS6) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [15:13:23] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:13:23] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:03] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:14:03] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:00] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: WikimediaEvents: Sample mediawiki.client.session_tick at 1:100 (T274172) (duration: 01m 00s) [15:15:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Ottomata) Approved. [15:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:07] T274172: [Session Length] Complete sessionTick deployment to all wikis - https://phabricator.wikimedia.org/T274172 [15:15:07] (03PS1) 10Elukey: hadoop: add more precise notes_url for various daemons [puppet] - 10https://gerrit.wikimedia.org/r/664584 [15:15:10] (03CR) 1020after4: [C: 03+2] Branch commit for wmf/1.36.0-wmf.31 [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664407 (https://phabricator.wikimedia.org/T271345) (owner: 10TrainBranchBot) [15:16:33] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:16:33] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:11] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:55] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:21] (03CR) 10JMeybohm: [C: 03+2] calico: Typha needs to get endpoints to discover it's instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/660399 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:22:39] (03PS2) 10Elukey: hadoop: add more precise notes_url for various daemons [puppet] - 10https://gerrit.wikimedia.org/r/664584 [15:22:40] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T274106 (10MoritzMuehlenhoff) 05Open→03Resolved @VeronicaThamaini : I enabled your access, you should be able to login now. You can find initial documentation at https://wikitech.wiki... [15:22:50] (03Merged) 10jenkins-bot: calico: Typha needs to get endpoints to discover it's instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/660399 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [15:23:22] (03PS2) 10Muehlenhoff: admin: Add Pablo Aragon (paragon) user [puppet] - 10https://gerrit.wikimedia.org/r/663849 (https://phabricator.wikimedia.org/T274631) (owner: 10Vgutierrez) [15:23:34] (03PS1) 10Jbond: P:puppet_compiler: add job to deletd large pcc reports after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/664585 (https://phabricator.wikimedia.org/T274782) [15:25:25] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:25:27] (03CR) 10Elukey: [C: 03+2] hadoop: add more precise notes_url for various daemons [puppet] - 10https://gerrit.wikimedia.org/r/664584 (owner: 10Elukey) [15:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:38] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:55] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [15:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:06] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [15:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:25] (03CR) 10Muehlenhoff: [C: 03+2] admin: Add Pablo Aragon (paragon) user [puppet] - 10https://gerrit.wikimedia.org/r/663849 (https://phabricator.wikimedia.org/T274631) (owner: 10Vgutierrez) [15:27:00] !log re-enabling Puppet on cp-upload@eqsin to deploy Iab4d211 T274888 [15:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:05] T274888: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 [15:29:31] (03PS2) 10Muehlenhoff: admin: Add paragon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/663850 (https://phabricator.wikimedia.org/T274631) (owner: 10Vgutierrez) [15:30:18] (03PS2) 10Jbond: P:puppet_compiler: add job to deletd large pcc reports after 7 days [puppet] - 10https://gerrit.wikimedia.org/r/664585 (https://phabricator.wikimedia.org/T274782) [15:31:37] (03CR) 10Muehlenhoff: [C: 03+2] admin: Add paragon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/663850 (https://phabricator.wikimedia.org/T274631) (owner: 10Vgutierrez) [15:34:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @Pablo : I've enabled your access, but it will take up to 30 minutes until P... [15:41:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.31 [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664407 (https://phabricator.wikimedia.org/T271345) (owner: 10TrainBranchBot) [15:42:05] (03PS3) 10Muehlenhoff: admin: Add christinedk user [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [15:42:20] (03PS2) 10Giuseppe Lavagetto: Add the add_user filter [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 [15:43:17] (03PS7) 10Klausman: Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) [15:44:03] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071 [15:44:05] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071 [15:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:08] T273071: Create etcd VMs for use with ML platform - https://phabricator.wikimedia.org/T273071 [15:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:37] (03CR) 10Klausman: [C: 03+2] Add etcd role for ML Team's new clusters [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:46:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:36] (03CR) 10Muehlenhoff: [C: 03+2] admin: Add christinedk user [puppet] - 10https://gerrit.wikimedia.org/r/664226 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [15:47:18] (03CR) 10Klausman: [C: 03+2] Add etcd role for ML Team's new clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:48:40] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (backup1002, ...), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:49:28] (03PS2) 10Muehlenhoff: admin: Add christinedk to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [15:50:54] (03CR) 10Muehlenhoff: [C: 03+2] admin: Add christinedk to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/664227 (https://phabricator.wikimedia.org/T274304) (owner: 10Vgutierrez) [15:51:26] (03PS1) 10Klausman: files/ssl: Fix broken name of ML etcd SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/664587 (https://phabricator.wikimedia.org/T273071) [15:52:06] (03CR) 10Klausman: [C: 03+2] files/ssl: Fix broken name of ML etcd SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/664587 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:52:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664566 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [15:53:05] (03Abandoned) 10JMeybohm: charts/calico: Add GlobalNetworkPolicy for tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/659863 (owner: 10JMeybohm) [15:53:12] (03Abandoned) 10JMeybohm: admin_ng: Spectify IPv4 and IPv6 addresses to kubernetes API [deployment-charts] - 10https://gerrit.wikimedia.org/r/659864 (owner: 10JMeybohm) [15:53:45] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922 (10Marostegui) <3 nice work [15:54:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10MoritzMuehlenhoff) 05Open→03Resolved a:05Vgutierrez→03MoritzMuehlenhoff @ChristineDeKock : I've enabled your access, but it wil... [15:54:24] (03CR) 10JMeybohm: [C: 03+1] "If this works out, we should switch eventrouter next" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664530 (owner: 10Alexandros Kosiaris) [15:55:57] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:58:52] !log power down ms-be2031 for firmware upgrade [15:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:40] (03CR) 10CDanis: [C: 03+1] varnish: fix escaping of variables in test run script [puppet] - 10https://gerrit.wikimedia.org/r/663017 (owner: 10Giuseppe Lavagetto) [16:01:08] PROBLEM - Host ms-be2031 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:54] RECOVERY - Host ms-be2031 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [16:04:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: fix escaping of variables in test run script [puppet] - 10https://gerrit.wikimedia.org/r/663017 (owner: 10Giuseppe Lavagetto) [16:07:57] gehel: thanks [16:09:54] !log installing python-bottle security updates on buster [16:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:27] (03PS1) 10Dave Pifke: arclamp: add excimer-real pipeline [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) [16:13:08] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [16:16:03] (03PS6) 10Dave Pifke: wall-clock excimer profiling for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [16:17:45] !log installing edk2 security updates [16:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:14] (03PS1) 10Lucas Werkmeister (WMDE): Enable Wikibase Repo ID generator rate limiting on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664593 (https://phabricator.wikimedia.org/T272032) [16:25:19] !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071 [16:25:21] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071 [16:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:24] T273071: Create etcd VMs for use with ML platform - https://phabricator.wikimedia.org/T273071 [16:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:00] (03PS1) 10Klausman: site: move mml-etcd in codfw from insetup to etcd role [puppet] - 10https://gerrit.wikimedia.org/r/664595 (https://phabricator.wikimedia.org/T273071) [16:27:44] (03CR) 10Klausman: [C: 03+2] site: move mml-etcd in codfw from insetup to etcd role [puppet] - 10https://gerrit.wikimedia.org/r/664595 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:40:09] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:28] XioNoX: ^^ [16:48:33] (03PS1) 10Jbond: debug_host: calculate the correct realm [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/664598 [16:54:04] (03PS2) 10Jbond: debug_host: calculate the correct realm [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/664598 [16:57:42] (03CR) 10Razzi: [C: 03+2] Remove MaxMind archiving code [puppet] - 10https://gerrit.wikimedia.org/r/663687 (https://phabricator.wikimedia.org/T273891) (owner: 10Razzi) [16:59:29] PROBLEM - Host ms-be2031 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:04] jbond42 and cdanis: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1700). [17:03:15] RECOVERY - Host ms-be2031 is UP: PING OK - Packet loss = 0%, RTA = 33.04 ms [17:03:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1093.eqiad.wmnet - https://phabricator.wikimedia.org/T273955 (10wiki_willy) a:05wiki_willy→03Cmjohnson [17:05:01] (03CR) 10Alexandros Kosiaris: "That looks ok as a start, I am not so sure about the makefile issuing docker commands directly, I don't think CI allows for that. We 'll n" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [17:06:36] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/664550 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [17:08:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add the add_user filter (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/664544 (owner: 10Giuseppe Lavagetto) [17:09:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:10:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:11:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:11:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] staging-codfw: Switch tillers to internal DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/664530 (owner: 10Alexandros Kosiaris) [17:13:43] (03Merged) 10jenkins-bot: staging-codfw: Switch tillers to internal DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/664530 (owner: 10Alexandros Kosiaris) [17:13:46] (03CR) 10Hnowlan: "> Patch Set 1:" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [17:15:43] PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:44] hello! I have a (possibly dumb) question while running a Hive query. when I run a query and move on to something else and in the meantime the query finishes, I see the backlog filled with messages like: "ExecutorAllocationManager: Removing executor 26 because it has been idle for 60 seconds (new desired total will be 1)" [17:16:54] is there a way to filter these out so that I can see the actual query result? thanks! [17:17:15] RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:17:22] oh wow, sorry [17:17:26] I thought this was analytics [17:17:30] sigh [17:18:13] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [17:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:02] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [17:21:02] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [17:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:11] !log jforrester@deploy1001 Started deploy [integration/docroot@8ab9125]: Update docroot with Special:MyLanguage links. [17:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:23] !log jforrester@deploy1001 Finished deploy [integration/docroot@8ab9125]: Update docroot with Special:MyLanguage links. (duration: 00m 11s) [17:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:19] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [17:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:24] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [17:27:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1352.eqiad.wmnet with reason: REIMAGE [17:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:44] (03PS1) 10Ahmon Dancy: Add "minimum hits" support to logspam/logspam-watch [puppet] - 10https://gerrit.wikimedia.org/r/664602 [17:28:46] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: let keepalived track static routes [puppet] - 10https://gerrit.wikimedia.org/r/664603 (https://phabricator.wikimedia.org/T272963) [17:28:48] (03PS1) 10Alexandros Kosiaris: calico/coredns: Use the external kubernetes API [deployment-charts] - 10https://gerrit.wikimedia.org/r/664604 [17:29:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1352.eqiad.wmnet with reason: REIMAGE [17:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:22] (03CR) 10JMeybohm: [C: 03+1] calico/coredns: Use the external kubernetes API [deployment-charts] - 10https://gerrit.wikimedia.org/r/664604 (owner: 10Alexandros Kosiaris) [17:30:23] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [17:30:24] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [17:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:35] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [17:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:18] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [17:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:46] !log akosiaris@deploy1001 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [17:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:28] !log akosiaris@deploy1001 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [17:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:50] !log 1.36.0-wmf.31 was branched at c49ac6d2448efa085bdd34fc415aeece05a98dde (T271345) [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:55] T271345: 1.36.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T271345 [17:37:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1347.eqiad.wmnet with reason: REIMAGE [17:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1346.eqiad.wmnet with reason: REIMAGE [17:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1347.eqiad.wmnet with reason: REIMAGE [17:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1346.eqiad.wmnet with reason: REIMAGE [17:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:03] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:52] (03CR) 10Dzahn: [C: 03+2] "thanks Valentin and Amir. confirmed again, compiler says " No hosts found matching `C:tlsproxy::prometheus` "" [puppet] - 10https://gerrit.wikimedia.org/r/659377 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [17:54:30] (03CR) 10Ppchelko: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [17:54:38] (03CR) 10Dzahn: [C: 03+2] "tested query from phab1001 with mysql to -u phstats -h m3-slave.eqiad.wmnet ... works, 96 rows in set. not pasting anymore since you told" [puppet] - 10https://gerrit.wikimedia.org/r/664002 (https://phabricator.wikimedia.org/T274711) (owner: 10Aklapper) [17:54:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] phabricator weekly changes email: List cookie-licked Bugzilla tasks [puppet] - 10https://gerrit.wikimedia.org/r/664002 (https://phabricator.wikimedia.org/T274711) (owner: 10Aklapper) [17:55:19] RECOVERY - Keyholder SSH agent on alert1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [17:58:02] (03CR) 10Dzahn: [C: 03+1] "Thanks for the summary Hashar, I understand. I will just consider it stalled for now." [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox) [17:58:44] (03PS1) 10Dduvall: testwikis wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664612 [17:58:46] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664612 (owner: 10Dduvall) [17:59:41] (03CR) 10Dzahn: "Unless paladox wants to amend to do the "add support for different auth types" without using this specific option, as Hashar suggested abo" [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox) [17:59:54] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664612 (owner: 10Dduvall) [18:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1800). [18:03:17] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: let keepalived track static routes [puppet] - 10https://gerrit.wikimedia.org/r/664603 (https://phabricator.wikimedia.org/T272963) [18:04:44] !log dduvall@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.31 [18:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:20] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: let keepalived track static routes [puppet] - 10https://gerrit.wikimedia.org/r/664603 (https://phabricator.wikimedia.org/T272963) [18:14:20] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: let keepalived track additional static routes [puppet] - 10https://gerrit.wikimedia.org/r/664603 (https://phabricator.wikimedia.org/T272963) [18:15:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/28099/" [puppet] - 10https://gerrit.wikimedia.org/r/664603 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [18:17:33] (03PS1) 10Dzahn: DHCP: update MAC address for new mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/664625 (https://phabricator.wikimedia.org/T274023) [18:24:21] (03CR) 10Dzahn: [C: 03+2] "This VM has been recreated from scratch but using the same name." [puppet] - 10https://gerrit.wikimedia.org/r/664625 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [18:24:25] PROBLEM - Host ms-be2031 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:28:35] RECOVERY - Host ms-be2031 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [18:28:42] !log mw1352 - powercycle via mgmt [18:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [18:31:37] the latency spike looks to be slowly dropping [18:31:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1347.eqiad.wmnet'] ` an... [18:32:19] legoktm: confirmed, it is down to WARN level, and a few appservers are coming back into pool..but still [18:32:30] rescheduled the check [18:32:36] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1347.eqiad.wmnet [18:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1352.eqiad.wmnet'] ` an... [18:33:49] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1352.eqiad.wmnet [18:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:45] scap pulling on 1347 and 1352 to then pool them again [18:34:54] jouncebot: next [18:34:54] In 0 hour(s) and 25 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1900) [18:35:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1346.eqiad.wmnet'] ` an... [18:35:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1346.eqiad.wmnet [18:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1347.eqiad.wmnet [18:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:05] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1352.eqiad.wmnet [18:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:36] legoktm: repooling what was in progress but giving it some time before starting the next ones [18:39:45] ack [18:39:49] I have 4 in progress right now [18:40:01] maybe it was a bit too much between both of us [18:40:21] i am taking a break and pick it up again later [18:40:50] well, or maybe 1 instead of 4 [18:41:33] doing mwdebug1002 now [18:41:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1346.eqiad.wmnet [18:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:49] 10SRE, 10ops-codfw: ms-be2031 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T273895 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi firmware upgrade Let me know if you are still seeing the same error Ilo: 2.50. to 2.77 Bios 2.30 to 2.64 [18:45:53] mutante: seeing some bad known_hosts entries during scap sync to testwikis. is that on account of the reimaging? [18:46:18] legoktm: ^ [18:46:27] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:46:27] marxarelli: which hosts? [18:46:42] marxarelli: probably yes, even though it should only happen if the timing is bad [18:46:56] mw1289.eqiad.wmnet, mw1290.eqiad.wmnet, mw1297.eqiad.wmnet [18:47:31] PROBLEM - Check systemd state on analytics1062 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:34] yeah, that's me [18:47:42] I'll make sure to run scap pull before repooling [18:48:05] right on. ty! [18:49:35] !log dduvall@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.31 (duration: 49m 37s) [18:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:08] hoping the second spike also goes away now and was due to deployment itself [18:50:20] or once those are back in pool as well [18:51:15] the /usr/local/sbin/check-and-restart-php hook is still running fwiw [18:52:13] ack,thx [18:52:27] !log re-creating mwdebug1002 [18:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:39] RECOVERY - Check systemd state on analytics1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:07] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1288.eqiad.wmnet with reason: REIMAGE [18:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:08] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1289.eqiad.wmnet with reason: REIMAGE [18:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:31] !log puppetmaster1002 - puppet cert clean mwdebug1002.eqiad.wmnet, sign new request, initial puppet run (T274023) [18:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:36] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023 [18:59:41] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [18:59:58] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T1900) [19:00:12] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1288.eqiad.wmnet with reason: REIMAGE [19:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:07] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1290.eqiad.wmnet with reason: REIMAGE [19:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:10] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [19:01:46] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) a:03Jclark-ctr [19:01:53] 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10RobH) [19:02:21] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1289.eqiad.wmnet with reason: REIMAGE [19:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:10] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1297.eqiad.wmnet with reason: REIMAGE [19:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:31] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1290.eqiad.wmnet with reason: REIMAGE [19:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:06:18] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1297.eqiad.wmnet with reason: REIMAGE [19:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:39] PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 1076 MB (2% inode=73%): /tmp 1076 MB (2% inode=73%): /var/tmp 1076 MB (2% inode=73%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [19:09:01] PROBLEM - PHP7 rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:09:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade [19:09:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade [19:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:38] mwdebug1002 is me, acked [19:10:31] any luck? [19:15:07] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10KFrancis) @CDanis The NDA has been signed. Please move forward with the access request. Thanks! [19:16:45] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10CDanis) a:05KFrancis→03MoritzMuehlenhoff [19:19:11] (03CR) 10Alex Paskulin: "My understanding of the original product design was that namespaces were meant to separate endpoints based on origin, so it would be /serv" [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [19:32:03] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:13] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:42:03] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:05] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:46:38] Have there been any recent (Debian) updates to font stack packages (FontConfig, Pango, or whatever there is), or the Platypus package? Asking because EasyTimeline shows no text anymore in PNG files: https://phabricator.wikimedia.org/T274822 [19:48:03] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [19:49:39] twentyafterfour: fyi https://phabricator.wikimedia.org/T274934 might be related to .30, on mobile do can't test atm [19:50:11] andre__: could be related to T245757 (Buster migration) [19:50:12] T245757: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 [19:50:12] andre__: most appservers were upgraded from Stretch to Buster in the past month or so [19:51:35] eh, true (forgot that one). Thanks! [19:54:18] andre__: also EasyTimeline is still looking for a product owner it seems (and no request on-record that I can see) [19:54:56] ideally someone a little bit closer to the code would be able to confirm a package issue before escalating to sre [19:55:32] right... in an ideal world, https://phabricator.wikimedia.org/T137291 would happen (migrate to Graph), but there are also some cons (like having to learn new things) [19:58:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1288.eqiad.wmnet', 'mw12... [19:58:39] !log [WDQS] De-pooled `wdqs100[4,7]` to catch up on lag, and pooled `wdqs100[5,6]` [19:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:57] andre__: Graphits successor but yes, that certainly seems reasonable. [20:00:04] marxarelli and twentyafterfour: That opportune time is upon us again. Time for a Mediawiki train - American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210216T2000). [20:01:16] apergos: yes, but just treating it like a brandnew VM after deleting it and stuff being cleaned up in netbox [20:01:51] eh, if it gets the job done! [20:02:18] yea, users won't care if it's the same VM or not, hostkey changes either way [20:02:50] there were like 3 issues but none related ..tldr [20:03:58] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1288.eqiad.wmnet [20:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:03] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1289.eqiad.wmnet [20:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:08] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet [20:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:13] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1297.eqiad.wmnet [20:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:05] RECOVERY - PHP7 rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 302 Found - 659 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:06:40] ah, easytimeline [20:06:59] our favourite topic for when we run out of easy things to do or think about [20:09:04] "easy". Naming things is hard [20:09:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10wiki_willy) a:03Jclark-ctr [20:11:35] I'm surprised it's not spitting out warning messages like crazy [20:14:23] hmm [20:14:32] the cgroup errors from the other day are still happening [20:14:33] (03CR) 10Cwhite: [C: 03+1] "Not verified, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664555 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi) [20:15:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1002.eqiad.wmnet [20:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mwdebug1002.eqiad.wmnet [20:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:30] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1288.eqiad.wmnet [20:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:36] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1289.eqiad.wmnet [20:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:41] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1290.eqiad.wmnet [20:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:45] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1297.eqiad.wmnet [20:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:47] !log mwdebug1002 has been recreated on buster and has been repooled after scap pull - you can find a .tar.gz in your home with the contents of your home before reimaging, fingerprint at T274023#6835116 [20:20:47] (03PS4) 10Cwhite: profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) [20:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:52] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023 [20:20:54] apergos: ^ it's back [20:21:09] woo hoo! [20:21:30] anything left on stretch or are all the mwdebugs buster now? [20:22:12] (03PS9) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [20:23:04] apergos: mwdebug1003 is stretch now [20:23:32] ok, I'll update my deploy-howto-notes with that, thanks! [20:24:09] legoktm: ok if i continue the train? [20:24:26] I think so [20:25:00] alrighty [20:25:56] apergos: mdebug1003 is stretch on purpose for that use case.. until we delete it [20:26:05] it remains the special case [20:26:18] okey dokey [20:26:24] (03PS1) 10Jdlrobson: Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664265 (https://phabricator.wikimedia.org/T274889) [20:26:25] wow the train rolling at lasy [20:26:26] t [20:26:29] crossig fingers! [20:26:32] 1001 i need to do but now i have to step away to dentist :p [20:29:17] er [20:29:20] "enjoy"? [20:29:51] (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664642 [20:29:52] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664642 (owner: 10Dduvall) [20:30:37] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664642 (owner: 10Dduvall) [20:31:45] (03CR) 10Cwhite: [C: 03+2] profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [20:32:23] bd808: https://sal.toolforge.org/ is "500 Internal Server Error" [20:32:39] bd808: works now, nvm [20:32:40] not for me though [20:32:47] must have been a momenary glitch [20:32:47] * bd808 peeks at logs [20:33:00] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.31 [20:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:36] legoktm: it looks like the lighttptd process and the php process were having a hard time talking to each other inside the tool. Nothing really to explain what that problem was. [20:34:59] Just "response not received" for lighttpd when waiting on php [20:35:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-syslog-udp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-syslog-udp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:35:48] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-syslog-tcp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/logstash-syslog-tcp is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:46:31] (03PS1) 10Herron: kibana: set vega.enabled: false by default [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) [20:47:19] !log 1.36.0-wmf.31 rolled to group0. no new errors for wmf.31 (T271345) [20:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:30] T271345: 1.36.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T271345 [20:47:34] \o/ [20:47:41] :) [20:49:02] confd alert from ^^ is resolved [20:50:38] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/28100/" [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) (owner: 10Herron) [20:50:43] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [20:51:39] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [20:53:00] I'm not sure why I even opened up Developers/Maintainers to identify EasyTimeline as unmaintained [20:58:51] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:29] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:02:02] (03PS1) 10Legoktm: Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) [21:02:04] (03PS1) 10Legoktm: Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) [21:03:41] (03CR) 10jerkins-bot: [V: 04-1] Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [21:04:46] (03CR) 10Legoktm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [21:08:43] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [21:10:13] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:11:04] (03PS2) 10Legoktm: Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) [21:11:06] (03PS2) 10Legoktm: Remove putenv() for GDFONTPATH [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) [21:17:01] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:18:29] (03PS2) 10DannyS712: Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664265 (https://phabricator.wikimedia.org/T274889) (owner: 10Jdlrobson) [21:18:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:18:51] (03CR) 10DannyS712: "PS2 just adds the "cherry picked from commit ..."" [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664265 (https://phabricator.wikimedia.org/T274889) (owner: 10Jdlrobson) [21:19:55] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1885 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [21:19:57] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [21:20:05] uhoh [21:20:30] oh my [21:20:33] hmm [21:20:47] this explains the latency spikes [21:21:28] or it's an effect? [21:21:49] the worker saturation is likely an effect [21:21:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:22:26] * akosiaris around [21:22:53] nothing is glaring at me re: databases or memcached [21:23:11] can we isolate it by server? [21:23:13] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:23:14] hmm, there are corresponding spikes in 304s served by the appservers (not APIs) [21:23:35] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:24:23] <_joe_> do you need me to go to my desktop? [21:24:24] shout if you need help, I'm far from an expert on the topic and a bit later here [21:24:29] Anyone else seeing errors on ores right now? [21:24:33] *late [21:24:38] I'm getting Internal Server Errors in Huggle [21:24:55] <_joe_> phuzion: we have larger issues, likely causing those [21:25:06] phuzion: might be linked to the pages that just went off. [21:25:26] Noted. [21:25:29] what are those 1.5krps 304 spikes in https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=22&from=now-1h&orgId=1&to=now&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET ? [21:25:44] sorry, 1k rps [21:25:52] akosiaris: yeah, that's what I was just looking at [21:26:32] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6446 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [21:27:05] <_joe_> ok if someone is looking into the 304s, I suggest to do so from the apache access logs on one server [21:27:11] <_joe_> I will look at the php slowlog [21:27:24] yeah, digging on 1261 [21:27:34] there's even a rows written spike in https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=37&from=now-1h&orgId=1&to=now&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET for s1 [21:27:43] they coincide [21:27:57] suddenly 25k wr/s from 5kwr/s ? [21:28:15] <_joe_> es3 [21:28:29] nope, it's s1 [21:28:54] whatever it is, it's enwiki related [21:29:43] s6 has increased as well, but it's more stable and it's "only" 3x writes starting on 21:00 [21:30:01] a lot of the 304s have a "Wikidata Query Service Updater Bot" u-a [21:30:01] but it has had similar behavior recently (2days) [21:30:18] urls like www.wikidata.org/wiki/Special:EntityData/Q64572145.ttl?flavor=dump&revision=1362743153 [21:31:09] if there was a burst of a specific type of Wikidata edits it could trigger writes to s1/enwp [21:32:44] why would a wdqs bot (we own that, don't we?) hit the appservers though and not the api servers? [21:33:35] yes and I guess Special:EntityData is getting routed to appservers since it's not an api.php endpoint? [21:33:56] ah indeed [21:34:34] (why it's not a proper API I don't understand) [21:39:28] I have a bad hunch [21:41:10] (03PS1) 10Andrew Bogott: Openstack control node galera: send mariadb logs to central logging [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) [21:43:55] PROBLEM - Squid on install1003 is CRITICAL: connect to address 208.80.154.32 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [21:44:29] akosiaris: ^ expected, right? [21:44:56] rzl, yes, logging it now [21:45:08] 👍 [21:45:34] !log stop squid as a stopgap on install1003 and disable puppet so that it is not restarted while we figure out what wdqs updater is doing to cause issue to mediawiki [21:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:23] (03CR) 10Andrew Bogott: "I'd like to be able to check galera syncing status on Kibana. This will accomplish that, but it also contains a line that redirects all my" [puppet] - 10https://gerrit.wikimedia.org/r/664678 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [21:49:28] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664644 (https://phabricator.wikimedia.org/T274777) (owner: 10Herron) [21:49:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:49] akosiaris: webproxy down? [22:00:54] FYI razzi ^ [22:01:01] ottomata: yes, logged at sal [22:01:05] oh just saw sorry [22:08:12] akosiaris: qq, how long do you expect it to be down? [22:08:54] ottomata: we are still investigating the outage. I guess until we can concretely rule out enabling it will cause another outage [22:09:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [22:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:33] ok, sounds good. it is going to cause some alerts in analytics after an hourish. I just filed a task T274951 to stop relying on webproxy. [22:10:33] T274951: WikimediaEventUtilities and produce_canary_events job should use api-ro.discovery.wmnet instead of meta.wikimedia.,org to get stream config - https://phabricator.wikimedia.org/T274951 [22:10:45] nothing will break, but some monitoring things are going to be confused [22:12:50] ok, good to know. [22:13:28] tools shouldn't be anyway relying on webproxy functioning, it's a squid meant for systems to use to fetch updates and such [22:14:46] akosiaris: yeah, this is my fault, missed this. i think sometimes the proxy just gets set and then we forget they are relying on it [22:14:53] in this case it is because there is an analytics vlan firewall [22:15:03] that means we can't reach meta.wm.org without the webproxy.. [22:15:37] actually....i just realized that event data is not going to be imported whlie this is down. still no data loss, but this is def a problem [22:15:54] elukey: yt? [22:16:01] (03PS8) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [22:18:12] akosiaris: am looking for docs on nhow to make an analytics vlan acl change [22:18:15] i've never done it myself [22:18:20] there is a repo now, right? [22:18:34] ottomata: yes homer [22:20:23] ottomata: https://wikitech.wikimedia.org/wiki/Homer [22:20:36] looking there, where is config? in puppet? [22:20:47] https://wikitech.wikimedia.org/wiki/Homer#Making_changes [22:20:59] AH on cumin [22:20:59] ok [22:21:09] the repos are on gerrit [22:21:22] ahah [22:21:23] except the private one [22:21:24] homer public [22:21:33] that is similar to puppet private [22:21:48] got it [22:22:23] RECOVERY - Squid on install1003 is OK: TCP OK - 0.000 second response time on 208.80.154.32 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [22:22:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10nettrom_WMF) 05Declined→03Open I'm reopening this task as we've now got visualizations/dashboards in Superset that @CBogen needs access to, and these require t... [22:22:56] !log re-enable puppet and squid on install1003. wdqs seems to be mildly related to the outage, restart it [22:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:48] !log bstorm@cumin1001 START - Cookbook wmcs.wikireplicas.add_wiki [22:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:20] ottomata: I 've re-enabled it, it might have been a red herring after all [22:24:28] ok phew [22:24:38] def exposed problems on our side that we should fix asap though [22:24:49] akosiaris: i'd rather not emergency fix it now if i don't have to [22:27:17] you don't have to, but you gave me a good idea about becoming chaos monkey with this every now and then [22:27:26] and not just that :-) [22:27:29] :) [22:29:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico/coredns: Use the external kubernetes API [deployment-charts] - 10https://gerrit.wikimedia.org/r/664604 (owner: 10Alexandros Kosiaris) [22:30:59] (03Merged) 10jenkins-bot: calico/coredns: Use the external kubernetes API [deployment-charts] - 10https://gerrit.wikimedia.org/r/664604 (owner: 10Alexandros Kosiaris) [22:31:08] (03PS1) 1020after4: Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664648 (https://phabricator.wikimedia.org/T274889) [22:31:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:47] (03CR) 1020after4: [C: 03+2] Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664648 (https://phabricator.wikimedia.org/T274889) (owner: 1020after4) [22:35:05] !log bstorm@cumin1001 END (FAIL) - Cookbook wmcs.wikireplicas.add_wiki (exit_code=99) [22:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:15] !log restarting wdqs-updater on wdqs2001 [22:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:50] (03CR) 10CRusnov: "> Patch Set 7:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [22:40:54] 10SRE, 10InternetArchiveBot, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Legoktm) We saw a bunch of these requests again today. The main problem is that making requests like https://www.wikidata.... [22:53:22] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Jclark-ctr) Checked all three host no issue with DAC possibly port is turned off for two host https://netbox.wikimedia.org/... [22:58:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10Jclark-ctr) @RobH Checked all three host no issue with DAC possibly port is turned off for two host https://netbox.wikimedia.org/dcim/devices/3023/ moved dac cable from port 39 to 4... [23:00:44] (03CR) 1020after4: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664670 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [23:00:51] (03CR) 1020after4: [C: 03+1] Set $wgTimelineFontDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664669 (https://phabricator.wikimedia.org/T274822) (owner: 10Legoktm) [23:03:24] twentyafterfour: thanks, I'll get that backported/deployed during the next window [23:03:36] (03Merged) 10jenkins-bot: Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664648 (https://phabricator.wikimedia.org/T274889) (owner: 1020after4) [23:03:55] 👍 [23:04:17] jouncebot: next [23:04:17] In 0 hour(s) and 55 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T0000) [23:06:19] PROBLEM - Host ms-be1034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:09:53] !log twentyafterfour@deploy1001 Synchronized php-1.36.0-wmf.30/includes/HookContainer/DeprecatedHooks.php: silence deprecation refs T274889 (duration: 01m 14s) [23:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:59] T274889: Use of ProtectionForm::buildForm hook (used in FlaggedRevsUIHooks::onProtectionForm) was deprecated in MediaWiki 1.36 - https://phabricator.wikimedia.org/T274889 [23:13:35] RECOVERY - Host ms-be1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [23:14:29] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Jclark-ctr) @fgiunchedi would you be ok with chassis swap using ms-be1018 recently decommissioned? [23:25:05] when that alert triggered earlier I was literally on the dentist chair. saw backlog [23:28:31] mutante: I'd take an alert every day instead of a dentist chair :) [23:29:01] heh yea :p [23:29:12] jouncebot: now [23:29:12] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [23:29:25] jouncebot: next [23:29:25] In 0 hour(s) and 30 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210217T0000) [23:29:52] the window looks empty in calendar [23:30:00] can I assume there wont be a deploy ? [23:30:30] mutante: I was going to backport some stuff [23:30:53] legoktm: I am asking to decide if i should reimage mwdebug1001 or tomorrow [23:31:09] fine with me, I can use 1002 or 1003 [23:31:24] ok then, please use 1002 and let me know if all is fine [23:31:29] that is a good test [23:31:55] (03PS1) 10Legoktm: Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664649 (https://phabricator.wikimedia.org/T274822) [23:32:07] (03PS1) 10Legoktm: Add $wgTimelineFontDirectory to be passed as GDFONTPATH [extensions/timeline] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664650 (https://phabricator.wikimedia.org/T274822) [23:34:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:35:13] (added to the calendar) [23:35:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [23:35:23] ah, thank you! [23:37:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug1001.eqiad.wmnet with reason: OS upgrade [23:37:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug1001.eqiad.wmnet with reason: OS upgrade [23:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:18] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MattCleinman) [23:43:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mwdebug1001.eqiad.wmnet [23:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:07] !log reimaging mwdebug1001 with buster [23:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:37] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mwdebug1001.eqiad.wmnet [23:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1351.eqiad.wmnet with reason: REIMAGE [23:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1351.eqiad.wmnet with reason: REIMAGE [23:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:30] !log puppetmaster1001 - puppet cert clean mwdebug1001, sign new request, initial puppet run, now on buster (T274023) [23:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:35] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023