[06:14:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:32] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Marostegui) Thanks John - I have powered up the host and it looks good now. I will take it from here ` root@db1134:~# free -g total used free shared... [06:19:13] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [06:19:41] RECOVERY - Check systemd state on elastic2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:19] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2047-production-search-omega-codfw on elastic2047 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-omega-codfw&var-instance=elastic2047&panelId=37 [06:22:17] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Marostegui) 05Open→03Resolved I am going to close this - the task to track next steps is: T275343 Thanks everyone for helping out here [06:35:08] ACKNOWLEDGEMENT - SSH on db1162.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds ayounsi https://phabricator.wikimedia.org/T275309 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:35:08] ACKNOWLEDGEMENT - Host db1162.mgmt is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T275309 [06:46:11] thanks XioNoX [07:11:59] !log powercycle elastic2045 - com2 available, no ssh, no root login (hangs indefinitely), no prometheus metrics reported [07:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:45] 10SRE, 10ops-eqiad, 10DBA: db1162 crashed - https://phabricator.wikimedia.org/T275309 (10wiki_willy) a:05wiki_willy→03Cmjohnson [07:14:36] !log Restarting CI Jenkins for plugin upgrade # T271683 [07:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:42] T271683: Upgrade Jenkins Gearman plugin from a forked repo - https://phabricator.wikimedia.org/T271683 [07:14:51] PROBLEM - Host elastic2045 is DOWN: PING CRITICAL - Packet loss = 100% [07:15:33] RECOVERY - Host elastic2045 is UP: PING OK - Packet loss = 0%, RTA = 33.10 ms [07:16:29] RECOVERY - SSH on elastic2045 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:16:49] PROBLEM - Check systemd state on elastic2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:11] RECOVERY - Check systemd state on elastic2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:31] PROBLEM - MD RAID on elastic2045 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:22:32] ACKNOWLEDGEMENT - MD RAID on elastic2045 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T275344 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:22:36] 10SRE, 10ops-codfw: Degraded RAID on elastic2045 - https://phabricator.wikimedia.org/T275344 (10ops-monitoring-bot) [07:25:21] 10ops-codfw, 10Discovery: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10elukey) [07:28:53] pff [07:29:49] !log Restarting CI Jenkins to downgrade plugin # T271683 [07:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:56] T271683: Upgrade Jenkins Gearman plugin from a forked repo - https://phabricator.wikimedia.org/T271683 [07:38:19] !log installing openldap security updates on LDAP replicas [07:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:51] (03PS1) 10Muehlenhoff: Update production key for annet [puppet] - 10https://gerrit.wikimedia.org/r/665813 [07:49:05] (03PS1) 10Elukey: Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) [07:52:08] (03PS1) 10Marostegui: instances.yaml: Remove db1090* from dbctl. [puppet] - 10https://gerrit.wikimedia.org/r/665892 (https://phabricator.wikimedia.org/T274333) [07:52:42] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1090* from dbctl. [puppet] - 10https://gerrit.wikimedia.org/r/665892 (https://phabricator.wikimedia.org/T274333) (owner: 10Marostegui) [07:54:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1090* from dbctl T274333', diff saved to https://phabricator.wikimedia.org/P14426 and previous config saved to /var/cache/conftool/dbconfig/20210222-075437-marostegui.json [07:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:44] T274333: decommission db1090.eqiad.wmnet - https://phabricator.wikimedia.org/T274333 [07:56:41] (03CR) 10Muehlenhoff: [C: 03+2] Update production key for annet [puppet] - 10https://gerrit.wikimedia.org/r/665813 (owner: 10Muehlenhoff) [08:04:07] jouncebot: now [08:04:07] No deployments scheduled for the next 3 hour(s) and 25 minute(s) [08:04:09] jouncebot: next [08:04:10] In 3 hour(s) and 25 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1130) [08:04:17] (03CR) 10Urbanecm: [C: 03+2] fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665548 (https://phabricator.wikimedia.org/T275017) (owner: 10Urbanecm) [08:04:38] (03PS3) 10ArielGlenn: make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 [08:04:40] (03PS1) 10ArielGlenn: make commons json entity dumps use mediainfo name like rdf ones [puppet] - 10https://gerrit.wikimedia.org/r/665989 [08:05:10] (03CR) 10jerkins-bot: [V: 04-1] make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (owner: 10ArielGlenn) [08:05:22] (03Merged) 10jenkins-bot: fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665548 (https://phabricator.wikimedia.org/T275017) (owner: 10Urbanecm) [08:07:37] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 577267230832 and 890778 seconds Gehel tracked in https://phabricator.wikimedia.org/T275350 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:07:37] ACKNOWLEDGEMENT - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 721627853568 and 890181 seconds Gehel tracked in https://phabricator.wikimedia.org/T275350 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:08:00] (03PS1) 10Elukey: Apply ensure absent to the monitor refine flags in Analytics refine [puppet] - 10https://gerrit.wikimedia.org/r/665990 [08:11:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: cea41a2f7736aa29dee8f10de4c0c17353ece963: fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file (T275017; 1/2) (duration: 01m 08s) [08:11:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28149/console" [puppet] - 10https://gerrit.wikimedia.org/r/665990 (owner: 10Elukey) [08:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:21] T275017: Reviewers can't stabilize pages on fiwiki - https://phabricator.wikimedia.org/T275017 [08:12:03] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel tracked in https://phabricator.wikimedia.org/T275353 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:27] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: cea41a2f7736aa29dee8f10de4c0c17353ece963: fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file (T275017; 2/2) (duration: 00m 55s) [08:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:38] * Urbanecm done [08:12:42] Majavah: fyi ^^ [08:13:27] (03CR) 10Elukey: [V: 03+1 C: 03+2] Apply ensure absent to the monitor refine flags in Analytics refine [puppet] - 10https://gerrit.wikimedia.org/r/665990 (owner: 10Elukey) [08:13:36] !log Upgrade x2 codfw hosts kernel [08:17:25] 10SRE, 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10ayounsi) 05Resolved→03Open a:05Cmjohnson→03herron > The last Puppet run was at Mon Feb 8 14:16:19 UTC 2021 (19799 minutes ago). Puppet is disabled. disabled for re-racking T273984... [08:17:50] (03CR) 10Nikerabbit: [C: 03+1] Enable Section Translation on Bengali Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [08:19:00] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10ayounsi) Changed its Netbox status to `failed` so the Netbox report doesn't alert. [08:19:21] (03PS2) 10Elukey: Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) [08:19:24] (03PS2) 10Muehlenhoff: Remove members of gerrit-admin who are also members of gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/665246 [08:20:15] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10fgiunchedi) Links came back over the weekend, looks like we can proceed when ready [08:22:36] (03PS5) 10KartikMistry: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) [08:24:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove members of gerrit-admin who are also members of gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/665246 (owner: 10Muehlenhoff) [08:24:49] (03PS6) 10KartikMistry: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) [08:25:00] (03PS4) 10ArielGlenn: make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 [08:25:25] (03CR) 10Elukey: [C: 03+2] bigtop: require hadoop users before installing daemon packages [puppet] - 10https://gerrit.wikimedia.org/r/665360 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [08:25:34] (03CR) 10jerkins-bot: [V: 04-1] make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (owner: 10ArielGlenn) [08:27:06] (03PS5) 10ArielGlenn: make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 [08:27:21] (03CR) 10Ayounsi: "One nit, LGTM otherwise." (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [08:28:42] (03CR) 10Elukey: Add a mediawiki-api term to the analytics-in4 filter (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [08:29:11] (03PS3) 10Elukey: Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) [08:29:45] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) Any updates? Thank you [08:30:21] 10SRE, 10ops-codfw: Degraded RAID on elastic2045 - https://phabricator.wikimedia.org/T275344 (10Gehel) [08:30:25] 10SRE, 10ops-codfw, 10Discovery: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Gehel) [08:30:55] (03CR) 10Ayounsi: [C: 03+1] Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [08:33:16] (03CR) 10Filippo Giunchedi: mw_rc_irc: add check_prometheus alert on no messages being relayed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [08:38:28] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: weekly rebuild [puppet] - 10https://gerrit.wikimedia.org/r/665991 [08:39:04] !log depool elastic2045 and ban from clsuters - T275345 [08:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:11] T275345: Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 [08:40:31] !log swift codfw-prod: more weight to ms-be20[58-61] - T269337 [08:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:38] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [08:44:21] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Gehel) @Papaul this server is depooled and banned from the cluster. Can you replace sda? This should still be under warranty. @RKemper: once the new disk is in p... [08:44:27] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Gehel) [08:45:37] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:45:47] 10SRE, 10ops-codfw: Degraded RAID on elastic2045 - https://phabricator.wikimedia.org/T275344 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:46:41] 10SRE, 10Language-Team, 10Performance-Team, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:00:50] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet [09:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:12] !log installing screen security updates on Buster [09:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:21] (03PS1) 10Filippo Giunchedi: prometheus: force prometheus.service to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) [09:06:28] we should be able to hit gerrit 666000 today [09:07:41] (03CR) 10Muehlenhoff: prometheus: force prometheus.service to be disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [09:08:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1005.eqiad.wmnet [09:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:25] this should the commit of the beast which kills the scb cluster :-) [09:08:44] haha! looking forward to it [09:10:44] (03CR) 10Muehlenhoff: prometheus: force prometheus.service to be disabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [09:13:23] (03PS1) 10Urbanecm: Grant sysops review and unreviewed pages right by default [extensions/FlaggedRevs] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666013 (https://phabricator.wikimedia.org/T275293) [09:14:15] (03PS2) 10Filippo Giunchedi: prometheus: mask prometheus.service [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) [09:15:24] moritzm: thank you for the review [09:16:52] the other thing I'm currently trying to understand is why e.g. prometheus@ops.service isn't enabled on e.g. prometheus3001 but it should [09:17:37] ah now I get it, no [Install] section [09:17:37] having a look [09:17:43] I'll fix that shortly [09:19:53] (03PS1) 10Filippo Giunchedi: prometheus: add [Install] section to prometheus@ instances [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) [09:20:26] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:40] (03PS2) 10Urbanecm: Grant sysops review and unreviewed pages right by default [extensions/FlaggedRevs] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666013 (https://phabricator.wikimedia.org/T275293) [09:22:34] (03CR) 10Muehlenhoff: prometheus: mask prometheus.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [09:24:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [09:28:31] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:32] (03CR) 10Filippo Giunchedi: prometheus: mask prometheus.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [09:30:06] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/665089 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:30:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [09:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:53] (03PS3) 10Filippo Giunchedi: prometheus: mask prometheus.service [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) [09:30:55] (03PS2) 10Filippo Giunchedi: prometheus: add [Install] section to prometheus@ instances [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) [09:34:15] (03CR) 10Muehlenhoff: prometheus: mask prometheus.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [09:34:56] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] services: remove restbase http LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/665089 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:35:42] !log Deploy schema change on s3 codfw master, there will be lag on s3 codfw - T273359 [09:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:49] T273359: Schema change for renaming name_title_timestamp on archive table - https://phabricator.wikimedia.org/T273359 [09:36:35] (03PS1) 10David Caro: wmf-auto-restart: Added some help to the script [puppet] - 10https://gerrit.wikimedia.org/r/665995 [09:38:32] (03PS2) 10David Caro: wmf-auto-restart: Added some help to the script [puppet] - 10https://gerrit.wikimedia.org/r/665995 (https://phabricator.wikimedia.org/T275354) [09:42:41] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:33] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:57] <_joe_> !log restarting low-traffic pybals in codfw to remove the restbase http endpoint [09:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:19] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.17:7231]) https://wikitech.wikimedia.org/wiki/PyBal [09:54:39] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.17:7231]) https://wikitech.wikimedia.org/wiki/PyBal [09:56:10] <_joe_> that's me ofc [10:00:09] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.17:7231]) https://wikitech.wikimedia.org/wiki/PyBal [10:00:37] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.17:7231]) https://wikitech.wikimedia.org/wiki/PyBal [10:02:13] interesting, there was a small spike of 502s on analytics-intake, which I wouldn't expect it to be affected [10:03:11] must have a very high request rate [10:03:33] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:37] PROBLEM - Check systemd state on ms-be2046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:54] <_joe_> jynus: analytics-intake doesn't even go via low-traffic, at least I hope [10:04:05] * _joe_ checks [10:05:14] _joe_, what I saw https://logstash.wikimedia.org/goto/377bb72bd6c77b9c43a51ab597370ddf [10:05:37] <_joe_> yeah I'm trying to see if that's possibly related to the restarts at all [10:05:57] yeah, maybe is unrelated [10:06:49] jouncebot: refresh [10:06:50] I refreshed my knowledge about deployments. [10:06:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P14428 and previous config saved to /var/cache/conftool/dbconfig/20210222-100653-marostegui.json [10:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:13] PROBLEM - SSH on analytics1058.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:07:34] <_joe_> jynus: meh it points at eventgate-logging-external AFAICT that is class: low-traffic [10:07:43] <_joe_> so yes, probably related [10:08:13] <_joe_> let's see what happens once I've restarted eqiad [10:08:57] I am not worried, I just wanted to let you know [10:09:13] <_joe_> sure, thaank you :) [10:09:30] <_joe_> I need to clean my keyboard, it's starting to have ghosting issues [10:10:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 10%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14429 and previous config saved to /var/cache/conftool/dbconfig/20210222-101018-root.json [10:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:27] <_joe_> !log restarting pybal on lvs1016 to pick up restbase http removal [10:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:11] <_joe_> !log restarting pybal on lvs1015 to pick up restbase http removal [10:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:44] <_joe_> jynus: definitely related, and this is not good [10:17:55] <_joe_> also worrisome that we had two spikes [10:23:22] 10SRE, 10LDAP-Access-Requests: LDAP access to the nda group for Uzoma Ozurumba - https://phabricator.wikimedia.org/T275139 (10UOzurumba) >>! In T275139#6841740, @Aklapper wrote: >> I am tagging you because I am required to do so. Thank you. > Hmm, that surprises me. Could you elaborate why you think that you a... [10:25:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14430 and previous config saved to /var/cache/conftool/dbconfig/20210222-102521-root.json [10:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:03] !log depool sessionstore in codfw for sessionstore certificate refresh. T274564 [10:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:10] T274564: sessionstore certificates will expire soon - https://phabricator.wikimedia.org/T274564 [10:30:22] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=sessionstore [10:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10WMDE-leszek) Thanks @bd808 for mentioning the tag. "Staff" account or not, some permission adjustment on wikitech will likely be needed in either cases, so it is good to know how to make this... [10:36:43] <_joe_> !log manually removed the restbase-http ipvs entry from the load balancers [10:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:01] RECOVERY - Check systemd state on ms-be2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:59] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:38:47] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:38:51] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:39:13] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:40:06] (03PS1) 10Volans: CHANGELOG: fix typo [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666002 [10:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14431 and previous config saved to /var/cache/conftool/dbconfig/20210222-104025-root.json [10:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:55] (03CR) 10Elukey: [C: 03+1] CHANGELOG: fix typo [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666002 (owner: 10Volans) [10:42:46] (03PS4) 10Filippo Giunchedi: prometheus: mask prometheus.service [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) [10:42:48] (03PS3) 10Filippo Giunchedi: prometheus: add [Install] section to prometheus@ instances [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) [10:43:11] <_joe_> the puppet compiler has a full disk again [10:44:18] <_joe_> I'm on it [10:44:24] (03CR) 10Filippo Giunchedi: prometheus: mask prometheus.service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [10:44:35] thank you [10:48:05] (03CR) 10Effie Mouzeli: [C: 03+2] Supply default value for profile::memcached::enable_16 for cloud [puppet] - 10https://gerrit.wikimedia.org/r/665417 (https://phabricator.wikimedia.org/T270315) (owner: 10Ahmon Dancy) [10:49:04] (03PS4) 10Filippo Giunchedi: prometheus: add [Install] section to prometheus@ instances [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) [10:49:06] (03PS5) 10Filippo Giunchedi: prometheus: mask prometheus.service [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) [10:49:32] <_joe_> !log removing stray old builds from compiler1003 [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:38] (03PS1) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/666005 (https://phabricator.wikimedia.org/T270315) [10:53:47] (03CR) 10Volans: [C: 03+2] CHANGELOG: fix typo [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666002 (owner: 10Volans) [10:54:39] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 75%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14432 and previous config saved to /var/cache/conftool/dbconfig/20210222-105528-root.json [10:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10Arrbee) This is an approved request for @AMuigai . Thanks. [10:56:02] (03Merged) 10jenkins-bot: CHANGELOG: fix typo [software/pywmflib] - 10https://gerrit.wikimedia.org/r/666002 (owner: 10Volans) [10:57:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28154/console" [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [11:02:05] (03CR) 10Nik Gkountas: [C: 03+1] Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [11:03:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [11:07:47] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:36] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: mask prometheus.service [puppet] - 10https://gerrit.wikimedia.org/r/665992 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [11:08:39] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: add [Install] section to prometheus@ instances [puppet] - 10https://gerrit.wikimedia.org/r/665993 (https://phabricator.wikimedia.org/T273278) (owner: 10Filippo Giunchedi) [11:10:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1166 (re)pooling @ 100%: Slowly repool db1166', diff saved to https://phabricator.wikimedia.org/P14433 and previous config saved to /var/cache/conftool/dbconfig/20210222-111032-root.json [11:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:18] (03PS1) 10Elukey: bigtop: add the hadoop group to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/666092 (https://phabricator.wikimedia.org/T231067) [11:12:23] !log restart prometheus on prometheus2004 to apply changes - T273278 [11:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:50] there will be prometheus alerts for prometheus restarts, that's expected [11:14:47] (03CR) 10Hashar: "recheck after deployment of the CI config change https://gerrit.wikimedia.org/r/666086" [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [11:14:59] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:15] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1024_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:15:29] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:44] there will be also one/several "varnish traffic traffic drop", expected [11:16:00] ack [11:18:49] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 19 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28155/console" [puppet] - 10https://gerrit.wikimedia.org/r/666092 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [11:20:27] (03CR) 10Urbanecm: [C: 03+2] "to prepare for B&C" [extensions/FlaggedRevs] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666013 (https://phabricator.wikimedia.org/T275293) (owner: 10Urbanecm) [11:21:25] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [11:21:33] !log roll restart prometheus on prometheus* [11:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:57] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:22:06] 10SRE, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) >>! In T244843#6840321, @Joe wrote: > In order to catch calls to mediawiki th... [11:22:11] (03Abandoned) 10Effie Mouzeli: hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/666005 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [11:22:24] !log roll restart prometheus on cloudmetrics* [11:22:24] 10SRE, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:57] PROBLEM - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [11:23:49] PROBLEM - Prometheus prometheus4001/ops restarted: beware possible monitoring artifacts on prometheus4001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops [11:24:13] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:25:11] PROBLEM - ensure kvm processes are running on cloudvirt1039 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:25:40] (03Merged) 10jenkins-bot: Grant sysops review and unreviewed pages right by default [extensions/FlaggedRevs] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666013 (https://phabricator.wikimedia.org/T275293) (owner: 10Urbanecm) [11:25:50] that was quicker than expected [11:25:55] will deploy during B&C [11:26:23] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances [11:26:24] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances [11:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:53] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances [11:26:54] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances [11:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:11] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 27 hosts with reason: Restarting cloudcanary instances [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:21] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 27 hosts with reason: Restarting cloudcanary instances [11:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:47] (03CR) 10Muehlenhoff: [C: 03+1] "Confirmed the key, merging." [puppet] - 10https://gerrit.wikimedia.org/r/662661 (owner: 10Awight) [11:27:52] (03PS4) 10Muehlenhoff: New 2FA key for awight [puppet] - 10https://gerrit.wikimedia.org/r/662661 (owner: 10Awight) [11:28:45] PROBLEM - Prometheus cloudmetrics1001/labs restarted: beware possible monitoring artifacts on cloudmetrics1001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [11:30:00] (03PS2) 10Jbond: wmflib: drop to_seconds and to_milliseconds [puppet] - 10https://gerrit.wikimedia.org/r/661415 (https://phabricator.wikimedia.org/T273743) [11:30:02] (03PS2) 10Giuseppe Lavagetto: restbase: remove references to the non-https LVS [puppet] - 10https://gerrit.wikimedia.org/r/665090 (https://phabricator.wikimedia.org/T244843) [11:30:04] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1130). [11:30:25] PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1024_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:30:40] (03CR) 10jerkins-bot: [V: 04-1] restbase: remove references to the non-https LVS [puppet] - 10https://gerrit.wikimedia.org/r/665090 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [11:30:46] <_joe_> effie: is the downtime expired? ^^ [11:32:09] "Wikimedia Portals Update" - skipping that today due a build error... [11:32:57] (03CR) 10Jbond: [C: 03+2] nutcracker: drop use of to_milliseconds function [puppet] - 10https://gerrit.wikimedia.org/r/661414 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [11:33:01] (03CR) 10Jbond: [C: 03+2] wmflib: drop to_seconds and to_milliseconds [puppet] - 10https://gerrit.wikimedia.org/r/661415 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [11:33:42] (03PS3) 10Jbond: wmflib: drop to_seconds and to_milliseconds [puppet] - 10https://gerrit.wikimedia.org/r/661415 (https://phabricator.wikimedia.org/T273743) [11:33:44] (03CR) 10Muehlenhoff: [C: 03+2] New 2FA key for awight [puppet] - 10https://gerrit.wikimedia.org/r/662661 (owner: 10Awight) [11:33:55] PROBLEM - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [11:33:59] moritzm: happy for me to merge yuors [11:34:07] please do :-) [11:34:18] merging now [11:34:47] (03CR) 10Hnowlan: [C: 03+2] Add simple blubber image [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [11:34:47] PROBLEM - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [11:34:53] PROBLEM - Prometheus prometheus2003/k8s-staging restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s-staging [11:35:09] (03PS2) 10Klausman: analytics/camus: Add job to export ATSKafka events to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) [11:35:16] sorry about the spam/noise [11:35:23] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:35:29] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:35:29] (03Merged) 10jenkins-bot: Add simple blubber image [software/tegola] - 10https://gerrit.wikimedia.org/r/664564 (owner: 10Hnowlan) [11:36:09] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:36:19] PROBLEM - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [11:36:33] PROBLEM - Prometheus prometheus1003/analytics restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [11:37:02] silenced ^ [11:40:03] (03CR) 10Elukey: [C: 03+1] "LGTM, I added a little note about the mappers, but then you can proceed.. Maybe running this every 10 mins might be too aggressive, but we" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [11:42:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={jmx_wdqs_streaming_updater,webperf_arclamp} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:43:11] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [11:46:10] (03CR) 10Effie Mouzeli: [C: 03+2] hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/664778 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [11:46:13] RECOVERY - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [11:47:57] RECOVERY - Prometheus prometheus4001/ops restarted: beware possible monitoring artifacts on prometheus4001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops [11:48:26] (03CR) 10Volans: [C: 03+2] gitignore: added vim swapfiles [software/cumin] - 10https://gerrit.wikimedia.org/r/665364 (owner: 10David Caro) [11:49:38] (03PS3) 10Giuseppe Lavagetto: restbase: remove references to the non-https LVS [puppet] - 10https://gerrit.wikimedia.org/r/665090 (https://phabricator.wikimedia.org/T244843) [11:50:19] !log upgrading python3-wmflib fleet wide to 0.0.7-1+deb10u1 [11:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:56] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=analytics file=debian_version.prom instance=an-worker1101 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [11:53:46] !log upgrading memecached to 1.6 on mc1036 [11:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:12] (03PS3) 10Klausman: analytics/camus: Add job to export ATSKafka events to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) [11:54:45] (03CR) 10Klausman: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [11:54:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1036.eqiad.wmnet [11:54:46] RECOVERY - Prometheus cloudmetrics1001/labs restarted: beware possible monitoring artifacts on cloudmetrics1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [11:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:26] (03CR) 10Klausman: analytics/camus: Add job to export ATSKafka events to HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [11:55:48] (03PS4) 10Klausman: analytics/camus: Add job to export ATSKafka events to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) [11:56:12] (03Merged) 10jenkins-bot: gitignore: added vim swapfiles [software/cumin] - 10https://gerrit.wikimedia.org/r/665364 (owner: 10David Caro) [11:56:45] (03CR) 10Klausman: [C: 03+2] analytics/camus: Add job to export ATSKafka events to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [11:56:49] (03CR) 10Klausman: [V: 03+2 C: 03+2] analytics/camus: Add job to export ATSKafka events to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/665321 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [11:57:43] (03PS2) 10KartikMistry: Adjust CX MT threshold to 90 for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665238 (https://phabricator.wikimedia.org/T275121) [11:59:02] RECOVERY - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [11:59:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1036.eqiad.wmnet [11:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:12] RECOVERY - Prometheus prometheus2003/k8s-staging restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s-staging [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1200). [12:00:05] kart_ and Urbanecm: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:14] * kart_ here [12:00:34] I can deploy today [12:00:41] kart_: or you can self service if you wish [12:00:50] RECOVERY - Prometheus prometheus1003/analytics restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [12:00:58] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) is CRITICAL: Test Get value for key returned the unexpected status 500 (expecting: 200) https://www.mediawiki.org/wiki/Kask [12:01:13] _joe_: this was an acked alert as far as I remember [12:01:16] I will take care of it [12:01:51] (03CR) 10Urbanecm: [C: 03+2] Adjust CX MT threshold to 90 for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665238 (https://phabricator.wikimedia.org/T275121) (owner: 10KartikMistry) [12:02:12] !log installing openldap security updates on serpens/seaborgium [12:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:24] Urbanecm: oh thanks :) Lost the tab. [12:02:30] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [12:02:36] kart_: np [12:02:42] so, i guess I'm deploying it [12:02:45] (03Merged) 10jenkins-bot: Adjust CX MT threshold to 90 for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665238 (https://phabricator.wikimedia.org/T275121) (owner: 10KartikMistry) [12:03:20] Urbanecm: Yeah, please go ahead. [12:03:21] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance=mc2024 site=codfw tunnel=mc1024_v4 Effie Mouzeli Its counterpart, mc1024, is resting in peace. We have ordered new servers, this is fine. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:03:24] kart_: pulled to mwdebug1001 if you can test it there [12:03:55] Testing. [12:04:12] Urbanecm: Yes. All good. Value is updated. [12:04:16] syncing [12:05:53] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4775fb63e79501c3dba7ae4b9c3b1172d92dc0d0: Adjust CX MT threshold to 90 for Vietnamese Wikipedia (T275121) (duration: 00m 57s) [12:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:00] T275121: Adjust the threshold for Vietnamese to prevent publishing when overall unmodified content is higher than 90% - https://phabricator.wikimedia.org/T275121 [12:06:01] (03PS7) 10Urbanecm: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [12:06:07] (03CR) 10Urbanecm: [C: 03+2] Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [12:07:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/665365 (owner: 10David Caro) [12:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1175 for schema change', diff saved to https://phabricator.wikimedia.org/P14434 and previous config saved to /var/cache/conftool/dbconfig/20210222-120717-marostegui.json [12:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:50] (03Merged) 10jenkins-bot: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [12:08:17] kart_: pulled to mwdebug1001, please test [12:09:19] Testing.. [12:09:54] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [12:10:21] Urbanecm: looks good! [12:10:25] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28158/console" [puppet] - 10https://gerrit.wikimedia.org/r/665090 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:10:26] syncing [12:10:26] Urbanecm: Please deploy. [12:10:48] (03CR) 10Volans: [C: 04-1] "Although this LGTM, py39-unit-min and py39-man-min fails locally for me to setup the venv, I'll dig it further later today and report back" [software/cumin] - 10https://gerrit.wikimedia.org/r/665365 (owner: 10David Caro) [12:10:56] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [12:10:59] Urbanecm: I'm here if we find any errors afterwards.. [12:11:05] gokay, will ping you [12:11:09] is that a new feature? [12:11:20] (03PS1) 10Klausman: Fix time limit on Camus job [puppet] - 10https://gerrit.wikimedia.org/r/666103 [12:11:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14435 and previous config saved to /var/cache/conftool/dbconfig/20210222-121139-root.json [12:11:41] (03PS2) 10Klausman: Fix time limit on Camus job [puppet] - 10https://gerrit.wikimedia.org/r/666103 (https://phabricator.wikimedia.org/T254317) [12:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a1f8ce48249ad457d79c57e27836ee492eb00427: Enable Section Translation on Bengali Wikipedia (T271397) (duration: 00m 56s) [12:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:02] kart_: synced. anything else? [12:12:03] T271397: Enable Section Translation on Bengali Wikipedia - https://phabricator.wikimedia.org/T271397 [12:12:08] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [12:12:14] Urbanecm: All done. Thanks a lot. [12:12:17] np [12:12:27] (03PS2) 10Urbanecm: ukwikivoyage: Enable block AbuseFilter action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665526 (https://phabricator.wikimedia.org/T275271) [12:12:31] (03CR) 10Urbanecm: [C: 03+2] ukwikivoyage: Enable block AbuseFilter action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665526 (https://phabricator.wikimedia.org/T275271) (owner: 10Urbanecm) [12:13:06] RECOVERY - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [12:13:28] (03Merged) 10jenkins-bot: ukwikivoyage: Enable block AbuseFilter action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665526 (https://phabricator.wikimedia.org/T275271) (owner: 10Urbanecm) [12:13:51] <_joe_> godog: lmk when you're done with prometheus, I have a theoretically-mundane-but-maybe-dangerous change to merge, and I'll wait for prometheus to be 100% ok befor doing so [12:15:26] RECOVERY - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is OK: SSL OK - Certificate sessionstore2001-a valid until 2023-02-22 11:12:13 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:15:36] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: 391900b8db9ffdee8565d82c38c089843876a27b: ukwikivoyage: Enable block AbuseFilter action (T275271) (duration: 00m 55s) [12:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:44] T275271: Enable AbuseFilter block action for Ukrainian Wikivoyage - https://phabricator.wikimedia.org/T275271 [12:15:52] (03PS2) 10Urbanecm: Add inaturalist-open-data.s3.amazonaws.com to copyupload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665656 (https://phabricator.wikimedia.org/T275318) [12:15:57] (03CR) 10Urbanecm: [C: 03+2] Add inaturalist-open-data.s3.amazonaws.com to copyupload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665656 (https://phabricator.wikimedia.org/T275318) (owner: 10Urbanecm) [12:16:24] RECOVERY - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is OK: SSL OK - Certificate sessionstore2003-a valid until 2023-02-22 11:12:18 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:16:44] (03Merged) 10jenkins-bot: Add inaturalist-open-data.s3.amazonaws.com to copyupload list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665656 (https://phabricator.wikimedia.org/T275318) (owner: 10Urbanecm) [12:17:34] RECOVERY - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [12:18:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7bd26dc6160a5bc3ba9235ce93c01e7ab9744487: Add inaturalist-open-data.s3.amazonaws.com to copyupload list (T275318) (duration: 00m 56s) [12:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:29] T275318: Add inaturalist-open-data.s3.amazonaws.com to copyupload list - https://phabricator.wikimedia.org/T275318 [12:20:48] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/FlaggedRevs/extension.json: a4cd98e7a581fe18634da05ba04eaf8035023c26: Grant sysops review and unreviewed pages right by default (T275293) (duration: 00m 55s) [12:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:54] T275293: FlaggedRevs does not give sysops the review right - https://phabricator.wikimedia.org/T275293 [12:21:01] (03PS4) 10Urbanecm: Add a throttle rule for for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [12:21:05] (03CR) 10Urbanecm: [C: 03+2] Add a throttle rule for for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [12:21:17] RECOVERY - cassandra-a SSL 10.192.32.101:7001 on sessionstore2002 is OK: SSL OK - Certificate sessionstore2002-a valid until 2023-02-22 11:12:16 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:21:33] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=sessionstore [12:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:49] !log repool sessionstore in codfw after sessionstore certificate refresh. T274564 [12:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:56] T274564: sessionstore certificates will expire soon - https://phabricator.wikimedia.org/T274564 [12:22:04] !log depool sessionstore in eqiad for sessionstore certificate refresh. T274564 [12:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:14] (03Merged) 10jenkins-bot: Add a throttle rule for for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [12:24:18] !log urbanecm@deploy1001 Synchronized wmf-config//throttle.php: d806f3a986244f8027aba730e72d99babe3b37e9: Add a throttle rule for for edit-a-thon (T275237) (duration: 00m 54s) [12:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:24] T275237: Request for temporary lift of IP cap for edit-a-thon (2020-02-24 - 2020-03-05) - https://phabricator.wikimedia.org/T275237 [12:26:40] * Urbanecm done [12:26:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14436 and previous config saved to /var/cache/conftool/dbconfig/20210222-122643-root.json [12:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:20] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:19] ^ heads up this is a lot of requests switching from 1 DC to the other. It will probably be completely undetectable, just pointing it out [12:31:22] (03CR) 10Klausman: [C: 03+2] Fix time limit on Camus job [puppet] - 10https://gerrit.wikimedia.org/r/666103 (https://phabricator.wikimedia.org/T254317) (owner: 10Klausman) [12:31:33] (03PS1) 10KartikMistry: CX3 Build 0.1.0+20210216 [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 [12:32:10] (03PS2) 10KartikMistry: CX3 Build 0.1.0+20210216 [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 [12:32:12] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore [12:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:43] (03CR) 10Jbond: [C: 03+1] "lgtm see comment inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [12:34:55] (03Abandoned) 10KartikMistry: CX3 Build 0.1.0+20210216 [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 (owner: 10KartikMistry) [12:41:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14437 and previous config saved to /var/cache/conftool/dbconfig/20210222-124146-root.json [12:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:18] (03PS1) 10Muehlenhoff: Add dummy keytab for cuminunpriv1001 [labs/private] - 10https://gerrit.wikimedia.org/r/666106 [12:54:05] RECOVERY - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is OK: SSL OK - Certificate sessionstore1001-a valid until 2023-02-22 11:12:05 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [12:56:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14438 and previous config saved to /var/cache/conftool/dbconfig/20210222-125650-root.json [12:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker::baseimages: weekly rebuild [puppet] - 10https://gerrit.wikimedia.org/r/665991 (owner: 10Giuseppe Lavagetto) [13:00:23] (03Restored) 10KartikMistry: CX3 Build 0.1.0+20210216 [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 (owner: 10KartikMistry) [13:00:25] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytab for cuminunpriv1001 [labs/private] - 10https://gerrit.wikimedia.org/r/666106 (owner: 10Muehlenhoff) [13:02:08] (03CR) 10Jbond: "> Patch Set 5:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) (owner: 10Jbond) [13:02:40] (03CR) 10KartikMistry: "recheck" [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 (owner: 10KartikMistry) [13:02:48] (03PS6) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (https://phabricator.wikimedia.org/T274338) [13:05:32] (03CR) 10Jbond: [C: 03+2] admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [13:05:38] (03PS6) 10Jbond: admin: add Angie Muigai to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665367 (https://phabricator.wikimedia.org/T275140) (owner: 10Dzahn) [13:06:03] Urbanecm: is it possible to deploy (wmf/1.36.0-wmf.31) - https://gerrit.wikimedia.org/r/666074 as emergency fix now or should I wait for tomorrow? [13:06:05] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 157 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:06:24] kart_: looking [13:06:42] kart_: mind linking the task it fixes? [13:07:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10jbond) [13:08:19] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28159/console" [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:08:21] RECOVERY - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is OK: SSL OK - Certificate sessionstore1002-a valid until 2023-02-22 11:12:08 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [13:08:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10jbond) 05Open→03Resolved a:03jbond This has now been enabled please allow upto 30minutes for the change to fully propagate [13:08:35] Urbanecm: fixing missing bits from: https://phabricator.wikimedia.org/T271397 [13:08:48] aha [13:08:54] jouncebot: now [13:08:54] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [13:08:57] jouncebot: next [13:08:57] In 1 hour(s) and 51 minute(s): New wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1500) [13:09:00] let's do it [13:09:12] (03CR) 10Urbanecm: [C: 03+2] "per kart's IRC req" [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 (owner: 10KartikMistry) [13:09:18] will ping you once ready [13:09:44] Urbanecm: thanks! [13:09:48] no [13:09:50] *np [13:11:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: Slowly repool db1175', diff saved to https://phabricator.wikimedia.org/P14439 and previous config saved to /var/cache/conftool/dbconfig/20210222-131153-root.json [13:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:03] RECOVERY - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is OK: SSL OK - Certificate sessionstore1003-a valid until 2023-02-22 11:12:10 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [13:13:32] (03CR) 10Jbond: "lgtm however see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:16:39] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 27 hosts with reason: Restarting cloudcanary instances [13:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:50] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 27 hosts with reason: Restarting cloudcanary instances [13:16:53] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:56] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances [13:16:58] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on cloudvirt-wdqs[1001-1003].eqiad.wmnet with reason: Restarting cloudcanary instances [13:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/665995 (https://phabricator.wikimedia.org/T275354) (owner: 10David Caro) [13:23:34] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+1] "+1 from me and PCC is ok, but see inline for John's comment. I like that approach." [puppet] - 10https://gerrit.wikimedia.org/r/665459 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:23:57] (03PS1) 10Hnowlan: postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) [13:24:30] (03CR) 10jerkins-bot: [V: 04-1] postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [13:26:12] _joe_: yep all good, go ahead [13:26:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [13:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:02] actually I'll take this chance to reboot two more prometheus hosts [13:27:16] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus3001.esams.wmnet [13:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:56] (03CR) 10Amire80: "🎉🎈🥂🍾" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [13:29:02] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=sessionstore [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] !log repool sessionstore in eqiad after sessionstore certificate refresh. T274564 [13:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:21] T274564: sessionstore certificates will expire soon - https://phabricator.wikimedia.org/T274564 [13:30:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [13:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:52] !log reset-failed ifup@ens14 on prometheus3001 - T273026 [13:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:58] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [13:32:24] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus4001.ulsfo.wmnet [13:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:00] (03PS2) 10Hnowlan: postgres: add script for automatic resyncing [puppet] - 10https://gerrit.wikimedia.org/r/666110 (https://phabricator.wikimedia.org/T275381) [13:33:27] (03Merged) 10jenkins-bot: CX3 Build 0.1.0+20210216 [extensions/ContentTranslation] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/666074 (owner: 10KartikMistry) [13:34:16] (03PS1) 10Hnowlan: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) [13:34:35] pulled to mwdebug1001 kart_ [13:35:15] (03PS5) 10Jbond: profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:35:29] Urbanecm: testing.. [13:35:33] (03PS2) 10JMeybohm: api-gateway: Update envoy, fluent-bit and ratelimit images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) [13:35:55] (03CR) 10Jbond: "LGTM but needs cloud sign off," [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [13:36:19] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/FlaggedRevs/extension.json: a4cd98e7a581fe18634da05ba04eaf8035023c26: Grant sysops review and unreviewed pages right by default (apparently i forgot to rebase the first time, resync; T275293) (duration: 00m 57s) [13:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:25] T275293: FlaggedRevs does not give sysops the review right - https://phabricator.wikimedia.org/T275293 [13:36:39] (03CR) 10jerkins-bot: [V: 04-1] postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [13:37:12] Urbanecm: Thanks. Seems fixing what we wanted. [13:37:36] !log installing openldap security updates on corp replicas [13:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:27] thanks, syncing [13:38:39] (03PS2) 10Hnowlan: postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) [13:38:41] (03PS2) 10Alexandros Kosiaris: Remove graphoid deployment references [puppet] - 10https://gerrit.wikimedia.org/r/663814 (https://phabricator.wikimedia.org/T242855) [13:38:43] (03PS2) 10Alexandros Kosiaris: graphoid: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/663815 (https://phabricator.wikimedia.org/T242855) [13:38:45] (03PS2) 10Alexandros Kosiaris: graphoid: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/663816 (https://phabricator.wikimedia.org/T242855) [13:38:47] (03PS2) 10Alexandros Kosiaris: graphoid: Remove conftool data [puppet] - 10https://gerrit.wikimedia.org/r/663817 (https://phabricator.wikimedia.org/T242855) [13:38:49] (03PS2) 10Alexandros Kosiaris: graphoid: Remove LVS IP from scb [puppet] - 10https://gerrit.wikimedia.org/r/663818 (https://phabricator.wikimedia.org/T242855) [13:38:51] (03PS2) 10Alexandros Kosiaris: graphoid: Remove all puppet references [puppet] - 10https://gerrit.wikimedia.org/r/663819 (https://phabricator.wikimedia.org/T242855) [13:40:13] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3001.esams.wmnet [13:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:20] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/ContentTranslation/app/: f9e823e: CX3 Build 0.1.0+20210216 (fixes missing bits in T271397) (duration: 00m 55s) [13:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:27] T271397: Enable Section Translation on Bengali Wikipedia - https://phabricator.wikimedia.org/T271397 [13:41:24] Thanks again Urbanecm [13:41:55] np [13:42:23] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Medium error reported for sda on elastic2045 - https://phabricator.wikimedia.org/T275345 (10Papaul) a:03Papaul [13:42:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove graphoid deployment references [puppet] - 10https://gerrit.wikimedia.org/r/663814 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [13:42:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] graphoid: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/663815 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [13:45:21] (03PS1) 10JMeybohm: deployment_server: Update default envoy image to 1.15.1-4 [puppet] - 10https://gerrit.wikimedia.org/r/666116 (https://phabricator.wikimedia.org/T268612) [13:45:23] (03PS1) 10JMeybohm: deployment_server: Update default prometheus-statsd-exporter image to 0.0.9 [puppet] - 10https://gerrit.wikimedia.org/r/666117 (https://phabricator.wikimedia.org/T268612) [13:47:35] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4001.ulsfo.wmnet [13:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:49] PROBLEM - gdnsd checkconf on dns4001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:52:20] PROBLEM - gdnsd checkconf on dns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:54:32] heads up: I'm going to depool swift codfw for reads for https://phabricator.wikimedia.org/T267338 [13:56:16] PROBLEM - gdnsd checkconf on dns5002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [13:56:43] (03CR) 10Ottomata: "HTTPS should be fine! This is just a direct http request to the MW API using e.g. https://www.javadoc.io/doc/org.apache.httpcomponents/ht" [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [13:57:02] !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [13:57:04] (03CR) 10Ottomata: "Maybe add both?" [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [13:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:38] XioNoX: ^ FYI swift depooled from codfw [13:58:20] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10fgiunchedi) Just depooled swift from codfw (for reads) `confctl --object-type discovery select 'dnsdisc=swift,name=codfw' set/pooled=false` [13:58:33] I will be checking potential app level performance issues, just in case [13:58:38] 10Puppet, 10SRE: using the include function can trigger false positives with puppet-lint-wmf_styleguide - https://phabricator.wikimedia.org/T275387 (10jbond) [13:58:57] 10Puppet, 10SRE: using the include function can trigger false positives with puppet-lint-wmf_styleguide - https://phabricator.wikimedia.org/T275387 (10jbond) >>! In T275387#6848677, @Aklapper wrote: > @jbond: Which codebase is this about? Sorry updated description and added tags [13:58:58] someone else should be checking at traffic/network bottlenecks, if any [13:59:44] (03PS1) 10Muehlenhoff: Add missing keytab metadata to unpriv Cumin role [puppet] - 10https://gerrit.wikimedia.org/r/666119 [14:00:58] (03PS1) 10Ottomata: Don't run Camus checker for atskafka webrerquest job [puppet] - 10https://gerrit.wikimedia.org/r/666121 (https://phabricator.wikimedia.org/T254317) [14:01:02] (03PS3) 10JMeybohm: api-gateway: Update mutiple sidecar images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) [14:01:04] (03PS1) 10JMeybohm: wikifeeds: Update envoy to 1.15.1-4 and statsd exporter to 0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666122 (https://phabricator.wikimedia.org/T274254) [14:01:06] (03PS1) 10JMeybohm: Remove deployment local definitions of statsd exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/666123 (https://phabricator.wikimedia.org/T274254) [14:01:12] PROBLEM - gdnsd checkconf on dns3002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:02:09] (03PS4) 10Elukey: Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) [14:02:22] PROBLEM - gdnsd checkconf on dns3001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:02:51] (03PS2) 10Muehlenhoff: Add missing keytab metadata to unpriv Cumin role [puppet] - 10https://gerrit.wikimedia.org/r/666119 [14:03:04] (03CR) 10Elukey: "Added 443, I see that there is an envoy process listening on it on all mws :)" [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [14:03:16] PROBLEM - gdnsd checkconf on dns1002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:04:12] PROBLEM - gdnsd checkconf on dns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:04:13] (03PS2) 10JMeybohm: wikifeeds: Update envoy to 1.15.1-4 and statsd exporter to 0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666122 (https://phabricator.wikimedia.org/T274254) [14:04:15] (03PS2) 10JMeybohm: Remove deployment local definitions of statsd exporter version [deployment-charts] - 10https://gerrit.wikimedia.org/r/666123 (https://phabricator.wikimedia.org/T274254) [14:05:00] (03CR) 10Ottomata: [C: 03+2] Don't run Camus checker for atskafka webrerquest job [puppet] - 10https://gerrit.wikimedia.org/r/666121 (https://phabricator.wikimedia.org/T254317) (owner: 10Ottomata) [14:08:24] (03CR) 10Muehlenhoff: [C: 03+2] Add missing keytab metadata to unpriv Cumin role [puppet] - 10https://gerrit.wikimedia.org/r/666119 (owner: 10Muehlenhoff) [14:11:00] PROBLEM - gdnsd checkconf on dns4002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:12:42] PROBLEM - gdnsd checkconf on dns5001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:13:01] (03PS1) 10JMeybohm: wikifeeds: Switch back to the global tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/666124 (https://phabricator.wikimedia.org/T274254) [14:13:15] (03CR) 10Volans: [C: 03+1] "Assuming that /usr/local/bin/resync_replica does the right thing, LGTM here." [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [14:14:23] godog: relevent ^ ? [14:14:36] relevant* [14:14:53] XioNoX: which ? [14:15:03] gdnsd errors [14:15:10] ah no, I think that's graphoid [14:15:13] is it linked to the swift link? [14:15:15] ah ok [14:15:18] XioNoX, I will add you to the paste [14:15:43] akosiaris: ^ [14:16:38] !log roll restarting kafkamon hosts for updates [14:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:05] ^and I think that is totally unrelated too (session.scope) [14:19:00] PROBLEM - gdnsd checkconf on authdns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:20:02] (03PS1) 10Ottomata: Refine - Remove override for spark2 assembly jar [puppet] - 10https://gerrit.wikimedia.org/r/666125 [14:23:15] is anyone already fixing the gdnsd config issue? Invalid resource name 'disc-graphoid' [14:23:59] I'm looking at it, but dunno where to look exactly [14:24:05] volans: if you know go for it :) [14:25:36] https://gerrit.wikimedia.org/r/663814 by akosiaris earlier [14:25:38] re: gdnsd [14:25:39] XioNoX: it's a geo-resource, I'm checking the puppet side of it [14:25:41] seems likely related [14:25:52] ah no [14:26:03] the other patches in that stack [14:26:24] also https://phabricator.wikimedia.org/T242855 [14:26:47] akosiaris: the ops/dns repo needs a patch too for graphoid, gdnsd config is currently invalid [14:27:47] PROBLEM - gdnsd checkconf on dns2002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:28:15] (03CR) 10Ottomata: [C: 03+2] Refine - Remove override for spark2 assembly jar [puppet] - 10https://gerrit.wikimedia.org/r/666125 (owner: 10Ottomata) [14:30:11] (03PS1) 10Ottomata: Refine - re-add accidentally removed spark conf from last patch [puppet] - 10https://gerrit.wikimedia.org/r/666127 (https://phabricator.wikimedia.org/T274384) [14:30:22] (03PS2) 10Ottomata: Refine - re-add accidentally removed spark conf from last patch [puppet] - 10https://gerrit.wikimedia.org/r/666127 (https://phabricator.wikimedia.org/T274384) [14:30:25] (03CR) 10jerkins-bot: [V: 04-1] Refine - re-add accidentally removed spark conf from last patch [puppet] - 10https://gerrit.wikimedia.org/r/666127 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [14:32:29] (03CR) 10Ottomata: [C: 03+2] Refine - re-add accidentally removed spark conf from last patch [puppet] - 10https://gerrit.wikimedia.org/r/666127 (https://phabricator.wikimedia.org/T274384) (owner: 10Ottomata) [14:33:29] (03PS1) 10Muehlenhoff: Include profile::kerberos::keytabs in profile::cumin::unprivmaster [puppet] - 10https://gerrit.wikimedia.org/r/666129 [14:35:31] volans: there a ton of patches incoming as well, just waiting for monitoring to clear [14:36:27] (03CR) 10David Caro: [C: 03+2] wmf-auto-restart: Added some help to the script [puppet] - 10https://gerrit.wikimedia.org/r/665995 (https://phabricator.wikimedia.org/T275354) (owner: 10David Caro) [14:36:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] graphoid: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/663816 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [14:36:57] akosiaris: ack, i think that without https://gerrit.wikimedia.org/r/c/operations/dns/+/663822 gdnsd will not be happy [14:37:10] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10Angel777342) a:03Angel777342 [14:37:30] volans: yeah that's next right after killing LVS [14:40:39] (03PS2) 10Elukey: bigtop: add the hadoop/hdfs/mapred/yarn groups to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/666092 (https://phabricator.wikimedia.org/T231067) [14:40:41] (03PS1) 10Elukey: admin: reserve gid/uid for various Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/666133 (https://phabricator.wikimedia.org/T231067) [14:40:43] (03PS1) 10Elukey: bigtop: set uid/gid for yarn/hdfs/mapred/hadoop user/groups for Buster [puppet] - 10https://gerrit.wikimedia.org/r/666134 (https://phabricator.wikimedia.org/T231067) [14:40:45] (03PS1) 10Elukey: druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) [14:41:23] (03PS2) 10Alexandros Kosiaris: graphoid: Remove all RRs for it [dns] - 10https://gerrit.wikimedia.org/r/663822 (https://phabricator.wikimedia.org/T242855) [14:41:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] graphoid: Remove all RRs for it [dns] - 10https://gerrit.wikimedia.org/r/663822 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [14:43:20] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10Majavah) a:05Angel777342→03None [14:43:52] (03PS2) 10Elukey: druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) [14:44:19] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:24] Would be nice to add what to do to the linked doc when that alert triggers: https://wikitech.wikimedia.org/wiki/DNS [14:44:33] RECOVERY - gdnsd checkconf on dns3001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:44:33] RECOVERY - gdnsd checkconf on dns5002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:01] RECOVERY - gdnsd checkconf on dns4002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:13] RECOVERY - gdnsd checkconf on dns4001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:13] RECOVERY - gdnsd checkconf on dns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:13] RECOVERY - gdnsd checkconf on dns3002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:21] RECOVERY - gdnsd checkconf on dns5001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:23] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:45:25] RECOVERY - gdnsd checkconf on dns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:33] RECOVERY - gdnsd checkconf on dns2002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:35] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.15:19000]) https://wikitech.wikimedia.org/wiki/PyBal [14:45:39] RECOVERY - gdnsd checkconf on dns1002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:45:53] (03CR) 10David Caro: transport.clustershell: handle str when reporting commands (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [14:46:29] (03CR) 10Hnowlan: [C: 03+1] api-gateway: Update mutiple sidecar images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:46:37] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.15:19000]) https://wikitech.wikimedia.org/wiki/PyBal [14:47:49] (03CR) 10Muehlenhoff: [C: 03+2] Include profile::kerberos::keytabs in profile::cumin::unprivmaster [puppet] - 10https://gerrit.wikimedia.org/r/666129 (owner: 10Muehlenhoff) [14:48:38] moritzm: so it is finally happenning! Kerberos is spreading! Winter is coming :D [14:48:42] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10jbond) [14:49:43] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:49:51] RECOVERY - gdnsd checkconf on authdns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:50:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 19 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28161/console" [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [14:50:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:50:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy-future: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664854 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:51:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Assuming you tested this change, the dockerfile is correct." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664864 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:51:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] nutcracker: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664865 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:51:53] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:52:57] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:53:06] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Patch-For-Review, 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [14:53:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] api-gateway: Update mutiple sidecar images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:53:47] 10Puppet, 10SRE: using the include function can trigger false positives with puppet-lint-wmf_styleguide - https://phabricator.wikimedia.org/T275387 (10jbond) p:05Triage→03Low [14:54:23] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:55:11] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:57:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wikifeeds: Update envoy to 1.15.1-4 and statsd exporter to 0.0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666122 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [14:57:41] (03CR) 10Elukey: "This change needs to be done with puppet disabled on buster nodes, doing chown on each of them" [puppet] - 10https://gerrit.wikimedia.org/r/666134 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [15:00:05] Urbanecm and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for New wikis . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1500). [15:00:07] (03CR) 10Elukey: [C: 04-1] "Will amend this to include only data.yaml changes, more clear." [puppet] - 10https://gerrit.wikimedia.org/r/666133 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [15:03:15] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:33] I'm here [15:03:35] Amir1: ^ [15:06:12] * Urbanecm is going to init stuff [15:06:12] (03PS1) 10AGueyte: Enable SecurePoll logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) [15:07:19] (03CR) 10jerkins-bot: [V: 04-1] Enable SecurePoll logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) (owner: 10AGueyte) [15:09:29] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1002 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:11:33] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664853 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:11:44] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy-future: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664854 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:11:53] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] prometheus-statsd-exporter: Run as nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664864 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:11:58] (03Abandoned) 10AGueyte: Enable SecurePoll logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) (owner: 10AGueyte) [15:12:02] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] nutcracker: Run as user nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664865 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:12:26] (03PS1) 10Urbanecm: Initial configuration for altwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666141 (https://phabricator.wikimedia.org/T271980) [15:16:20] (03PS1) 10Urbanecm: Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) [15:17:05] (03PS2) 10Urbanecm: Initial configuration for altwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666141 (https://phabricator.wikimedia.org/T271980) [15:17:43] (03Restored) 10AGueyte: Enable SecurePoll logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) (owner: 10AGueyte) [15:17:56] (03PS2) 10AGueyte: Enable SecurePoll logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) [15:20:15] o/ [15:20:35] (03PS2) 10Urbanecm: Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) [15:20:40] hey Amir1 [15:20:41] ready? [15:20:49] Ready [15:21:07] Amir1: can you please look at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/666142/ and check if I copied meta namespace well? My browser is having issues displaying those fonts :/ [15:21:58] I do have that problem as well [15:22:26] :( [15:22:41] (03PS3) 10Urbanecm: Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) [15:23:02] mojibake looks the same though [15:23:09] good [15:23:10] (the unicode numbers [15:23:18] so probably good to try? [15:24:09] (03PS4) 10Urbanecm: Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) [15:28:32] going to merge config for the alternative Wikipedia [15:28:34] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for altwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666141 (https://phabricator.wikimedia.org/T271980) (owner: 10Urbanecm) [15:28:42] those new lang codes are really cool [15:29:19] (03PS1) 10Urbanecm: Initial configuration for mniwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666145 (https://phabricator.wikimedia.org/T273457) [15:29:39] (03Merged) 10jenkins-bot: Initial configuration for altwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666141 (https://phabricator.wikimedia.org/T271980) (owner: 10Urbanecm) [15:29:51] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10JMeybohm) [15:29:52] (03CR) 10JMeybohm: [C: 03+2] api-gateway: Update mutiple sidecar images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:29:55] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) 05Open→03Resolved I think it's safe to say this is done now with the admin_ng using helm3 and the updates to puppet. [15:30:06] pulling to mwmaint1002 [15:30:24] and creating the DB [15:30:46] and...here we go with fatal Amir1 [15:30:50] addwiki is broken again [15:30:56] YAY [15:31:08] What is the erro [15:31:13] pasting to phab [15:31:40] Amir1: https://phabricator.wikimedia.org/P14443 [15:31:53] 10SRE, 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10herron) 05Open→03Resolved Thanks @ayounsi it's been re-enabled and puppet has been run [15:31:57] (03Merged) 10jenkins-bot: api-gateway: Update mutiple sidecar images [deployment-charts] - 10https://gerrit.wikimedia.org/r/664523 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:32:04] the good thing is the DB is there, and it has the right tables AFAIK [15:32:34] yeah this is for main page [15:32:38] yup [15:32:43] the other good thing it is at the right shard [15:32:44] do we have anything after creating main page? [15:32:53] looking [15:32:56] (03PS3) 10Cwhite: mw_rc_irc: add check_prometheus alert on no messages being relayed [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) [15:32:59] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:33:02] 10SRE, 10observability, 10serviceops, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10herron) [15:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:12] (03CR) 10Cwhite: mw_rc_irc: add check_prometheus alert on no messages being relayed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [15:33:18] 10SRE, 10serviceops, 10cloud-services-team (Kanban): hosts failing puppet compile due to missing secrets - https://phabricator.wikimedia.org/T274392 (10herron) [15:33:25] Amir1: extensions/WikimediaMaintenance/filebackend/setZoneAccess.php [15:33:33] clearing massmessage cache [15:33:35] > Expected User to belong to 'altwiki', but it belongs to the local wiki [15:33:35] but altwiki is the local wiki [15:33:53] the first one is for upload [15:33:56] yup [15:34:05] it should be mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=altwiki, right? [15:34:25] oh we have one for that, nice [15:34:28] let's run that [15:34:37] yeah, it runs a child [15:34:49] meh, no version entry for `altwiki`. [15:34:53] hacking wikiversions.php [15:35:34] (03PS1) 10Muehlenhoff: Enable dns_canonicalize_hostname for sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/666146 [15:35:37] and per https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/addWiki.php#L199, I'm also supposed to put --backend=local-multiwrite [15:35:39] is that right, Amir1 ? [15:35:53] mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --wiki=altwiki --backend=local-multiwrite ? [15:36:01] yeah [15:36:05] let's try [15:36:14] no errors on that one [15:36:24] standard output https://www.irccloud.com/pastebin/KLdAL3vn/ [15:36:36] 10SRE, 10OTRS, 10Security: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Majavah) [15:37:09] \o/ [15:37:28] and i can also edit the wiki on mwdebug [15:37:43] Amir1: I'd still prefer to fix the mainpage issue through before going with the other wikis. Any idea how to do it? [15:38:05] at worse we can add a flag --no-edits, so it still finishes successfully, and create it manually [15:38:32] yeah, I was looking into it [15:38:39] oh, cool [15:38:42] It should use the maintenance user [15:39:13] any objections to syncing altwiki and creating mainpage etc. manually? [15:39:37] (03PS1) 10JMeybohm: api-gateway: Pin nutcracker version to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666148 (https://phabricator.wikimedia.org/T274254) [15:41:30] (03PS5) 10Ottomata: Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [15:41:32] I don't. Just populate sites table too [15:41:38] (03CR) 10Ottomata: [C: 03+1] Add a mediawiki-api term to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/665814 (https://phabricator.wikimedia.org/T274951) (owner: 10Elukey) [15:41:41] it's also after creating main page [15:41:52] ah, will do [15:42:07] oh set fundraising link is there too w.wiki/$ [15:42:17] yeah [15:42:23] will do manually, once the wiki is live [15:42:47] funnily enough, this also needs fixing [15:43:34] Urbanecm: suggestion, hack addWiki.php and move creating main page at the end [15:43:37] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating altwiki (T271980) (duration: 00m 56s) [15:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:43] T271980: Create Wikipedia Altai - https://phabricator.wikimedia.org/T271980 [15:44:03] (and fundraising link because this one will break too) [15:44:29] (03CR) 10Hnowlan: [C: 03+1] api-gateway: Pin nutcracker version to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666148 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:45:55] (03CR) 10JMeybohm: [C: 03+2] api-gateway: Pin nutcracker version to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666148 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:46:40] would work Amir1 [15:46:50] at mwmaint, or backport it regularly? [15:47:11] PROBLEM - Ensure local MW versions match expected deployment on mwmaint1002 is CRITICAL: CRITICAL: 1 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [15:47:20] ^^^that is me^^^ [15:47:42] (03Merged) 10jenkins-bot: api-gateway: Pin nutcracker version to 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/666148 (https://phabricator.wikimedia.org/T274254) (owner: 10JMeybohm) [15:48:21] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:43] RECOVERY - ensure kvm processes are running on cloudvirt1039 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:50:37] I can't say this would fix it but it's worth a try if you want to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/666154 [15:50:43] Urbanecm: ^ [15:51:11] no backporting, we need to fix this properly [15:51:20] true [15:51:42] (03Abandoned) 10Alexandros Kosiaris: linkrecommendation: Point to dyna.w.o [dns] - 10https://gerrit.wikimedia.org/r/659315 (https://phabricator.wikimedia.org/T269581) (owner: 10Alexandros Kosiaris) [15:51:48] (03Abandoned) 10Alexandros Kosiaris: linkrecommendation: Add linkrecommendation.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/659314 (https://phabricator.wikimedia.org/T269581) (owner: 10Alexandros Kosiaris) [15:53:03] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:27] (03PS1) 10Jgiannelos: Multiple fixes [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 [15:53:38] Amir1: and/or https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/666158, to workaround it for those wikis, and think about a proper fix for later (MF-W said he'll approve a few more wikis this week, so we'll have way to test anyway) [15:54:07] PROBLEM - Ensure local MW versions match expected deployment on mwdebug1001 is CRITICAL: CRITICAL: 1 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [15:54:15] ^^this is also me^^ [15:54:22] either way is fine for me [15:54:36] (03PS2) 10Jgiannelos: Multiple fixes [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 [15:54:52] I really want to clean up addWiki.php [15:55:37] Amir1: ack. I'll upload my version of addwiki to mwmaint then [15:55:55] Virtual +2 [15:56:03] thx [15:56:36] Amir1: maybe we want to merge it anyway, to make it easy to skip it when this bug reoccurs (it already has skipclusters, and i guess that was introduced under similar conditions) [15:56:38] (03PS2) 10Elukey: admin: reserve gid/uid for various Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/666133 (https://phabricator.wikimedia.org/T231067) [15:56:40] (03PS2) 10Elukey: bigtop: set uid/gid for yarn/hdfs/mapred/hadoop user/groups for Buster [puppet] - 10https://gerrit.wikimedia.org/r/666134 (https://phabricator.wikimedia.org/T231067) [15:56:41] but I'll leave it up to you [15:56:42] (03PS3) 10Elukey: druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) [15:57:13] (03PS3) 10Jgiannelos: Fix tegola building pipeline [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 [15:57:44] !log Temporarily replace /srv/mediawiki/php-1.36.0-wmf.31/extensions/WikimediaMaintenance/addWiki.php with /home/urbanecm/addWiki.php at mwmaint1002 to unbreak addWiki.php [15:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:08] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating altwiki (T271980) (duration: 00m 59s) [15:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:14] T271980: Create Wikipedia Altai - https://phabricator.wikimedia.org/T271980 [16:00:06] (03CR) 10Jgiannelos: "I tried it locally while working on a draft helm chart and it looks like even though build steps are passing on jenkins, I don't think the" [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 (owner: 10Jgiannelos) [16:00:13] !log urbanecm@deploy1001 Synchronized dblists: Creating altwiki (T271980) (duration: 00m 54s) [16:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:59] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating altwiki (T271980) [16:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:23] (03PS3) 10Elukey: admin: reserve gid/uid for various Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/666133 (https://phabricator.wikimedia.org/T231067) [16:02:25] (03PS3) 10Elukey: bigtop: set uid/gid for yarn/hdfs/mapred/hadoop user/groups for Buster [puppet] - 10https://gerrit.wikimedia.org/r/666134 (https://phabricator.wikimedia.org/T231067) [16:02:27] (03PS4) 10Elukey: druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) [16:02:52] my problem is that when it breaks, we can't run addWiki.php again. Can we? [16:03:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating altwiki (T271980) (duration: 00m 55s) [16:03:05] RECOVERY - Ensure local MW versions match expected deployment on mwmaint1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:15] Amir1: we can't [16:03:16] so it really needs fixing or we just remove this part and let the creator do those [16:03:30] but if it's broken for 1 wiki, and I'm creating 3, I can add that param for the other two [16:03:41] instead of livehacking addwiki after each scap pull [16:04:08] we obviously need a real fix, too [16:04:33] 10SRE, 10OTRS, 10Security: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10grin) As a sidenote: I have checked a lot of alternatives in the past and have had a second round when otrs ag pulled the plug but found no real replacement. Som... [16:05:03] RECOVERY - Ensure local MW versions match expected deployment on mwdebug1001 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [16:05:06] (03PS5) 10Urbanecm: Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) [16:05:13] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) (owner: 10Urbanecm) [16:06:10] (03Merged) 10jenkins-bot: Initial configuration for mniwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666142 (https://phabricator.wikimedia.org/T273456) (owner: 10Urbanecm) [16:07:22] (03PS4) 10Elukey: bigtop: set uid/gid for hadoop user/groups for Buster [puppet] - 10https://gerrit.wikimedia.org/r/666134 (https://phabricator.wikimedia.org/T231067) [16:07:24] (03PS5) 10Elukey: druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) [16:07:40] Amir1: made the mainpage and https://alt.wikipedia.org/wiki/MediaWiki:Sitesupport-url, hope it's good [16:08:52] !log urbanecm@deploy1001 Synchronized langlist: Creating altwiki (T271980) (duration: 00m 55s) [16:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:59] T271980: Create Wikipedia Altai - https://phabricator.wikimedia.org/T271980 [16:09:53] the livehacked version worked fine for mniwiki [16:11:00] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating mniwiki (T273456) (duration: 00m 56s) [16:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:07] T273456: Create Wikipedia Meitei - https://phabricator.wikimedia.org/T273456 [16:12:01] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating mniwiki (T273456) (duration: 00m 55s) [16:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:13] !log urbanecm@deploy1001 Synchronized dblists: Creating mniwiki (T273456) (duration: 00m 57s) [16:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:32] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating mniwiki (T273456) [16:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:51] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating mniwiki (T273456) (duration: 00m 55s) [16:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:01] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: Creating mniwiki (T273456) (duration: 00m 56s) [16:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:07] T273456: Create Wikipedia Meitei - https://phabricator.wikimedia.org/T273456 [16:17:13] Amir1: if you have time for quick CR: https://github.com/Ladsgroup/Phabricator-maintenance-bot/pull/25, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/666149 [16:18:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating mniwiki (T273456) (duration: 00m 56s) [16:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:59] {{merged}} [16:19:09] thx [16:19:23] !log urbanecm@deploy1001 Synchronized langlist: Creating mniwiki (T273456) (duration: 00m 54s) [16:19:27] the second seems unmerged through :) [16:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:02] (03PS2) 10Urbanecm: Initial configuration for mniwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666145 (https://phabricator.wikimedia.org/T273457) [16:20:06] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for mniwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666145 (https://phabricator.wikimedia.org/T273457) (owner: 10Urbanecm) [16:21:06] (03Merged) 10jenkins-bot: Initial configuration for mniwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666145 (https://phabricator.wikimedia.org/T273457) (owner: 10Urbanecm) [16:21:23] let's do the last wiki now [16:22:23] okay... [16:22:27] ...we have an issue now... [16:22:33] :'( [16:22:39] ...I created the DB in s3 by mistake [16:23:05] Amir1: marostegui: Can someone of you help me get rid of mniwiktionary created in s3 unintentionally? [16:23:21] this is going to be complicated [16:23:27] I'm afraid so :( [16:23:31] sorry :( [16:24:11] but the canonical version exists in s5? [16:24:13] create a ticket, deleting a db needs some careful checks [16:24:15] (so two wikis) [16:24:21] (03PS1) 10Urbanecm: Point mniwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666178 (https://phabricator.wikimedia.org/T273457) [16:25:50] I can delete it in labsdb hosts only if replication breaks there [16:25:55] actually, maybe we're good [16:26:00] https://www.irccloud.com/pastebin/23Uh6ENk/ [16:26:05] db1100 is s5 master, right? [16:26:28] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@1f3bce1]: Deploy ArcLamp fixes for T273565 and T273640 [16:26:33] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@1f3bce1]: Deploy ArcLamp fixes for T273565 and T273640 (duration: 00m 05s) [16:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:35] T273565: Invalid /metrics output after adding new pipeline - https://phabricator.wikimedia.org/T273565 [16:26:35] T273640: NULL in stack frame causes SVG to be unreadable - https://phabricator.wikimedia.org/T273640 [16:26:48] Urbanecm, https://noc.wikimedia.org/db.php [16:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:19] (03PS5) 10Elukey: bigtop: set uid/gid for hadoop user/groups for Buster [puppet] - 10https://gerrit.wikimedia.org/r/666134 (https://phabricator.wikimedia.org/T231067) [16:27:21] (03PS6) 10Elukey: druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward [puppet] - 10https://gerrit.wikimedia.org/r/666135 (https://phabricator.wikimedia.org/T231067) [16:27:29] (03CR) 10Volans: [C: 04-1] "So, the failure I get on macos is that it fails to compile pyyaml 3.13 (and also 3.12 fwiw) with python 3.9 (I tried both 3.9.1 and 3.9.2)" [software/cumin] - 10https://gerrit.wikimedia.org/r/665365 (owner: 10David Caro) [16:28:29] (03CR) 10Hnowlan: [C: 03+1] Fix tegola building pipeline [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 (owner: 10Jgiannelos) [16:29:03] Amir1: in which esxxx host should i expect the right mniwiktionary ES database? [16:29:27] (03CR) 10MSantos: [C: 03+1] postgres: use remote script on replica to resync [cookbooks] - 10https://gerrit.wikimedia.org/r/666113 (https://phabricator.wikimedia.org/T275381) (owner: 10Hnowlan) [16:30:01] I don't know from top of my head [16:30:04] :( [16:30:14] I actually forgot how es works, it's rusty [16:30:16] it should be on all es4 and es5 hosts [16:30:21] thanks jynus [16:30:51] 10SRE, 10SRE-Access-Requests: Add Pau Giner to analytics-privatedata-users - https://phabricator.wikimedia.org/T275138 (10MNovotny_WMF) Approved! [16:30:53] it's on both es4 and es5 master [16:31:42] Amir1: so, merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/666178, execute rest of the script in shell.php, and sync? [16:32:17] it definitely needs doing [16:32:29] then let's check if everything is fine or we have explosions [16:32:58] at least the placement of the databases _looks_ correct, AFAICS [16:33:04] (03CR) 10Urbanecm: [C: 03+2] Point mniwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666178 (https://phabricator.wikimedia.org/T273457) (owner: 10Urbanecm) [16:33:07] so, let's merge it [16:34:00] (03Merged) 10jenkins-bot: Point mniwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666178 (https://phabricator.wikimedia.org/T273457) (owner: 10Urbanecm) [16:34:24] it failed on #8 /srv/mediawiki/php-1.36.0-wmf.31/extensions/WikimediaMaintenance/addWiki.php(187): Wikibase\Lib\Maintenance\PopulateSitesTable->execute() [16:34:28] let me check if it exists on s3 [16:34:33] it doesn't, i already checked [16:34:39] oh cool [16:34:48] why populatesites table fail [16:35:01] because it looked to s3 [16:35:07] and the database _wasn't_ there [16:35:29] so once i fix db-* files, it should be possible to rerun it from there [16:36:59] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:36:59] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:37:01] okay okay [16:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:42] (03CR) 10Tchanders: "Looks good to me (aside from whitespace nitpick)." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) (owner: 10AGueyte) [16:38:13] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:44] Hmm, why s5? [16:39:33] James_F: DBAs want to have all new wikis on s5 [16:39:41] Oh. Interesting. [16:39:49] We should probably re-work the scripts then. [16:39:55] we did [16:40:02] In mediawiki-config. [16:40:15] Rather than have things fall back to s3. [16:40:20] we should make db-* use the dblists somehow (either directly or indirectly) [16:40:35] No no no. [16:40:39] We absolutely should not. :-) [16:40:41] why not [16:40:45] anyway, the rest of the addwiki script successfully executed [16:40:47] The dblists are very expensive to parse. [16:40:58] James_F: that's why I'm saying "or indirectly" [16:41:12] ie. have some script that regenerates the mapping in db-*.php files [16:41:13] OK, I'll say "we should make the DB config work with magic" then. :-) [16:41:21] we really should migrate everything to yaml T223602 [16:41:22] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [16:41:30] agreed [16:41:36] Amir1: Yes, well, good luck. :-( [16:41:44] All I want want for Christmas is seeing T223602 resolved [16:42:02] how can I help? [16:42:04] syncing the wikt config [16:42:35] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating mniwiktionary (T273457) (duration: 00m 55s) [16:42:36] James_F: the thing is that s3 should be still the default as it's 900 wikis and s5 is like 20 [16:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:41] T273457: Create Wiktionary Meitei - https://phabricator.wikimedia.org/T273457 [16:42:54] Amir1: But we should explicitly configure it, not have it fallback at all. [16:43:13] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:28] or at least have CI yell at me for not configuring db-* properly [16:44:10] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating mniwiktionary (T273457) (duration: 00m 56s) [16:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:29] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:44:30] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [16:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:44] yeah [16:44:59] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:07] James_F: regarding the yaml config. Is there anything I can do to push it forward What's stuck on? [16:45:10] !log urbanecm@deploy1001 Synchronized dblists: Creating mniwiktionary (T273457) (duration: 00m 56s) [16:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:10] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664686 (owner: 10CRusnov) [16:46:25] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/665410 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [16:46:34] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating mniwiktionary (T273457) [16:46:37] Amir1: No idea, sorry. I'm not in RelEng any more. [16:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:08] James_F: where did you move to, if i may ask? [16:47:15] Abstracting Wikipedia [16:47:24] Wikipediaing Abstraction. [16:47:29] i see [16:47:37] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating mniwiktionary (T273457) (duration: 00m 56s) [16:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:42] T273457: Create Wiktionary Meitei - https://phabricator.wikimedia.org/T273457 [16:47:47] lol [16:48:02] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/665410 (https://phabricator.wikimedia.org/T265084) (owner: 10CRusnov) [16:48:21] so, just iwiki cache now [16:48:33] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666181 [16:48:35] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666181 (owner: 10Urbanecm) [16:49:44] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666181 (owner: 10Urbanecm) [16:50:48] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 22s) [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:09] !log Run scap pull on mwmaint1002 to clear any local changes [16:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:32] Amir1: I think we're done :) [16:51:37] YAY [16:51:49] I run the populate sites for all wikis for wikidata support [16:51:52] thanks [16:52:35] but first, lunch [16:52:38] "lunch" [16:52:46] enjoy your meal Amir1 :) [16:54:02] (03CR) 10Hnowlan: [C: 03+2] Fix tegola building pipeline [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 (owner: 10Jgiannelos) [16:54:05] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [16:54:53] (03Merged) 10jenkins-bot: Fix tegola building pipeline [software/tegola] - 10https://gerrit.wikimedia.org/r/666159 (owner: 10Jgiannelos) [16:58:36] (03PS1) 10AGueyte: Enable SecurePoll logging on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666183 (https://phabricator.wikimedia.org/T273990) [17:03:10] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28164/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:04:27] (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/666184 [17:05:55] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:15] (03PS41) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:09:16] (03CR) 10Klausman: [C: 03+1] Don't run Camus checker for atskafka webrerquest job [puppet] - 10https://gerrit.wikimedia.org/r/666121 (https://phabricator.wikimedia.org/T254317) (owner: 10Ottomata) [17:13:20] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/666184 (owner: 10Ahmon Dancy) [17:14:14] (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/666184 (owner: 10Ahmon Dancy) [17:14:59] !log ppchelko@deploy1001 Started deploy [restbase/deploy@c5c4b2d] (dev-cluster): remove graphoid [17:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:03] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:16:41] (03PS3) 10David Caro: transport.clustershell: handle str when reporting commands [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) [17:18:08] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c5c4b2d] (dev-cluster): remove graphoid (duration: 03m 09s) [17:18:11] (03PS4) 10David Caro: transport.clustershell: handle str when reporting commands [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) [17:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:19] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:18:45] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:33:46] (03CR) 10Bstorm: [C: 03+2] pbuilder: create apt-cache directory before running pbuilder init [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [17:40:55] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE)) [17:42:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:45:25] Is there a problem on uploading to Commons? I've been trying to upload a ~300MB video, and it gets stuck in the "queued" phase post-upload until eventually it times out. [17:46:57] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.76 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:47:09] (several times now) [17:49:01] abartov: Nothing specifically I don't think. Are you using UW or the "default" MW upload form? Or some other tool/similar? [17:49:23] UW [17:49:40] which I use every week to upload such videos (these are recorded lessons I give), without issue. [17:50:01] It never lingers on the "queued" for more than a second or two, before moving to "publishing" [17:50:41] I can supply the filename if anyone is able to and interested in grepping the logs. [17:51:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [17:51:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:52:30] abartov: I assume it's likely this T274589 [17:52:31] T274589: No atomic section is open (got LocalFile::lockingTransaction) - https://phabricator.wikimedia.org/T274589 [17:56:23] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.492 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:57:19] (03CR) 10David Caro: [C: 03+2] transport.clustershell: handle str when reporting commands [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [17:57:35] (just retried again) [17:57:56] Amir1: so should I just wait until that ticket is resolved? No point in trying any more, right? [17:58:09] I think so [17:58:13] We can probably do it server side for you [17:58:15] Maybe [17:58:29] (if that isn't broken too) [17:58:29] yeah, there has been cases like that before [17:58:46] but as Sam said, it might broken there too [17:59:43] (03CR) 10Giuseppe Lavagetto: "Removing my -2 as requested." [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [18:00:05] ryankemper: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1800). [18:00:47] Reedy: that would be lovely, actually, as I have some students waiting on it (people who can't make the live lesson and watch the recordings). Shall I wetransfer it to you, or...? [18:01:28] Yeah, whatever works for you. Can also use google drive or similar to upload it if you want [18:01:43] Have you also written the description? Of course, that can be fixed after, so nbd there [18:02:28] (03Merged) 10jenkins-bot: transport.clustershell: handle str when reporting commands [software/cumin] - 10https://gerrit.wikimedia.org/r/665366 (https://phabricator.wikimedia.org/T275210) (owner: 10David Caro) [18:03:41] Reedy: what e-mail should I send the link to? [18:03:52] reedy@wikimedia.org wfm :) [18:04:21] thank you! [18:04:46] what username do you want it uploading as? [18:07:27] my volunteer account, [[User:Ijon]] [18:07:36] (that's a capital I, not a lowercase L) [18:09:42] (03CR) 10Bstorm: [C: 03+1] toolforge: front proxy: drop non-TLS support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662942 (https://phabricator.wikimedia.org/T274123) (owner: 10Arturo Borrero Gonzalez) [18:11:39] abartov: https://commons.wikimedia.org/wiki/File:Latin_for_Beginners_%E2%80%93_contra_pestilentiam_%E2%80%93_Lesson_031b.webm [18:11:41] easy! [18:12:00] (also, good to know server side apparently works fine) [18:12:52] Reedy: wonderful, thank you! [18:39:34] 10SRE, 10OTRS, 10Security: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Keegan) FWIW, Znuny has helped us before, in 2013 (T24622). Znuny was founded by Martin Edenhofer, the original author of OTRS. Our OTRS install was stuck on so... [18:39:55] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 56.65 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:46:51] (03CR) 10Volans: [C: 04-1] "One small issue inline, looks sane otherwise. Consider it a +1 once that's fixed." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664686 (owner: 10CRusnov) [18:51:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:51:22] jouncebot: next [18:51:22] In 0 hour(s) and 8 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1900) [18:55:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:56:39] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.604 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:59:43] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T1900). [19:00:04] Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] I'll self-service [19:00:32] (03PS4) 10Urbanecm: Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) [19:00:36] (03CR) 10Urbanecm: [C: 03+2] Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [19:00:57] (03PS3) 10CRusnov: reports: Update for 2.10 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/664686 [19:00:59] (03PS2) 10CRusnov: customscripts: Make minor changes to port to 2.10 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/665410 (https://phabricator.wikimedia.org/T265084) [19:01:26] mutante: I see you used the jouncebot: next, should I ping you once done? [19:01:30] (03Merged) 10jenkins-bot: Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [19:02:21] 10SRE, 10Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409 (10Legoktm) For reference, https://wiki.mozilla.org/Security/DOH-resolver-policy has links to the privacy policies of the Mozilla-approved resolvers. > Additionally, maybe it's a good idea to try to en... [19:02:41] Urbanecm: it's fine, not needed. it's me who just needs to scap pull when done [19:03:03] well, optionallyy :) thanks [19:03:07] okay, just asking in case you want to do something :) [19:03:16] happy to ping you, that's like the least i can do [19:03:18] just reimaging the last 10% [19:03:26] codfw is already 100% buster [19:03:36] ok, ping me:) [19:03:47] kk [19:03:55] and congrats for having 100% buster codfw [19:04:11] Congrats [19:05:11] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:32] thanks [19:06:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 902b6854b5d56fde9fbf5d2c779282049bf7288a: Enable GrowthExperiments on thwiki (T274646) (duration: 00m 56s) [19:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:21] T274646: Growth tools deployment on Thai Wikipedia - https://phabricator.wikimedia.org/T274646 [19:08:25] !log urbanecm@deploy1001 Synchronized dblists/growthexperiments.dblist: 902b6854b5d56fde9fbf5d2c779282049bf7288a: Enable GrowthExperiments on thwiki (T274646) (duration: 00m 54s) [19:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:03] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:09] mutante: done with this one patch, will have another one in ~15 minutes [19:11:17] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "lgtm! by the way, see this:" [puppet] - 10https://gerrit.wikimedia.org/r/665471 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [19:12:14] Urbanecm: ack, thanks! [19:14:30] 10SRE, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10aaron) [19:17:15] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1316.eqiad.wmnet with reason: REIMAGE [19:19:14] (03PS1) 10Urbanecm: Enable GrowthExperiments on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666198 (https://phabricator.wikimedia.org/T275130) [19:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:58] 10SRE, 10Language-Team, 10Performance-Team, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Gilles) The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not famili... [19:22:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1316.eqiad.wmnet with reason: REIMAGE [19:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1315.eqiad.wmnet with reason: REIMAGE [19:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1315.eqiad.wmnet with reason: REIMAGE [19:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:16] (03CR) 10Urbanecm: [C: 03+2] Enable GrowthExperiments on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666198 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [19:28:13] (03Merged) 10jenkins-bot: Enable GrowthExperiments on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666198 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [19:28:20] 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10pkang) Hi all, per my slack conversation with David, the T&S team is okay with continuing to use the default reply address. Please feel free to close this ticket as no... [19:28:29] 10SRE, 10Language-Team, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Gilles) [19:31:44] 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10Legoktm) 05Open→03Declined [19:33:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fc7b071b98b2c14d45259212bd6bea858e3f5aa7: Enable GrowthExperiments on rowiki (T275130; 1/3) (duration: 00m 55s) [19:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:47] T275130: Deploy Growth features on Romanian Wikipedia - https://phabricator.wikimedia.org/T275130 [19:35:04] !log urbanecm@deploy1001 Synchronized dblists/growthexperiments.dblist: fc7b071b98b2c14d45259212bd6bea858e3f5aa7: Enable GrowthExperiments on rowiki (T275130; 2/3) (duration: 00m 55s) [19:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:55] !log urbanecm@deploy1001 Synchronized wmf-config/config/rowiki.yaml: fc7b071b98b2c14d45259212bd6bea858e3f5aa7: Enable GrowthExperiments on rowiki (T275130; 3/3) (duration: 00m 55s) [19:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:40] 10SRE, 10Language-Team, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Reedy) >>! In T275319#6850077, @Gilles wrote: > The fact that each character takes twice the storage space shouldn't affect pa... [19:46:06] 10SRE, 10Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409 (10ssingh) >>! In T275409#6849975, @Legoktm wrote: > For reference, https://wiki.mozilla.org/Security/DOH-resolver-policy has links to the privacy policies of the Mozilla-approved resolvers. Ah right,... [19:48:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:02:22] 10SRE, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10ssingh) We decided to file https://github.com/citizenlab/test-lists/pull/730 so that we can get some test data. (Thanks Chris for suggesting the use of this... [20:03:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1316.eqiad.wmnet'] ` an... [20:04:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1349.eqiad.wmnet with reason: REIMAGE [20:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1349.eqiad.wmnet with reason: REIMAGE [20:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1315.eqiad.wmnet'] ` an... [20:08:36] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1315.eqiad.wmnet [20:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1315.eqiad.wmnet [20:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:13] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:16:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1316.eqiad.wmnet [20:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1316.eqiad.wmnet [20:19:22] (03PS1) 10Dzahn: cloud/devtools: add mediawiki::sites lookup for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/666202 [20:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:47] (03CR) 10jerkins-bot: [V: 04-1] cloud/devtools: add mediawiki::sites lookup for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/666202 (owner: 10Dzahn) [20:20:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:25:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1349.eqiad.wmnet'] ` an... [20:25:45] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1349.eqiad.wmnet [20:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:21] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1349.eqiad.wmnet [20:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:29:29] !log mw1279 (canary) - reimaging to stretch [20:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:40] doh.. not stretch [20:29:52] !log mw1279 (canary) - reimaging to buster [20:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:13] (03PS2) 10Dzahn: cloud/devtools: add empty mediawiki::sites for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/666202 [20:35:20] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add empty mediawiki::sites for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/666202 (owner: 10Dzahn) [20:39:25] Hey, can anyone advice me what the right dashboard to see how frequently a particular job gets (not) executed is? I never can find the right one in Grafana :-( [20:39:41] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T273463 T271985 T273468) [20:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:50] T273468: Add Wikidata support for mniwiki - https://phabricator.wikimedia.org/T273468 [20:39:50] T271985: Add Wikidata support for altwiki - https://phabricator.wikimedia.org/T271985 [20:39:50] T273463: Add Wikidata support for mniwiktionary - https://phabricator.wikimedia.org/T273463 [20:39:52] thanks Amir1 :) [20:40:19] nah, this poor node is doing the job, I'm just running a command [20:40:26] hehe [20:40:46] you're the commander, and I thank you for commanding that node (and solving issues it complains with :)) [20:41:10] Amir1: by any chance, do you know the answer to the job queue question few lines above? :D [20:41:21] let me see [20:41:46] oh I know [20:41:52] that's great [20:42:38] Urbanecm: https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?viewPanel=1&orgId=1 [20:42:43] thanks! [20:44:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1314.eqiad.wmnet with reason: REIMAGE [20:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1314.eqiad.wmnet with reason: REIMAGE [20:46:13] Amir1: is it expected for the normal job processing rate graph to be zero? https://grafana-rw.wikimedia.org/d/000000400/jobqueue-eventbus?viewPanel=2&orgId=1 [20:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:13] nope. ottomata do you know why ^? [20:47:20] Did we break something? [20:48:00] Amir1: I'm not sure. Tldr I'm looking at T275429, and it seems that UserOptionsUpdateJob does not get executed [20:48:00] T275429: Homepage mentor is not stored persistently at Romanian Wikipedia - https://phabricator.wikimedia.org/T275429 [20:48:04] but maybe some others don't too [20:48:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1312.eqiad.wmnet with reason: REIMAGE [20:48:12] Also it would be great if we clean up pre-eventbus grafana board It takes a lot to find the right dashboard [20:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:23] (or all, if that graph is right) [20:48:49] and T275432 sounds to be jobqueue related too [20:48:49] T275432: MassMessage not delivering - https://phabricator.wikimedia.org/T275432 [20:49:53] This looks bad [20:49:54] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?viewPanel=15&orgId=1&from=now-90d&to=now [20:50:10] (03PS1) 10Dzahn: cloud/devtools: mediawiki::sites needs to be array not hash [puppet] - 10https://gerrit.wikimedia.org/r/666206 [20:50:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1312.eqiad.wmnet with reason: REIMAGE [20:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:23] I'm surprised we don't have alerts for that one [20:51:28] Hey all - I'd like to deploy a quick sec patch to .31 for T274883. [20:52:48] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: mediawiki::sites needs to be array not hash [puppet] - 10https://gerrit.wikimedia.org/r/666206 (owner: 10Dzahn) [20:53:00] looking at logstash, it runs jobs, so it's not 100% broken but I can't find out why it's not sending metrics to grafana [20:53:07] Urbanecm: that dashboard says that useroptionsupdate queue is empty [20:53:09] Urbanecm: can you file a ticket? [20:53:28] and some other jobs like category updates have definitely been running today [20:53:30] Amir1: I already filled T275429 for the growthexperiments-specific issue I noticed. [20:53:30] T275429: Homepage mentor is not stored persistently at Romanian Wikipedia - https://phabricator.wikimedia.org/T275429 [20:53:41] Should I fill some more generic one, about jobqueue (partially) not executing? [20:54:48] Urbanecm: we also have a problem with altwiki it seems [20:54:48] https://logstash.wikimedia.org/goto/6574f7908e4f41b51c66b849ea1215df [20:55:05] that's more expected since it's new, heh [20:55:42] but it shouldn't fail? [20:56:09] Urbanecm: we should have ticket for general issue whether it's no metrics or no jobs being ran [20:56:19] Amir1: okay, filling [20:57:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1279.eqiad.wmnet with reason: REIMAGE [20:57:36] just noticed this https://usercontent.irccloud-cdn.com/file/fuUi2vaY/image.png [20:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:29] 10SRE: Application Servers / JobQueue EventBus :: Normal job processing rates says "Failed to fetch" - https://phabricator.wikimedia.org/T275436 (10Urbanecm_WMF) [20:59:35] (that specific issue filled as T275436) [20:59:36] T275436: Application Servers / JobQueue EventBus :: Normal job processing rates says "Failed to fetch" - https://phabricator.wikimedia.org/T275436 [20:59:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1279.eqiad.wmnet with reason: REIMAGE [20:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:53] !log Deployed security patch for T274883 [20:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T2100). [21:00:24] !log end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T273463 T271985 T273468) [21:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:39] T273468: Add Wikidata support for mniwiki - https://phabricator.wikimedia.org/T273468 [21:00:39] T271985: Add Wikidata support for altwiki - https://phabricator.wikimedia.org/T271985 [21:00:39] T273463: Add Wikidata support for mniwiktionary - https://phabricator.wikimedia.org/T273463 [21:01:17] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) [21:01:24] and filled ^^ for the generic job queue issue [21:01:54] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) [21:02:12] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) [21:02:34] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) [21:03:21] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) [21:03:54] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Ladsgroup) Some scary graphs: {F34119169} {F34119171} [21:04:14] (03PS3) 10AGueyte: Enable SecurePoll logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666140 (https://phabricator.wikimedia.org/T273990) [21:04:52] we really should look into altwiki jobs issue but after finding out about other jobs [21:04:56] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) p:05Triage→03Unbreak! Preliminary prioritising this as UBN, unless we figure out this is not as serious as it looks like. [21:05:08] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Pchelolo) Need to delete the dashboard you've referenced, it's outdated. https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 is the correct one. The backlog of various jobs inde... [21:05:23] seems you didn't find the correct dashboard Amir1 🙂 [21:06:00] (and definitely, we should not forget the altwiki issue) [21:06:00] :( [21:06:06] Amir1: should i fill it, or will you? [21:07:37] reading backlog... [21:08:23] none of those dashboard links work for me! [21:08:23] I'm really done for the day, please subscribe me. I'll investigate tomorrow [21:08:41] Can we clean these graphs? [21:08:56] if you search "job" in grafana you find five different dashboards [21:09:38] dunno nuthin bout none of those :p [21:09:55] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Pchelolo) So, processMediaModeration job should be excluded - it's expected for it's backlog to grow this way. It's running in it's own queue, not affecting others, so even though it lo... [21:09:59] https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=1614006587510&to=1614028187510&var-dc=eqiad%20prometheus%2Fk8s looks like the main one [21:10:19] ottomata: Pchelolo said https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 should be the right one [21:10:27] ya makes sense [21:10:50] yeah, sorry, when moved to k8s we've duplicated the dashboard and I never deleted the old one. [21:11:15] 10SRE, 10WMF-JobQueue: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 (10Urbanecm) [21:11:22] I left some analysis on the task and from the backlog graphs it all seems to be ok. [21:11:24] Filled as T275438 Pchelolo, ottomata and Amir1 [21:11:25] T275438: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 [21:11:32] insert rate is much higher than the processing rate, is is just backlogged or is it not executing some jobs at all? [21:11:42] is there a dedicated grafana tag? [21:12:02] 10SRE, 10WMF-JobQueue: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 (10Pchelolo) 05Open→03Resolved a:03Pchelolo I've deleted the old dashboard. https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1 is the one and only dashboard. [21:12:30] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Ladsgroup) There are two reports of jobs not being processed above, one being massmessage and the other one being growth jobs. Maybe we are running out capacity because of depooling job... [21:13:11] lemme look at the subtasks [21:13:25] thanks Pchelolo [21:13:46] 10SRE, 10WMF-JobQueue: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 (10Urbanecm) 05Resolved→03Open There are apparently at least two other job dashboards loading in my browser: {F34119176} Can we clean those as well? [21:13:56] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Legoktm) >>! In T275437#6850546, @Ladsgroup wrote: > Maybe we are running out capacity because of depooling job runners for the buster upgrade? job runners are already 100% buster. [21:14:29] 10SRE, 10WMF-JobQueue: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 (10Ladsgroup) @Pchelolo There are also https://grafana.wikimedia.org/d/000000105/job-queue-rate?orgId=1 and https://grafana.wikimedia.org/d/000000107/job-queue-health?orgId=1&refresh=1m [21:14:40] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Ladsgroup) That rules that out. Thanks. [21:16:31] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Majavah) Looking at the dashboard, the job insert rate is much higher than the processing rate, is is just backlogged/out of capacity or is it not executing jobs/job types at all? [21:16:33] 10SRE, 10WMF-JobQueue: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 (10Pchelolo) Both need to be deleted as well. [21:17:02] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=5&orgId=1&from=now-7d&to=now what is the media moderation stuff? [21:17:12] really makes it hard to see other stuff [21:17:28] legoktm: that is some job that automatically looks into "bad" media at commons, and inform T&S to deal with them [21:17:46] https://www.mediawiki.org/wiki/Extension:MediaModeration [21:18:16] that one is abusing job queue a bit - there's a maintenance script that triggers a lot of those jobs, and then they are slowly processed [21:18:45] hmm, maybe it should run at mwmaint directly [21:18:46] ah, so is the backlog just a one-time thing or is there always going to be a backlog? [21:19:16] it's a one time thing [21:19:19] anyways, this is a side thing, it just makes it hard to see what's genuinely backlogged [21:19:21] but for a while [21:19:36] there is a lot of media on commons :) [21:23:13] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Legoktm) >>! In T275437#6850519, @Pchelolo wrote: > The recent drastic increase in LocalGlobalUserPageCacheUpdateJob is interesting, but we've already been through this process with it... [21:23:25] Pchelolo: do you need help investigating? [21:24:54] legoktm: at this point I'm kinda randomly poking at various logs with no plan or purpose... donno if this could be helped :) [21:28:03] another hypothesis, buster is slower, we are hitting the capacity [21:28:30] with that I just fade away into the night, see you tomorrow! [21:28:51] ttyl Amir1 [21:31:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1314.eqiad.wmnet'] ` an... [21:31:26] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Pchelolo) This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures. [21:31:44] for later: both links on https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue#Monitoring don't work [21:33:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1312.eqiad.wmnet'] ` an... [21:36:22] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Urbanecm_WMF) >>! In T275437#6850596, @Pchelolo wrote: > This ticket is not a UBN. We've had issues with mass message before, and there's no indication of widespread jobs failures. Fee... [21:36:51] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Pchelolo) p:05Unbreak!→03Triage [21:37:54] 10SRE, 10WMF-JobQueue: Clean graphana graphs about job queue - https://phabricator.wikimedia.org/T275438 (10Pchelolo) 05Open→03Resolved [21:42:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1279.eqiad.wmnet'] ` an... [21:45:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1312.eqiad.wmnet [21:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1279.eqiad.wmnet [21:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:21] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1314.eqiad.wmnet [21:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:04] Reedy and sbassett: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210222T2200) [22:05:50] (03PS2) 10Krinkle: mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) [22:08:10] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:08] (03PS1) 10Gergő Tisza: Add GrowthExperiments tables to private_tables [puppet] - 10https://gerrit.wikimedia.org/r/666216 (https://phabricator.wikimedia.org/T266913) [22:11:41] (03PS1) 10Krinkle: fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) [22:12:01] (03PS3) 10Krinkle: mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) [22:12:02] Pchelolo: is it safe to assume events will be in EventBus.log once they're scheduled by MediaWiki? [22:12:06] or is that a wrong assumption? [22:12:53] Urbanecm: we don't log all events in a log file, that would be too much [22:13:01] Aha [22:13:12] (03CR) 10jerkins-bot: [V: 04-1] fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [22:13:15] (03PS2) 10Krinkle: fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) [22:13:15] in order to debug what's actually getting into the queue, there's a trick [22:13:30] was going to ask if there's a way to confirm we're actually scheduling the job [22:13:34] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:37] (03CR) 10Krinkle: [C: 03+2] fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [22:14:15] Krinkle: i assume you know jenkins was not happy with you the first time? [22:14:18] (03CR) 10jerkins-bot: [V: 04-1] fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [22:15:52] (03PS3) 10Krinkle: fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) [22:16:03] Urbanecm: yeah, but forgot to take a look [22:16:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1314.eqiad.wmnet [22:16:11] I assumed my edit addressed whatever feedback it had [22:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:37] let's see [22:17:15] (03PS4) 10Krinkle: mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) [22:18:08] Pchelolo: still curious to hear the trick (sorry if you're still writing) [22:18:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1312.eqiad.wmnet [22:18:15] yeah, sorry. [22:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:22] so the trick is to go to mwmaint [22:18:29] and do sometihng like kafkacat -b kafka-main1001.eqiad.wmnet -t "eqiad.mediawiki.job.userOptionsUpdate" [22:18:41] then you can see all the jobs as they get in [22:18:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1279.eqiad.wmnet [22:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:56] PROBLEM - SSH on analytics1058.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:18:57] before all deduplication or anything. [22:19:02] that's nice trick [22:19:03] thanks [22:19:07] is it documented anywhere? [22:20:05] (03PS5) 10Krinkle: mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) [22:20:09] (03CR) 10Krinkle: [C: 03+2] fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [22:20:32] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Jclark-ctr) @fgiunchedi ms-be1017 has been swapped into place of ms-be1034. Will we be resurrecting the name ms-be1017 or renaming to ms-be1034? This will most likely need a re imaged. [22:20:40] Urbanecm: Pchelolo: I don't know which issue this is, but in theory if EventBus failed to deliver a message it would log to Logstash that it failed. [22:21:10] yeah, Krinkle. in this case the job is delivered but then mysteriosuly not processed [22:21:17] ok :) [22:21:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:21:59] (03Merged) 10jenkins-bot: fatal-error.php: Add support for from=destruct and from=shutdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666217 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [22:23:07] (03CR) 10Effie Mouzeli: create placeholder role/profile for gitlab VMs [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [22:24:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:25:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:31:32] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [22:31:32] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [22:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:53] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace sata cables for cloudvirt1024 - https://phabricator.wikimedia.org/T275215 (10Jclark-ctr) Created dell Service ticket Confirmed: Service Request 1052329386 was successfully submitted. [22:36:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [22:41:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE [22:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:17] 10SRE, 10Graphite: Application Servers / JobQueue EventBus :: Normal job processing rates says "Failed to fetch" - https://phabricator.wikimedia.org/T275436 (10Aklapper) [22:42:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE [22:42:15] !log krinkle@deploy1001 Synchronized w/fatal-error.php: df694d695 (duration: 00m 56s) [22:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE [22:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Jclark-ctr) a:05Jclark-ctr→03RobH @RobH replaced cable. link light on aqs1014 [22:45:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE [22:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:34] (03PS1) 10Mholloway: WikimediaEvents: Enable session tick instrument on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666221 (https://phabricator.wikimedia.org/T274172) [22:49:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1286.eqiad.wmnet with reason: REIMAGE [22:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:59] (03CR) 10Mholloway: [C: 04-2] "Hold 'til tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666221 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [22:50:31] !log disabling puppet on mwdebug1001 to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/664903 [22:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:13] Krinkle: copied into place on mwdebug1001 and reloaded php7.2-fpm [22:52:45] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1286.eqiad.wmnet with reason: REIMAGE [22:52:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10jijiki) @MoritzMuehlenhoff do you think it makes sense to keep 1 api and 1 app in s... [22:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:17] (03PS1) 10Greg Grossmeier: admin: update gjg's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/666222 [22:55:40] legoktm: ok, testing [22:56:02] (03PS2) 10Greg Grossmeier: admin: update gjg's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/666222 [22:57:32] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1103 - https://phabricator.wikimedia.org/T275266 (10Jclark-ctr) @Marostegui Swapped Bad SSD @wiki_willy we did have one new in box same size same model ect. it originally came from HP [22:57:34] greg-g: how did you get your hands on a Thinkpad X2? :p [22:58:19] oh [22:58:41] and here I thought they were going with "x1 nitrogen" next :p [23:00:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1410.eqiad.wmnet'] ` an... [23:00:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1410.eqiad.wmnet [23:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1412.eqiad.wmnet'] ` an... [23:02:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1412.eqiad.wmnet [23:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:34] legoktm: heh, my last x1 carbon (my first) had a hostname of x1, I didn't want to reuse, and I'm not very creative :P [23:03:57] better than x1-2 [23:04:19] x1002, obviously, to match prod naming scheme [23:04:29] mine is an 5thgen-x1 [23:05:39] (03PS3) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [23:06:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1412.eqiad.wmnet [23:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:33] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1410.eqiad.wmnet [23:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:44] (03PS4) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [23:09:24] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [23:09:24] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [23:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:50] legoktm: confirmed that it has the expected effect via mwdebug1001 [23:11:18] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1103 - https://phabricator.wikimedia.org/T275266 (10wiki_willy) Thanks @Jclark-ctr >>! In T275266#6850886, @Jclark-ctr wrote: > @Marostegui Swapped Bad SSD @wiki_willy we did have one new in box same size same model ect. it originally came from HP [23:11:51] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` aqs1014.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021022223... [23:12:16] Krinkle: ok, ready to be merged then? [23:12:26] legoktm: yep [23:12:54] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove duplicate "in X:Z" from php7-fatal-error.php message [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [23:15:16] ok, I'll let puppet roll it out gradually [23:15:30] did a manual run on mwdebug1001 though [23:18:40] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [23:18:40] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [23:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:48] !log milimetric@deploy1001 Started deploy [analytics/refinery@3de01b5]: Fix camus [23:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:27] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [23:22:27] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [23:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:31] (03PS1) 10Ppchelko: Enable helmfile recreatePods for changeprop installations [deployment-charts] - 10https://gerrit.wikimedia.org/r/666225 [23:25:45] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE [23:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:51] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1014.eqiad.wmnet with reason: REIMAGE [23:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:22] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10IKhitron) >>! In T275437#6850610, @Urbanecm_WMF wrote: >>>! In T275437#6850596, @Pchelolo wrote: >> This ticket is not a UBN. We've had issues with mass message before, and there's no i... [23:33:52] !log milimetric@deploy1001 Finished deploy [analytics/refinery@3de01b5]: Fix camus (duration: 14m 03s) [23:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:17] !log milimetric@deploy1001 Started deploy [analytics/refinery@3de01b5] (thin): Fix camus [23:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:22] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10Pchelolo) Ok, restarting jobqueue change propagation service might have resolved the problem. Will investigate the root cause of this. [23:34:24] !log milimetric@deploy1001 Finished deploy [analytics/refinery@3de01b5] (thin): Fix camus (duration: 00m 07s) [23:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1286.eqiad.wmnet'] ` an... [23:35:02] RECOVERY - MegaRAID on db1103 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:35:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['aqs1014.eqiad.wmnet'] ` and were **ALL** successful. [23:35:49] (03PS1) 10Ppchelko: Increase concurrency for cdnPurge job [deployment-charts] - 10https://gerrit.wikimedia.org/r/666228 [23:36:16] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1286.eqiad.wmnet [23:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:41] (03PS2) 10Ppchelko: Increase concurrency for cdnPurge job [deployment-charts] - 10https://gerrit.wikimedia.org/r/666228 [23:37:41] 10SRE, 10WMF-JobQueue: Jobs are not getting executed or executed really slowly - https://phabricator.wikimedia.org/T275437 (10IKhitron) >>! In T275437#6850970, @IKhitron wrote: >>>! In T275437#6850610, @Urbanecm_WMF wrote: >>>>! In T275437#6850596, @Pchelolo wrote: >>> This ticket is not a UBN. We've had issue... [23:37:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1286.eqiad.wmnet [23:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:04] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.233 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [23:43:16] dpifke: on webperf1002/2002 service excimer-log is marked as failed [23:43:53] Looking. [23:44:23] ack,tx [23:47:03] Is https://maps.wikimedia.org/ down intentionally? [23:47:18] Oh works now xD [23:47:37] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@1f3bce1]: Revert https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/664600 [23:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:42] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@1f3bce1]: Revert https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/664600 (duration: 00m 05s) [23:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:09] Oh it never was [23:48:17] Interesting thing [23:48:56] I clicked my own link via Facebook, and got that tiles can only be shown to wikimedia sites blah blah blah error [23:49:58] Cladis: indeed. The SREs decided to restrict maps.wikimedia.org only to Wikimedia projects and official affiliates [23:52:16] this is because the maps cluster is underprovisioned (in terms of both people, and hardware), and the SREs came to a conclusion that we can't afford to support a map server for the public internet. Prior to this change, much of the maps traffic was non-WM related, including for-profit companies. [23:52:23] see https://phabricator.wikimedia.org/T261424 [23:52:46] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Base) Perhaps a known issue and not strictly related to this ticket, but it is referenced in the error, is that when I follow... [23:52:48] !log stat1004 - systemctl reset-failed to clear icinga alerts for systemd state caused by jupyterhub singleuser services [23:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:36] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:48] !log stat1007 - same problem and alerts as stat1004 [23:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:59] ryankemper: kibana.service on relforge1004 is failed, if that is you and (not) known [23:57:13] just clearing icinga systemd alerts here and above [23:57:52] (03PS5) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [23:59:21] !log logstash2031 - systemctl reset-failed [23:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:46] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state