[00:00:20] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:22] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:48] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:56] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:42] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:38] (03CR) 10Tim Starling: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [02:38:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:42:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:08:50] (03PS3) 10Bmansurov: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) [03:11:25] (03CR) 10Bmansurov: "Thanks. I was able to get the logs." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [03:39:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:41:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:42:05] (03PS1) 10Andrew Bogott: ns-recursor1.openstack.codfw1dev.wikimediacloud.org to 208.80.153.118 [dns] - 10https://gerrit.wikimedia.org/r/606848 [03:43:16] (03CR) 10Andrew Bogott: [C: 03+2] ns-recursor1.openstack.codfw1dev.wikimediacloud.org to 208.80.153.118 [dns] - 10https://gerrit.wikimedia.org/r/606848 (owner: 10Andrew Bogott) [04:11:45] 10Operations, 10MediaWiki-General, 10Patch-For-Review, 10Sustainability (Incident Prevention): Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378 (10tstarling) 05Open→03Resolved Considering that the connection errors a... [04:21:18] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) 05Open→03Resolved All done, I think. [04:47:03] (03PS1) 10Marostegui: db1134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606849 (https://phabricator.wikimedia.org/T254462) [04:48:10] (03CR) 10Marostegui: [C: 03+2] db1134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606849 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [04:48:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11612 and previous config saved to /var/cache/conftool/dbconfig/20200622-044853-marostegui.json [04:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:22] !log Deploy schema change on s3 primary master with a big sleep between wikis - T250066 [04:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:33] T250066: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 [04:52:33] (03CR) 10Marostegui: [C: 04-2] "Thank you Tim!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [04:59:55] (03CR) 10Marostegui: "Let's get a compiler run just to be sure this works as intended" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [05:03:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11613 and previous config saved to /var/cache/conftool/dbconfig/20200622-050259-marostegui.json [05:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11614 and previous config saved to /var/cache/conftool/dbconfig/20200622-051720-marostegui.json [05:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1134', diff saved to https://phabricator.wikimedia.org/P11615 and previous config saved to /var/cache/conftool/dbconfig/20200622-051730-marostegui.json [05:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1134', diff saved to https://phabricator.wikimedia.org/P11616 and previous config saved to /var/cache/conftool/dbconfig/20200622-053104-marostegui.json [05:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:09] (03PS1) 10Marostegui: install_server: Reimage db1118 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/606850 (https://phabricator.wikimedia.org/T254462) [05:33:35] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1118 for reimage and InnoDB compression', diff saved to https://phabricator.wikimedia.org/P11617 and previous config saved to /var/cache/conftool/dbconfig/20200622-053334-marostegui.json [05:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:53] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1118 to Buster. [puppet] - 10https://gerrit.wikimedia.org/r/606850 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [05:43:55] !log Stop haproxy on dbproxy1008 - T255406 [05:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:59] T255406: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 [05:46:19] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:49:04] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [05:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:37] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:45] !log Compress InnoDb on db1118 T254462 [05:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:49] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [06:03:05] (03PS1) 10Marostegui: dbstore1005: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606851 (https://phabricator.wikimedia.org/T254870) [06:04:31] (03PS2) 10Marostegui: install_server: Remiage dbstore1005 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606851 (https://phabricator.wikimedia.org/T254870) [06:05:10] (03CR) 10Marostegui: [C: 03+2] install_server: Remiage dbstore1005 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606851 (https://phabricator.wikimedia.org/T254870) (owner: 10Marostegui) [06:06:09] !log Stop MySQL on dbstore1005 for reimage to Buster - T254870 [06:06:11] elukey: ^ [06:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:14] T254870: Upgrade analytics dbstore databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254870 [06:24:01] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [06:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:36] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:42] (03PS1) 10Marostegui: dbstore1005: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606853 (https://phabricator.wikimedia.org/T254870) [06:41:53] (03CR) 10Marostegui: [C: 03+2] dbstore1005: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606853 (https://phabricator.wikimedia.org/T254870) (owner: 10Marostegui) [06:46:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Isn't it weird that we call the protocol "replication" but the roles master/slave? Anyway, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [06:56:18] (03CR) 10Alexandros Kosiaris: "May I ask what the configuration changes were? We probably need to amend the chart to account for those." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [06:59:55] (03CR) 10Marostegui: "MySQL calls them master/slave: https://dev.mysql.com/doc/refman/8.0/en/replication.html and so does mariadb https://mariadb.com/kb/en/sett" [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [07:02:32] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) Emailed AMS-IX NOC to schedule turning up LACP on that link (if able). [07:03:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> MySQL calls them master/slave: https://dev.mysql.com/doc/refman/8.0/en/replication.html and so does mariadb https://mariadb.com/kb/en/se" [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [07:07:51] (03CR) 10ArielGlenn: "> MySQL calls them master/slave: https://dev.mysql.com/doc/refman/8.0/en/replication.html" [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [07:13:36] (03PS1) 10Marostegui: db1117: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606943 [07:13:38] (03CR) 10Jcrespo: [C: 03+1] s/slave/replica/ in visible parts of MariaDB alerts [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [07:15:27] (03CR) 10Marostegui: [C: 03+2] db1117: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606943 (owner: 10Marostegui) [07:16:21] !log Reimage db1117 (irc haproxy alerts will be triggered) [07:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:34] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:20:08] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:20:22] (03Abandoned) 10Alexandros Kosiaris: sudo: recursively manage /etc/sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/180513 (owner: 10Faidon Liambotis) [07:20:32] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:20:40] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:21:13] (03CR) 10Muehlenhoff: [C: 03+2] Update various references/comments to jessie [puppet] - 10https://gerrit.wikimedia.org/r/606688 (owner: 10Muehlenhoff) [07:21:53] ^ all those haproxy alerts are expected [07:23:06] ok [07:23:24] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:23:24] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:23:40] just to be sure, we haven't had any (sre team wide) actual pages in the last couple days right? because I've got none either email or sms [07:24:15] apergos: we had one during the weekend but not in EU tz [07:24:27] ahhhh that's why I didn't get it, whew [07:24:37] I've gotten pages from victorops before but still paranoid [07:30:24] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [07:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:00] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:34:39] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:35:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:36:29] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:36:53] marostegui: ^ looks like a lot of dbproxy servers are in critical state [07:36:55] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:37:01] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:37:29] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:37:35] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:37:47] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:37:57] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:38:33] kormat: yeah, I mentioned earlier, it was expected due to the reimage of db1117 [07:38:58] ahh ok. i didn't read the scrollback [07:41:09] (03PS1) 10Marostegui: db1117: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606950 (https://phabricator.wikimedia.org/T254556) [07:43:11] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Kormat) a:03Papaul [07:43:38] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Kormat) Hi @Papaul, can we get this disk replaced please? It should still be under warranty with dell. [07:43:53] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Kormat) idrac logs are here: https://phabricator.wikimedia.org/P11619 [07:45:01] (03CR) 10Hashar: [C: 03+1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [07:45:29] (03PS6) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [07:45:44] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [07:47:20] (03CR) 10Marostegui: [C: 03+2] db1117: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606950 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [07:50:03] (03CR) 10Hashar: [C: 03+1] gerrit: Add option to mark gerrit servers as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606530 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [07:50:27] (03CR) 10Hashar: [C: 03+1] gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [07:53:36] (03PS1) 10Ammarpad: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 [07:53:51] (03CR) 10Hashar: "Those authorized_keys and known_hosts, there is a better puppet friendly way to do it using sshkey {} as we found out for CI with https://" [puppet] - 10https://gerrit.wikimedia.org/r/606532 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [07:54:02] (03CR) 10Hashar: [C: 03+1] gerrit: Add dedicated home dir for new Gerrit version [puppet] - 10https://gerrit.wikimedia.org/r/606532 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [07:54:41] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:54:57] (03CR) 10Hashar: [C: 03+1] gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [07:56:22] (03CR) 10Hashar: [C: 03+1] gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [07:57:29] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:59:47] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:00:24] (03PS1) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) [08:02:09] !log upload trafficserver 8.0.8~rc0-1wm1 to apt.wm.o (buster) [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:36] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [08:02:50] !log upgrade to trafficserver 8.0.8~rc0-1wm1 on cp4026 and cp4032 [08:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:19] RECOVERY - mysqld processes #page on db1088 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:04:47] RECOVERY - MariaDB read only s6 on db1088 is OK: Version 10.1.43-MariaDB, Uptime 104s, read_only: True, read_only: True, 15.56 QPS, connection latency: 0.004054s, query latency: 0.001311s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [08:06:04] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [08:11:48] (03PS1) 10ArielGlenn: the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) [08:12:09] (03CR) 10jerkins-bot: [V: 04-1] the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [08:13:24] !log extend prometheus codfw ops filesystem to 1TB [08:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:44] PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% [08:19:45] ACKNOWLEDGEMENT - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% Kormat Known [08:19:52] RECOVERY - Host db1088 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:19:56] dear icinga - i had already downtime'd that host :/ [08:20:14] (03PS1) 10JMeybohm: WIP: chartmusum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [08:20:47] (03PS1) 10Filippo Giunchedi: smart: ignore stderr from facter [puppet] - 10https://gerrit.wikimedia.org/r/606957 [08:21:27] (03CR) 10Filippo Giunchedi: "Fixes smart-data-dump on Ganeti hosts" [puppet] - 10https://gerrit.wikimedia.org/r/606957 (owner: 10Filippo Giunchedi) [08:21:29] (03PS2) 10ArielGlenn: the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) [08:27:59] RECOVERY - MariaDB Slave IO: s6 #page on db1088 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:28:18] (03CR) 10Hashar: [C: 03+1] "I have just added a dependency between jenkins and the java profile class to ensure Java gets installed before Jenkins. Else it will fail " [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [08:28:21] RECOVERY - MariaDB Slave SQL: s6 #page on db1088 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:29:13] (03PS3) 10Alexandros Kosiaris: wikifeeds: Enabling paging [puppet] - 10https://gerrit.wikimedia.org/r/537135 (https://phabricator.wikimedia.org/T170455) [08:30:57] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) [08:32:50] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) DCOps: The BBU on this machine has failed. Do you have a spare BBU in the DC, or if not, can we please order a replacement? Cheers. ` /system1/log1/record7 Targets Properties number=7 severity... [08:33:50] !log reimaging cumin1001 to buster T245114 [08:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:55] T245114: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 [08:36:45] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) Mysql is started and catching up on replication. Once that's completed we'll perform a data consistency check. [08:40:05] (03PS4) 10Alexandros Kosiaris: wikifeeds: Enabling paging [puppet] - 10https://gerrit.wikimedia.org/r/537135 (https://phabricator.wikimedia.org/T170455) [08:41:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Enabling paging [puppet] - 10https://gerrit.wikimedia.org/r/537135 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [08:43:32] 10Operations, 10ops-eqiad, 10serviceops, 10Sustainability (Incident Prevention): (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10akosiaris) [08:43:49] 10Operations, 10ops-codfw, 10serviceops, 10Sustainability (Incident Prevention): (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [08:44:29] (03PS1) 10Filippo Giunchedi: pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 [08:45:19] (03CR) 10jerkins-bot: [V: 04-1] pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [08:45:23] (03PS1) 10Kormat: mariadb: Silence notifications for db1088 [puppet] - 10https://gerrit.wikimedia.org/r/606962 (https://phabricator.wikimedia.org/T255927) [08:46:25] (03CR) 10Marostegui: [C: 03+1] mariadb: Silence notifications for db1088 [puppet] - 10https://gerrit.wikimedia.org/r/606962 (https://phabricator.wikimedia.org/T255927) (owner: 10Kormat) [08:46:54] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.04169 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:49:19] (03CR) 10Kormat: [C: 03+2] mariadb: Silence notifications for db1088 [puppet] - 10https://gerrit.wikimedia.org/r/606962 (https://phabricator.wikimedia.org/T255927) (owner: 10Kormat) [08:49:57] akosiaris: seems that the puppet failures above are related to your change [08:50:01] Function Call, 'wmflib::service::validate' parameter 'catalog' entry 'wikifeeds' entry 'monitoring' expects a value for key 'critical' [08:52:14] ah, it doesn't default anymore to true. Ok fixing [08:53:22] at least for once the error message was self-explanatory :D [08:54:44] (03PS1) 10Alexandros Kosiaris: Revert "wikifeeds: Enabling paging" [puppet] - 10https://gerrit.wikimedia.org/r/606964 [08:55:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "wikifeeds: Enabling paging" [puppet] - 10https://gerrit.wikimedia.org/r/606964 (owner: 10Alexandros Kosiaris) [08:55:48] volans: fix merged [08:56:35] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:56] thanks! [08:56:59] (03PS2) 10Filippo Giunchedi: pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 [08:57:51] (03CR) 10jerkins-bot: [V: 04-1] pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [08:58:41] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10elukey) @Halfak ping :) [08:59:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:47] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Marostegui) p:05Triage→03Medium [09:10:21] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [09:16:34] (03PS1) 10Marostegui: install_server: Do not allow db1140 reimage [puppet] - 10https://gerrit.wikimedia.org/r/606967 [09:17:10] (03CR) 10Ladsgroup: betacluster: Apply global abuse filters from metawiki instead of deploymentwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:21:01] (03CR) 10RhinosF1: betacluster: Apply global abuse filters from metawiki instead of deploymentwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:21:20] Amir1: https://meta.wikimedia.beta.wmflabs.org/wiki/Special:MyLanguage/Main_Page [09:21:43] RhinosF1: aaaah, metawiki in beta cluster [09:21:50] Got it, it's confusing [09:22:00] Amir1: no problen [09:22:06] * RhinosF1 can't spell [09:22:56] (03CR) 10Ladsgroup: [C: 03+2] betacluster: Apply global abuse filters from metawiki instead of deploymentwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:23:46] (03Merged) 10jenkins-bot: betacluster: Apply global abuse filters from metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:23:47] beta cluster patches can go in outside of BACC, specially to make it less crowded [09:24:06] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:25:03] (03PS2) 10Ladsgroup: betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:25:16] (03CR) 10Ladsgroup: [C: 03+2] betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:26:03] (03Merged) 10jenkins-bot: betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [09:27:11] what happening? :D [09:27:24] just came back from lunch [09:27:33] Majavah: magic [09:28:06] * Majavah reads backscroll [09:28:11] Patches in beta cluster can go in outside of BACC window (specially right now there are lots of patches waiting there) [09:28:50] we have another problem, I'm rebasing them on deploy1001 and there's a unrebased patch from Friday by ottomata [09:29:03] cc elukey it's this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606749 [09:29:11] do you want to deploy it? [09:30:15] (03PS3) 10Filippo Giunchedi: pontoon: first skeleton [puppet] - 10https://gerrit.wikimedia.org/r/606961 [09:30:29] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: bump pipeline workers [puppet] - 10https://gerrit.wikimedia.org/r/606647 (https://phabricator.wikimedia.org/T255243) (owner: 10Filippo Giunchedi) [09:31:34] Amir1: thanks for the ping, I am not sure what Andrew was deploying, to be sure let's not do it [09:31:57] !log roll-restart logstash in codfw/eqiad to apply configuration change [09:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:30] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.0006317 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:32:35] elukey: sure can I revert it then? [09:32:47] because otherwise it'll be deployed by the next sync :D [09:32:47] Amir1: sorry I am checking the SAL, I saw https://sal.toolforge.org/log/Zz3GzXIBv7KcG9M-DEUa [09:33:25] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1094 for reimage', diff saved to https://phabricator.wikimedia.org/P11621 and previous config saved to /var/cache/conftool/dbconfig/20200622-093323-marostegui.json [09:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:28] haha, I bet it didn't work (I do it sometimes, I forget to rebase and deploy) [09:34:25] Amir1: ah ok, I am a little ignorant about how to deploy mw stuff, if it didn't get deployed let's revert and wait for Andrew [09:34:29] is there a way to check? [09:34:39] (03PS1) 10Marostegui: install_server: Reimage db1094 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606970 [09:35:09] (03PS2) 10Jforrester: Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [09:35:11] (03PS1) 10Jforrester: dblists: Introduce lockeddown, to replace nonbetafeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606971 [09:35:13] (03PS1) 10Jforrester: Switch uses from nonbetafeatures to lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606972 [09:35:15] (03PS1) 10Jforrester: dblists: Drop nonbetafeatures, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606973 [09:35:23] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1094 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606970 (owner: 10Marostegui) [09:35:38] elukey: on deploy1001, it's "git log -p HEAD..@{u}" [09:35:55] (03PS3) 10ArielGlenn: the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) [09:36:08] (03CR) 10jerkins-bot: [V: 04-1] dblists: Introduce lockeddown, to replace nonbetafeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606971 (owner: 10Jforrester) [09:36:12] (03CR) 10jerkins-bot: [V: 04-1] Switch uses from nonbetafeatures to lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606972 (owner: 10Jforrester) [09:36:26] Hey, everything OK? [09:36:37] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) [09:36:48] (03PS1) 10Ladsgroup: Revert "Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606974 [09:36:55] (03CR) 10Ladsgroup: [C: 03+2] Revert "Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606974 (owner: 10Ladsgroup) [09:37:12] Amir1: while you're at it, can you merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606755 too [09:37:14] Amir1: nono I mean if it got deployed [09:37:29] if not, let's revert it [09:37:50] (03Merged) 10jenkins-bot: Revert "Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606974 (owner: 10Ladsgroup) [09:37:58] ack :) [09:38:00] thanks a lot! [09:38:03] ottomata: --^ [09:38:13] (03PS2) 10Jforrester: dblists: Introduce lockeddown, to replace nonbetafeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606971 [09:38:14] elukey: I'd say either check it on your side (If you're receiving schema with new version I guess?) or login to a random mw node and check /srv/mediawiki/wmf-config/IS.php [09:38:43] James_F: yeah, someone forgot to rebase mw config on deploy1001 and now we have merged but undeployed changes [09:38:51] Fun. :-( [09:38:59] Need me to help? [09:39:19] Amir1: yes I can do it, I asked some help because I have zero knowledge of the part that andrew is following :) [09:39:31] reverted it and we all are good [09:39:36] OK. [09:40:01] elukey: me neither :( I have no idea what these schemas are. Sorry :( [09:40:08] Majavah: let's see [09:41:15] (03PS4) 10ArielGlenn: the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) [09:41:22] (03CR) 10Ladsgroup: [C: 03+2] betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah) [09:41:28] (03PS4) 10Ladsgroup: betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah) [09:41:35] (03CR) 10Ladsgroup: [C: 03+2] betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah) [09:42:23] (03Merged) 10jenkins-bot: betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah) [09:43:04] (03PS5) 10ArielGlenn: the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) [09:43:08] some days I should just go back to bed :-/ [09:44:02] Majavah: done, thanks! [09:45:11] (03PS2) 10Jforrester: Switch uses from nonbetafeatures to lockeddown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606972 [09:45:13] (03PS2) 10Jforrester: dblists: Drop nonbetafeatures, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606973 [09:45:15] (03PS3) 10Jforrester: Use 'lockeddown' dblist more instead of listing both wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606295 (owner: 10Bartosz Dziewoński) [09:45:55] (03PS6) 10ArielGlenn: the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) [09:46:06] Production now clear? [09:46:15] (03PS1) 10Filippo Giunchedi: prometheus: require class node_exporter for node textfile scripts [puppet] - 10https://gerrit.wikimedia.org/r/606977 [09:46:48] apergos: I get that. The last few days have not been fun here. [09:47:34] (03CR) 10ArielGlenn: [C: 03+2] the dumps exception checker should not start if it's already running [puppet] - 10https://gerrit.wikimedia.org/r/606955 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [09:48:47] beta cluster testwiki (which the last patch just tried to fix) now complains that it can't access the database server :( https://test.wikimedia.beta.wmflabs.org [09:52:31] Majavah: that trace mentions testwikidatawiki - should it not be wikidatawiki [09:52:53] it's working now [09:53:14] RhinosF1: now it works. it was "(Cannot access the database: Cannot access the database: Unknown error (172.16.4.147:3306))" a few moments ago [09:53:23] I saw [09:54:09] The logo is buggy though [09:54:13] (03PS1) 10Elukey: cumin: reduce scope of the 'hadoop' alias [puppet] - 10https://gerrit.wikimedia.org/r/606979 [09:54:18] https://usercontent.irccloud-cdn.com/file/hMhwaO0r/IMG_6043.PNG [09:54:50] * RhinosF1 goes off to file a bug [09:55:02] (03PS7) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [09:55:04] that's a separate problem [09:55:42] (03CR) 10Marostegui: [C: 03+1] mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [09:56:02] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [09:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:39] (03CR) 10Kormat: [C: 03+2] mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [09:58:27] Majavah: filed as https://phabricator.wikimedia.org/T255978 [09:58:40] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:38] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [10:06:13] (03PS2) 10Elukey: [WIP] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [10:07:58] the icinga config error is notify-service|host-by-irc-databases specified for contact irc-databases is not defined anywhere [10:08:12] kormat ^ known/expected? [10:08:14] moritzm: i added that to the private repo before merging the CR [10:08:21] so, not expected by me, at least [10:08:37] ok, this probably needs a puppet run on icinga1001, doing that npw [10:10:00] 🤞 [10:11:34] moritzm: ah. sounds like i should have run puppet on icinga1001 between the private change and the puppet repo merge [10:13:25] I guess so, the new contact gets applied by a puppet run and with that not happened it, it complained [10:13:56] icinga-wm leaving and rejoining was the effect of that change, I think it should be okay now [10:15:48] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [10:15:56] moritzm: thanks :) [10:21:56] (03PS1) 10Kormat: Revert "mariadb: Add monitoring for lag spikes." [puppet] - 10https://gerrit.wikimedia.org/r/606984 [10:22:37] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) >>! In T254939#6239427, @BGerdemann wrote: > @jbond Yes, 31 August is correct. I can update this threa... [10:22:40] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) With Robot Comments, you can also provide a [proposed fix](https://gerrit-test.wikimedia.org/r/Documentation/rest-api-changes.html#app... [10:23:22] (03CR) 10Kormat: [C: 03+2] Revert "mariadb: Add monitoring for lag spikes." [puppet] - 10https://gerrit.wikimedia.org/r/606984 (owner: 10Kormat) [10:23:28] (03PS3) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [10:28:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606985 (https://phabricator.wikimedia.org/T128546) [10:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T1030). [10:30:05] (03PS2) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606985 (https://phabricator.wikimedia.org/T128546) [10:31:24] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606985 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:15] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606985 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:57] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:606985| Bumping portals to master (606985)]] (duration: 01m 12s) [10:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:606985| Bumping portals to master (606985)]] (duration: 00m 56s) [10:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:36] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [10:42:07] (03PS1) 10Hashar: cumin: fix python3 version on Buster [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) [10:42:27] kormat: it's now missing the databases-testing contact group [10:42:40] * kormat facepalms [10:43:22] (03CR) 10Hashar: "Spotted on deployment-cumin.deployment-prep.eqiad.wmflabs:" [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [10:44:06] moritzm: what did i do wrong this time? :) [10:44:45] not sure, is that group referenced in a new patch you merged? [10:45:01] it was referenced in the patch which i have reverted [10:45:19] i wonder i should have run puppet on all db hosts after the revert [10:46:14] probably, puppet run on icinga is running, we'll see in a bit :-) [10:46:26] i ran it multiple times on icinga [10:46:37] some of those times it showed a check being removed for specific hosts [10:46:45] (03CR) 10Volans: [C: 04-1] "You need also the component for the tqdm version, see for example:" [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [10:49:04] (I also ran puppet on puppetmaster) [10:49:38] (03PS2) 10Ammarpad: IS: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 [10:53:50] (03CR) 10RhinosF1: [C: 03+1] IS: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 (owner: 10Ammarpad) [10:54:47] (03PS1) 10Arturo Borrero Gonzalez: UNTESTED: openstack: neutron: refresh API policy to allow port management [puppet] - 10https://gerrit.wikimedia.org/r/606991 (https://phabricator.wikimedia.org/T255670) [10:55:58] ok, i've run puppet on all db hosts. re-running it once last time (hopefully) on icinga. [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T1100). Please do the needful. [11:00:04] Jhs, Majavah, VulpesVulpes825, and Ammarpad: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:43] (03PS1) 10ArielGlenn: alert on high load on xml dumps nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/606994 (https://phabricator.wikimedia.org/T254856) [11:01:17] At a meeting right now :( [11:01:42] (03CR) 10jerkins-bot: [V: 04-1] alert on high load on xml dumps nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/606994 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [11:02:01] same meeting [11:03:11] i can deploy today [11:03:50] (03PS1) 10Mvolz: Revert "Allow generic params to be passed to getWikitextFragment" [extensions/VisualEditor] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606995 (https://phabricator.wikimedia.org/T255785) [11:03:59] (03CR) 10Arturo Borrero Gonzalez: "This patch is untested, just a PoC. For additional context and evaluation, please see https://phabricator.wikimedia.org/T255670" [puppet] - 10https://gerrit.wikimedia.org/r/606991 (https://phabricator.wikimedia.org/T255670) (owner: 10Arturo Borrero Gonzalez) [11:04:17] (03CR) 10Urbanecm: [C: 03+2] Disable NS_USER(_TALK) search engine indexing on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606374 (https://phabricator.wikimedia.org/T255538) (owner: 10Majavah) [11:04:27] Jhs: VulpesVulpes825: Are you around? [11:04:35] (03PS1) 10Marostegui: db1094: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606996 [11:04:48] Urbanecm: Yes [11:05:05] (03Merged) 10jenkins-bot: Disable NS_USER(_TALK) search engine indexing on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606374 (https://phabricator.wikimedia.org/T255538) (owner: 10Majavah) [11:05:20] (03PS2) 10ArielGlenn: alert on high load on xml dumps nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/606994 (https://phabricator.wikimedia.org/T254856) [11:05:30] Majavah: your patch is available at mwdebug1001 [11:05:46] Urbanecm: working, thanks [11:05:50] syncing [11:07:03] (03CR) 10Marostegui: [C: 03+2] db1094: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606996 (owner: 10Marostegui) [11:07:05] Too late to add stuff to european midday backport window? Maybe morning sf instead? [11:07:26] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: defa81e: Disable NS_USER(_TALK) search engine indexing on trwiki (T255538) (duration: 00m 58s) [11:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:32] T255538: Disable search engine indexing (with noindex) in specific namespaces of Turkish Wikipedia - https://phabricator.wikimedia.org/T255538 [11:07:40] mvolz: no, add it to the calendar :) [11:08:08] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P11622 and previous config saved to /var/cache/conftool/dbconfig/20200622-110806-marostegui.json [11:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:10] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:08:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:09:04] (03CR) 10jerkins-bot: [V: 04-1] Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:09:16] VulpesVulpes825: please look at the jenkins bot response :) [11:09:30] (03CR) 10Urbanecm: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:09:48] (and my comments) [11:09:49] moritzm: i've run all the puppets, but icinga still isn't happy [11:09:59] Urbanecm: Interesting, checking console and your comment. [11:10:06] (03PS2) 10Urbanecm: Add import sources for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604642 (https://phabricator.wikimedia.org/T255098) (owner: 10Jon Harald Søby) [11:10:10] kormat: let me check the current error message [11:10:22] moritzm: where do you look for that btw? [11:10:34] * Urbanecm is going to deploy Jhs's patch now [11:10:40] (03PS2) 10Jbond: admin: add shell account for lmata and add to ops group [puppet] - 10https://gerrit.wikimedia.org/r/603950 (https://phabricator.wikimedia.org/T254818) [11:10:44] (03CR) 10Urbanecm: [C: 03+2] Add import sources for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604642 (https://phabricator.wikimedia.org/T255098) (owner: 10Jon Harald Søby) [11:10:50] 10Operations, 10netops, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) [11:11:20] Urbanecm, yeah, I'm around, sorry [11:11:24] kormat: that's sudo /usr/sbin/icinga -v /etc/icinga/icinga.conf on the active Icinga host [11:11:28] forgot the time [11:11:38] Jhs: no problem, waiting for your patch to get merged now :) [11:11:39] (03Merged) 10jenkins-bot: Add import sources for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/604642 (https://phabricator.wikimedia.org/T255098) (owner: 10Jon Harald Søby) [11:11:43] currently icinga1001, wherever the "icinga" CNAME currently points to [11:11:46] awesome Urbanecm :) [11:12:01] Urbanecm: can I put a spanner in VulpesVulpes825's patch? [11:12:16] Module/Module Talk translations should be done in the extension repo [11:12:25] wdym RhinosF1 ? [11:12:27] Urbanecm: thanks, done, the revert is merged in master, but the cherry pick to the branch is not yet - do we need to wait for that? [11:13:12] Urbanecm: the aliases for Module namespace on all projects should be in the extension code shouldn't they [11:13:17] mvolz: yes, I'm going to +2 it now, and ping you once merged :) [11:13:21] RhinosF1: yes [11:13:24] RhinosF1, no [11:13:26] (03CR) 10Urbanecm: [C: 03+2] "B&C" [extensions/VisualEditor] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606995 (https://phabricator.wikimedia.org/T255785) (owner: 10Mvolz) [11:13:30] or no? [11:13:31] RhinosF1, the translations are already in the repo [11:13:43] RhinosF1: Chinese Project want the namespace to be in English [11:13:48] RhinosF1, but the Chinese Wikimedia projects specifically have chosen to have all their namespace names in English [11:13:53] Ok [11:13:55] probably because of conversion issues [11:13:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10jbond) Hi Leo, Faidon pointed out that this wasn't actully merged. it is now and should be working sorry for the confusion/delay [11:14:02] ah, so repo has Chinese, and config in English, right? [11:14:06] yup [11:14:18] gotcha, going to merge when jenkins is ready [11:14:27] So Wikimedia will be neutral on Simplified Chinese or Traditional Chinese [11:14:28] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:14:50] Jhs: sorry, at mwdebug1001 now :) [11:14:59] Urbanecm: Fixing the extra line on 8897 [11:15:04] ack [11:15:13] Urbanecm, verified [11:15:17] thanks, syncing [11:15:58] (03PS4) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) [11:16:10] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:16:34] (03CR) 10Urbanecm: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:16:38] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1301fd4: Add import sources for gomwiktionary (T255098) (duration: 00m 57s) [11:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:42] T255098: Define import sources for the Konkani Wiktionary - https://phabricator.wikimedia.org/T255098 [11:16:44] Jhs: here you go! :-) [11:16:57] great, thank you Urbanecm [11:17:02] no problem! [11:17:09] Ammarpad: around? [11:17:22] (03PS5) 10Urbanecm: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:17:27] (03CR) 10Urbanecm: [C: 03+2] Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:17:37] @Urbanecm Yes [11:17:48] VulpesVulpes825: will ping you once your patch is available at mwdebug1001 for testing [11:18:06] (03PS3) 10Ammarpad: IS: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 [11:18:13] (03Merged) 10jenkins-bot: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [11:18:20] (you need to install an extension to be able to do so, see https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions) [11:18:22] VulpesVulpes825: ^ [11:18:41] Urbanecm: Reading instruction... [11:19:23] Ammarpad: ack [11:19:34] VulpesVulpes825: your patch is at mwdebug1001 now [11:20:02] Urbanecm: Testing [11:20:35] (03PS2) 10Hashar: cumin: fix labs install on Buster [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) [11:21:25] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [11:22:32] Urbanecm: LGTM, all Simplified Chinese and Traditional Chinese module name is working and display in English in the url. [11:22:43] excellent, going to sync the patch then! [11:22:45] thanks [11:23:00] (03CR) 10Hashar: "Stupid puppet does not downgrade even with priority 1002.." [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [11:23:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I'll merge this in a bit." [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [11:24:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: db952ba: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects (T165593) (duration: 00m 56s) [11:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:21] T165593: Modification of the default alias for namespace 828 "模块:" of Zh Projects - https://phabricator.wikimedia.org/T165593 [11:24:21] VulpesVulpes825: done :) [11:24:52] Urbanecm: Thank you so much and sorry for the additional blank line. [11:24:52] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P11623 and previous config saved to /var/cache/conftool/dbconfig/20200622-112451-marostegui.json [11:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:04] no problem, it happens :) [11:25:21] (03PS4) 10Urbanecm: IS: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 (owner: 10Ammarpad) [11:25:54] Ammarpad: going to deploy yours now, once synced, please spot-check the value didn't change somewhere :) [11:26:02] (03CR) 10Urbanecm: [C: 03+2] IS: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 (owner: 10Ammarpad) [11:26:53] (03Merged) 10jenkins-bot: IS: Cleanup some redundant rows. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606951 (owner: 10Ammarpad) [11:26:56] Urbanecm OK [11:29:01] !log Run namespaceDupes.php for zh* projects (T165593) [11:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:58] (03Merged) 10jenkins-bot: Revert "Allow generic params to be passed to getWikitextFragment" [extensions/VisualEditor] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606995 (https://phabricator.wikimedia.org/T255785) (owner: 10Mvolz) [11:30:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 74e8295: IS: Cleanup some redundant rows (duration: 00m 56s) [11:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:21] (03CR) 10Volans: [C: 03+1] "LGTM, thx for the fix" [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [11:31:51] (03CR) 10Hashar: "Cherry picked for both deployment-prep and integration. The cumin instances now each have python3-tqdm 4.23.4-1~wmf1" [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [11:33:38] mvolz: i see it's merged now, do you want to deploy it yourself, or should I? [11:34:02] !log marostegui@cumin2001 dbctl commit (dc=all): 'Slowly repool db1094', diff saved to https://phabricator.wikimedia.org/P11625 and previous config saved to /var/cache/conftool/dbconfig/20200622-113401-marostegui.json [11:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:06] Urbanecm: I have no idea how to deploy it :) [11:34:15] ah, I'll do it then :) [11:35:23] mvolz: could you test it at mwdebug1001, please? [11:37:23] !log volans@deploy1001 Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster [11:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:52] !log volans@deploy1001 Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster (duration: 00m 28s) [11:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:01] Urbanecm: I'm not exactly sure how to do that either - does it only do curl requests? [11:38:12] mvolz: you can do that in your browser [11:38:30] at https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions, there are extensions for both Chrome and Firefox [11:40:14] you need select the debug server there, and make sure the error isn't present there, just as you would do at live servers. Does that make sense? [11:40:58] !log draining ganeti2008 for eventual reboot [11:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:40] mwdebug1001.eqiad.wmnet gives me server ip address can't be found. Is that the right address? I do have the browser extension, and have selected 1001... Urbanecm still pretty confused :). [11:43:21] mvolz: you install the extension, select mwdebug1001, and then go to (for example) en.wikipedia.org and test [11:43:54] ok, I've done that and it still seems broken to me - :/ [11:44:31] oh wait [11:44:41] yes? [11:44:48] no it works. just had have to refresh js cache [11:44:53] 10Operations, 10Patch-For-Review: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10hashar) I have: * created `deployment-cumin.deployment-prep.eqiad.wmflabs` and `integration-cumin.integration.eqiad.wmflabs`. * Cherry picked a python3.7 fix from https://gerrit.wikimedia.org/r/#/c/op... [11:44:54] give me a few more seconds to make sure for sure [11:45:02] mvolz: sure, ping me once you're sure :) [11:45:18] (03PS4) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [11:45:56] !log marostegui@cumin2001 dbctl commit (dc=all): 'Fully repool db1094', diff saved to https://phabricator.wikimedia.org/P11627 and previous config saved to /var/cache/conftool/dbconfig/20200622-114554-marostegui.json [11:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:53] Urbanecm: ok everything looks good to me. Regression is fixed with mwdebug1001 and still very broken with it off :). [11:48:11] (03CR) 10Muehlenhoff: [C: 03+2] cumin: fix labs install on Buster [puppet] - 10https://gerrit.wikimedia.org/r/606986 (https://phabricator.wikimedia.org/T245114) (owner: 10Hashar) [11:48:12] mvolz: okay, great. Going to sync that now :) [11:48:55] great [11:49:50] (03CR) 10Muehlenhoff: cumin: reduce scope of the 'hadoop' alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606979 (owner: 10Elukey) [11:50:07] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.37/extensions/VisualEditor/modules/: Backport: 0a08066: Revert "Allow generic params to be passed to getWikitextFragment" (T255785) (duration: 00m 58s) [11:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:11] T255785: Template insertion fails in 2017 wikitext editor (including citation templates via citoid) - https://phabricator.wikimedia.org/T255785 [11:50:18] mvolz: it should be live now :) [11:51:04] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "What the …? I don't find it acceptable to silently remove a -1 that is well explained." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [11:51:35] Urbanecm: woo! looks good! Thank you! [11:51:43] happy to help! [11:53:08] !log EU B&C window done [11:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:00] 10Operations, 10Patch-For-Review: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10hashar) On integration I am hitting a wall, the keyholder refuses to grant access: For integration, the agent refuses to sign: `counterexample integration-cumin:~$ sudo -H SSH_AUTH_SOCK=/run/keyholde... [11:59:05] puppet all settled out now? [12:00:38] (03CR) 10Elukey: cumin: reduce scope of the 'hadoop' alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606979 (owner: 10Elukey) [12:02:46] (03PS5) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [12:05:50] (03PS2) 10Elukey: cumin: reduce scope of the 'hadoop' alias [puppet] - 10https://gerrit.wikimedia.org/r/606979 [12:06:02] (03PS3) 10ArielGlenn: alert on high load on xml dumps nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/606994 (https://phabricator.wikimedia.org/T254856) [12:07:47] (03CR) 10ArielGlenn: [C: 03+2] alert on high load on xml dumps nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/606994 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn) [12:12:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606979 (owner: 10Elukey) [12:17:03] (03CR) 10Gilles: "It's not well explained, it's a series of vague statements not repeated here in the context of the patch. The suggestion of exploring a gr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [12:18:05] (03CR) 10Elukey: [C: 03+2] cumin: reduce scope of the 'hadoop' alias [puppet] - 10https://gerrit.wikimedia.org/r/606979 (owner: 10Elukey) [12:21:18] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10Volans) @hashar I've `systemctl restart keyholder-proxy.service` and ssh seems to work fine now and I can run cumin too. [12:22:17] !log volans@deploy1001 Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster [12:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] !log volans@deploy1001 Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster (duration: 00m 03s) [12:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:50] !log volans@deploy1001 Started deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster [12:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:15] !log volans@deploy1001 Finished deploy [homer/deploy@e9acec8]: Release v0.2.3 on cumin1001 now on buster (duration: 01m 25s) [12:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:31] !log failover logstash2023 from ganeti2007->ganeti2023 for migration_downtime change to apply [12:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:24] 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10lmata) Thank you! [12:38:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:04] (03CR) 10Elukey: [C: 04-1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [12:39:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 (owner: 10Elukey) [12:40:22] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "I refuse to respond to something that aggressive." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [12:42:01] (03CR) 10Elukey: "To keep archives happy - the log database has been imported to HDFS and we will not need to keep it around on mariadb, so I am going to ch" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [12:42:22] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:26] (03PS1) 10Ayounsi: Don't use virtual sub-interfaces for basic interfaces (.0) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/607011 [12:43:25] (03PS2) 10Ayounsi: Don't use virtual sub-interfaces for basic interfaces (.0) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/607011 [12:45:24] !log draining ganeti2007 for eventual reboot [12:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:36] (03PS5) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [12:53:58] !log upgrade to trafficserver 8.0.8~rc0-1wm1 on cp5006 and cp5012 [12:53:58] (03PS6) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:52] (03PS2) 10JMeybohm: WIP: chartmusum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [12:56:22] (03CR) 10Gilles: "The fact that this is a lossy conversion is right in the title of the patch since the beginning. The point of this patch has always been a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [12:58:20] (03PS1) 10Ssingh: wikidough: add comment about private data [puppet] - 10https://gerrit.wikimedia.org/r/607012 (https://phabricator.wikimedia.org/T252132) [13:00:29] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23348/ no code change." [puppet] - 10https://gerrit.wikimedia.org/r/607012 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:04:43] Amir1: hello! [13:05:00] in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/606974 you wrote "never got rebased" [13:05:04] did I miss a step? [13:05:06] (03CR) 10Volans: [C: 03+1] "Looks sane for the limited understanding I have of the data structure that is mangled." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/607011 (owner: 10Ayounsi) [13:05:15] i'm pretty sure I merged it on deploy1001 [13:05:24] is that what you mean or something else? [13:05:57] (03PS1) 10Ottomata: Revert "Revert "Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607013 [13:07:08] (03CR) 10CDanis: [C: 03+2] s/slave/replica/ in visible parts of MariaDB alerts [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [13:10:21] 10Operations, 10Patch-For-Review: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10hashar) It is complete for `integration`. On deployment-prep I kept around the old instance just in case but I guess I will delete it at the end of this week. [13:10:44] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All cumin hosts in production are now running Buster. [13:10:46] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10MoritzMuehlenhoff) [13:10:49] (03PS1) 10Kormat: nagios_common: Add data persistence irc bot config [puppet] - 10https://gerrit.wikimedia.org/r/607014 (https://phabricator.wikimedia.org/T253120) [13:11:26] !log Stop MySQL on db2078 instances [13:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:26] (03PS2) 10Ottomata: Revert "Revert "Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607013 [13:15:59] (03PS1) 10Ottomata: Set wgEventLoggingServiceUri for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607017 (https://phabricator.wikimedia.org/T238230) [13:16:48] (03CR) 10Ottomata: [C: 03+2] Revert "Revert "Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607013 (owner: 10Ottomata) [13:17:39] (03CR) 10Ottomata: [C: 03+2] Set wgEventLoggingServiceUri for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607017 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [13:19:27] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt and set wgEventLoggingServiceUri for all wikis - T238230 (duration: 00m 58s) [13:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:31] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [13:23:14] (03PS1) 10Marostegui: multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607020 (https://phabricator.wikimedia.org/T250666) [13:29:08] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.4 [software/homer] - 10https://gerrit.wikimedia.org/r/607021 [13:32:28] (03CR) 10Jcrespo: "@cdanis Let me know if you need help re-applying the manual alert disabling that will be lost due to the name change." [puppet] - 10https://gerrit.wikimedia.org/r/606801 (owner: 10CDanis) [13:32:45] (03PS1) 10Muehlenhoff: Remove banner now that the server is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/607022 [13:33:18] jynus: I think we'll be okay? for db1088 and db1118 the downtimes are at the host level [13:33:27] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10elukey) >>! In T255520#6235758, @Jclark-ctr wrote: > @elukey 10g or 1g? 1g thanks! [13:33:40] cdanis: I think they are mostly the backup source hosts [13:34:13] You got the list? [13:36:46] (03CR) 10BBlack: [C: 03+1] deploy-check: fix detection of need reload [dns] - 10https://gerrit.wikimedia.org/r/606542 (https://phabricator.wikimedia.org/T255748) (owner: 10Volans) [13:37:52] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607022 (owner: 10Muehlenhoff) [13:37:54] so far I'm yet to find an alert that isn't already hit by a host-level downtime [13:38:03] (03CR) 10Kormat: [C: 03+1] "This was bothering me too :)" [puppet] - 10https://gerrit.wikimedia.org/r/607020 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [13:38:08] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) https://github.com/unbit/uwsgi/issues/2189 [13:38:16] (03CR) 10Marostegui: [C: 03+2] multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607020 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [13:39:31] (03PS1) 10Ssingh: cescout: ensure that /var/run/postgresql exists at boot time [puppet] - 10https://gerrit.wikimedia.org/r/607025 (https://phabricator.wikimedia.org/T247273) [13:40:14] ah, okay, found one: looks like db2099 for specifically s4 lag was downtimed, but it's currently not firing (lag of 0.12 seconds), and there's no expiry on the downtime?? [13:40:19] (03CR) 10Muehlenhoff: [C: 03+2] Remove banner now that the server is reimaged [puppet] - 10https://gerrit.wikimedia.org/r/607022 (owner: 10Muehlenhoff) [13:40:21] that seems more like a mistake than anything else [13:40:29] not downtime [13:40:35] disabled notifications [13:41:18] there was quite a few strategic ones that we needed [13:41:19] 10Operations, 10DNS, 10Traffic, 10netbox, 10Patch-For-Review: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10BBlack) +1 on the latest patch, looks like the right fix. As an aside though: > We didn't dare restarting gdnsd. You can generally dare kicking gdnsd pre... [13:41:25] strategic? [13:41:38] yes, because we had no puppet code to replace them [13:42:05] as it needed a complete refactoring [13:42:09] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23350/" [puppet] - 10https://gerrit.wikimedia.org/r/607025 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [13:42:22] (03CR) 10Marostegui: [C: 03+1] nagios_common: Add data persistence irc bot config [puppet] - 10https://gerrit.wikimedia.org/r/607014 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [13:42:31] (03PS2) 10Volans: deploy-check: fix detection of need reload [dns] - 10https://gerrit.wikimedia.org/r/606542 (https://phabricator.wikimedia.org/T255748) [13:44:10] do we have a backup of icinga overrides to list those? [13:44:17] (03CR) 10Volans: [C: 03+2] deploy-check: fix detection of need reload [dns] - 10https://gerrit.wikimedia.org/r/606542 (https://phabricator.wikimedia.org/T255748) (owner: 10Volans) [13:45:31] jynus: they should be able to be extracted from the status file [13:45:51] but even for deleted ones? [13:46:06] (03PS1) 10Elukey: profile::mediawiki::mcrouter_wancache: send probe after 60s [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) [13:46:30] 10Operations, 10DNS, 10Traffic, 10netbox, 10Patch-For-Review: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10Volans) The fix has been deployed, I'll check with @ayounsi if there is any new data in any of the PoPs to test it with. [13:46:39] could you get that list? - I can apply them manually if needed [13:46:44] I am looking [13:47:13] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::mcrouter_wancache: send probe after 60s [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [13:47:28] fwiw the status file is copied over to the passive host every hour around minute 33 IIRC [13:47:40] so icinga2001? [13:47:49] volans: yeah, so, already happened post-merge [13:48:03] yes, icinga2001 [13:48:05] assuming that is very small, should we just add to bacula just in case? [13:48:12] *that [13:48:25] I will check and prepare a patch [13:48:41] (03PS2) 10Elukey: profile::mediawiki::mcrouter_wancache: send probe after 60s [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) [13:48:56] (03PS1) 10Privacybatm: transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) [13:48:57] jynus: that file changed every 10s, not sure what's the purpose of backing it up [13:49:04] *changes [13:49:25] (03CR) 10jerkins-bot: [V: 04-1] transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [13:50:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23351/mw1345.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [13:50:37] also I don't think if something must be in a specific state that that info should be on icinga unpuppetized [13:50:59] sure 100%, but again, reality makes things non-idea [13:51:03] *ideal [13:51:09] jynus: okay, I think the retention.dat file does not retain servicedowntimes for deleted services, sorry [13:51:13] or at least, I did not find any there [13:51:16] don't worry [13:51:32] I know some of the hosts, but not all [13:51:33] !log disable Puppet in codfw to reduce puppetdb2002 memory activity, unblocking the migration of the Ganeti instance for a reboot [13:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:37] ah wait, maybe this is wrong, I was searching with the wrong case [13:51:38] just a moment [13:52:05] Jenkins is posting that `Post-merge build succeeded.` on changes that are still open - has this already been reported? [13:52:21] volans: technically this is a (small) data loss, my job is to prevent that- and adding an extra backup file may not be a huge overhead? [13:52:57] I will ask obs. team [13:53:48] DannyS712: should be reported to #wikimedia-releng [13:54:43] jynus: okay, I've found only downtimes for db1088 and db1118 left behind, which I think isn't comprehensive and isn't helpful :) [13:54:45] ack, reported [13:54:50] :-( [13:55:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:55:11] ok, please forgive if there is some extra alerts spam in the coming days [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:21] jynus: sure, I Was not saying we should't, but there are some tricky bits. The file is around 75MB atm and icinga deletes and writes it continuosly (it's in tmpfs) and there is no guarantee copying the file that it will not be truncated. [13:55:22] as I may miss some of those manual overrides [13:55:26] yeah, sorry, did not realize this would be the case [13:55:34] IIRC we added some safety check on teh sync that happens every hour [13:55:35] not your fault, I did not warn, I +1 [13:55:43] so my fault too [13:55:57] my only ask is please be permisive if there some extra spam [13:56:02] had not realized there were un-puppetized not-just-whole-host-level long-term downtimes [13:56:12] not downtimes [13:56:14] yeah, some spam will be okay [13:56:16] disabled alert [13:56:22] it will be nothing serious [13:56:35] (03CR) 10RLazarus: [C: 03+1] "Check me -- this would still flap when a shard is under sustained load, it would just flap (and produce a micro-outage) once per minute in" [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [13:56:38] as it is for things that we had yet to properly tuned [13:56:55] I think mostly multi-instance hosts for which lag was not parametrized [13:57:01] cdanis: today is spam day. check out #-databases for proof ;) [13:57:07] kormat: lol I saw [13:57:08] and kormat knows why that is delayed [13:57:12] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:26] because it requires a log of refactoring, agree, kormat? [13:57:29] *lot [13:57:55] so we had a few short-term disablings there [13:58:18] (03PS1) 10Filippo Giunchedi: prometheus: import availability aggregation rules from Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/607031 (https://phabricator.wikimedia.org/T233956) [13:58:19] I think the core issue is icinga not being smart enough [13:58:23] kubetcd2004 is the ganeti2007 reboot, uses a plain disk [13:58:47] I will ping marostegui to redowntime db1118 [13:58:56] I will have a look at some of the rest [13:58:57] Doing it now [13:59:04] !log re-enabling Puppet in codfw [13:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:17] done! [13:59:30] maybe let's document "renaming messages side effects" too? [13:59:57] sure, the ones with host-level downtimes (like db1118 and db1088) are still all downtimed [14:00:04] Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Creating Shan Wiktionary. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T1400). [14:00:05] no, they got lost [14:00:12] the host one kept [14:00:19] but the service level got lost [14:00:31] o/ [14:00:32] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [14:00:38] this is quite relevant because I wanted to do https://gerrit.wikimedia.org/r/c/operations/puppet/+/448503 [14:00:51] some time ago, but it was too complicated preciselly due to this issue [14:01:24] (03CR) 10Elukey: "> Check me -- this would still flap when a shard is under sustained" [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [14:02:47] yeah, it was only kept on those applied through puppet [14:03:02] e.g. "MariaDB Replica Lag: test-s4" on db1077 [14:03:07] (03PS1) 10Volans: Remove support for older Python versions [software/homer] - 10https://gerrit.wikimedia.org/r/607032 [14:03:18] ottomata: sorry I was at lunch, yeah, you did "git fetch" but you didn't "git rebase" it [14:03:33] hm i ususally do git fetch and then get merge [14:03:41] git merge [14:03:43] I am going to change the roles of the source servers and that will help [14:03:45] so when I tried to rebase something on top, it showed your commit [14:04:10] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers [14:04:11] i follow this [14:04:11] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#In_your_own_repo_via_gerrit [14:04:31] you should follow this one for config changes ^ [14:04:36] uh huh [14:04:36] ok! [14:05:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:28] hm Amir1 maybe docs should be updated? [14:05:30] i got to my link via [14:05:31] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Getting_configuration_changes_on_the_deployment_host [14:05:55] (03CR) 10Kormat: [C: 03+2] nagios_common: Add data persistence irc bot config [puppet] - 10https://gerrit.wikimedia.org/r/607014 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [14:06:12] ^ icinga config will be broken for a few minutes while this gets merged [14:07:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Note that our Puppet integration only applies tmpfiles on boot, but you can also run "systemd-tmpfile --create" manually." [puppet] - 10https://gerrit.wikimedia.org/r/607025 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:07:46] (03PS2) 10Filippo Giunchedi: prometheus: import availability aggregation rules from Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/607031 (https://phabricator.wikimedia.org/T233956) [14:10:16] ottomata: hmm, sure [14:10:32] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice. Looks pretty good already. A few comments inline" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [14:12:25] PROBLEM - Host kubestagetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:25] PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:35] ^ ganeti1013 reboot, expected [14:12:48] cdanis: I identified the machines we had manual overrides as being role mariadb::dbstore_multiinstance [14:12:58] so that should be easy to handle [14:13:04] no harm was done [14:13:57] RECOVERY - Check systemd state on dumpsdata1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:14] (03PS1) 10Marostegui: core/multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607034 (https://phabricator.wikimedia.org/T250666) [14:15:39] RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [14:15:49] RECOVERY - Host kubestagetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [14:16:29] (03CR) 10jerkins-bot: [V: 04-1] core/multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607034 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [14:16:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:29] (03PS2) 10Marostegui: core/multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607034 (https://phabricator.wikimedia.org/T250666) [14:19:18] (03CR) 10Marostegui: "Expected output: https://puppet-compiler.wmflabs.org/compiler1001/23353/" [puppet] - 10https://gerrit.wikimedia.org/r/607034 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [14:20:07] (03CR) 10Kormat: [C: 03+1] core/multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607034 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [14:21:47] (03CR) 10Ssingh: [C: 03+2] cescout: ensure that /var/run/postgresql exists at boot time [puppet] - 10https://gerrit.wikimedia.org/r/607025 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:24:10] (03PS4) 10Reedy: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:24:12] (03PS1) 10Reedy: [beta] Update http to https for url shorteners [puppet] - 10https://gerrit.wikimedia.org/r/607038 [14:26:21] (03PS5) 10Ladsgroup: Initial config for shnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597132 (https://phabricator.wikimedia.org/T253029) (owner: 10Jon Harald Søby) [14:27:41] (03PS1) 10Kormat: mariadb: Add monitoring for lag spikes (v2) [puppet] - 10https://gerrit.wikimedia.org/r/607039 (https://phabricator.wikimedia.org/T253120) [14:27:46] (03CR) 10Ladsgroup: [C: 03+2] Initial config for shnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597132 (https://phabricator.wikimedia.org/T253029) (owner: 10Jon Harald Søby) [14:28:41] (03Merged) 10jenkins-bot: Initial config for shnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597132 (https://phabricator.wikimedia.org/T253029) (owner: 10Jon Harald Søby) [14:30:51] (03PS1) 10Ladsgroup: Add shnwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607040 (https://phabricator.wikimedia.org/T253029) [14:31:12] (03CR) 10DannyS712: [C: 03+1] Add shnwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607040 (https://phabricator.wikimedia.org/T253029) (owner: 10Ladsgroup) [14:31:17] (03CR) 10Ladsgroup: [C: 03+2] Add shnwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607040 (https://phabricator.wikimedia.org/T253029) (owner: 10Ladsgroup) [14:31:59] (03Merged) 10jenkins-bot: Add shnwiktionary to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607040 (https://phabricator.wikimedia.org/T253029) (owner: 10Ladsgroup) [14:34:03] (03CR) 10Marostegui: [C: 03+2] core/multiinstance.pp: Change the mariadb basedir depending on the OS [puppet] - 10https://gerrit.wikimedia.org/r/607034 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [14:36:18] !log ladsgroup@deploy1001 Synchronized dblists: Creating shnwiktionary (T253029) (duration: 00m 58s) [14:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:23] T253029: Create Shan Wiktionary - https://phabricator.wikimedia.org/T253029 [14:37:54] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: Creating shnwiktionary (T253029) [14:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:52] works on mwdebug1001, syncing the rest [14:39:05] (03CR) 10CDanis: [C: 03+1] "+1 from me for being better than the status quo, even if far from ideal" [puppet] - 10https://gerrit.wikimedia.org/r/607026 (https://phabricator.wikimedia.org/T255511) (owner: 10Elukey) [14:39:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:58] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating shnwiktionary (T253029) (duration: 00m 56s) [14:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:49] (03PS3) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [14:41:09] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: Creating shnwiktionary (T253029) (duration: 00m 56s) [14:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:18] (03CR) 10RLazarus: [C: 03+2] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:41:33] (03CR) 10RLazarus: [C: 03+2] [beta] Update http to https for url shorteners [puppet] - 10https://gerrit.wikimedia.org/r/607038 (owner: 10Reedy) [14:41:41] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607046 [14:41:43] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607046 (owner: 10Ladsgroup) [14:41:57] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [14:42:13] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [14:42:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:07] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607046 (owner: 10Ladsgroup) [14:43:13] (03PS2) 10Reedy: [beta] Update http to https for url shorteners [puppet] - 10https://gerrit.wikimedia.org/r/607038 [14:44:16] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 58s) [14:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:20] (03CR) 10CDanis: [C: 03+1] prometheus: import availability aggregation rules from Prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/607031 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:46:45] (03Abandoned) 10Hashar: contint: fix git cloning of docroot for integration.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [14:47:30] !log creating shnwiktionary is done [14:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:59] Urbanecm: it was so quick due to the steps outlined automatically in the ticket :D [14:48:32] !log installing mutt security updates [14:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] Amir1: thanks for the wiki, and see you in 10 mins :) [14:50:53] See you soon! [14:52:15] So I made the first read edit to the new site :) but it was fixing a list gap in the main page. Can anyone review https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMaintenance/+/607050/ to ensure that the list gap isn't added to sites that are created in the future? [14:52:55] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606957 (owner: 10Filippo Giunchedi) [14:54:23] DannyS712: that sentence above is wordy as well: "Please do not start editing this new site. This site has a test project on the Wikimedia Incubator, Beta Wikiversity or Old Wikisource) and it will be imported to here." flows so much better. [14:55:27] * RhinosF1 wonders why https://shn.wiktionary.org/wiki/%E1%81%B6%E1%80%AD%E1%80%AF%E1%81%B5%E1%80%BA%E1%82%89%E1%80%90%E1%80%BD%E1%81%BC%E1%80%BA%E1%80%B8:%E1%80%9C%E1%80%BD%E1%80%84%E1%80%BA%E1%82%88%E1%81%B6%E1%80%9D%E1%80%BA%E1%82%88%E1%82%81%E1%80%B0%E1%80%99%E1%80%BA%E1%82%88/127.0.0.1 doesn't use a system account [14:56:00] no one coded it yet! I'm quite certain there's no technical reason behind it :) [14:56:23] @Amir1 https://shn.wiktionary.org/wiki/ၼႃႈႁူဝ်ႁႅၵ်ႈ links to https://aa.wiktionary.org/wiki/ၼႃႈႁူဝ်ႁႅၵ်ႈ under "In other languages" - why? [14:56:55] DannyS712: Cognate, we had the same thing with gomwiktionary too (IIRC) [14:57:19] Urbanecm: that sounds like the perfect idea for a bug! [14:58:08] There already is a bug [14:58:13] Cognate tries to find pages with the same name across wikis (in wiktionaries). You can either go like enwiktionary and add other wikis manually (and override Cognate) or move the page to another ns and connect it to wikidata (like arwiktionary) [14:58:34] https://phabricator.wikimedia.org/T255507 [14:58:38] RhinosF1: well I've been trying to get more into MediaWiki core/extension development, so thanks for giving me more stuff to do :D [14:59:03] addWiki uses `WikiPage::doEditContent` and falls back to `$wgUser` which is the ip [14:59:03] Majavah: np [15:00:00] Majavah: https://phabricator.wikimedia.org/T256007 [15:00:36] DannyS712: that's the cause not fix [15:00:58] Is it not? [15:01:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:34] the underlying cause is not having a user specified [15:02:57] Something like https://github.com/miraheze/CreateWiki/blob/master/maintenance/populateMainPage.php#L21 might be useful [15:03:07] PROBLEM - Host kubestagetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:29] (03PS1) 10Ssingh: cescout: enable proxy access for the postgres service [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) [15:03:56] what account do we want to use for that? MediaWiki default? MediaWiki maintenance? something else [15:05:08] Majavah: Either [15:05:20] Urbanecm, Amir1: thoughts --^ [15:05:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:13] RECOVERY - Host kubestagetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:07:03] @Amir I made https://ia.wiktionary.org/w/index.php?title=Pagina_principal&diff=43464&oldid=41621 to try and override cognate, but can't update on wikidata because it hasn't been added yet [15:07:09] Fix @Amir1 [15:07:33] (03PS1) 10Hashar: Add fake ssh key pair for integration/docroot deployment [labs/private] - 10https://gerrit.wikimedia.org/r/607053 (https://phabricator.wikimedia.org/T256005) [15:08:12] (03PS2) 10Ssingh: cescout: enable proxy access for the postgres service [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) [15:08:12] DannyS712: sorry if this is a dumb question, but how does changing method signatures work? wouldn't that break existing uses? [15:08:33] (03PS1) 10Andrew Bogott: Dummy passwords for galera backup [labs/private] - 10https://gerrit.wikimedia.org/r/607054 [15:08:50] DannyS712: meeting atm [15:08:51] (03PS4) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [15:09:06] (03CR) 10Hashar: "I have updated the 3 puppet compiler hosts." [labs/private] - 10https://gerrit.wikimedia.org/r/607053 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [15:09:58] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [15:10:50] thats why its tricky - got to support both. See, eg, some of the patches at https://phabricator.wikimedia.org/T249561 - https://gerrit.wikimedia.org/r/#/c/586459/ changes the signature to support the new calling method, and https://gerrit.wikimedia.org/r/#/c/589791 deprecates the old signature [15:11:46] (03PS5) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [15:13:04] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/23357/cescout1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:13:31] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [15:16:22] (03PS1) 10Hashar: scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) [15:17:18] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [15:17:31] (03CR) 10jerkins-bot: [V: 04-1] scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [15:18:23] do you have any small tasks that someone with little experience with MW codebase (like me) can get started? :P [15:19:17] (03PS2) 10Hashar: scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) [15:20:25] (03CR) 10jerkins-bot: [V: 04-1] scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [15:25:51] (03CR) 10Ayounsi: [C: 03+1] "👍" [software/homer] - 10https://gerrit.wikimedia.org/r/607032 (owner: 10Volans) [15:26:01] (03CR) 10Muehlenhoff: cescout: enable proxy access for the postgres service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:26:32] (03PS2) 10Privacybatm: transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) [15:27:04] (03CR) 10Volans: [C: 03+2] Remove support for older Python versions [software/homer] - 10https://gerrit.wikimedia.org/r/607032 (owner: 10Volans) [15:27:05] 10Operations, 10ops-eqiad, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10wiki_willy) a:03Jclark-ctr @Jclark-ctr - I think there are some bbu's leftover from the last time you requested some spares to be ordered, but let me know if not. Thanks, Willy [15:28:58] (03Merged) 10jenkins-bot: Remove support for older Python versions [software/homer] - 10https://gerrit.wikimedia.org/r/607032 (owner: 10Volans) [15:29:26] (03CR) 10Ayounsi: [C: 03+1] CHANGELOG: add changelogs for release v0.2.4 [software/homer] - 10https://gerrit.wikimedia.org/r/607021 (owner: 10Volans) [15:33:20] (03PS2) 10Volans: CHANGELOG: add changelogs for release v0.2.4 [software/homer] - 10https://gerrit.wikimedia.org/r/607021 [15:34:51] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.4 [software/homer] - 10https://gerrit.wikimedia.org/r/607021 (owner: 10Volans) [15:35:59] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.4 [software/homer] - 10https://gerrit.wikimedia.org/r/607021 (owner: 10Volans) [15:36:15] (03PS3) 10Ssingh: cescout: enable proxy access for the postgres service [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) [15:38:43] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/23358/cescout1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:40:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:41:23] (03CR) 10Ssingh: [C: 03+2] cescout: enable proxy access for the postgres service [puppet] - 10https://gerrit.wikimedia.org/r/607052 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:42:19] hasharAway: ok to merge your change? [15:43:00] (skipped for now) [15:51:14] (03CR) 10Dzahn: [C: 03+2] gerrit: Drop empty unused Git config file [puppet] - 10https://gerrit.wikimedia.org/r/606783 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [15:53:14] sukhe: which change? [15:54:02] hashar: sorry, it was this: "Antoine Musso: Add fake ssh key pair for integration/docroot deployment (ea407ab)" [15:54:47] sukhe: well it is merged isn't it? [15:54:56] it still shows up for me for some reason [15:55:37] it is merged https://gerrit.wikimedia.org/r/#/c/labs/private/+/607053/ [15:55:38] ;) [15:56:07] ha ok. sorry for the noise! [15:56:28] it appears as merged on gerrit and now it needs to be merged on the puppetmaster [15:56:37] same as the production puppet repo [15:57:29] I'm merging it BTW [15:58:05] does the puppetmaster has labs/private? [15:58:25] indeed [15:59:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 54 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:05:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:07:31] (03PS3) 10Hashar: scap configuration for integration/docroot [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) [16:08:06] sukhe: sorry I wasn't aware of the extra step involved :] [16:08:38] Urbanecm: you should have all the rights now [16:08:54] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [16:09:38] Amir1: confirmed, thanks [16:10:46] there is a new Google Groups interface, great! [16:11:14] 10Operations, 10SRE-swift-storage, 10serviceops: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm) [16:11:46] (03PS3) 10Dzahn: gerrit: Add option to mark gerrit servers as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606530 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:11:50] yup, better than the mailman we have :D [16:12:05] Urbanecm: if you need navigating through the wiki, let me know [16:12:08] what about the newer version of mailman you're pursuing? :) [16:12:39] I need to puppetize it first, it's going to be LOTS of work [16:12:58] mailman3 is so different from from mailman2 [16:13:04] Urbanecm: Start from https://techconduct.wikimedia.org/wiki/CoC_committee:Community_portal [16:13:10] okay, thanks :) [16:14:38] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23359/" [puppet] - 10https://gerrit.wikimedia.org/r/606530 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:17:31] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10mpopov) a:03mpopov [16:17:54] (03CR) 10Hashar: "We need first need scap to be configured in the source repo: https://gerrit.wikimedia.org/r/#/c/integration/docroot/+/607055/" [puppet] - 10https://gerrit.wikimedia.org/r/607056 (https://phabricator.wikimedia.org/T256005) (owner: 10Hashar) [16:19:22] (03CR) 10Dzahn: [C: 03+2] gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:21:29] 10Operations, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review, 10Product-Analytics (Kanban): Creation of a new POSIX group and system user for the Product Analytics team - https://phabricator.wikimedia.org/T255039 (10mpopov) @elukey: so who needs to do what? it looks like SRE needs to review... [16:24:26] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23360/" [puppet] - 10https://gerrit.wikimedia.org/r/606532 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:24:39] (03PS4) 10Dzahn: gerrit: Add dedicated home dir for new Gerrit version [puppet] - 10https://gerrit.wikimedia.org/r/606532 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:29:50] (03PS4) 10Dzahn: gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:30:44] (03CR) 10Paladox: "This can be done anytime, since drafts were dropped from 2.15." [puppet] - 10https://gerrit.wikimedia.org/r/606533 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:31:13] (03CR) 10Paladox: [C: 03+1] gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:32:34] (03CR) 10Paladox: [C: 03+1] gerrit: Drop empty unused Git config file [puppet] - 10https://gerrit.wikimedia.org/r/606783 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:32:46] (03CR) 10Paladox: [C: 03+1] gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:34:46] (03CR) 10Paladox: [C: 03+1] gerrit: Use `replica` instead of `slave` for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:35:01] (03CR) 10Paladox: [C: 03+1] gerrit: Remove old Polymer <2 styles for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606840 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [16:40:42] (03PS1) 10Hashar: ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 [16:40:44] (03PS1) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T256005) [16:41:13] (03CR) 10jerkins-bot: [V: 04-1] ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 (owner: 10Hashar) [16:41:17] (03PS2) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) [16:42:39] (03PS2) 10Hashar: ci: remove Apache config for nightlies [puppet] - 10https://gerrit.wikimedia.org/r/607075 [16:42:50] (03PS6) 10Dzahn: gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [16:49:49] !log volans@cumin1001 START - Cookbook sre.dns.netbox [16:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:56] (03PS1) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [16:57:59] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Dummy passwords for galera backup [labs/private] - 10https://gerrit.wikimedia.org/r/607054 (owner: 10Andrew Bogott) [16:58:07] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [16:58:31] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] gehel and onimisionipe: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T1700). [17:01:56] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) I am having an issue with these. While running the script it gives me an error about IPMI. These are HP servers and I do not know of... [17:01:57] (03PS2) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [17:03:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:04:39] (03PS3) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [17:07:37] (03CR) 10Dzahn: "with the diff between PS 4 and PS6 this became a noop on the old versions" [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:09:20] (03CR) 10Ladsgroup: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [17:14:18] !log gerrit1002 (gerrit-test): re-enabled puppet, restarted gerrit service [17:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/23363/" [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:17:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:19:05] 10Operations, 10DNS, 10Traffic, 10netbox: Netbox DNS change not effective in gdns - https://phabricator.wikimedia.org/T255748 (10Volans) p:05High→03Medium [17:19:33] !log gerrit1002 - let puppet remove [database] secttion from config; restart gerrit another time [17:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:50] (03CR) 10Dzahn: [C: 03+2] gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:20:59] (03PS5) 10Dzahn: gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:27:08] (03PS4) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [17:29:49] (03CR) 10Andrew Bogott: "@elukey, this is using/copying some backup work you did in Analytics so I'm interested in your opinion :)" [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [17:31:07] (03CR) 10Dzahn: [C: 03+2] gerrit: Update its-phabricator templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606781 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:31:16] (03PS3) 10Dzahn: gerrit: Update its-phabricator templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606781 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:32:40] (03CR) 10Paladox: [C: 03+1] gerrit: Update its-phabricator templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606781 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:32:46] (03PS3) 10Dzahn: gerrit: Update email templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606782 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:32:54] (03CR) 10Paladox: [C: 03+1] gerrit: Update email templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606782 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:33:12] (03CR) 10Paladox: [C: 03+1] gerrit: Switch header styling for new Gerrits from component to style [puppet] - 10https://gerrit.wikimedia.org/r/606841 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [17:33:22] (03CR) 10Paladox: [C: 03+1] gerrit: Use colored header bar also in dark theme for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606842 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [17:33:32] (03CR) 10Paladox: [C: 03+1] gerrit: Have a proper light and dark style for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606843 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [17:35:25] (03CR) 10Dzahn: [C: 03+2] gerrit: Update email templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606782 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:41:39] (03PS3) 10Dzahn: gerrit: Drop empty unused Git config file [puppet] - 10https://gerrit.wikimedia.org/r/606783 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [17:43:03] (03CR) 10Jforrester: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/573759 (owner: 10Paladox) [17:58:06] (03PS4) 10Dzahn: gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T1800) [18:02:41] (03CR) 10Krinkle: [C: 03+1] "LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/598292 (owner: 10Aaron Schulz) [18:05:48] (03PS3) 10Krinkle: arclamp: add svgs for some key entrypoint/singleton methods calls [puppet] - 10https://gerrit.wikimedia.org/r/598292 (https://phabricator.wikimedia.org/T253679) (owner: 10Aaron Schulz) [18:09:37] (03CR) 10Dzahn: [C: 03+2] gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [18:15:25] (03PS2) 10Dzahn: gerrit: Use `replica` instead of `slave` for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [18:20:56] (03CR) 10Dzahn: [C: 04-1] "on the current prod slave this becomes: " daemon -d /var/lib/gerrit2/review_site--slave --enable-http" without a space before --slave" [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [18:22:37] (03PS16) 10Dzahn: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 (https://phabricator.wikimedia.org/T254648) (owner: 10Paladox) [18:23:31] (03CR) 10Dzahn: [C: 03+2] Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 (https://phabricator.wikimedia.org/T254648) (owner: 10Paladox) [18:24:05] (03PS9) 10Dzahn: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [18:26:22] (03CR) 10Dzahn: [C: 03+2] Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [18:27:58] (03PS2) 10Dzahn: gerrit: Remove old Polymer <2 styles for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606840 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:33:06] (03CR) 10Dzahn: [C: 03+2] gerrit: Remove old Polymer <2 styles for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606840 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:34:09] (03CR) 10Bstorm: [C: 04-1] cloud nfs: only run nfs-exportd on the current active node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [18:34:42] (03PS2) 10Dzahn: gerrit: Switch header styling for new Gerrits from component to style [puppet] - 10https://gerrit.wikimedia.org/r/606841 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:36:21] (03CR) 10Dzahn: [C: 03+2] gerrit: Switch header styling for new Gerrits from component to style [puppet] - 10https://gerrit.wikimedia.org/r/606841 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:37:51] (03CR) 10Dzahn: [C: 03+2] gerrit: Use colored header bar also in dark theme for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606842 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:38:00] (03PS2) 10Dzahn: gerrit: Use colored header bar also in dark theme for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606842 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:46:36] (03PS2) 10Dzahn: gerrit: Have a proper light and dark style for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606843 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:47:34] (03CR) 10Dzahn: [C: 03+2] gerrit: Have a proper light and dark style for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606843 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [18:47:51] (03PS1) 10Ssingh: cescout: enable proxy access (improves 16ac24a82c) [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) [18:51:01] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23367/cescout1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [18:52:00] (03CR) 10Dzahn: "curious: is there a specific reason to use systemd::unit{} and service{} instead of systemd::service{} ?" [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [18:52:51] (03PS5) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [18:53:45] (03PS2) 10Bstorm: cloud nfs: only run nfs-exportd on the current active node [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) [18:54:03] (03CR) 10jerkins-bot: [V: 04-1] wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [18:54:52] (03CR) 10Dzahn: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [18:55:05] (03PS9) 10Dzahn: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [18:55:12] (03PS6) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [19:00:09] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:03:01] (03PS2) 10Ssingh: cescout: enable proxy access (improves 16ac24a82c) [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) [19:03:58] (03CR) 10Ssingh: "> curious: is there a specific reason to use systemd::unit{} and" [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:04:17] (03CR) 10jerkins-bot: [V: 04-1] cescout: enable proxy access (improves 16ac24a82c) [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:05:07] (03PS3) 10Ssingh: cescout: enable proxy access (improves 16ac24a82c) [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) [19:06:46] (03CR) 10Dzahn: [C: 03+1] cescout: enable proxy access (improves 16ac24a82c) [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:07:18] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23371/cescout1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:07:58] (03CR) 10Dzahn: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:08:18] (03CR) 10Ssingh: [C: 03+2] cescout: enable proxy access (improves 16ac24a82c) [puppet] - 10https://gerrit.wikimedia.org/r/607089 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:08:50] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [19:09:09] (03PS10) 10Dzahn: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:09:25] (03CR) 10Paladox: [C: 04-1] Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:09:43] (03CR) 10Paladox: [C: 04-1] "@Dzahn @Qchris has already updated them." [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:09:59] (03Abandoned) 10Paladox: Gerrit: Update soy templates for gerrit 2.16 [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:10:01] (03CR) 10Dzahn: "PS10: - rebased, moved files from homedir to homedir-new, removed draft template" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:10:19] (03CR) 10Hashar: "I have confirmed the existing DocumentRoot is clean:" [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [19:10:36] (03CR) 10Dzahn: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:11:03] (03CR) 10Paladox: "> > Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:11:15] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [19:11:20] 10Operations, 10Wikidata, 10serviceops: mw1384 is misbehaving - https://phabricator.wikimedia.org/T255282 (10Krinkle) [19:11:32] (03CR) 10QChris: "I'll have to read up on <%- vs <% and -%> vs %> and can upload a new version then." [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [19:11:39] (03PS5) 10Paladox: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [19:12:16] (03CR) 10Dzahn: "> Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:13:14] (03CR) 10Paladox: "> > Patch Set 10:" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [19:14:04] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [19:15:16] (03PS3) 10QChris: gerrit: Use `replica` instead of `slave` for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) [19:15:45] (03CR) 10Krinkle: [C: 04-1] gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [19:15:55] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Use `replica` instead of `slave` for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [19:16:26] (03CR) 10Paladox: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [19:16:30] (03CR) 10QChris: [C: 03+1] "Sorry. PS3 should look better. My Puppet foo is weak :-(" [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [19:22:25] (03CR) 10Dzahn: [C: 03+1] "This is old (that's why I'm looking) but it looks good to merge. The deprecated role class still exists, the old one doesn't. The linked t" [puppet] - 10https://gerrit.wikimedia.org/r/389295 (owner: 10Paladox) [19:23:00] (03PS6) 10Dzahn: mediawiki_vagrant: Update role name used for if defined check [puppet] - 10https://gerrit.wikimedia.org/r/389295 (owner: 10Paladox) [19:23:45] (03CR) 10Bstorm: cloud nfs: only run nfs-exportd on the current active node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [19:24:24] (03CR) 10Dzahn: [C: 03+2] "openstack-browser shows only these are used: role::labs::mediawiki_vagrant , role::labs::vagrant_lxc" [puppet] - 10https://gerrit.wikimedia.org/r/389295 (owner: 10Paladox) [19:26:43] (03CR) 10Dzahn: "Hi @Paladox, this is old and a draft, can we abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/509173 (owner: 10Paladox) [19:26:59] (03CR) 10Bstorm: wmcs galera: add daily backups of each OpenStack db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [19:27:07] (03Abandoned) 10Paladox: Gerrit: Rename some gerrit logs [puppet] - 10https://gerrit.wikimedia.org/r/509173 (owner: 10Paladox) [19:27:12] (03CR) 10Paladox: "> Hi @Paladox, this is old and a draft, can we abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/509173 (owner: 10Paladox) [19:27:48] (03CR) 10Dzahn: "Hi Paladox, another test draft, ok to abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/539956 (owner: 10Paladox) [19:29:02] (03Abandoned) 10Paladox: test [puppet] - 10https://gerrit.wikimedia.org/r/539956 (owner: 10Paladox) [19:29:08] (03CR) 10Paladox: "> Hi Paladox, another test draft, ok to abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/539956 (owner: 10Paladox) [19:29:10] (03CR) 10Dzahn: "The linked ticket is resolved. Does this mean this change is not needed anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:29:43] (03PS4) 10Dzahn: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/492314 (owner: 10Paladox) [19:29:45] (03CR) 10Paladox: "> The linked ticket is resolved. Does this mean this change is not" [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:30:04] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/492314 (owner: 10Paladox) [19:31:03] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/606784 enabled this for new gerrits now" [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [19:31:18] (03Abandoned) 10Paladox: Gerrit: Support git protocol version 2 [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [19:31:34] (03PS4) 10QChris: gerrit: Use `replica` instead of `slave` for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) [19:32:07] (03CR) 10Dzahn: "This seems to be seperate from jgit config in https://gerrit.wikimedia.org/r/c/operations/puppet/+/606784 though. adding qchris." [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [19:32:47] (03CR) 10Dzahn: "@Paladox Could you move the .gitconfig to homedir-new and add an "if" in the erb template so it only applies to new gerrits?" [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [19:33:09] (03PS5) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/492314 [19:33:48] (03CR) 10Paladox: "@Dzahn i've abandoned this because @Qchris has done this already :)" [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [19:36:13] (03CR) 10QChris: "I think the confusion comes from where Gerrit picks up configuration" [puppet] - 10https://gerrit.wikimedia.org/r/473643 (owner: 10Paladox) [19:38:02] (03CR) 10Dzahn: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:39:34] (03CR) 10Dzahn: "I know we want the new keys (and do this after the upgrade). Linking T240266 ,right?" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [19:39:51] (03PS9) 10Dzahn: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [19:40:08] (03PS11) 10Dzahn: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [19:40:33] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [19:41:16] (03PS10) 10Paladox: Gerrit: Rename ssh_host_key to ssh_host_rsa_key [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) [19:41:21] (03CR) 10Dzahn: "What was this for? To fix scap for gerrit in cloud? Something else?" [puppet] - 10https://gerrit.wikimedia.org/r/565713 (owner: 10Paladox) [19:41:28] (03PS12) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) [19:41:46] (03PS13) 10Paladox: Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) [19:41:58] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add ed25519 and ecdsa ssh host keys [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [19:43:47] (03CR) 10Dzahn: [C: 04-1] "i think we will just use simple auth in cloud and forget about trying to have an LDAP server there" [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [19:45:58] (03CR) 10Dzahn: [C: 04-1] "How did we get around this since Phab is up and running in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [19:46:46] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10jrbs) >>! In T255733#6236574, @Dzahn wrote: > To clarify: Do you want the mail to still be in OTRS but additionally also forward t... [19:47:05] (03CR) 10Dzahn: [C: 03+1] gerrit: drop old redirect [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [19:47:39] (03PS3) 10Dzahn: gerrit: drop old redirect as workaround for broken browser detection [puppet] - 10https://gerrit.wikimedia.org/r/606434 (owner: 10Paladox) [19:49:06] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Aklapper) [19:49:15] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Aklapper) [19:56:32] (03CR) 10Dzahn: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606432 (owner: 10Paladox) [19:57:58] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Krinkle) [19:59:05] (03PS6) 10Paladox: gerrit: Redirect /r/(#/)?projects/(.+),dashboards/(.+) to /r/p/$2/+/dashboard/$3 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [19:59:21] (03PS7) 10Paladox: gerrit: Redirect /r/projects/(.+),dashboards/(.+) to /r/p/$1/+/dashboard/$2 [puppet] - 10https://gerrit.wikimedia.org/r/606432 [20:00:04] halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T2000). [20:05:16] (03CR) 10Dzahn: [C: 03+2] "thanks! now noop on prod servers. https://puppet-compiler.wmflabs.org/compiler1001/23376/" [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [20:10:05] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 70.17 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:17:29] (03CR) 10Dzahn: "what's an example of a permission issue, please. Is the webserver user trying to write to config files?" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [20:24:03] (03PS1) 10Dzahn: gerrit (cloud): set gerrit::server::is_new_version: true [puppet] - 10https://gerrit.wikimedia.org/r/607108 (https://phabricator.wikimedia.org/T254158) [20:25:16] (03PS2) 10Dzahn: gerrit (cloud): set gerrit::server::is_new_version: true [puppet] - 10https://gerrit.wikimedia.org/r/607108 (https://phabricator.wikimedia.org/T254158) [20:26:11] (03CR) 10Dzahn: [C: 03+2] gerrit (cloud): set gerrit::server::is_new_version: true [puppet] - 10https://gerrit.wikimedia.org/r/607108 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [20:27:16] (03PS2) 10Dzahn: gerrit (cloud): remove SQL database hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/606550 [20:36:14] (03Abandoned) 10Dzahn: gerrit (cloud): remove SQL database hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/606550 (owner: 10Dzahn) [20:36:17] (03PS3) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) [20:36:45] (03CR) 10Herron: [C: 03+1] "LGTM! The pontoon instances in the monitoring project have been super useful. Excited to see test environments for further stacks built u" [puppet] - 10https://gerrit.wikimedia.org/r/606961 (owner: 10Filippo Giunchedi) [20:37:47] (03CR) 10Herron: [C: 03+1] smart: ignore stderr from facter [puppet] - 10https://gerrit.wikimedia.org/r/606957 (owner: 10Filippo Giunchedi) [20:38:33] (03CR) 10Herron: [C: 03+1] prometheus: require class node_exporter for node textfile scripts [puppet] - 10https://gerrit.wikimedia.org/r/606977 (owner: 10Filippo Giunchedi) [20:53:16] (03PS3) 10Bstorm: cloud nfs: only run nfs-exportd on the current active node [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) [20:54:09] (03CR) 10Bstorm: "PCC exposed a problem in the way I was doing this https://puppet-compiler.wmflabs.org/compiler1003/23377/ in patchset 2" [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [20:57:45] (03CR) 10Bstorm: "This version does all the things I want: https://puppet-compiler.wmflabs.org/compiler1003/23378/" [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [20:59:19] (03PS1) 10Dzahn: gerrit: allow for 3 different methods to get TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/607116 [21:00:04] Reedy and sbassett: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T2100). [21:02:28] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet failure on deployment-cache-text06 - https://phabricator.wikimedia.org/T256064 (10Ottomata) [21:06:11] jouncebot: now [21:06:12] For the next 1 hour(s) and 53 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T2100) [21:17:30] (03CR) 10Herron: "I think this is looking good overall, but not tested personally. One comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [21:24:14] (03PS5) 10Bstorm: unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) [21:29:01] (03PS7) 10Andrew Bogott: wmcs galera: add daily backups of each OpenStack db [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) [21:31:07] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera: add daily backups of each OpenStack db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607078 (https://phabricator.wikimedia.org/T242455) (owner: 10Andrew Bogott) [21:31:32] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/23379/" [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [21:37:42] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10Jclark-ctr) @ayounsi can we close this ticket? or anything i can do? [21:39:41] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10Jclark-ctr) @akosiaris what times work best for you i am usually on site tuesday and thursday [21:40:24] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10wiki_willy) a:03Jclark-ctr [21:40:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10wiki_willy) a:03Jclark-ctr [21:43:57] PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:44:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:44:30] andrewbogott: ^ [21:44:36] (03CR) 10Bstorm: [C: 03+2] "I think this is safe to merge following finding my own error and PCC runs." [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [21:44:53] I'll ack [21:45:32] ACKNOWLEDGEMENT - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir andrew bogott broken by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/607078/ Im investigating https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [21:46:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Jclark-ctr) @Andrew not sure if you noticed my last comment? [21:47:55] 10Operations, 10ops-eqiad, 10DC-Ops: decomission oresrdb100[12] - https://phabricator.wikimedia.org/T254238 (10Jclark-ctr) a:03Jclark-ctr [21:48:31] (03PS1) 10Andrew Bogott: wmcs galera backups: move to weekly [puppet] - 10https://gerrit.wikimedia.org/r/607128 [21:50:48] (03CR) 10Andrew Bogott: [C: 03+2] wmcs galera backups: move to weekly [puppet] - 10https://gerrit.wikimedia.org/r/607128 (owner: 10Andrew Bogott) [21:51:50] 10Operations, 10ops-eqiad, 10DC-Ops: apply hostname labels to bast1002/WMF4749 - https://phabricator.wikimedia.org/T186625 (10wiki_willy) a:03Jclark-ctr [21:58:22] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) I've created this one-off script and run it on the af-netbox test instance to cleanup ifaces and addresses from existing offline devic... [21:58:42] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@6e7f9f7]: bump glent jar to 0.2.2 [21:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:00] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@6e7f9f7]: bump glent jar to 0.2.2 (duration: 00m 18s) [21:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:29] (03PS1) 10Andrew Bogott: Revert "wmcs galera backups: move to weekly" [puppet] - 10https://gerrit.wikimedia.org/r/607131 [22:03:15] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs galera backups: move to weekly" [puppet] - 10https://gerrit.wikimedia.org/r/607131 (owner: 10Andrew Bogott) [22:05:21] (03PS1) 10Volans: scripts: fix offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607132 [22:11:42] (03PS1) 10CDanis: check_prometheus: rewrite 'instance' to 'host' w/o port numbers [puppet] - 10https://gerrit.wikimedia.org/r/607134 [22:12:34] !log cleanup interfaces and addresses in Netbox for offline servers - T233183 [22:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:38] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [22:12:45] (03CR) 10CRusnov: "heh lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607132 (owner: 10Volans) [22:12:52] (03CR) 10CRusnov: [C: 03+1] scripts: fix offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607132 (owner: 10Volans) [22:13:14] (03CR) 10Volans: [C: 03+2] scripts: fix offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607132 (owner: 10Volans) [22:14:33] (03CR) 10CDanis: "This patch is a suggestion / a very rough first pass, but I think the overall idea is reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/607134 (owner: 10CDanis) [22:15:39] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@f2002c8]: bump glent jar to 0.2.2 [22:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:35] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@f2002c8]: bump glent jar to 0.2.2 (duration: 00m 56s) [22:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:36] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) Got it reviewed by @crusnov, run on production with P11632 In case of any immediate issue there is a backup on netboxdb1001 taken rig... [22:18:28] (03PS1) 10Andrew Bogott: galera/bacula: giving this one more try [puppet] - 10https://gerrit.wikimedia.org/r/607135 [22:18:30] (03CR) 10CDanis: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/607134 (owner: 10CDanis) [22:19:57] (03CR) 10Andrew Bogott: [C: 03+2] galera/bacula: giving this one more try [puppet] - 10https://gerrit.wikimedia.org/r/607135 (owner: 10Andrew Bogott) [22:20:33] (03PS1) 10Jdlrobson: Enable click tracking in Vector on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607136 (https://phabricator.wikimedia.org/T250282) [22:23:21] RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:25:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:33:30] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm now!" [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [22:35:24] !log volans@cumin1001 START - Cookbook sre.dns.netbox [22:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:47] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:38] !log downtimed labstore1005 to prevent an alert during puppet merge T253353 [22:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:43] T253353: Add cluster-awareness to nfs-exportd - https://phabricator.wikimedia.org/T253353 [22:40:14] (03CR) 10Bstorm: [C: 03+2] cloud nfs: only run nfs-exportd on the current active node [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [22:40:37] * RhinosF1 has B&C patches and is here [22:52:09] (03PS1) 10Dzahn: bacula: remove unneeded lint-ignores [puppet] - 10https://gerrit.wikimedia.org/r/607139 [22:54:54] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [22:55:31] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) [22:56:18] (03CR) 10Dzahn: "I see the directory is being created, but it doesn't really move files around, does it?" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200622T2300). [23:00:04] RhinosF1: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] o/ [23:01:31] Ooohh, stickers are nice [23:01:34] (03PS1) 10Bstorm: cloud nfs: clean up some of the secondary cluster materials [puppet] - 10https://gerrit.wikimedia.org/r/607142 (https://phabricator.wikimedia.org/T224747) [23:01:51] QuIRC: I got to find a deployer first :) [23:02:08] I can do it [23:03:01] thanks RoanKattouw, https://gerrit.wikimedia.org/r/c/606144/ is a straight sync. The others could do with some verification especially https://gerrit.wikimedia.org/r/c/605978/ [23:03:43] (03CR) 10Catrope: [C: 03+2] Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) (owner: 10RhinosF1) [23:03:53] OK I'll do the CopyUploads one first [23:09:13] RoanKattouw: any clue why no merge? It's given the V+2 - need rebase, I think? [23:09:20] Ugh [23:09:33] (03PS5) 10Catrope: Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) (owner: 10RhinosF1) [23:09:38] (03CR) 10Catrope: Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) (owner: 10RhinosF1) [23:09:41] (03CR) 10Catrope: [C: 03+2] Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) (owner: 10RhinosF1) [23:09:47] I hate it when that hapens [23:09:50] * RhinosF1 wonders if Jenkins could do that itself [23:09:57] It says "Merge Conflict", but that's clearly a lie because it rebases cleanly [23:10:08] RoanKattouw: yep [23:10:11] Gerrit should do the rebase itself, but for some reason it thinks there's a conflict [23:10:25] strange [23:10:31] (03Merged) 10jenkins-bot: Add numerous domains to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606144 (https://phabricator.wikimedia.org/T255336) (owner: 10RhinosF1) [23:10:45] yey [23:11:42] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) Hi @jrbs understood! This is surprisingly complex because so far wikidata.org isn't a domain that has any special cases... [23:11:57] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (netbox-dev2001, ...), No backups: 3 (cloudcontrol2001-dev, ...), Fresh: 95 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:13:05] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) Would you want to just have privacy@ mail for ANY of our domains and have NONE of them be sent to OTRS anymore in general? [23:13:10] (03PS1) 10Cwhite: Filter the files watched for modifications to only those configured by `-logs` and `-progs`. [debs/mtail] (cross_dist_build) - 10https://gerrit.wikimedia.org/r/607144 (https://phabricator.wikimedia.org/T255776) [23:13:30] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10jrbs) Thank you for the detailed breakdown here! Let me ask the Privacy team what would be best for them and get back to you here.... [23:13:52] RoanKattouw: sync? [23:14:02] On it, sorry got distracted [23:14:18] I'll try to speed up cause it's past midnight for you [23:14:25] np [23:14:41] (03CR) 10Cwhite: [C: 03+2] smart: ignore stderr from facter [puppet] - 10https://gerrit.wikimedia.org/r/606957 (owner: 10Filippo Giunchedi) [23:15:12] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 2 (netbox-dev2001, ...), No backups: 3 (cloudcontrol2001-dev, ...), Fresh: 95 jobs andrew bogott I broke bacula director for a bit this is probably a side-effect of that and should recover on its own. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:15:50] (03CR) 10Cwhite: [C: 03+1] prometheus: require class node_exporter for node textfile scripts [puppet] - 10https://gerrit.wikimedia.org/r/606977 (owner: 10Filippo Giunchedi) [23:16:03] (03PS4) 10Catrope: Add localised sitename for bewikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599760 (https://phabricator.wikimedia.org/T253962) (owner: 10RhinosF1) [23:16:13] (03CR) 10Catrope: [C: 03+2] Add localised sitename for bewikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599760 (https://phabricator.wikimedia.org/T253962) (owner: 10RhinosF1) [23:16:20] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add domains to wgCopyUploadsDomains (T255336, T255363, T255386, T255313) (duration: 01m 01s) [23:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:28] T255336: Add http://pashaei.studio/ to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T255336 [23:16:28] T255313: Add ww2db.com to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T255313 [23:16:28] T255386: Add cdm16022.contentdm.oclc.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T255386 [23:16:28] T255363: Add parliamentdiagram.toolforge.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T255363 [23:17:06] (03Merged) 10jenkins-bot: Add localised sitename for bewikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599760 (https://phabricator.wikimedia.org/T253962) (owner: 10RhinosF1) [23:19:37] 10Operations, 10Data-Services, 10cloud-services-team (Kanban): Convert labstore cluster configuration to hiera and profiles - https://phabricator.wikimedia.org/T161835 (10Bstorm) 05Open→03Resolved a:03Bstorm This is actually done now. There is room for more passes, but that doesn't mean this task shoul... [23:19:42] 10Operations, 10Data-Services, 10Tracking-Neverending, 10cloud-services-team (Kanban): overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10Bstorm) [23:20:15] RhinosF1: OK bewikibooks site name is ready for testing [23:20:27] RoanKattouw: working [23:20:32] (03PS5) 10Catrope: Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [23:20:36] (03CR) 10Catrope: [C: 03+2] Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [23:21:46] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add localized sitename for bewikibooks (T253962) (duration: 00m 57s) [23:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:50] T253962: Change $wgSitename for Belarusian Wikibooks - https://phabricator.wikimedia.org/T253962 [23:23:23] (03Merged) 10jenkins-bot: Create 'rollbacker' user group for elwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606146 (https://phabricator.wikimedia.org/T255569) (owner: 10RhinosF1) [23:23:39] RhinosF1: elwiktionary rollbacker is ready to test [23:24:31] RoanKattouw: working [23:26:15] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create rollbacker group on elwiktionary (T225569) (duration: 00m 56s) [23:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:19] T225569: Blazegraph flaky tests - https://phabricator.wikimedia.org/T225569 [23:27:39] (03PS5) 10Catrope: close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) (owner: 10RhinosF1) [23:28:14] RoanKattouw: T255569 you meant [23:28:15] T255569: Create 'rollbackers' user group for elwikt - https://phabricator.wikimedia.org/T255569 [23:28:19] noted on task [23:28:43] !log Synchronized wmf-config/InitialiseSettings.php: Create rollbacker group on elwiktionary (T255569) (typoed the task number before) [23:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:52] Thanks for catching that [23:29:57] (03CR) 10Catrope: [C: 03+2] close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) (owner: 10RhinosF1) [23:30:42] (03Merged) 10jenkins-bot: close trwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605978 (https://phabricator.wikimedia.org/T247330) (owner: 10RhinosF1) [23:32:25] RhinosF1: Alright, close trwikinews is ready for testing. Is there anything else I need to run to make that happen? [23:32:38] RoanKattouw: I don't think so. [23:33:38] RhinosF1: Could you also do a follow-up removing all trwikinews settings from InitialiseSettings.php [23:33:42] ? [23:34:07] RoanKattouw: now or whenever? [23:34:10] Whenever [23:34:20] I can do. [23:34:23] The documentation says to do it but I don't tihnk it hurts to do it later [23:34:23] https://wikitech.wikimedia.org/wiki/Close_a_wiki [23:34:39] In any case, lemme know when you've tested and I'll sync [23:35:13] RoanKattouw: edit seems revoked from everyone. [23:35:22] docs say to touch IS.php if not done [23:35:27] Yeah I will [23:35:48] Syncing dblists first, then I'll touch+sync InitialiseSettings.php [23:36:04] ok [23:36:39] !log catrope@deploy1001 Synchronized dblists/: Close trwikinews (T247330) (duration: 00m 58s) [23:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:43] T247330: Close Turkish Wikinews - https://phabricator.wikimedia.org/T247330 [23:40:59] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: touch for T247330 (duration: 00m 56s) [23:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:16] (03CR) 10Ladsgroup: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [23:42:52] RoanKattouw: looks good, will do the follow up tommorow. [23:43:08] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 75 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:43:26] Thanks! [23:43:30] Have a good night [23:43:59] RoanKattouw: ty [23:44:08] * RhinosF1 has had 3 late nights in a row [23:48:55] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas