[00:00:04] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T0000). [00:05:31] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:09:27] (03PS1) 10Ahmon Dancy: Only report errors from production mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673171 [00:10:36] (03PS2) 10Ahmon Dancy: logspam.pl: Only process errors from production mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673171 [00:13:20] (03CR) 10Krinkle: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy) [00:14:27] (03CR) 10Sharvaniharan: "@ottomata, @MHolloway any idea what the tab v/s space error is about? I did use tabs.. not sure what's wrong" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [00:17:58] 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Krinkle) [00:18:57] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding due to Krinkle's comments." [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy) [00:19:27] (03CR) 10Brennen Bearnes: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy) [00:19:39] dancy: any specific noise that led to this? [00:19:44] might be able to shed some light [00:19:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:20:55] yes, this: `2021-03-17 23:12:39 [37c59da3060210298617bec3] mwdebug1001 enwiki 1.36.0-wmf.34 error WARNING: [37c59da3060210298617bec3] [no req] ErrorException: PHP Notice: Writing to directory /home/urbanecm/.config/psysh is not allowed. ` [00:21:29] dancy: that's me running shell.php on mwdebug1001 [00:21:40] Yeah, eval/shell.php are excluded in Logstash for that reason [00:21:45] We had an exclusion for mwmaint* already but this showed up today so I tried coming at it from a different direction. [00:21:51] can be run on any mw* server, including mw1234 [00:22:06] sorry, I'll try to not run shell.php elsewhere :) [00:22:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:22:14] well, I do, and have to. [00:22:40] OK. I can change it to an eval/shell.php filter. [00:23:02] (03CR) 10Krinkle: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy) [00:23:42] dancy: alright. if anything not eval/shell related does pop up, happy to take look. there might be something else we need to fix or avoid by other means. [00:24:08] Sounds good. Thanks all. I'll update the commit tomorrow. [00:34:25] (03PS8) 10Krinkle: arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [00:34:44] (03CR) 10Krinkle: [C: 03+1] "This is ready to go afaics." [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [01:13:02] (03CR) 10Brennen Bearnes: logspam.pl: Only process errors from production mw servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy) [01:14:59] (03PS2) 10Aaron Schulz: Use $region for default mcrouter routes [puppet] - 10https://gerrit.wikimedia.org/r/654330 [01:32:29] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:32:33] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [01:35:39] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:42:57] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:19] (03CR) 10Aaron Schulz: "Compiler shows diffs for me: https://puppet-compiler.wmflabs.org/compiler1002/28659/" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [01:46:03] (03PS1) 10Dzahn: parsoid::testreduce: switch mysql data dir to /srv/data/mysql [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) [01:48:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={eqiad,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:50:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:07:05] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [03:09:31] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [03:26:45] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:35:33] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:31] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.591 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:29] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:03] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [03:42:05] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [03:42:53] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:53] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:44:47] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:25] !log restarting slapd on seaborgium, serpens, and r-o ldap replicas (we're getting irregular connection failures) [03:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:07] RECOVERY - Check systemd state on ml-serve2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:51] RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:29] PROBLEM - Check systemd state on ml-serve2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:11] PROBLEM - Check systemd state on ml-serve2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:31] (03CR) 10Subramanya Sastry: [C: 03+1] "Will the current data in /var/lib/mysql be copied over separately after?" [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn) [04:10:44] (03CR) 10Legoktm: [C: 04-1] arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [04:11:58] (03PS1) 10Andrew Bogott: Nova vendordata first boot: try to work around a puppet race [puppet] - 10https://gerrit.wikimedia.org/r/673178 [04:12:37] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata first boot: try to work around a puppet race [puppet] - 10https://gerrit.wikimedia.org/r/673178 (owner: 10Andrew Bogott) [04:13:55] (03CR) 10Legoktm: [C: 04-1] "Ping me tomorrow (Thursday) and we can sync on deploying this?" [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [04:20:35] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup1002), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:29:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:32:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:52:31] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [04:52:31] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [05:13:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:36:21] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:44:55] RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:55] PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 for schema change', diff saved to https://phabricator.wikimedia.org/P14940 and previous config saved to /var/cache/conftool/dbconfig/20210318-060445-marostegui.json [06:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:43] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:11:47] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:18:12] (03PS1) 10Marostegui: db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673186 (https://phabricator.wikimedia.org/T276302) [06:18:55] (03CR) 10Marostegui: [C: 03+2] db1084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673186 (https://phabricator.wikimedia.org/T276302) (owner: 10Marostegui) [06:20:45] (03PS1) 10Marostegui: instances.yaml: Add db1161 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/673187 (https://phabricator.wikimedia.org/T258361) [06:21:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1161 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/673187 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:22:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2120', diff saved to https://phabricator.wikimedia.org/P14941 and previous config saved to /var/cache/conftool/dbconfig/20210318-062201-marostegui.json [06:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:06] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10Marostegui) Looking good: ` [06:29:42] marostegui@cumin1001:~$ sudo cumin 'db11[76-84].eqiad.wmnet' 'free -g ; echo ; df -hT /srv; echo ; pvs ; echo ; megacli -LdPdInfo -a0 | eg... [06:32:29] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:32:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1161 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14942 and previous config saved to /var/cache/conftool/dbconfig/20210318-063241-marostegui.json [06:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:50] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:33:35] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1161 is now on dbctl but depooled. Won't pool till Monday [06:48:43] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 83, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:52:35] (03PS1) 10Ladsgroup: flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 [06:53:29] (03PS1) 10Marostegui: db11[77-84]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673190 (https://phabricator.wikimedia.org/T275633) [06:54:20] (03CR) 10Marostegui: [C: 03+2] db11[77-84]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/673190 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:59:31] (03PS1) 10Marostegui: install_server: Reimage db1156 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/673191 (https://phabricator.wikimedia.org/T258361) [07:00:18] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1156 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/673191 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:01:35] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1156.eqiad.wmnet'] ` The log ca... [07:01:44] 10Puppet, 10Beta-Cluster-Infrastructure: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10Majavah) See also: {T161675} [07:02:18] (03PS1) 10ArielGlenn: Dumps: continue restructuring page content batches [dumps] - 10https://gerrit.wikimedia.org/r/673192 (https://phabricator.wikimedia.org/T252396) [07:02:46] (03CR) 10jerkins-bot: [V: 04-1] Dumps: continue restructuring page content batches [dumps] - 10https://gerrit.wikimedia.org/r/673192 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [07:05:04] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) For db1165 which will replace db1085 in s6, what I will do: - Do not reimage db1165 to Stretch, instead will leave it as Buster and... [07:05:20] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:07:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:09:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:13:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE [07:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE [07:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14943 and previous config saved to /var/cache/conftool/dbconfig/20210318-071747-root.json [07:17:50] RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:32] !log Deploy schema change on s4 codfw master, lag will appear - T276150 T276156 [07:19:35] (03PS2) 10ArielGlenn: Dumps: continue restructuring page content batches [dumps] - 10https://gerrit.wikimedia.org/r/673192 (https://phabricator.wikimedia.org/T252396) [07:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:40] T276150: Schema change to make rc_id unsigned and rc_timestamp BINARY - https://phabricator.wikimedia.org/T276150 [07:19:41] T276156: Drop default of rc_timestamp - https://phabricator.wikimedia.org/T276156 [07:20:30] !log depooling & restarting blazegraph on wdqs1005 [07:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:34] PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:00] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.079 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:22:54] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1156.eqiad.wmnet'] ` and were **ALL** successful. [07:23:48] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:27:10] <_joe_> uhm why is tcpircbot alerting? [07:27:15] <_joe_> not again, sigh [07:27:56] <_joe_> I see dbctl works though [07:28:28] <_joe_> yeah it' [07:28:33] <_joe_> s the alert that's wrong [07:32:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14944 and previous config saved to /var/cache/conftool/dbconfig/20210318-073250-root.json [07:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:46] ACKNOWLEDGEMENT - tcpircbot_service_running on alert1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py Giuseppe Lavagetto False positive the service is running fine https://wikitech.wikimedia.org/wiki/Logmsgbot [07:34:43] (03PS1) 10ArielGlenn: Name the batch files lock retry variables better [dumps] - 10https://gerrit.wikimedia.org/r/673194 (https://phabricator.wikimedia.org/T252396) [07:36:13] (03PS1) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) [07:36:27] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [07:38:20] (03PS1) 10Marostegui: wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) [07:39:36] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [07:40:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:41:06] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:41:08] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [07:41:18] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [07:42:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:42:12] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:44:12] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:47:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14945 and previous config saved to /var/cache/conftool/dbconfig/20210318-074754-root.json [07:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:39] (03PS1) 10ArielGlenn: worker.py --batches requires one job name be specified [dumps] - 10https://gerrit.wikimedia.org/r/673198 (https://phabricator.wikimedia.org/T252396) [07:53:08] (03PS2) 10ArielGlenn: update worker scripts to loop in secondary batch worker mode [dumps] - 10https://gerrit.wikimedia.org/r/638043 (https://phabricator.wikimedia.org/T252396) [07:59:19] (03PS1) 10Alexandros Kosiaris: ml_k8s::worker: Use new kubernetes/calico [puppet] - 10https://gerrit.wikimedia.org/r/673199 [08:01:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Gonna merge this as I am trying to debug the creation of the docker volume_group. It should NOT be related but it's causing noise and a di" [puppet] - 10https://gerrit.wikimedia.org/r/673199 (owner: 10Alexandros Kosiaris) [08:02:58] !log reimage ml-serve1004 to debug a docker volume_group issue [08:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14946 and previous config saved to /var/cache/conftool/dbconfig/20210318-080258-root.json [08:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:09] akosiaris: thanks, lemme know if I can help [08:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:17] (I was asking about the calico packages in #serviceops :) [08:05:43] ah, we only have them for stretch? [08:05:51] remind me again, why are you targetting buster? [08:07:20] akosiaris: simply because we thought that it would have been easier than later on with prod traffic flowing, but we assumed it wouldn't have caused pain to others :) We can revert to Stretch in case [08:08:10] I think it's the inverse. Per the buster migration task, it's actually quite more difficult. You are pioneering a bit there [08:08:45] and btw, we are pondering whether for the services cluster it makes sense to skip straight to bullseye [08:09:01] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:07] akosiaris: the pioneering part is fine, that painful part is pinging you daily to ask for guidance, this is why I said that :) [08:09:31] services/main/the k8s cluster up to now. We need to name it now that we have >1 I guess [08:11:26] one thing that you will definitely need to rebuld is our rsyslog package for buster [08:11:43] we have the rsyslog-kubernetes package which IIRC is not on buster [08:11:47] akosiaris: what magic did you use to make dockerd running? The lvm partitions were not created by the storage profile [08:11:48] * akosiaris double checking [08:12:12] I thought there was a race condition in puppet but now I am confused [08:12:15] elukey: No magic, I just merged the puppet change you saw above [08:12:26] well no it's actually magic, cause I don't understand why it failed [08:12:37] this is bashable ^ :P [08:12:39] ahahahah [08:13:02] my impression was that docker.io was installed and dockerd started before the lvm class in profile::docker::engine [08:13:04] but seriously, I kind of went on a hunch seeing the diff in the catalog [08:13:05] err storage [08:13:24] it tried to start but failed, but that was not the issue [08:13:36] the problem was the puppet wasn't trying to create the docker lvm volume_group [08:13:44] as to why it wasn't trying to do that, there, you got me [08:13:53] I even got the catalog and it clearly referenced it [08:13:56] ah it wasn't even trying? [08:14:26] yeah, in a way that made 0 sense to me. [08:14:52] at some point I thought it was because of dependencies, but no indication of that [08:15:08] I am was probably way too tired and missed something yesterday night, but this is puzzling to say the least [08:15:31] akosiaris: sorry to re-ask, but isn't it because docker.io was installed by puppet before the lvm volumes were created in profile::docker::storagE? [08:15:45] RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:05] the only explanation that I can have is that your recent change caused a different puppet execution order [08:16:13] otherwise I feel lost [08:16:42] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE [08:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:27] elukey: my feeling exactly yesterday. [08:17:36] it's not better right now. but at least I got a lead now [08:17:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,routinator} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:17:59] (03PS1) 10Majavah: Enable CentralAuth IRC feed in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673201 (https://phabricator.wikimedia.org/T277432) [08:18:13] RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:53] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1004.eqiad.wmnet with reason: REIMAGE [08:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:00] elukey: you might be right, but I should be getting some error about failed dependencies creating the lvm volumes and I did not [08:20:19] plus... how on earth could kubernetes-node and calico be related to the docker profile class. [08:21:02] RECOVERY - DPKG on ml-serve1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:21:08] akosiaris: I stopped wondering why when dealing with puppet a long time ago :D [08:21:37] (due also to my ignorance about its internals, but I chose mental sanity instead) [08:22:05] yeah, that's where you got me. I went down the road of stitching puppet catalogs yesterday [08:22:19] thanks a lot for all the work btw :) [08:22:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:22:34] and it's clear that something was up with the catalog and it would not create the the volume_group, but I am still not sure what [08:22:42] or how 2 hiera variables fixed it [08:23:58] maybe we could add an explicit dep between profile::docker::engine and ::storage [08:24:50] RECOVERY - Check systemd state on ml-serve1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:48] !log swift eqiad-prod: less weight for ms-be[1019-1026] - T272836 [08:27:50] it's probably the calico change that did the diff. It does rely on docker [08:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:57] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [08:28:07] whereas on version 3 it does not [08:28:40] ah good point [08:28:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10JAllemandou) Thanks for letting me know @Ottomata :) @cmassaro : Let's sync on the work you wish to accomplish, as wikitext-history is really big and I might have some h... [08:31:40] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater:create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [08:31:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:32:34] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 67 probes of 602 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:34:08] 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10fgiunchedi) >>! In T224579#6920712, @fgiunchedi wrote: > And connections from Prometheus kept piling up. AFAIK the service/exporter is not owned ATM, I've restarted the exporter but this is obviou... [08:34:54] RECOVERY - Check systemd state on ml-serve2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:38:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 602 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:38:44] PROBLEM - Check systemd state on snapshot1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:18] RECOVERY - DPKG on ml-serve1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:40:48] RECOVERY - DPKG on ml-serve2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:44:48] RECOVERY - DPKG on ml-serve1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:45:13] 10SRE, 10Patch-For-Review: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Majavah) [08:46:10] RECOVERY - Check systemd state on ml-serve2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:51:27] RECOVERY - DPKG on ml-serve2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:53:05] RECOVERY - DPKG on ml-serve2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:54:33] <_joe_> apergos: snapshot1005 is in a very bad state [08:54:46] <_joe_> I can't get dmesg output and the fs is read-only [08:56:03] (03PS1) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 [08:56:51] <_joe_> yeah gonna try to reboot it, but I doubt it will come up cleanly [08:57:22] <_joe_> Mar 18 08:33:01 snapshot1005 kernel: [2677270.299698] sd 0:1:0:0: [sda] tag#611 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [08:57:23] (03PS1) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396) [08:57:35] ouch [08:57:45] nothing is going on over there so I dunno why [08:58:10] sda orilly [08:58:15] well that is dead as dead all right [08:58:19] (03CR) 10jerkins-bot: [V: 04-1] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [08:58:31] it's out of warranty, the new boxes were ordered to replace snapshot1005,6,7 [08:58:43] and expected Jan 31 but no word from Dell, they are just gone [08:58:55] I saw an escalation to willy on the task overnight [08:58:56] <_joe_> wut [08:59:09] <_joe_> so, ok to reboot? [08:59:10] in he meantime I have a testbed host I can assign the production role to, for the next round [08:59:25] <_joe_> worst that can happen is it doesn't come back [08:59:26] oh sure, actually lemme see if I can get on the host first [08:59:34] <_joe_> yes you can [08:59:42] <_joe_> ssh works, sudo works, that's about it :P [09:00:33] lol /usr/bin/various are broken [09:00:38] ps axuww worked :-D [09:00:51] nothing of interet happening there so lemme get off [09:01:01] reboot away [09:02:03] I'm amazed that's the one thing that alrted in icinga (the systemd unit) [09:02:56] <_joe_> yeah... [09:03:55] doo dee doo dee doo [09:04:08] <_joe_> so the only clean way to reboot was using systemd [09:04:14] <_joe_> because it's in-memory and running [09:04:20] <_joe_> take that systemd-haters [09:04:27] hahahaha [09:04:34] you were a systemd hater once [09:04:39] <_joe_> nope [09:04:41] <_joe_> never been [09:04:49] don't make me look in the logs [09:04:53] <_joe_> !log attempted reboot of snapshot1005, read-only filesystem and probably disks are broken beyond repair [09:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:03] <_joe_> I'm pretty sure I've never been :) [09:05:11] <_joe_> you're confusing me and brandon [09:05:25] nice attempt at a save :-P :-D [09:05:27] <_joe_> I might have expressed contempt towards Lennart and his attitude [09:05:37] you and brandon don't even vaguely kinda look alike [09:05:42] if it was Seddon, well... [09:05:59] PROBLEM - Host snapshot1005 is DOWN: PING CRITICAL - Packet loss = 100% [09:06:00] <_joe_> well we both do ramble like old men a lot [09:06:03] heh [09:06:06] <_joe_> hey icinga, keep up [09:06:06] and you're both not [09:06:21] I'm an old geezer, the rest of you younguns are just wannabes :-P [09:06:45] I assume you're watching the console? [09:06:59] <_joe_> no I'm not [09:07:14] <_joe_> my plan was to wait a bit first, and if it doesn't come back, go to the console [09:07:25] I think there's two os disks in raid 1 (ssd) [09:07:30] oh ok, that's fine too [09:08:10] yep hw raid 1 two ssds for the os indeed [09:08:27] <_joe_> hw raid and it broke? wtf [09:08:43] unclear [09:09:10] <_joe_> ugh, it's an ilo? [09:09:16] these are hp proliant boxes [09:09:19] <_joe_> sigh [09:09:24] the three we are replacing, incl this one [09:13:22] <_joe_> !log hard reboot of snapshot1005 [09:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of inline comments" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [09:15:55] <_joe_> apergos: it looks like the failure is in the bios controller [09:16:04] lovely [09:16:19] any message that can be screenshotted or copy-pastad into a task? [09:16:55] <_joe_> ]it's booting now though [09:17:00] er wut [09:17:09] how is it booting if the... >_< [09:17:12] <_joe_> yeah the message was about a *previous* failure [09:17:17] oh. hrm [09:17:21] <_joe_> I think the hard reset might have done the trick [09:17:25] welllll [09:17:35] <_joe_> [FAILED] Failed to mount /mnt/dumpsdata. [09:17:35] maybe I'll just turn that into the testbed host anyways to be safe [09:17:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:17:47] huh [09:17:49] RECOVERY - Host snapshot1005 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [09:17:56] <_joe_> please take a look yourself at this point :) [09:18:35] why would it fail to mount a frickin nfs share [09:19:00] (03CR) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [09:19:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:37] it's there now :-/ [09:20:43] RECOVERY - Check systemd state on snapshot1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:24:06] mount.nfs: Failed to resolve server dumpsdata1003.eqiad.wmnet: Name or service not known [09:24:08] ok really? [09:24:53] but two minutes later it was ok during the puppet run after reboot [09:25:35] (03CR) 10Kormat: mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [09:25:36] <_joe_> apergos: i guess that systemd unit needs to add a dependency on network.target maybe? [09:25:43] ah maybe so [09:26:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:26:49] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [09:27:13] no indication in the logs of how it died, nothing on the reboot to indicate issues that I can see [09:27:20] still going to switch it out for another host though [09:27:20] (03CR) 10Kormat: [C: 03+1] wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [09:27:31] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [09:27:38] (03PS2) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) [09:32:38] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [09:32:57] not really a systemd unit, I think it's just from the nfs mount being in pass 0 in /etc/fstab (which is likely my fault someway or other)... still looking around [09:34:28] hmm no. meh [09:35:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:37:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:37:40] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10JMeybohm) [09:44:24] lol snapshot1005 is already the testbed. so I guess I don't have to move it :-D [09:46:06] (03CR) 10Jbond: [C: 03+2] P:debmonitor: Make profile compatible with cloud environments [puppet] - 10https://gerrit.wikimedia.org/r/673050 (owner: 10Jbond) [09:46:15] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add config for pki-debmon [puppet] - 10https://gerrit.wikimedia.org/r/673048 (owner: 10Jbond) [09:46:34] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Majavah) [09:47:08] (03PS1) 10Marostegui: install_server: Reimage db1181 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/673217 (https://phabricator.wikimedia.org/T275633) [09:47:35] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) [09:47:45] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1181 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/673217 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [09:48:44] (03PS1) 10Kormat: hiera: Reenable notifications for db2089+db2137 [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) [09:49:32] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10JMeybohm) p:05Triage→03Medium [09:49:42] (03CR) 10Marostegui: "The other two will be handled by jaime?" [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat) [09:50:09] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat) [09:50:19] (03CR) 10Marostegui: [C: 03+1] hiera: Reenable notifications for db2089+db2137 [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat) [09:51:09] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [09:51:16] 10SRE, 10Prod-Kubernetes, 10SRE-tools, 10serviceops: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) The cookbook does not seem to work (tried during the kubernetes codfw reinit): * It did not allow multiple services a... [09:51:51] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [09:55:03] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10Volans) @CGlenn access granted. Please confirm that it's working and that you've read the following notice: ` If you believe your Google account delegated for Google Webmaster Too... [09:56:50] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10Jgiannelos) [09:57:20] (03PS1) 10Jcrespo: dbbackups: Move s5 from db2139 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) [09:57:40] (03PS2) 10Jcrespo: dbbackups: Move s5 from db2139 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) [09:57:55] (03PS2) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 [09:57:57] (03CR) 10Kormat: [C: 03+2] hiera: Reenable notifications for db2089+db2137 [puppet] - 10https://gerrit.wikimedia.org/r/673219 (https://phabricator.wikimedia.org/T277632) (owner: 10Kormat) [09:58:04] (03PS3) 10Jcrespo: dbbackups: Move s5 from db2139 to db2101 [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) [10:00:04] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [10:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1000). [10:00:07] (03CR) 10jerkins-bot: [V: 04-1] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [10:04:18] (03PS3) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 [10:04:44] (03CR) 10Marostegui: [C: 03+1] dbbackups: Move s5 from db2139 to db2101 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [10:06:34] (03CR) 10Jcrespo: [C: 03+2] "I was aware." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673220 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [10:11:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) [10:13:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatdata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) p:05Triage→03Medium @cmassaro I can't find your signature on L3, could you please double check you've signed it? [10:13:55] (03CR) 10Giuseppe Lavagetto: check_systemd_state: improve alerting message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [10:15:18] (03PS1) 10Jbond: systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 [10:16:49] (03CR) 10jerkins-bot: [V: 04-1] systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond) [10:19:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-user for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) [10:19:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) [10:19:45] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10akosiaris) > After a chat with Filippo, IIUC 8.2008.0-1 is used only on centrallog nodes (hence the component) but we might want to use 8.19011 provided on Buster and add the custom bits for rs... [10:23:20] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:10] (03PS1) 10Klausman: helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 [10:27:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [10:28:02] PROBLEM - mysqld processes on db2101 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:28:15] (03CR) 10Klausman: [C: 03+1] ml_k8s::worker: Use new kubernetes/calico [puppet] - 10https://gerrit.wikimedia.org/r/673199 (owner: 10Alexandros Kosiaris) [10:28:49] jynus: ^ is that one of the backup ones? [10:29:45] (03PS13) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [10:29:47] (03PS15) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [10:30:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman) [10:30:08] yep, it is a backup source [10:30:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:43] oh, I guess there is a race condition with alert1001 [10:30:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] WIP: Update cxserver to 2021-03-15-131520-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/672386 (https://phabricator.wikimedia.org/T271711) (owner: 10KartikMistry) [10:30:58] I will run puppet on icinga so it is disabled [10:31:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) [10:32:11] jynus: in case it's relevant: once you add `profile::base::notifications: disabled`, you need to run puppet on the affected machine once, and then on alert1001 *twice* [10:32:29] yeah, it is not great [10:32:31] (don't ask me _why_. i've just discovered this empherically) [10:32:51] *empirically [10:33:03] not sure if you have to run it twice or just "wait an undeterminite amount of minute between puppet runs" [10:33:09] https://json-schema.org/understanding-json-schema/structuring.html [10:33:10] I guess for the backend to get updated [10:33:19] ops, wrong tab :D [10:33:32] (03PS2) 10Jbond: systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 [10:33:43] jynus: i run it twice in succession, and that works. [10:33:50] volans, ther right tab is: https://learnxinyminutes.com/docs/yaml/ [10:34:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28661/console" [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:34:28] kormat, sure. I just think there is a wait, for example, on the reimaging scripts like that [10:34:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:35:39] (03CR) 10Klausman: [C: 03+2] helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman) [10:35:55] (03CR) 10Elukey: "All right all puppet-style comments resolved, now I am going to only modify the capacity scheduler settings :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [10:36:58] (03Merged) 10jenkins-bot: helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman) [10:37:03] (03PS1) 10Volans: admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) [10:37:16] (03CR) 10Klausman: [V: 03+2 C: 03+2] helm: Add calico/cordns/values config for ML k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/673222 (owner: 10Klausman) [10:37:42] (03CR) 10Volans: [C: 04-1] "Waiting for L3 to be signed" [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans) [10:37:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28662/console" [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond) [10:44:58] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) I am a little confused from the debian/stretch-wikimedia branch of the rsyslog repo, because the debian changelog seems to mention `8.38.0-1~bpo9+1wmf1` (it corresponds to a commit from... [10:48:37] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) [10:56:57] (03PS1) 10Kosta Harlan: Remove variant C from list of valid variants [extensions/GrowthExperiments] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673107 (https://phabricator.wikimedia.org/T277727) [10:57:34] (03PS1) 10Gergő Tisza: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673224 (https://phabricator.wikimedia.org/T277727) [10:57:51] (03PS3) 10Jbond: systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 [10:57:53] (03PS1) 10Jbond: systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1100). [11:00:04] Majavah: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] i can deploy today [11:00:15] here, beta only patch [11:00:18] kostajh: around? [11:00:28] (03CR) 10jerkins-bot: [V: 04-1] systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [11:00:35] Urbanecm: indeed [11:00:44] (03CR) 10Urbanecm: [C: 03+2] Enable CentralAuth IRC feed in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673201 (https://phabricator.wikimedia.org/T277432) (owner: 10Majavah) [11:00:45] just finished updating the calendar with the two patches [11:01:00] (03PS2) 10Mvolz: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 [11:01:02] (03PS2) 10Jbond: systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 [11:01:04] (03PS2) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 (owner: 10PipelineBot) [11:01:06] Majavah: done, it will apply automagically within 30 minutes. Shout at me if it doesn't ;) [11:01:14] Urbanecm: suree, thanks! [11:01:23] (03CR) 10Urbanecm: [C: 03+2] Remove variant C from list of valid variants [extensions/GrowthExperiments] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673107 (https://phabricator.wikimedia.org/T277727) (owner: 10Kosta Harlan) [11:01:43] (03Merged) 10jenkins-bot: Enable CentralAuth IRC feed in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673201 (https://phabricator.wikimedia.org/T277432) (owner: 10Majavah) [11:01:51] kostajh: please remind me, does the config depend on backport? [11:02:56] (03CR) 10jerkins-bot: [V: 04-1] systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [11:03:22] (03PS3) 10Jbond: systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 [11:03:42] Urbanecm: it does not [11:03:58] kostajh: okay, let's do the config first then [11:04:04] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673224 (https://phabricator.wikimedia.org/T277727) (owner: 10Gergő Tisza) [11:04:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28665/console" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [11:05:13] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 (owner: 10PipelineBot) [11:05:21] (03Merged) 10jenkins-bot: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673224 (https://phabricator.wikimedia.org/T277727) (owner: 10Gergő Tisza) [11:05:52] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/669967 (owner: 10PipelineBot) [11:06:02] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/670839 (owner: 10PipelineBot) [11:06:14] kostajh: can you test on mwdebug1001 please? [11:06:23] Urbanecm: the config patch? [11:06:26] yup [11:06:28] if possible [11:06:37] Urbanecm: beta scaps seem to be stuck looking at https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ :-/ [11:06:38] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/672723 (owner: 10PipelineBot) [11:06:48] Majavah: lemme restart the job [11:06:49] Urbanecm: alright [11:07:51] Majavah: restarted [11:08:22] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: NOOP: e7f5eac: Enable CentralAuth IRC feed in beta cluster (T277432) (duration: 01m 12s) [11:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:40] T277432: Enable CentralAuth IRC feed in beta cluster - https://phabricator.wikimedia.org/T277432 [11:09:10] Urbanecm: thanks! does that require some Jenkins UI access or could I have done that myself on deployemnt-deploy01 somehow? [11:09:36] Majavah: it requires access in jenkins UI [11:09:45] (precisely, be in https://ldap.toolforge.org/group/nda IIRC) [11:09:45] Urbanecm: seems fine [11:09:50] thanks kostajh, syncing [11:10:00] Majavah: (or in wmf LDAP group) [11:10:27] Urbanecm: okay, thanks [11:10:41] If there's time, can this be done too? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/673189 [11:11:04] I can check it on mwdebug to make sure the wiki doesn't break [11:11:35] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0005676e704cad907655a4a0bca7bd2164714b1c: GrowthExperiments: set $wgGEHomepageNewAccountVariants to D only (T277727) (duration: 01m 10s) [11:11:38] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [11:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:43] T277727: hewiki users seem to get variant C on desktop which breaks the UI - https://phabricator.wikimedia.org/T277727 [11:11:47] config done kostajh [11:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:51] (03PS3) 10Mvolz: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 [11:11:59] Urbanecm: cool :) [11:12:16] Amir1: sure [11:12:22] (03PS7) 10David Caro: wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) [11:12:24] (03PS5) 10David Caro: wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) [11:12:26] (03PS3) 10David Caro: wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) [11:12:54] (03Merged) 10jenkins-bot: Remove variant C from list of valid variants [extensions/GrowthExperiments] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673107 (https://phabricator.wikimedia.org/T277727) (owner: 10Kosta Harlan) [11:13:11] Thanks! [11:14:27] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:41] kostajh: backport now at mwdebug1001 [11:14:43] * Urbanecm testing as well [11:14:48] Urbanecm: thx, looking [11:14:56] (03PS1) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) [11:15:44] Urbanecm: hmm, I'm not seeing the SE module for a user who has their variant set to C [11:15:57] kostajh: my test acc from yesterday has this https://usercontent.irccloud-cdn.com/file/PTF84Zae/image.png [11:16:27] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:36] and the same appears when I run `new mw.Api().saveOption('growthexperiments-homepage-variant', 'C').done( function() { window.location.reload() });` again [11:16:44] Urbanecm: that looks good but the account I created after you synced the config patch has https://imgur.com/a/K1AWDUE [11:16:55] * kostajh looks at logs [11:17:07] kostajh: that looks like frwiki? [11:17:21] Urbanecm: oops. right [11:17:28] you need to test at hewiki, because that's the only Wikipedia we target that's in group1 [11:17:34] wrong group [11:17:44] yep looking again :) [11:17:52] thx [11:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:18:42] Urbanecm: yeah seems good to me [11:18:45] cool [11:18:46] let's sync [11:19:03] 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10akosiaris) Hi, Thanks for this request. Couple of quick questions and pointers * Is xhgui stateless? More specifically ** Does xhg... [11:19:16] (03CR) 10Gergő Tisza: [C: 03+1] flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup) [11:19:23] (03PS1) 10Jbond: P:tcpircbot: drop monitoring of service [puppet] - 10https://gerrit.wikimedia.org/r/673228 [11:19:25] (03PS1) 10Jbond: P:tcpircbot: delete a previoulsy absented resource [puppet] - 10https://gerrit.wikimedia.org/r/673229 [11:19:27] (03PS1) 10Jbond: P:tcpircbot: fix minor style violations [puppet] - 10https://gerrit.wikimedia.org/r/673230 [11:19:39] (03PS2) 10Urbanecm: flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup) [11:19:43] (03CR) 10Urbanecm: [C: 03+2] flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup) [11:20:08] kostajh: should we backport to wmf.34 as well, to fix the frwiki blank homepage? [11:20:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:33] (03CR) 10Jbond: "i think this looks fine however i think we should drop this check in favour of the generic check_systemd_state check, see https://gerrit.w" [puppet] - 10https://gerrit.wikimedia.org/r/673077 (owner: 10CRusnov) [11:20:34] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/GrowthExperiments/includes/HomepageHooks.php: 3b2aa1aa28e9d204f32ae937a84ec211137cbb2e: Remove variant C from list of valid variants (T277727) (duration: 01m 09s) [11:20:41] anyway, live for wmf.35 [11:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:44] T277727: hewiki users seem to get variant C on desktop which breaks the UI - https://phabricator.wikimedia.org/T277727 [11:20:51] (03Merged) 10jenkins-bot: flaggedrevs: Disable multiple dimensions in hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673189 (owner: 10Ladsgroup) [11:21:13] Amir1: pulled ^^onto mwdebug1001 [11:21:30] ack [11:22:00] (03CR) 10Mvolz: [C: 03+2] Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 (owner: 10Mvolz) [11:22:30] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10akosiaris) Hi, Is this still something that might happen? I don't see any activity on this task for the last 5 months. Note that it will require a per... [11:23:14] Urbanecm: looks good, please proceed [11:23:18] syncing [11:23:23] (03Merged) 10jenkins-bot: Update Zotero to 2021-03-12-015945-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/673011 (owner: 10Mvolz) [11:23:33] Urbanecm: patch was finally synced to beta and seems to be working, thanks [11:23:51] cool Majavah [11:24:57] !log urbanecm@deploy1002 Synchronized wmf-config/flaggedrevs.php: 896c9f019b17d1ad3a1589d377158ca2fb91ebaa: flaggedrevs: Disable multiple dimensions in hewikisource (duration: 01m 09s) [11:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:05] Amir1: done [11:25:10] anything else? [11:25:14] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [11:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:27] nah, thanks! [11:25:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/654336 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:25:49] cool :) [11:27:01] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/662765 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:27:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] check_systemd_state: improve alerting message [puppet] - 10https://gerrit.wikimedia.org/r/673209 (owner: 10Giuseppe Lavagetto) [11:31:24] (03CR) 10Jbond: check_keystone_roles.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670925 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:32:42] (03PS1) 10Giuseppe Lavagetto: check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234 [11:32:46] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge.etcd: Added cookbook to depool and remove a node [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:32:52] (03PS2) 10Giuseppe Lavagetto: check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234 [11:32:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to add a new etcd node [cookbooks] - 10https://gerrit.wikimedia.org/r/668090 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:32:57] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: add cookbook to create an instance of a prefix [cookbooks] - 10https://gerrit.wikimedia.org/r/667214 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [11:33:24] (03CR) 10Jbond: wmcs-spreadcheck.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:33:47] (03PS3) 10Giuseppe Lavagetto: check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234 [11:34:11] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] check_systemd_state: fix variable to format [puppet] - 10https://gerrit.wikimedia.org/r/673234 (owner: 10Giuseppe Lavagetto) [11:34:11] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:35] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:58] (03CR) 10Jbond: wmcs-webproxy.py: Port to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:37:24] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:27] (03CR) 10Jbond: wmcs-webproxy.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [11:42:37] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:28] (03PS1) 10Volans: openldap: improve cross-validate-accounts [puppet] - 10https://gerrit.wikimedia.org/r/673241 [11:46:50] (03CR) 10Volans: "Tested with my user and the WMCS key and it printed:" [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans) [11:47:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:52:30] (03PS14) 10Elukey: hadoop: add a profile to deploy the capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) [11:52:31] (03PS16) 10Elukey: hadoop: set the Yarn capacity scheduler for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/672654 (https://phabricator.wikimedia.org/T277062) [11:53:27] (03CR) 10Jbond: "lgtm but i there are a few dependencies between required and not required params" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans) [11:53:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:57:30] (03CR) 10Volans: "replies inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans) [11:57:48] (03PS1) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) [11:58:02] (03PS2) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) [11:58:33] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [11:58:37] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [11:59:28] (03PS3) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) [12:00:03] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [12:00:10] (03PS4) 10Jcrespo: dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) [12:00:13] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] dbbackups: Reenable notifications on db2139 after removing s5 instance [puppet] - 10https://gerrit.wikimedia.org/r/673246 (https://phabricator.wikimedia.org/T277632) (owner: 10Jcrespo) [12:01:55] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:03:43] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/670937 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:03:48] (03CR) 10Jbond: [C: 03+1] proxylistener.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670937 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:05:59] (03PS1) 10Majavah: swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) [12:06:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) a:05Ottomata→03cmassaro [12:07:54] (03CR) 10Jbond: [C: 03+1] "lgtm few minor nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670938 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:10:40] (03CR) 10Jbond: [C: 03+1] pybal-eval-check.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670952 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [12:10:57] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) Hello! Yes, we would love to have this service deployed. Although, over the course of the last 5 months, we have developed a newer iteration o... [12:16:33] (03CR) 10Jbond: "ack thanks for the info lgtm" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans) [12:21:34] (03CR) 10Majavah: "cc'ing people listed on todays puppet request window, requesting review instead of cherrypicking on beta (recommended by https://wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah) [12:31:59] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [12:33:01] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [12:34:23] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) @JMeybohm thanks for the task, this is surely something we want to add support for. There's also a catch that I'm not sure how to solve right now, because the since serv... [12:39:52] (03PS2) 10Jbond: swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah) [12:40:10] Majavah: you o kfor me to merge ^^^ now or do you want to wait for the window? [12:42:47] (03CR) 10Jbond: "fyi i made a few minor changes" [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah) [12:42:51] (03CR) 10Jbond: [C: 03+1] swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah) [12:49:17] (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.3.2: create 6.3.2 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/670483 (owner: 10Jbond) [12:53:15] (03PS1) 10Ladsgroup: languageLabelDescriptionAliases: use getLanguageNameByCode [extensions/Wikibase] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673108 (https://phabricator.wikimedia.org/T275611) [12:54:15] (03CR) 10Ladsgroup: [C: 03+2] "UBN" [extensions/Wikibase] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673108 (https://phabricator.wikimedia.org/T275611) (owner: 10Ladsgroup) [12:54:32] I'm deploying a fix for the train (UBN) [12:55:56] jbond42: hi, around now, feel free to merge [12:56:09] Majavah: ack will mereg now [12:56:25] (03CR) 10Jbond: [C: 03+2] swift: compare kernel version directly [puppet] - 10https://gerrit.wikimedia.org/r/673250 (https://phabricator.wikimedia.org/T276179) (owner: 10Majavah) [12:57:31] Majavah: merged and deployed to deployment-puppetmaster04 [12:57:56] jbond42: thanks! I'll run puppet manually on the affected deployment-prep instances and will report back if there are any problems [12:58:44] !log upload cas_6.3.2 to apt buster-wikimedia [12:58:46] Majavah: ack [12:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:40] (03PS1) 10Jbond: idp: failover for upgrade [dns] - 10https://gerrit.wikimedia.org/r/673265 (https://phabricator.wikimedia.org/T271684) [13:08:07] jbond42: the patch correctly detected the kernel version, it updated fstab but even after systemctl daemon-reload systemctl has not picked up the new params, trying to reboot a host now to see if systemctl picks up the changes [13:08:25] ack [13:08:45] 10SRE, 10ops-eqiad: analytics1063 interface errors - https://phabricator.wikimedia.org/T277633 (10Cmjohnson) 05Open→03Resolved This has been completed [13:08:50] (03CR) 10Jbond: [C: 03+2] idp: failover for upgrade [dns] - 10https://gerrit.wikimedia.org/r/673265 (https://phabricator.wikimedia.org/T271684) (owner: 10Jbond) [13:09:56] (03PS1) 10Jbond: Revert "idp: failover for upgrade" [dns] - 10https://gerrit.wikimedia.org/r/673109 [13:10:41] jbond42: found the issue, it does the options wrong way around, I'll make a patch to fix [13:11:07] ack ping me when its in [13:12:02] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:12:09] (03PS1) 10Majavah: swift: fix kernel version check [puppet] - 10https://gerrit.wikimedia.org/r/673266 [13:12:10] there ^ [13:12:17] jbond42: ^ [13:12:25] did that cause that prod puppet failure check or other problems? [13:12:29] Majavah: ack one sec im just checking on the icinga error [13:12:35] possibly [13:14:14] (03CR) 10Jbond: [C: 03+2] swift: fix kernel version check [puppet] - 10https://gerrit.wikimedia.org/r/673266 (owner: 10Majavah) [13:14:19] jbond42: the alert is barely above the threshold, could it be that the ml-servexxxx nodes (WIP) are contributing? [13:15:02] elukey: yes it could be [13:15:39] i think the change above may have cuased a transient error on some of the swift nodes which pushed it opver the edge [13:16:21] indeed we got 'mount point not mounted or bad option.' [13:16:38] Majavah: that fix has been deployed [13:16:50] jbond42: thanks, seems to be working on beta [13:16:55] oops, sorry for that [13:17:34] Majavah: no probes, my fault i actully thought it was wrong but somehow convinced my self it wasn't [13:17:49] eitherway no harm done [13:18:00] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:18:46] also beta swift works now again! systemctl needed a manual daemon-reload [13:19:15] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) >>! In T277739#6924397, @elukey wrote: > I am a little confused from the debian/stretch-wikimedia branch of the rsyslog repo, because the debian changelog seems to mention `8.38.0-1~bpo... [13:19:46] (03Merged) 10jenkins-bot: languageLabelDescriptionAliases: use getLanguageNameByCode [extensions/Wikibase] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673108 (https://phabricator.wikimedia.org/T275611) (owner: 10Ladsgroup) [13:21:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: base: stop using LVM [puppet] - 10https://gerrit.wikimedia.org/r/673267 (https://phabricator.wikimedia.org/T272114) [13:21:35] it's on mwdebug1001 and confirming that it works, syncing to the world [13:22:51] (03CR) 10Jbond: [C: 03+2] Revert "idp: failover for upgrade" [dns] - 10https://gerrit.wikimedia.org/r/673109 (owner: 10Jbond) [13:23:47] !log ladsgroup@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/Wikibase/repo: [[gerrit:673108|languageLabelDescriptionAliases: use getLanguageNameByCode]] (T275611 T277722) (duration: 01m 14s) [13:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:56] T277722: TypeError: this._languageCodes is undefined at getLanguageNameMap - https://phabricator.wikimedia.org/T277722 [13:23:56] T275611: Termbox v2 uses the wrong list of languages - https://phabricator.wikimedia.org/T275611 [13:25:48] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10akosiaris) >>! In T277739#6924819, @elukey wrote: >>>! In T277739#6924397, @elukey wrote: >> I am a little confused from the debian/stretch-wikimedia branch of the rsyslog repo, because the deb... [13:26:47] (03CR) 10Ottomata: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658396 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [13:27:48] (03CR) 10Ottomata: "I see some spaces on the opening bracket of your config entry, I bet that's it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [13:29:21] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) Thanks @Volans! Looks like I can get further than before, but I now see the following error when attempting to run a query: `Error while compiling statement: FAILED: RuntimeExce... [13:33:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10jijiki) @RobH thank you! @Jclark-ctr, mc1039-mc1054 can be racked in Q4, unless we have more mc* victims. Thank you! [13:34:14] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ladsgroup) I assume you need kerberos setup and if you have it, you need to initialize it with "kinit" first [13:34:23] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ladsgroup) https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide [13:35:21] 10SRE, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles) [13:35:42] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) This might have got lost in all the comments, but to query mediawiki_history Sam will need to be in the analytics-privatedata-users group. Hm, we don't have this case covered in... [13:36:23] (03CR) 10Andrew Bogott: "This may be a more complete solution: https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456" [puppet] - 10https://gerrit.wikimedia.org/r/673267 (https://phabricator.wikimedia.org/T272114) (owner: 10Arturo Borrero Gonzalez) [13:38:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 88 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:40:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:42:33] (03PS1) 10Ottomata: Add samwalton as a posix user in analytics-privatedata-users, but no ssh [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) [13:42:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:43:46] (03PS1) 10Volans: interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 [13:44:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 49 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:44:44] (03CR) 10jerkins-bot: [V: 04-1] interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 (owner: 10Volans) [13:44:59] (03CR) 10Ayounsi: [C: 03+1] interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 (owner: 10Volans) [13:46:10] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:46:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) Signed! [13:47:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) a:05cmassaro→03Ottomata [13:48:35] (03PS2) 10Volans: interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 [13:49:20] !log reboot analytics1066 [13:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:58] (03CR) 10Volans: [C: 03+2] interface automation: fix support for cloud-hosts [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673271 (owner: 10Volans) [13:50:12] 10SRE, 10CAS-SSO: Update CAS to 6.3 - https://phabricator.wikimedia.org/T271684 (10jbond) Production, staging and the cloud idp have all now been upgraded to cas 6.3.2 [13:53:54] (03CR) 10Volans: "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata) [13:55:44] (03PS1) 10Jbond: P:pki::client: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/673275 [13:56:25] (03CR) 10Jbond: [C: 03+2] P:pki::client: add missing hiera key [puppet] - 10https://gerrit.wikimedia.org/r/673275 (owner: 10Jbond) [13:57:22] RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:02:38] (03CR) 10Ottomata: Add samwalton as a posix user in analytics-privatedata-users, but no ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata) [14:03:00] (03PS1) 10Gergő Tisza: Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) [14:04:06] (03CR) 10David Caro: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/667183 (https://phabricator.wikimedia.org/T274497) (owner: 10David Caro) [14:04:08] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata) [14:06:24] (03PS1) 10Volans: interface automation: 2nd fix for cloud-hosts VLAN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673276 [14:07:31] (03CR) 10Ayounsi: [C: 03+1] interface automation: 2nd fix for cloud-hosts VLAN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673276 (owner: 10Volans) [14:10:01] (03CR) 10Volans: [C: 03+2] interface automation: 2nd fix for cloud-hosts VLAN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/673276 (owner: 10Volans) [14:10:59] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10fgiunchedi) >>! In T277739#6924365, @akosiaris wrote: >> After a chat with Filippo, IIUC 8.2008.0-1 is used only on centrallog nodes (hence the component) but we might want to use 8.19011 provi... [14:12:46] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [14:13:00] looking [14:13:22] (03CR) 10Joal: "A bunch of comments - happy to talk more :)" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672373 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [14:13:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [14:13:54] looking [14:14:13] what's going on between codfw and ulsfo ? https://librenms.wikimedia.org/device/device=89/tab=port/port=16787/ [14:14:16] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) This is what I did on deneb before reading Filippo's answer :) * `apt-get source rsyslog -t buster` * applied Filippo's patch manually (git diff HEAD~1 HEAD on the debian/stretch-wikim... [14:14:24] XioNoX: https://librenms.wikimedia.org/bill/bill_id=10/ [14:14:44] yep [14:15:18] cdanis: all through equinix [14:15:42] let's see netflow [14:15:52] XioNoX: https://w.wiki/36x$ [14:16:13] cdanis: and it's not being cashed [14:16:15] cached [14:16:28] as it comes from a transport link [14:16:35] yeah [14:16:39] PROBLEM - LibreNMS has a critical alert #page on alert1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr1-codfw.wikimedia.org) https://bit.ly/wmf-librenms [14:16:48] there's the actual page :) [14:17:04] splunk loging failed... [14:17:08] can someone ack it... [14:17:15] it's upload-lb unsurprisingly [14:17:19] acked [14:17:20] hey [14:17:28] I can't stay though, interviewing shortly [14:17:32] acked [14:17:36] I'm here too [14:17:46] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [14:18:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [14:18:57] RECOVERY - LibreNMS has a critical alert #page on alert1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [14:21:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:56] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) Update after a chat with Filippo: * a fleetwide update of rsyslog is painful, we can avoid it with the component solution (extremely wise point) * update the rsyslog repo seems good -... [14:25:39] (03PS1) 10BBlack: Ratelimit applebot temporarily [puppet] - 10https://gerrit.wikimedia.org/r/673281 [14:27:29] (03CR) 10BBlack: [C: 03+2] Ratelimit applebot temporarily [puppet] - 10https://gerrit.wikimedia.org/r/673281 (owner: 10BBlack) [14:29:18] !log Restarting CI Jenkins for plugin upgrade [14:29:22] (03CR) 10Ottomata: [C: 03+2] Add samwalton as a posix user in analytics-privatedata-users, but no ssh [puppet] - 10https://gerrit.wikimedia.org/r/673270 (https://phabricator.wikimedia.org/T277298) (owner: 10Ottomata) [14:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) @Samwalton9 try now! [14:34:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) `org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec... [14:35:21] (03PS1) 10Ottomata: Don't echo unset in env_vars.sh on deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/673284 [14:35:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Don't echo unset in env_vars.sh on deactivate [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/673284 (owner: 10Ottomata) [14:37:49] !log repooling wdqs1005 [14:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:20] (03CR) 10Hashar: [C: 03+1] "I still don't know which PHP version to use, most probably will aim at using php 7.2 to be aligned with MediaWiki prod. But that can be a" [puppet] - 10https://gerrit.wikimedia.org/r/673027 (owner: 10Jbond) [14:40:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:doc: use the correct php version for each debian distro [puppet] - 10https://gerrit.wikimedia.org/r/673027 (owner: 10Jbond) [14:43:09] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:16] (03PS1) 10Hashar: contint: remove erroneous hiera setting for labs [puppet] - 10https://gerrit.wikimedia.org/r/673286 (https://phabricator.wikimedia.org/T269354) [14:48:35] (03PS2) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) [14:49:34] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [14:49:54] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10hashar) I just misunderstood how hiera lookup works on WMCS. The bulk of it is that parameters specific to... [14:50:10] (03PS2) 10Hashar: contint: remove erroneous hiera setting for labs [puppet] - 10https://gerrit.wikimedia.org/r/673286 (https://phabricator.wikimedia.org/T277526) [14:50:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:02] (03CR) 10Filippo Giunchedi: "Do we know why the host field is causing the exceptions ?" [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite) [14:51:33] (03PS3) 10Klausman: helm: Make ML k8s clusters visible to helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/673227 (https://phabricator.wikimedia.org/T272918) [14:54:14] (03CR) 10Cwhite: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite) [14:54:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 87 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:58:06] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add late-stage host field type validation for ecs events [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite) [14:59:07] (03PS1) 10Jbond: admin: clean up yaml lint issues [puppet] - 10https://gerrit.wikimedia.org/r/673287 [14:59:09] (03PS1) 10Jbond: admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 [14:59:47] (03CR) 10jerkins-bot: [V: 04-1] admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond) [15:00:12] (03PS2) 10Jbond: admin: clean up yaml lint issues [puppet] - 10https://gerrit.wikimedia.org/r/673287 [15:00:31] (03PS2) 10Jbond: admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 [15:00:35] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS standalone puppet master does not lookup cherry picked hiera change - https://phabricator.wikimedia.org/T277526 (10hashar) 05Open→03Resolved a:03hashar Thank you @jbond ! [15:01:08] (03CR) 10jerkins-bot: [V: 04-1] admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond) [15:02:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) access=WRITE ... ? What is the query you are trying to run? [15:03:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 50 probes of 604 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:07:38] (03CR) 10Volans: "L3 signed now" [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans) [15:08:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) a:05Ottomata→03Volans [15:11:16] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Samwalton9) 05Open→03Resolved a:03Samwalton9 >>! In T277298#6925220, @Ottomata wrote: > access=WRITE ... ? > > What is the query you are trying to run? I'm doing a simple test query... [15:15:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans) [15:16:16] (03PS3) 10Jbond: admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 [15:18:53] (03CR) 10Volans: [C: 03+2] admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans) [15:19:00] (03PS1) 10Jbond: DO NOT MERGE: Test CI picks up duplicate yaml keys [puppet] - 10https://gerrit.wikimedia.org/r/673291 [15:19:47] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: Test CI picks up duplicate yaml keys [puppet] - 10https://gerrit.wikimedia.org/r/673291 (owner: 10Jbond) [15:20:34] (03PS2) 10Volans: admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) [15:20:39] (03CR) 10Jbond: [C: 03+2] admin: clean up yaml lint issues [puppet] - 10https://gerrit.wikimedia.org/r/673287 (owner: 10Jbond) [15:20:43] (03CR) 10Jbond: [C: 03+2] admin: add yaml lint to admin data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond) [15:21:41] (03PS4) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 [15:21:43] (03PS1) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:21:52] (03PS2) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:21:55] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) [15:22:15] (03CR) 10jerkins-bot: [V: 04-1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:22:17] (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo) [15:27:05] 10SRE, 10SRE-Access-Requests: Add samwalton to analytics-privatedata-users - https://phabricator.wikimedia.org/T277298 (10Ottomata) Hm, maybe your user had to be created on the host that runs Hue first. I did not manually do that after I merged, so it may have just done that now. Great!@ [15:28:53] (03CR) 10Volans: [C: 03+2] admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans) [15:29:05] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db11[76-84] - https://phabricator.wikimedia.org/T273566 (10RobH) 05Open→03Resolved [15:30:12] (03PS3) 10Volans: admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) [15:30:16] damn, 2 conflicts in 10 minutes :d [15:31:22] (03Abandoned) 10Jbond: DO NOT MERGE: Test CI picks up duplicate yaml keys [puppet] - 10https://gerrit.wikimedia.org/r/673291 (owner: 10Jbond) [15:32:20] (03CR) 10Volans: [C: 03+2] admin: add apine to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/673223 (https://phabricator.wikimedia.org/T277692) (owner: 10Volans) [15:32:41] volans: ahh that could have been me reformating the few low hanging yamllint errors [15:33:00] you and otto, but it's fine, no worries :D [15:33:04] :) [15:33:10] !log clean up dead letter queue and restart all logstashes [15:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Support building a grid-exec node with cinder or flavor-defined storage [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [15:38:40] (03PS3) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:38:57] 10SRE, 10Services, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10dpifke) >>! In T277483#6924456, @akosiaris wrote: > * Is xhgui stateless? More specifically > ** Does xhgui in any way persist anythi... [15:39:13] (03CR) 10jerkins-bot: [V: 04-1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:39:18] (03PS4) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:39:48] (03CR) 10jerkins-bot: [V: 04-1] Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:40:52] (03PS5) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:41:06] (03PS6) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:42:29] (03CR) 10Cwhite: [C: 03+2] logstash: remove type setting on dlq input [puppet] - 10https://gerrit.wikimedia.org/r/673131 (https://phabricator.wikimedia.org/T277080) (owner: 10Cwhite) [15:42:36] (03CR) 10Jcrespo: "This will require later debian changes to install the .sql on the filesystem." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:42:51] (03CR) 10Cwhite: [C: 03+2] logstash: add late-stage host field type validation for ecs events [puppet] - 10https://gerrit.wikimedia.org/r/673142 (owner: 10Cwhite) [15:43:40] (03CR) 10Cwhite: [C: 03+2] ensure host field is the correct type in late-stage ecs filter [software/ecs] - 10https://gerrit.wikimedia.org/r/673148 (owner: 10Cwhite) [15:43:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) Created kerberos principal: ` krb1001 ~$ sudo manage_principals.py create apine --email_address=cmassaro@wikimedia.org Principal successf... [15:44:05] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: grid: base: stop using LVM [puppet] - 10https://gerrit.wikimedia.org/r/673267 (https://phabricator.wikimedia.org/T272114) (owner: 10Arturo Borrero Gonzalez) [15:44:07] (03Merged) 10jenkins-bot: ensure host field is the correct type in late-stage ecs filter [software/ecs] - 10https://gerrit.wikimedia.org/r/673148 (owner: 10Cwhite) [15:50:05] (03PS7) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:50:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) @cmassaro kerberos activated and patch with your access merged, please follow https://wikitech.wikimedia.org/wiki/Production_access#Settin... [15:51:06] (03Abandoned) 10Ahmon Dancy: logspam.pl: Only process errors from production mw servers [puppet] - 10https://gerrit.wikimedia.org/r/673171 (owner: 10Ahmon Dancy) [15:51:47] (03PS2) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396) [15:52:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) And feel free to resolve this task once it's all working as expected. [15:53:28] (03PS1) 10Gilles: Add edge cache hostname to Server-Timing header [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) [15:55:00] (03CR) 10Volans: "post-merge optional nit ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond) [15:56:02] (03CR) 10Jcrespo: "Like tendril, the database deployment for metadata db backups was never documented, as we handled it with puppet/backup infrastructure." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:56:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Volans) a:05Volans→03None @Ottomata is there anything to be done on the analytics side to sync the user for the intended usage? [15:57:05] (03PS8) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [15:57:32] (03PS1) 10Jbond: admin: data_validate use Path().parent vs Path.parents[0] [puppet] - 10https://gerrit.wikimedia.org/r/673296 [15:57:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Ottomata) Nope, that's it! Thank you! [15:58:25] (03PS9) 10Jcrespo: Add sql with the empty database structure to the repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673292 (https://phabricator.wikimedia.org/T138562) [16:00:04] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1600). [16:00:04] Majavah: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:42] ^ already done [16:01:52] :) thanks [16:02:06] (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/673296 (owner: 10Jbond) [16:02:21] 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10fdans) 05Open→03Resolved [16:02:37] (03CR) 10Jbond: [C: 03+2] admin: data_validate use Path().parent vs Path.parents[0] [puppet] - 10https://gerrit.wikimedia.org/r/673296 (owner: 10Jbond) [16:03:43] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673288 (owner: 10Jbond) [16:04:17] (03CR) 10Volans: [C: 03+2] openldap: improve cross-validate-accounts [puppet] - 10https://gerrit.wikimedia.org/r/673241 (owner: 10Volans) [16:06:17] (03PS1) 10Jon Harald Søby: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 [16:06:34] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10akosiaris) fwiw, a possibly desired UX would be something like `$ sre.downtime.service 'service1|service2|service3'` or `$ sre.downtime.service service1 [service2] [service3]`... [16:07:40] (03PS2) 10Jforrester: Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza) [16:07:53] (03PS1) 10Jbond: P:debmonitor::server: make logback and monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/673299 [16:08:44] (03CR) 10jerkins-bot: [V: 04-1] Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby) [16:09:10] (03CR) 10Urbanecm: [C: 04-1] "thanks for the patch!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 (owner: 10Jon Harald Søby) [16:10:14] (03PS2) 10Jon Harald Søby: Add alt, bcl, diq, mad, mni, mnw, nia, skr, tay and trv to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673297 [16:10:58] (03CR) 10Volans: [C: 03+1] "LGTM if PCC is happy" [puppet] - 10https://gerrit.wikimedia.org/r/673299 (owner: 10Jbond) [16:11:18] (03CR) 10CRusnov: [C: 03+1] "seems good, unless the extra test was in place in order to differentiate from generic systemd alerts." [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [16:13:14] (03CR) 10Jforrester: "@Urbanecm, want to deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [16:14:06] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) Doh, I think we have naming clash here :) - service: as in Icinga single service belonging to an Icinga host - service: as in a WMF .svc. service but treated as a Ho... [16:15:17] I'll deploy a beta-only patch. [16:15:37] tgr_: Ha, beat me to it. [16:15:47] (03CR) 10Jforrester: [C: 03+1] Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza) [16:16:17] (03CR) 10Gergő Tisza: [C: 03+2] Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza) [16:17:36] (03PS2) 10Jbond: P:debmonitor::server: make logback and monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/673299 [16:18:37] (03Merged) 10jenkins-bot: Revert "[beta] Disable captchas while they are completely broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673112 (https://phabricator.wikimedia.org/T276176) (owner: 10Gergő Tisza) [16:19:20] (03PS2) 10Jforrester: Drop ability to use graphoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654954 (https://phabricator.wikimedia.org/T242855) [16:19:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28668/console" [puppet] - 10https://gerrit.wikimedia.org/r/673299 (owner: 10Jbond) [16:19:26] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [16:19:57] (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657696 (https://phabricator.wikimedia.org/T269712) [16:22:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:debmonitor::server: make logback and monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/673299 (owner: 10Jbond) [16:23:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:26:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:33:12] (03PS3) 10ArielGlenn: distinguish between "no wikis with batches available" and "no wikis left to run" [dumps] - 10https://gerrit.wikimedia.org/r/673210 (https://phabricator.wikimedia.org/T252396) [16:37:30] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2239.codfw.wmnet [16:37:36] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2240.codfw.wmnet [16:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:41] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2241.codfw.wmnet [16:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:44] (03PS4) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [16:37:46] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2242.codfw.wmnet [16:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2239.codfw.wmnet [16:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:17] (03CR) 10jerkins-bot: [V: 04-1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [16:40:24] (03CR) 10Mholloway: Add event stream config for android.image_recommendations_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [16:40:44] (03CR) 10Dzahn: "We would leave the data import/export to you like last time. This was more to point out there is this path to change once it's time for it" [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn) [16:45:21] (03CR) 10Mholloway: Add event stream config for android.image_recommendations_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [16:47:50] 10SRE, 10Analytics-Radar: Upgrade to Kafka MirrorMaker 2 - https://phabricator.wikimedia.org/T277467 (10fdans) [16:50:31] (03CR) 10Dzahn: [C: 03+1] "I think the biggest advantage is that a user on IRC will immediately see what the actual problem is instead of some generic "systemd state" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [16:51:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2239.codfw.wmnet [16:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:42] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:53:02] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:53:28] (03CR) 10RLazarus: [C: 03+1] "Oh, I love this. I'd been hoping we could have this sort of unit-specific monitoring -- the problem Daniel mentions has been bugging me, b" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [16:54:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2240.codfw.wmnet [16:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:52] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:57:28] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 3.771 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:57:30] (03CR) 10Jbond: [V: 03+1] "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [16:57:38] (03CR) 10Jbond: [C: 03+2] systemd::monitor: create a generic nrpe systemd check [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond) [16:57:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd::service: Add ability to monitor services [puppet] - 10https://gerrit.wikimedia.org/r/673225 (owner: 10Jbond) [16:58:24] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/673221 (owner: 10Jbond) [16:59:57] (03PS1) 10Bstorm: toolforge: set up an "opt out" of the infrastructure profile [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) [17:00:04] chrisalbon and accraze: Dear deployers, time to do the Services – Graphoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1700). [17:00:56] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:01:49] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) Sounds great! Can you post here when the model is up on toolforge? I'd love to take a look. [17:02:24] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:02:48] 10SRE, 10Analytics, 10Traffic: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10fdans) p:05Triage→03Medium a:03razzi cc @ema [17:03:36] PROBLEM - AQS root url on aqs1010 is CRITICAL: connect to address 10.64.0.40 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [17:03:39] (03PS1) 10Jforrester: FlaggedRevs: Stop setting wgFlaggedRevsWhitelist, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306 [17:04:04] aqs1010 is a new host, downtime expired for sure [17:07:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2240.codfw.wmnet [17:07:40] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2241.codfw.wmnet [17:09:46] (03PS1) 10Andrew-WMDE: Enable bracket matching on group0 and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673312 (https://phabricator.wikimedia.org/T273591) [17:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:49] (03PS6) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [17:09:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) @robh these are ready for you, had a delay in getting them set up because the netbox script didn't work for them, the... [17:10:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) [17:10:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) a:05Cmjohnson→03RobH [17:10:52] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:11:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @Papaul I just noticed this host has status "offline" in netbox. But should be "decom" state. [17:13:25] 10ops-codfw, 10serviceops: decom 7 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [17:15:13] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670973 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:17:19] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) @calbon we have hit some snags in deploying the model on toolforge and are in the process of resolving those. But in the meantime you can have... [17:17:39] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:18:37] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670977 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:23:49] (03PS1) 10Urbanecm: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) [17:24:13] jouncebot: next [17:24:13] In 0 hour(s) and 35 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1800) [17:26:05] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:26:24] (03PS4) 10Giuseppe Lavagetto: [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [17:27:06] (03CR) 10David Caro: "You can ignore my 'nit' comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:27:42] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [17:27:44] (03CR) 10Legoktm: "See Change-Id: I51ba05f2537b4c068a0150c22fc00920a9f70edb :)" [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:28:14] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2241.codfw.wmnet [17:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:33] (03CR) 10David Caro: "Same comments as the other patch (https://gerrit.wikimedia.org/r/c/operations/puppet/+/670933)" [puppet] - 10https://gerrit.wikimedia.org/r/670928 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:28:47] (03CR) 10Jbond: [C: 03+1] rabbitmqadmin.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670970 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:29:38] (03PS1) 10Urbanecm: simplewiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673319 (https://phabricator.wikimedia.org/T277550) [17:31:01] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:31:17] (03CR) 10David Caro: [C: 03+1] "The 'nit' can be ignored." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) (owner: 10Bstorm) [17:31:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2242.codfw.wmnet [17:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:21] (03CR) 10David Caro: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [17:32:29] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:33:15] PROBLEM - mediawiki-installation DSH group on mw2242 is CRITICAL: Host mw2242 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:34:30] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673321 [17:35:23] (03PS1) 10Urbanecm: tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491) [17:35:48] (03CR) 10Urbanecm: [C: 03+2] simplewiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673319 (https://phabricator.wikimedia.org/T277550) (owner: 10Urbanecm) [17:36:36] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673321 (owner: 10PipelineBot) [17:37:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Cmjohnson) [17:37:59] (03Merged) 10jenkins-bot: simplewiki: Enable Growth team features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673319 (https://phabricator.wikimedia.org/T277550) (owner: 10Urbanecm) [17:38:23] (03CR) 10Bstorm: paws: block using the Jupyterhub from Tor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/671286 (https://phabricator.wikimedia.org/T276615) (owner: 10Bstorm) [17:39:34] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673321 (owner: 10PipelineBot) [17:40:14] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Cmjohnson) a:05Cmjohnson→03Jgreen Assigning this to @Jgreen to complete the installs. All the on-site work has been completed, network ports are set up and enabled so please ins... [17:40:19] !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [17:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:12] (03PS2) 10Urbanecm: tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491) [17:41:14] (03CR) 10Subramanya Sastry: [C: 03+1] "Okay, in that case, let us wait till we finish any rt testing we need to do for next week's train before rolling this out." [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn) [17:41:16] (03CR) 10Bstorm: toolforge: set up an "opt out" of the infrastructure profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) (owner: 10Bstorm) [17:41:18] (03CR) 10Urbanecm: [C: 03+2] tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491) (owner: 10Urbanecm) [17:41:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:42:02] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 04342e9bb0765a6a58ad78bd7eaa380d4167f0c1: simplewiki: Enable Growth team features in stealth mode (T277550) (duration: 01m 10s) [17:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:10] T277550: Deploy Growth features on Simple English Wikipedia - https://phabricator.wikimedia.org/T277550 [17:42:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) So far I have found out nothing This is a PCI error, in the past, this would mean a blown capacitor but this is the first I've seen of this error since we left... [17:43:10] (03CR) 10Dzahn: "Ok, sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/673175 (https://phabricator.wikimedia.org/T277580) (owner: 10Dzahn) [17:44:20] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/670981 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:44:37] (03Merged) 10jenkins-bot: tewiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673323 (https://phabricator.wikimedia.org/T277491) (owner: 10Urbanecm) [17:45:00] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) [17:45:11] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 04342e9bb0765a6a58ad78bd7eaa380d4167f0c1: simplewiki: Enable Growth team features in stealth mode (T277550) (duration: 01m 09s) [17:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:38] (03CR) 10Jbond: get-raid-status-megacli.py: Port to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670973 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:46:49] (03PS1) 10Zabe: Use of Article::getId was deprecated in MediaWiki 1.35 [extensions/LiquidThreads] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673114 (https://phabricator.wikimedia.org/T277772) [17:47:30] (03CR) 10Jcrespo: "If it works, ship it, but you may want to deploy at the same time the Swift and DB owners are around, as they will probably have ongoing R" [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:47:48] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the nod... [17:48:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 55aa6cb: tewiki: Enable Growth features in stealth mode (T277491; 1/2) (duration: 01m 10s) [17:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:31] T277491: Request to implement Growth experiments on Telugu Wikipedia (Tewiki) - https://phabricator.wikimedia.org/T277491 [17:48:39] (03CR) 10Bstorm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/28670/ happy with that so merging" [puppet] - 10https://gerrit.wikimedia.org/r/673304 (https://phabricator.wikimedia.org/T277756) (owner: 10Bstorm) [17:49:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/670977 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:49:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:50:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2242.codfw.wmnet [17:50:17] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 55aa6cb: tewiki: Enable Growth features in stealth mode (T277491; 2/2) (duration: 01m 08s) [17:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:01] 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) a:03Legoktm [17:51:14] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal: Research storage solutions for media backups - https://phabricator.wikimedia.org/T264190 (10jcrespo) 05Open→03Resolved Research (analysis) and Design finished for now, we are now in implementation phase: T276442 and T276445. Documentation... [17:51:18] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [17:51:27] (03PS1) 10Andrew-WMDE: Enable CodeMirror accessibility colors on initial wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673326 (https://phabricator.wikimedia.org/T276346) [17:53:19] (03CR) 10Andrew Bogott: "I built a test VM, toolsbeta-sgeexec-0902.toolsbeta.eqiad1.wikimedia.cloud, which looks OK. Swap is turned off for the toolsbeta-sgeexec " [puppet] - 10https://gerrit.wikimedia.org/r/672456 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [17:53:55] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1001.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:54:30] (03CR) 10Jbond: "lgtm but see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670978 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:55:52] (03CR) 10Legoktm: [V: 03+1] docker_registry_ha: Require authentication from k8s nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [17:56:05] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:56:12] 10SRE, 10ops-eqiad: elastic1062 interface errors - https://phabricator.wikimedia.org/T277634 (10Cmjohnson) 05Open→03Resolved replaced the production cable [17:56:52] (03CR) 10Jbond: wmcs-webproxy.py: Port to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:56:55] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frqueue1001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T277171 (10Cmjohnson) a:05Cmjohnson→03Papaul [17:58:51] !log disabled puppet on registry* for rolling out https://gerrit.wikimedia.org/r/672537 [17:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:33] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1800). [18:00:04] Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] (03PS1) 10Urbanecm: mswiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673329 (https://phabricator.wikimedia.org/T277562) [18:00:19] I'll self-service [18:00:36] (03CR) 10Urbanecm: [C: 03+2] mswiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673329 (https://phabricator.wikimedia.org/T277562) (owner: 10Urbanecm) [18:00:48] (03PS2) 10Urbanecm: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) [18:00:52] (03CR) 10Urbanecm: [C: 03+2] hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [18:01:51] (03Merged) 10jenkins-bot: mswiki: Enable Growth features in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673329 (https://phabricator.wikimedia.org/T277562) (owner: 10Urbanecm) [18:01:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:02:05] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:02:08] (03CR) 10Legoktm: [V: 03+1 C: 03+2] docker_registry_ha: Require authentication from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/672537 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:02:18] (03PS3) 10Urbanecm: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) [18:02:23] (03CR) 10Urbanecm: [C: 03+2] hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [18:02:53] (03CR) 10Jbond: mwgrep.py: Port to Python 3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670975 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:03:15] (03Merged) 10jenkins-bot: hrwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673316 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [18:06:06] (03CR) 10Jbond: [C: 04-1] "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670981 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [18:07:23] (03PS3) 1001miki10: Disable ContentTranslation New article campaign in fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672416 (https://phabricator.wikimedia.org/T277473) [18:07:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=ircd site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:51] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:09:07] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:09:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:10:01] what's the right reaction to that?^ [18:10:42] I remember seeing something about linkrecommendation in phab, one sec [18:10:45] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 179d9e5: mswiki: Enable Growth features in stealth mode (T277562; 1/2) (duration: 01m 11s) [18:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:53] T277562: Deploy Growth features on Malay Wikipedia - https://phabricator.wikimedia.org/T277562 [18:11:43] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Majavah) [18:11:57] https://phabricator.wikimedia.org/T277297 [18:12:05] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1006.eqiad. [18:12:05] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:12:13] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 179d9e5: mswiki: Enable Growth features in stealth mode (T277562; 2/2) (duration: 01m 08s) [18:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:33] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 5.943 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:12:34] legoktm, thanks [18:12:46] If it has been happening for days, probably not an emergency [18:12:52] https://phabricator.wikimedia.org/T277297 [18:12:57] I don't think the service is in active use yet [18:13:04] ah, much better [18:13:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) Hi @Dzahn - we typically change the status to "offline" after the server is unracked. [18:13:09] :-) [18:13:15] that is the part I didn't know [18:13:53] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:16:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:17:18] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 44eddcc: hrwiki: Deploy Growth features to newcomers (T275684) (duration: 01m 08s) [18:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:26] T275684: Deploy Growth features on Croatian Wikipedia - https://phabricator.wikimedia.org/T275684 [18:17:43] 10SRE, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) a:05Cmjohnson→03ayounsi @ayounsi I verified all of the ports listed in https://librenms.wikimedia.org/ports/state=down/hostname=asw/format=list_basic/ are not in service at the moment. There were 2 i... [18:18:08] left a comment for now [18:18:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10Dzahn) @wiki_willy Oh, right, I got confused here myself and compared it to the servers that have been decom'ed but are still physically in racks. All is good the... [18:19:44] jynus: legoktm: linkrecommendation is supposed to be used...soon :) [18:19:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) Thanks Chris [18:19:55] yeah, no problem [18:20:13] it is just that when seeing lvs complain, normally it is a very bad thing :-) [18:20:58] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) >>! In T277711#6926081, @Joe wrote: > As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarti... [18:21:19] definitely :) [18:23:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) I am able to access the bastion. I am not able to access stat1006.equiad.wmnet, though. I can provide the output from `ssh -v` if that would help. [18:23:46] Urbanecm: Hey, I have a question if you have time. If I want to submit a patch for backporting to wmf/1.36.0-wmf.35, does the patch for the master branch already has to be merged, or is that not important? [18:24:08] unless there are exceptional circumstances, it should already be merged into master [18:24:57] do you have a specific example in mind? [18:25:26] Zabe: it SHOULD be merged [18:25:30] (really really really should) [18:25:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Urbanecm) >>! In T277692#6926262, @cmassaro wrote: > I am able to access the bastion. I am not able to access stat1006.equiad.wmnet, though. I can provide the output fr... [18:25:53] ok thx [18:26:01] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1001.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:26:26] Zabe: Given that the warning only showed up once, let's wait for the normal processes. We can do a backport during the train window if needed. [18:26:37] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:26:40] (which is in about 30 minutes) [18:27:09] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1016.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:28:04] !log re-enabled puppet on registry* [18:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:15] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:28:18] Urbanecm: sadly RFC 6919 has defined "REALLY SHOULD NOT" but not "really really really should" :D [18:28:29] :D [18:29:02] (03PS6) 10Jbond: (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [18:29:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) Got it. I did set up my SSH config as prescribed there, and `ssh bast1002.wikimedia.org` works as a result. When I try `ssh stat1006.eqiad.wmnet`, it looks li... [18:29:53] 10SRE, 10netops, 10Patch-For-Review: Auhoritative ports list - https://phabricator.wikimedia.org/T277146 (10jbond) >>! In T277146#6913871, @Kormat wrote: > Just in case it's relevant, we use a range of ports for mariadb. Most (but not all) of them are [[ https://github.com/wikimedia/puppet/blob/da54cc6f29deb... [18:30:26] (03CR) 10jerkins-bot: [V: 04-1] (WIP): netbase: first pass at parsing service::catalogue ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [18:31:33] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:31:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10Dzahn) @cmassaro It seems to be a typo in the host name. found in auth.log on bast1002 ` error: connect_to stat1006.equiad.wmnet: unknown host (Name or service... [18:35:16] (03CR) 10Effie Mouzeli: create helmfile.d structure (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [18:35:37] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 193 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:36:33] (03CR) 10Effie Mouzeli: "There are a few more things that we should look into, but let's fix them in the next iteration:)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [18:36:56] (03CR) 10Legoktm: hiera: add dummy secrets for ML k8s workers (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/672455 (https://phabricator.wikimedia.org/T272918) (owner: 10Klausman) [18:37:28] (03CR) 10Effie Mouzeli: [C: 04-1] create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [18:37:56] (03PS1) 10Legoktm: ml_k8s: Remove docker registry password [labs/private] - 10https://gerrit.wikimedia.org/r/673333 [18:38:08] (03PS1) 10Dzahn: site/conftool-data: decom mw2239 through mw2242, rack A4 [puppet] - 10https://gerrit.wikimedia.org/r/673334 (https://phabricator.wikimedia.org/T277119) [18:38:35] (03CR) 10Legoktm: [V: 03+2 C: 03+2] ml_k8s: Remove docker registry password [labs/private] - 10https://gerrit.wikimedia.org/r/673333 (owner: 10Legoktm) [18:42:03] (03PS5) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [18:44:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: decom mw2256 (was: mw2256 - CPU/board hardware issue) - https://phabricator.wikimedia.org/T263065 (10wiki_willy) No worries @Dzahn, thanks for checking. =) [18:44:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) Ahhhh sorry, that is embarrassing. It's all good now. Thank you! [18:44:53] (03PS1) 10Bstorm: prometheus: re-order sudo access for the cron [puppet] - 10https://gerrit.wikimedia.org/r/673336 [18:45:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) >>! In T277692#6924028, @JAllemandou wrote: > Thanks for letting me know @Ottomata :) > @cmassaro : Let's sync on the work you wish to accomplish, as wikitex... [18:45:10] (03CR) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [18:46:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] prometheus: re-order sudo access for the cron [puppet] - 10https://gerrit.wikimedia.org/r/673336 (owner: 10Bstorm) [18:46:46] (03CR) 10Bstorm: [C: 03+2] prometheus: re-order sudo access for the cron [puppet] - 10https://gerrit.wikimedia.org/r/673336 (owner: 10Bstorm) [18:46:59] (03PS6) 10Sharvaniharan: Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 [18:52:13] (03CR) 10Ottomata: [C: 03+1] Add event stream config for android.image_recommendations_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673005 (owner: 10Sharvaniharan) [18:56:01] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krinkle) [18:57:07] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) p:05Medium→03High [18:57:53] 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Krinkle) [18:58:31] (03CR) 10Dzahn: [C: 03+2] site/conftool-data: decom mw2239 through mw2242, rack A4 [puppet] - 10https://gerrit.wikimedia.org/r/673334 (https://phabricator.wikimedia.org/T277119) (owner: 10Dzahn) [18:59:59] (03PS2) 10Majavah: Remove deploymentwiki configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/649609 (https://phabricator.wikimedia.org/T198673) [19:00:04] dancy and brennen: Dear deployers, time to do the Mediawiki train - American Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T1900). [19:00:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Cory Massaro - https://phabricator.wikimedia.org/T277692 (10cmassaro) 05Open→03Resolved [19:01:37] (03PS1) 10Ahmon Dancy: group2 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673340 [19:01:39] (03CR) 10Ahmon Dancy: [C: 03+2] group2 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673340 (owner: 10Ahmon Dancy) [19:02:45] (03Merged) 10jenkins-bot: group2 wikis to 1.36.0-wmf.35 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673340 (owner: 10Ahmon Dancy) [19:04:35] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.36.0-wmf.35 [19:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:33] (03PS1) 10Jgreen: add A/PTR records for payments100[5-8].frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/673342 (https://phabricator.wikimedia.org/T266481) [19:15:02] (03CR) 10Jgreen: [C: 03+2] add A/PTR records for payments100[5-8].frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/673342 (https://phabricator.wikimedia.org/T266481) (owner: 10Jgreen) [19:17:20] (03CR) 10Jgreen: [C: 03+2] Add ssl check for frdata2001 [puppet] - 10https://gerrit.wikimedia.org/r/673132 (https://phabricator.wikimedia.org/T260183) (owner: 10Dwisehaupt) [19:19:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={ircd,netbox_device_statistics} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:21:05] 10SRE, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) [19:21:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:51] (03PS3) 10Legoktm: sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) [19:25:03] (03CR) 10Legoktm: [C: 03+2] sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [19:26:05] (03PS1) 10Ryan Kemper: elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) [19:28:59] (03CR) 10Ryan Kemper: "This implements the combined plugin upgrade + reboot functionally as a new cookbook. We'll want to circle back and refactor this because t" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [19:35:54] RECOVERY - Check systemd state on ml-serve2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:46] PROBLEM - Check systemd state on ml-serve2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:10] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [19:46:34] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [19:50:19] (03PS1) 10Andrew Bogott: nova vendordata.txt: try to fix signing of the wikimedia apt repo [puppet] - 10https://gerrit.wikimedia.org/r/673349 (https://phabricator.wikimedia.org/T271273) [19:51:21] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata.txt: try to fix signing of the wikimedia apt repo [puppet] - 10https://gerrit.wikimedia.org/r/673349 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [19:56:20] Zabe: Can you nag someone for a review on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/673325 ? [20:00:05] (03PS1) 10Andrew Bogott: nova vendordata: install gpg and dirmngr earlier in the cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673351 (https://phabricator.wikimedia.org/T271273) [20:01:25] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: install gpg and dirmngr earlier in the cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673351 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [20:05:11] (03PS1) 10Herron: add dummy grafana api key to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/673352 [20:06:24] (03CR) 10Herron: [V: 03+2 C: 03+2] add dummy grafana api key to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/673352 (owner: 10Herron) [20:07:53] (03PS1) 10Andrew Bogott: nova vendordata: install gnupg instead of gpg [puppet] - 10https://gerrit.wikimedia.org/r/673353 (https://phabricator.wikimedia.org/T271273) [20:09:14] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: install gnupg instead of gpg [puppet] - 10https://gerrit.wikimedia.org/r/673353 (https://phabricator.wikimedia.org/T271273) (owner: 10Andrew Bogott) [20:10:51] (03CR) 10Gehel: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [20:11:06] ryankemper: ^ [20:11:20] gehel: thanks [20:15:49] (03PS2) 10Kosta Harlan: linkrecommendation: Bump memory limit and image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/673006 (https://phabricator.wikimedia.org/T277297) [20:16:25] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/28673/" [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [20:19:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:19:31] DannyS712: See ^. Do you have time to look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/673325 ? [20:20:00] (03CR) 10DannyS712: [C: 03+1] "LGTM pending deployment of the flagged revs patch to all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673306 (owner: 10Jforrester) [20:20:36] +2'd [20:20:48] legoktm: thx [20:21:05] LQT is like, whatever comes after living on life support but not dead yet [20:21:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:18] (03CR) 10Legoktm: [C: 03+2] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [20:22:40] legoktm: maybe they should have used a short Gerrit URL without the project name visible :P [20:22:51] hehe [20:25:21] dancy: patch is merged [20:25:59] Awesome. Do you have a before/after test to verify it? [20:27:02] (03CR) 10Ahmon Dancy: [C: 03+2] Use of Article::getId was deprecated in MediaWiki 1.35 [extensions/LiquidThreads] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673114 (https://phabricator.wikimedia.org/T277772) (owner: 10Zabe) [20:27:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) >>! In T272403#6898639, @Jclark-ctr wrote: > @Cmjohnson > cloudgw1001 c8 u29 ports13/19 cableid #5322 > cloudgw1002. d5... [20:27:55] no, because I don't realy know how to test if there still is this deprecation warning. [20:28:32] hmm. ok. I +2'd the .35 cherry pick. Waiting for it to merge. [20:31:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobH→03aborrero The networking requirements for this request are wholly unclear at the time of this comment. The... [20:33:09] (03Merged) 10jenkins-bot: Use of Article::getId was deprecated in MediaWiki 1.35 [extensions/LiquidThreads] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673114 (https://phabricator.wikimedia.org/T277772) (owner: 10Zabe) [20:34:30] based on the stacktrace on the task it looks like it shows up if you reply to a post [20:37:02] !log dancy@deploy1002 Synchronized php-1.36.0-wmf.35/extensions/LiquidThreads/classes/Thread.php: (no justification provided) (duration: 01m 05s) [20:37:08] Zabe: deployed [20:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:28] dancy: Thanks for your help [20:38:41] Thanks of the fix! [20:42:31] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.makevm: Automatically generate fqdn from hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/668867 (https://phabricator.wikimedia.org/T276516) (owner: 10Legoktm) [20:43:13] (03PS1) 10Andrew Bogott: nova vendordata: disable ssh password logins with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673357 [20:44:00] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: disable ssh password logins with cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673357 (owner: 10Andrew Bogott) [20:50:11] (03PS1) 10Andrew Bogott: Revert "nova vendordata.txt: try to fix signing of the wikimedia apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/673359 [20:51:02] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nova vendordata.txt: try to fix signing of the wikimedia apt repo" [puppet] - 10https://gerrit.wikimedia.org/r/673359 (owner: 10Andrew Bogott) [21:07:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refine_sanitize_eventlogging_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:21] (03PS1) 10Andrew Bogott: nova vendordata: further attempts to get gnupg installed up front [puppet] - 10https://gerrit.wikimedia.org/r/673362 [21:12:19] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: further attempts to get gnupg installed up front [puppet] - 10https://gerrit.wikimedia.org/r/673362 (owner: 10Andrew Bogott) [21:13:51] (03PS27) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [21:14:45] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [21:17:23] (03CR) 10Mstyles: create helmfile.d structure (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [21:22:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:59] (03PS1) 10Andrew Bogott: nova vendordata: fix apt-get command [puppet] - 10https://gerrit.wikimedia.org/r/673364 [21:35:00] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: fix apt-get command [puppet] - 10https://gerrit.wikimedia.org/r/673364 (owner: 10Andrew Bogott) [21:41:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) You can start installing new servers in rack A3 in place of mw2215 through mw2238: https://netbox.wikimedia.org/dcim/devices/?q=mw2&rack_id=45&m... [21:46:23] (03PS7) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [21:49:12] 10SRE, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T277602 (10CGlenn) Hello @Volans ! Thank you for pointing that out to me. Just to confirm, I can access am.wikipedia.org in GSC. Is possible we can add am.m.wikipedia.org as well? Or would... [22:00:55] (03PS1) 10Dzahn: site/conftool-data: turn mw2251,mw2252 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) [22:00:57] (03PS1) 10Dzahn: site/conftool-data: decom mw2244,mw2245, former canary servers [puppet] - 10https://gerrit.wikimedia.org/r/673368 (https://phabricator.wikimedia.org/T277780) [22:04:23] (03PS28) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [22:05:07] 10SRE, 10netops, 10Patch-For-Review: Authoritative ports list - https://phabricator.wikimedia.org/T277146 (10Reedy) [22:05:13] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [22:10:51] (03PS1) 10Brennen Bearnes: ActorStore::getActorById - fall back to master. [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) [22:13:50] (03CR) 10Ppchelko: [C: 03+1] "The commit message is a bit misleading now, cause there's no TODO anymore, but overall it should fix the prod error." [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) (owner: 10Brennen Bearnes) [22:16:28] Pchelolo: i can go ahead and sling above out, assuming zuul is happy - is there any kind of mwdebug testing possible / needed? i guess my assumption is it's not plausible to reproduce... [22:17:23] jouncebot now [22:17:23] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [22:17:38] (03CR) 10Brennen Bearnes: [C: 03+2] "> Patch Set 1: Code-Review+1" [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) (owner: 10Brennen Bearnes) [22:18:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:18:50] brennen: mind if i do a blubberoid deploy? it shouldn't be very eventful [22:18:59] marxarelli: go for it [22:19:04] cool [22:19:35] brennen: no clue how to reproduce it reliably [22:19:38] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673371 [22:20:34] Pchelolo: yeah, figured. guess we'll just keep an eye for errors to drop off. [22:20:35] I think the best way is to just wait and see if logstash errors are gone [22:20:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:21:46] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673371 (owner: 10PipelineBot) [22:22:22] * brennen twiddles thumbs and waits for zuul. [22:23:42] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/673371 (owner: 10PipelineBot) [22:23:49] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/636081 (owner: 10PipelineBot) [22:23:57] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/656521 (owner: 10PipelineBot) [22:24:03] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/658682 (owner: 10PipelineBot) [22:24:12] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/667045 (owner: 10PipelineBot) [22:24:38] sorry ^ just cleanup [22:25:25] !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [22:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:16] (03PS2) 10Dzahn: site/conftool-data: turn mw2251,mw2252 into canaries [puppet] - 10https://gerrit.wikimedia.org/r/673367 (https://phabricator.wikimedia.org/T277780) [22:29:09] (03PS29) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [22:29:29] jouncebot next [22:29:30] In 0 hour(s) and 30 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T2300) [22:29:45] !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [22:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:07] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [22:30:53] !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [22:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:04] (03PS7) 10Mstyles: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [22:41:55] (03PS1) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 [22:43:45] (03CR) 10jerkins-bot: [V: 04-1] Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson) [22:48:43] (03Merged) 10jenkins-bot: ActorStore::getActorById - fall back to master. [core] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/673115 (https://phabricator.wikimedia.org/T277795) (owner: 10Brennen Bearnes) [22:48:56] (03CR) 10Jdlrobson: "I can't backport this today Legoktm, Greg but it seems reasonable that the WMF logo should only apply to office wiki, and not be the defau" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson) [22:49:05] (03PS2) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 [22:50:14] (03CR) 10jerkins-bot: [V: 04-1] Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson) [22:51:43] going ahead with that backport. [22:52:52] (03PS3) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 [22:53:04] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.35/includes/specials/SpecialContributions.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 01m 07s) [22:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:12] T277795: User not found by actor ID: [id] - https://phabricator.wikimedia.org/T277795 [22:53:27] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [22:54:43] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 6 out of 8 are jobrunners. Maybe best to wait for T274171 to have started and turn some new servers in A3 into jobrunners, then remove these in A4 af... [22:55:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10Dzahn) [22:55:20] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) [22:55:23] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Dzahn) 05Open→03Stalled p:05Triage→03High [22:57:24] (03CR) 10Dzahn: "Well, I have nothing against this but also I can't help to think "but he literally just merged the systemd::service class" which would be " [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [22:59:00] (03CR) 10Dzahn: "I don't think the notes URL is a difference here, but the IRC part is. It's nice to know early what it is about rather than first having t" [puppet] - 10https://gerrit.wikimedia.org/r/673228 (owner: 10Jbond) [22:59:11] (03CR) 10Bstorm: [C: 03+2] "The #1 request so far is that I remove the nested ternary expressions. On that note, since I know that part is working from the last time " [puppet] - 10https://gerrit.wikimedia.org/r/672540 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [22:59:26] (03PS1) 10Dduvall: pipeline: Use build environment HTTP proxy for APT sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210318T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:02:05] (03CR) 10Krinkle: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson) [23:02:58] (03CR) 10Jdlrobson: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson) [23:03:57] (03PS4) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) [23:06:00] (03CR) 10Krinkle: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:06:39] (03PS30) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [23:06:41] (03CR) 10Krinkle: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:06:44] !log train status: 1.36.0-wmf.35 (T274939) stable on all wikis after deploy of hotfix for T277795 [23:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:54] T277795: User not found by actor ID: [id] - https://phabricator.wikimedia.org/T277795 [23:06:54] T274939: 1.36.0-wmf.35 deployment blockers - https://phabricator.wikimedia.org/T274939 [23:07:31] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [23:07:36] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:08:22] (03PS31) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [23:09:21] (03CR) 10Jbond: netbase: add new module to manage /etc/services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [23:13:20] (03PS5) 10Jdlrobson: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 [23:13:33] (03CR) 10Jdlrobson: Don't define a default icon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (owner: 10Jdlrobson) [23:14:16] (03PS32) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [23:14:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:14:47] (03CR) 10Dduvall: [C: 03+1] "Tested successfully by running https://releases-jenkins.wikimedia.org/job/mediawiki-config-pipeline-wmf-publish/28/console and verifying t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) (owner: 10Dduvall) [23:15:07] (03PS1) 10Legoktm: docker_registry_ha: Add documentation to profile class [puppet] - 10https://gerrit.wikimedia.org/r/673376 [23:15:09] (03CR) 10jerkins-bot: [V: 04-1] netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [23:17:20] (03CR) 10Jeena Huneidi: [C: 03+2] pipeline: Use build environment HTTP proxy for APT sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) (owner: 10Dduvall) [23:18:15] (03Merged) 10jenkins-bot: pipeline: Use build environment HTTP proxy for APT sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673375 (https://phabricator.wikimedia.org/T277109) (owner: 10Dduvall) [23:18:53] (03PS33) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [23:21:52] since the current backports window is empty, i'm going to deploy ^. it's a noop pipeline-only mediawiki-config change. cc: longma [23:22:08] hi [23:22:09] :thumbsup: [23:22:12] marxarelli: ack. [23:25:39] !log dduvall@deploy1002 Synchronized .pipeline: config: [[gerrit:673375|Use build environment HTTP proxy for APT sources (T277109)]] (duration: 01m 02s) [23:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:48] T277109: Containers on releases hosts cannot update apt cache from non-WMF sources - https://phabricator.wikimedia.org/T277109 [23:26:06] legoktm, brennen: done. thanks! [23:26:24] I'm not sure I did anything :p [23:26:34] :P [23:26:57] haha, well you have my thanks nonetheless :p [23:27:04] deal with it [23:29:12] moral support. :) [23:30:24] (03PS1) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [23:30:30] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10RandomCanadian) Would it be possible to implement the temporary solutions as described at [[ https://en.wiki... [23:30:49] (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [23:33:11] gah, i think i may have synced the wrong file earlier. [23:33:48] (03PS2) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [23:33:55] yep. going to rectify. [23:34:16] (03CR) 10jerkins-bot: [V: 04-1] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [23:35:42] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 00m 58s) [23:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:50] T277795: User not found by actor ID: [id] - https://phabricator.wikimedia.org/T277795 [23:38:51] !log brennen@deploy1002 Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 00m 57s) [23:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:19] (03PS3) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [23:46:43] (03PS6) 10Legoktm: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:46:54] brennen: OK if I sync a config change? [23:47:05] legoktm: yeah, you're clear. [23:47:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:47:28] i'm signing off for the day, train looks good. [23:48:02] (03PS7) 10Legoktm: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:48:41] (03CR) 10Legoktm: "PS5: Fixed ordering of special projects list to be alphabetical, and left a pointer to the phab task about why the default is null." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:49:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:50:54] (03CR) 10Legoktm: [C: 03+2] Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:51:45] (03Merged) 10jenkins-bot: Don't define a default icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673372 (https://phabricator.wikimedia.org/T274199) (owner: 10Jdlrobson) [23:52:33] marxarelli: I merged your change in /srv/mediawiki-staging, so I believe I've officially earned the thanks now ;) [23:53:22] tested config patch on mwdebug1002, the correct Meta logo is back [23:53:32] legoktm: oh. which change? [23:53:35] (03CR) 10Cwhite: [C: 03+1] "first steps LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/671283 (owner: 10Herron) [23:53:53] the pipeline one apt proxy one [23:54:11] er... wasn't it already merge? [23:54:14] merged [23:54:14] well, pulled it in, not merged [23:54:24] git fetch origin; git rebase origin/master [23:54:32] oh boy... haha, k thanks! [23:54:37] :D [23:54:50] i must have done `git log` to compare and then forgot to rebase :/ [23:55:29] np :)) [23:56:20] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Don't define a default icon (T274199) (duration: 00m 57s) [23:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:28] T274199: With new Vector skin and Timeless, Meta, Wikimania and Wikitech logos are replaced by the WM Foundation logo - https://phabricator.wikimedia.org/T274199 [23:57:55] (03CR) 10Dave Pifke: "Huh, I didn't realize this never got merged. Thanks for picking it up!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke)