[00:00:10] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:10] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:22] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:32] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:38] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:42] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 3765 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [00:50:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:04] PROBLEM - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 17784, 1735573s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [02:21:06] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 3725 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [02:38:21] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) p:05Triage→03Medium [02:42:09] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10Papaul) I will be on site tomorrow Monday, Please de-pool server. Thanks [03:15:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:17:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:44:34] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 3799 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [04:09:20] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 4372 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [04:58:34] !log Stop MySQL on db2087 for on-site maintenance T258587 [04:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:41] T258587: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 [05:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2087:3316, db2087:3317 for on-site maintenance T258587', diff saved to https://phabricator.wikimedia.org/P12042 and previous config saved to /var/cache/conftool/dbconfig/20200727-050058-marostegui.json [05:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:09] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10Marostegui) Thank you @papaul - server depool and powered off. Once you are done, please power it back up Thanks! [05:11:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12043 and previous config saved to /var/cache/conftool/dbconfig/20200727-051156-marostegui.json [05:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:45] (03PS1) 10Marostegui: wmnet: Change m2-master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/616341 (https://phabricator.wikimedia.org/T255408) [05:39:54] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 3701 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [05:43:04] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10Marostegui) [05:43:13] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10Marostegui) p:05Triage→03High [06:07:40] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:06] PROBLEM - Check size of conntrack table on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:09:36] PROBLEM - puppet last run on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:11:16] PROBLEM - MD RAID on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:12:40] PROBLEM - Check the NTP synchronisation status of timesyncd on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:14:32] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:16:10] PROBLEM - configured eth on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:17:25] <_joe_> this is I think proton again [06:17:50] <_joe_> akosiaris: ^^ should we expedite merging michael's patch? [06:24:20] PROBLEM - DPKG on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:25:38] (03CR) 10Marostegui: "Questions about the PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [06:27:20] PROBLEM - Disk space on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [06:31:44] PROBLEM - dhclient process on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:36:55] !log apt-get clean on netbox1001 to free some space [06:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:54] (03PS5) 10Giuseppe Lavagetto: kubernetes::deployment_server::helmfile: use kube_env in .hfenv [puppet] - 10https://gerrit.wikimedia.org/r/613191 [06:40:17] 10Operations: netbox1001's root partition is filling up - https://phabricator.wikimedia.org/T258912 (10elukey) p:05Triage→03High [06:43:56] PROBLEM - IPMI Sensor Status on kubernetes2002 is CRITICAL: connect to address 10.192.16.42 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [06:44:33] !log truncate big log file on an-launcher1002 that is filling up the /srv partition [06:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:24] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:46:34] RECOVERY - Check size of conntrack table on kubernetes2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:47:02] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:04] RECOVERY - configured eth on kubernetes2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:48:12] RECOVERY - Disk space on kubernetes2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=kubernetes2002&var-datasource=codfw+prometheus/ops [06:50:40] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:53:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24136/deploy1001.eqiad.wmnet/index.html verified this to work" [puppet] - 10https://gerrit.wikimedia.org/r/613191 (owner: 10Giuseppe Lavagetto) [06:54:46] RECOVERY - MD RAID on kubernetes2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:55:12] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [06:55:12] RECOVERY - DPKG on kubernetes2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:00:40] !log Deploy schema change on s5 codfw T256682 [07:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:46] T256682: page_restrictions indexes have been majestically drifting from code - https://phabricator.wikimedia.org/T256682 [07:02:38] RECOVERY - dhclient process on kubernetes2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:14:26] RECOVERY - Check the NTP synchronisation status of timesyncd on kubernetes2002 is OK: OK: synced at Mon 2020-07-27 07:14:24 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:14:50] RECOVERY - IPMI Sensor Status on kubernetes2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:25:48] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10elukey) ` elukey@krb1001:~$ sudo manage_principals.py create agaduran --email_address=alberto.duran@epfl.ch Principal successfully created. Make sure... [07:26:14] (03PS1) 10Elukey: admin: add krb flag for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/616442 (https://phabricator.wikimedia.org/T258214) [07:27:19] (03CR) 10Elukey: [C: 03+2] admin: add krb flag for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/616442 (https://phabricator.wikimedia.org/T258214) (owner: 10Elukey) [07:29:00] 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Lea_Lacroix_WMDE) @Mohammed_Sadat_WMDE FYI, to check if it doesn't happen with the Weekly Summary [07:30:09] <_joe_> elukey: uhm I thought our docs said that for privatedata-users kerberos wasn't strictly needed [07:32:16] _joe_ Martin (the user's supervisor) specifically asked kerberos on the analytics chan, so I added it. In theory it is not needed if the user only wants to use dashboarding tools etc.. [07:32:25] <_joe_> oh ok [07:32:35] <_joe_> I wanted to be sure our docs weren't wrong [07:32:36] <_joe_> thanks [07:33:11] nono please let me know if I have to change the wording, maybe it is better to specifically ask to users if they need kerberos [07:34:29] (03PS1) 10Ema: prometheus: whitelist atskafka runtime metrics [puppet] - 10https://gerrit.wikimedia.org/r/616444 (https://phabricator.wikimedia.org/T253551) [07:38:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: fix deprecated filter names [puppet] - 10https://gerrit.wikimedia.org/r/615162 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [07:51:26] (03CR) 10Kormat: [C: 03+2] mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat) [07:53:23] (03PS12) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [07:57:10] (03PS13) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [08:04:05] (03CR) 10Kormat: "> Questions about the PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [08:05:56] (03CR) 10Marostegui: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [08:08:06] (03CR) 10Kormat: "> Patch Set 13:" [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [08:08:25] (03PS6) 10JMeybohm: Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) [08:12:06] PROBLEM - puppet last run on otrs1001 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:13:06] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [08:13:50] 10Operations, 10DBA, 10User-Kormat: Remove unused parameters from profile::mariadb::monitor::prometheus - https://phabricator.wikimedia.org/T256879 (10Kormat) 05Open→03Resolved a:03Kormat Closing this, refactoring of the mysql exporter puppet code will be covered by the parent task. [08:13:53] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Refactor mariadb puppet code - https://phabricator.wikimedia.org/T256972 (10Kormat) [08:15:34] I'm briefly disabling puppet on mw* hosts [08:17:50] (03PS1) 10Elukey: eventlogging: remove MobileWebUIClickTracking event [puppet] - 10https://gerrit.wikimedia.org/r/616445 [08:18:02] (03CR) 10Marostegui: [C: 03+1] mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [08:18:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:20:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:19] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling) [08:21:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:25:11] (03CR) 10JMeybohm: [C: 03+2] Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:25:22] (03CR) 10Elukey: [C: 03+1] wmnet: Change m2-master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/616341 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [08:26:02] elukey!!! <3 [08:26:04] (03CR) 10Kormat: [C: 03+1] wmnet: Change m2-master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/616341 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [08:26:19] thanks kormat! [08:26:30] (03CR) 10Marostegui: [C: 03+2] wmnet: Change m2-master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/616341 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [08:26:33] you're vaguely welcome! :) [08:26:38] (03PS2) 10Marostegui: wmnet: Change m2-master to dbproxy1013 [dns] - 10https://gerrit.wikimedia.org/r/616341 (https://phabricator.wikimedia.org/T255408) [08:28:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Would be nice to get a logstash filter test going too" [puppet] - 10https://gerrit.wikimedia.org/r/616116 (https://phabricator.wikimedia.org/T248181) (owner: 10Herron) [08:29:44] RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:29:46] puppet on mw hosts is enabled again [08:30:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:33:45] (03PS1) 10DCausse: [wdqs] disable autoploys on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/616446 [08:38:02] (03CR) 10Ayounsi: [C: 03+2] Add 185.71.138.0/24 to wikimedia4 [homer/public] - 10https://gerrit.wikimedia.org/r/616103 (owner: 10Ayounsi) [08:38:59] (03CR) 10Filippo Giunchedi: "I'm +1 on the idea/approach in general, though I think we should be able to drop the -service and -cluster suffixes to be consistent with " [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [08:39:24] !log push "Add 185.71.138.0/24 to wikimedia4" to all routers [08:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:05] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) (owner: 10Dave Pifke) [08:45:12] (03PS3) 10Volans: GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 [08:45:25] (03CR) 10Volans: "Addressed comment" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [08:49:50] (03CR) 10Volans: [C: 03+1] "Python wise looks ok to me. I can't speak about some of the logic but if you've tested it and the generated config is what you want go ahe" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi) [08:51:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [08:53:42] 10Operations: netbox1001's root partition is filling up - https://phabricator.wikimedia.org/T258912 (10Volans) @elukey thanks for the task, this was already mentioned on IRC I think on Friday by Jaime and @crusnov ack'ed the ping. It's definitely the dumps and the fix is outlined in T231512 [08:54:01] 10Operations: netbox1001's root partition is filling up - https://phabricator.wikimedia.org/T258912 (10Volans) a:03crusnov [08:59:13] (03PS2) 10Filippo Giunchedi: profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) [09:00:58] (03CR) 10jerkins-bot: [V: 04-1] profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) (owner: 10Filippo Giunchedi) [09:01:38] (03PS3) 10Filippo Giunchedi: profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) [09:03:54] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) (owner: 10Filippo Giunchedi) [09:04:23] (03PS1) 10JMeybohm: ATS: Add backend for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) [09:05:15] (03CR) 10JMeybohm: "Do I need to add something to hieradata/role/common/cache/text.yaml for caching as well?" [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:08:36] RECOVERY - Disk space on prometheus1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [09:11:51] (03CR) 10Vgutierrez: [C: 03+1] "looking good on the TLS side of things" [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:12:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:13:17] (03PS3) 10Hnowlan: Add discovery and disabled LVS components for API gateway [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) [09:13:27] jayme: the CR looks good, but yeah, I'd say you need to provide some caching rules but ema will confirm that [09:14:40] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Joe) p:05Triage→03High [09:14:53] vgutierrez: thanks for taking a look! [09:16:54] np [09:17:42] 10Operations: Broken cumin aliases - https://phabricator.wikimedia.org/T258377 (10Joe) 05Open→03Resolved [09:19:14] PROBLEM - cassandra-b service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:21:06] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Alert on no (or "few") logs indexed (was: No logs ingested in logstash7 since 2020-07-06 19:23) - https://phabricator.wikimedia.org/T257294 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi All done! We're alerting on no/low logs i... [09:22:06] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: whitelist atskafka runtime metrics [puppet] - 10https://gerrit.wikimedia.org/r/616444 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [09:25:42] RECOVERY - cassandra-b service on restbase2018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:25:54] PROBLEM - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is CRITICAL: connect to address 10.192.48.125 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:27:02] RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.036 second response time on 10.192.48.125 port 9042 https://phabricator.wikimedia.org/T93886 [09:27:11] (03CR) 10Ema: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:29:18] 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Joe) I'm not sure what the procedure would be for this, if there is one at all. [09:30:44] 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Joe) To be clear: I fully trust @Urbanecm and he also does code deployments, so I'll be bold and grant him ops status in the channel. We can alwasy roll it back and follow any procedur... [09:32:53] 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Peachey88) >>! In T258741#6336806, @Joe wrote: > I'm not sure what the procedure would be for this, if there is one at all. There is some documentation here: https://wikitech.wikimedi... [09:33:26] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-10-18 09:02:07 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:33:32] RECOVERY - HTTPS-planet on en.planet.wikimedia.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-10-18 09:02:07 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [09:34:03] acme-chief deployed the new unified cert already :) [09:35:02] 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Joe) 05Open→03Resolved a:03Joe [09:36:25] _joe_: seems it works :). Thanks [09:36:52] <_joe_> cool [09:37:54] 10Operations, 10serviceops: wtp1025's root partition full - https://phabricator.wikimedia.org/T258775 (10Joe) p:05Triage→03High I want to check the rest of the wtp servers as well. We'll need to schedule reimaging it ASAP. [09:37:57] (03CR) 10Muehlenhoff: [C: 03+2] Add CAS support to Superset [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff) [09:40:56] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) >>! In T252391#6328325, @aaron wrote: > Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate... [09:41:13] 10Operations, 10Analytics-Clusters, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10elukey) p:05Triage→03Medium [09:41:37] (03PS2) 10JMeybohm: ATS: Add backend for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) [09:41:59] (03PS4) 10JMeybohm: New upstream version 2.16.9 [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) [09:42:19] (03PS1) 10Jcrespo: mariadb-backups: Monitor for mariadb backups of matomo&analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/616452 (https://phabricator.wikimedia.org/T234826) [09:43:52] 10Operations, 10serviceops: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Joe) [09:45:08] 10Operations, 10serviceops: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Joe) I think this is a good chance for the newcomers to do some practice with the process of reinstalling a server in our infra - @JMeybohm and @hnowlan I'm thinking of you, but also... [09:46:51] 10Operations, 10serviceops: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10Joe) The problem is also present in codfw, btw: ` $ sudo cumin 'wtp*' 'lvdisplay | fgrep -A 10 _placeholder | fgrep Size' 43 hosts will be targeted: wtp[2001-2004,2006-2020].codfw.w... [09:51:16] 10Operations, 10LDAP-Access-Requests: Add DVrandecic to superuser and turnilo wmf group - https://phabricator.wikimedia.org/T258837 (10Joe) p:05Triage→03Medium a:03Joe [09:51:24] (03PS2) 10Muehlenhoff: Enable CAS for Superset [puppet] - 10https://gerrit.wikimedia.org/r/615754 [09:51:40] (03PS1) 10Jcrespo: mariadb-backups: Setup db1108 as the source of backups for analytics dbs [puppet] - 10https://gerrit.wikimedia.org/r/616453 (https://phabricator.wikimedia.org/T234826) [09:51:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:47] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/616453 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:53:25] (03CR) 10Elukey: [C: 03+1] "from my limited understanding it looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/616452 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [09:53:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:54:02] akosiaris: So, mobileapps / mcs now fully served by k8s and I can dismantle the CI stuff for the scb stuff? [09:54:14] 10Operations, 10serviceops: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10hnowlan) I've already done a few reimagings so I'm happy to let someone else have the opportunity :) [09:56:12] James_F: It looks like it yes. I was planning on waiting like 2-3 days more before dismantling it from scb as well. But that's probably me being overly cautious [09:59:04] 10Operations, 10serviceops: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) a:03JMeybohm Happy to pick this up [09:59:44] (03PS1) 10Giuseppe Lavagetto: admin: add Denny Vrandecic to wmf ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/616455 (https://phabricator.wikimedia.org/T258837) [10:00:14] (03CR) 10DCausse: [C: 04-1] Use correct UriScheme in Blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616110 (https://phabricator.wikimedia.org/T251497) (owner: 10ZPapierski) [10:01:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add Denny Vrandecic to wmf ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/616455 (https://phabricator.wikimedia.org/T258837) (owner: 10Giuseppe Lavagetto) [10:02:58] akosiaris: :-) [10:03:57] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add DVrandecic to superuser and turnilo wmf group - https://phabricator.wikimedia.org/T258837 (10Joe) 05Open→03Resolved You should be now able to access those tools and others. Let us know if you experience any issue. [10:04:01] (03PS1) 10Muehlenhoff: Correct usage of forwarded proto setting for Superset/CAS [puppet] - 10https://gerrit.wikimedia.org/r/616456 [10:06:02] 10Operations, 10LDAP-Access-Requests: Enable UF2 for Urbanecm's account - https://phabricator.wikimedia.org/T258781 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe This should be sorted out. LMK if anything doesn't work. [10:08:39] (03CR) 10Muehlenhoff: [C: 03+2] Correct usage of forwarded proto setting for Superset/CAS [puppet] - 10https://gerrit.wikimedia.org/r/616456 (owner: 10Muehlenhoff) [10:08:45] _joe_: I'll merge your patch along to add Denny [10:08:45] 10Operations, 10Pywikibot, 10Traffic, 10cloud-services-team (Kanban): http://pywikibot.org/ is displaying Wikimedia error page - https://phabricator.wikimedia.org/T257536 (10Joe) p:05Triage→03Low Adding traffic as they're involved in managing both our DNS and the ncredir service. [10:08:57] <_joe_> moritzm: ouch yes thanks [10:09:09] <_joe_> I already added him to the ldap group a couple minutes ago [10:14:23] (03PS1) 10Muehlenhoff: Fix typo in Hiera variable name [puppet] - 10https://gerrit.wikimedia.org/r/616458 [10:15:43] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24143/" [puppet] - 10https://gerrit.wikimedia.org/r/615754 (owner: 10Muehlenhoff) [10:19:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [10:22:37] (03CR) 10Muehlenhoff: "Adding Greg for comments as he's the staff sponsor for Chad's volunteer NDA. If we confirm this is no longer used/needed with Chad, we sho" [puppet] - 10https://gerrit.wikimedia.org/r/616164 (owner: 10Dzahn) [10:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T1030). [10:33:45] !log make cr*-ulsfo interfaces netbox compliant [10:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616463 (https://phabricator.wikimedia.org/T128546) [10:38:08] (03PS24) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [10:39:16] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [10:39:46] (03PS1) 10Jbond: CAS 6.1.7.1: upgrade cas software [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616464 [10:40:41] (03PS2) 10Jbond: CAS 6.1.7.1: upgrade cas software [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616464 (https://phabricator.wikimedia.org/T258911) [10:40:52] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616463 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:41:41] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616463 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:44:33] (03PS1) 10Hnowlan: api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) [10:45:46] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:616463| Bumping portals to master (616463)]] (duration: 01m 10s) [10:45:48] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: add helmfile.d configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/616467 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [10:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:50] 10Operations, 10Puppet, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) [10:46:07] 10Operations, 10Puppet, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) [10:46:52] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:616463| Bumping portals to master (616463)]] (duration: 01m 05s) [10:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:56] (03CR) 10Ema: [C: 03+2] Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [10:50:07] (03CR) 10Ema: [C: 03+2] Move to dh-golang [software/atskafka] - 10https://gerrit.wikimedia.org/r/616023 (owner: 10Ema) [10:50:18] (03CR) 10Ema: [C: 03+2] Go mod configuration [software/atskafka] - 10https://gerrit.wikimedia.org/r/616021 (owner: 10Ema) [10:50:29] (03CR) 10Ema: [C: 03+2] Use testify for testing [software/atskafka] - 10https://gerrit.wikimedia.org/r/616022 (owner: 10Ema) [10:50:39] (03CR) 10Ema: [C: 03+2] Release version 0.10 [software/atskafka] - 10https://gerrit.wikimedia.org/r/616046 (owner: 10Ema) [10:51:04] (03CR) 10Ema: [C: 03+2] prometheus: whitelist atskafka runtime metrics [puppet] - 10https://gerrit.wikimedia.org/r/616444 (https://phabricator.wikimedia.org/T253551) (owner: 10Ema) [10:53:26] (03CR) 10Ema: [C: 03+1] ATS: Add backend for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [10:54:31] !log upload atskafka 0.10 to buster-wikimedia, upgrade cp3050 T254317 [10:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:37] T254317: Compare logs produced by atskfafka with those produced by varnishkafka - https://phabricator.wikimedia.org/T254317 [10:58:36] (03PS1) 10Jbond: nutcracker: drop puppet < 3.5 support [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) [10:59:19] (03CR) 10JMeybohm: "Debian glue currently fails because go >1.11 is needed (BACKPORTS=yes) now." [debs/helm] - 10https://gerrit.wikimedia.org/r/616065 (https://phabricator.wikimedia.org/T258773) (owner: 10JMeybohm) [10:59:36] (03CR) 10JMeybohm: [C: 03+2] ATS: Add backend for helm-charts.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/616447 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T1100). [11:00:49] looks like nothing to deploy [11:01:14] (is the Bot under the Fountain a reference to anything in particular? I know Ecthelion *of* the Fountain…) [11:02:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "overall LGTM, see small issue in the comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [11:02:24] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Add systemd timer to package and push charts [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:02:26] (03CR) 10JMeybohm: [C: 03+2] modules/systemd: Allow to define EnvironmentFile for timers [puppet] - 10https://gerrit.wikimedia.org/r/613634 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:12:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616464 (https://phabricator.wikimedia.org/T258911) (owner: 10Jbond) [11:16:46] (03PS2) 10Jbond: nutcracker: drop puppet < 3.5 support [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) [11:16:53] (03CR) 10Jbond: "Thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [11:18:08] (03PS3) 10Jbond: nutcracker: drop puppet < 3.5 support [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) [11:22:10] (03PS1) 10JMeybohm: chartmuseum: Run timer command (ExecStart) in a shell [puppet] - 10https://gerrit.wikimedia.org/r/616473 (https://phabricator.wikimedia.org/T253843) [11:26:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] chartmuseum: Run timer command (ExecStart) in a shell [puppet] - 10https://gerrit.wikimedia.org/r/616473 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:28:06] (03PS4) 10Jbond: nutcracker: drop puppet < 3.5 support [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) [11:28:55] !log installing an-tool1009 T258768 [11:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:00] T258768: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 [11:29:58] (03PS2) 10JMeybohm: chartmuseum: Run timer command (ExecStart) in a shell [puppet] - 10https://gerrit.wikimedia.org/r/616473 (https://phabricator.wikimedia.org/T253843) [11:31:11] (03CR) 10Elukey: "Added some minor comments!" (037 comments) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [11:34:33] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24147/" [puppet] - 10https://gerrit.wikimedia.org/r/616469 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [11:34:40] (03PS1) 10Jbond: nutcracker: ensure verbosity is an integer [puppet] - 10https://gerrit.wikimedia.org/r/616479 (https://phabricator.wikimedia.org/T258931) [11:39:54] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24148/" [puppet] - 10https://gerrit.wikimedia.org/r/616479 (https://phabricator.wikimedia.org/T258931) (owner: 10Jbond) [11:40:56] (03CR) 10Jbond: [V: 03+2 C: 03+2] CAS 6.1.7.1: upgrade cas software [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616464 (https://phabricator.wikimedia.org/T258911) (owner: 10Jbond) [11:43:33] (03PS1) 10Jbond: CAS 6.1.7: prepare release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616483 [11:45:03] (03PS3) 10Urbanecm: labs: Increase AbuseFilter's emergency disable vars for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) [11:45:05] PROBLEM - Check systemd state on chartmuseum1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:49] (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) (owner: 10Urbanecm) [11:46:49] PROBLEM - Check the last execution of helm-chartctl-package-all on chartmuseum1001 is CRITICAL: CRITICAL: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:47:01] (03CR) 10Muehlenhoff: CAS 6.1.7: prepare release (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616483 (owner: 10Jbond) [11:47:21] thats me [11:48:02] (03PS2) 10Jbond: CAS 6.1.7.1: prepare release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616483 [11:48:35] (03Merged) 10jenkins-bot: labs: Increase AbuseFilter's emergency disable vars for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616081 (https://phabricator.wikimedia.org/T230305) (owner: 10Urbanecm) [11:52:21] (03CR) 10Jbond: [V: 03+2 C: 03+2] CAS 6.1.7.1: prepare release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/616483 (owner: 10Jbond) [11:52:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1138', diff saved to https://phabricator.wikimedia.org/P12047 and previous config saved to /var/cache/conftool/dbconfig/20200727-115258-marostegui.json [11:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615754 (owner: 10Muehlenhoff) [11:56:54] RECOVERY - Check systemd state on chartmuseum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:58] RECOVERY - Check the last execution of helm-chartctl-package-all on chartmuseum1001 is OK: OK: Status of the systemd unit helm-chartctl-package-all https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:57:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1138', diff saved to https://phabricator.wikimedia.org/P12048 and previous config saved to /var/cache/conftool/dbconfig/20200727-115739-marostegui.json [11:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1149 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12049 and previous config saved to /var/cache/conftool/dbconfig/20200727-115818-marostegui.json [11:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:47] !log Deploy MCR schema change on db1149 [11:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:03] (03PS1) 10Muehlenhoff: Assign the Hue role to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/616507 (https://phabricator.wikimedia.org/T258768) [12:01:21] (03PS3) 10JMeybohm: chartmuseum: Run timer command (ExecStart) in a shell [puppet] - 10https://gerrit.wikimedia.org/r/616473 (https://phabricator.wikimedia.org/T253843) [12:03:37] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Run timer command (ExecStart) in a shell [puppet] - 10https://gerrit.wikimedia.org/r/616473 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:04:04] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [12:04:24] (03CR) 10Urbanecm: [C: 04-1] "this should be fixed to use s5, per Manuel's comment at T257943#6337278" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:06:25] (03CR) 10Volans: [C: 03+2] GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [12:07:15] (03CR) 10Jbond: "> Patch Set 2: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [12:07:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [12:08:23] (03Merged) 10jenkins-bot: GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans) [12:11:58] (03PS4) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [12:12:00] (03PS3) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [12:13:53] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [12:13:59] (03CR) 10Elukey: [C: 03+1] Assign the Hue role to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/616507 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [12:14:17] (03CR) 10Hashar: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:14:26] (03CR) 10Hashar: [C: 03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:15:21] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [12:15:22] (03PS14) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) [12:16:04] !log installing batik security updates [12:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:57] (03CR) 10Kormat: [C: 03+2] mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [12:19:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 60 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:19:59] (03PS1) 10Volans: Release v0.2.7 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/616510 [12:21:12] !log installing ruby-json security updates [12:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:20] (03CR) 10Volans: [V: 03+2 C: 03+2] "Merging and releasing the GC fix." [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/616510 (owner: 10Volans) [12:23:16] !log A:cp rolling varnish-frontend restart to actually discard old VCL still pointing at varnishcheck/check T255015 T236754 [12:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:22] T255015: Varnish and ATS health-check improvements - https://phabricator.wikimedia.org/T255015 [12:23:22] T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 [12:23:50] PROBLEM - Disk space on webperf1002 is CRITICAL: DISK CRITICAL - free space: /srv 11480 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [12:24:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 44 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:25:02] !log upload new cas package [12:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:10] !log upload new cas package to buster-wikimedia [12:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:21] !log volans@deploy1001 Started deploy [debmonitor/deploy@25dbd20]: Release v0.2.7 [12:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:48] !log volans@deploy1001 Finished deploy [debmonitor/deploy@25dbd20]: Release v0.2.7 (duration: 00m 27s) [12:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:48] (03CR) 10Hashar: [C: 03+1] "Rebased to fix a trivial conflict. The puppet diff is the same, the previously empty repository parameters is now explicitly set:" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:30:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:48] !log standardize cr2-codfw interfaces [12:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:22] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1))): scap configuration in puppet defaults to forge the git repo name with 'med... - https://phabricator.wikimedia.org/T257413 [12:33:40] (03PS4) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [12:33:49] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:35:26] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudcephmon1003 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616172 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [12:36:29] !log akosiaris@cumin1001 conftool action : set/weight=0; selector: dc=eqiad,service=mobileapps,name=scb1001.eqiad.wmnet [12:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:40] * akosiaris testing something [12:36:51] PROBLEM - MariaDB read only db_inventory on db1115 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.44-MariaDB, Uptime 948536s, event_scheduler: True, 346.00 QPS, connection latency: 0.002278s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:37:21] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=mobileapps,name=scb1001.eqiad.wmnet [12:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:35] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=mobileapps,name=scb1001.eqiad.wmnet [12:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] !log akosiaris@cumin1001 conftool action : set/weight=0; selector: dc=eqiad,service=mobileapps,name=scb1001.eqiad.wmnet [12:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:46] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616513 (https://phabricator.wikimedia.org/T254462) [12:37:50] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb1001.eqiad.wmnet [12:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:54] (03PS5) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [12:38:08] ok, test complete [12:38:28] ACKNOWLEDGEMENT - MariaDB read only db_inventory on db1115 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.1.44-MariaDB, Uptime 948613s, event_scheduler: True, 3020.42 QPS, connection latency: 0.001981s Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:38:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 and pool db1105:3311 as vslow T254462', diff saved to https://phabricator.wikimedia.org/P12050 and previous config saved to /var/cache/conftool/dbconfig/20200727-123833-marostegui.json [12:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:39] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [12:38:42] !log disable puppet on idp1001/2001 [12:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:52] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:39:18] (03PS2) 10Muehlenhoff: Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615665 [12:39:56] (03PS6) 10Urbanecm: Initial configuration for avkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) [12:40:03] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/616513 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [12:41:00] !log Compress innodb on db1106, this will generate lag on enwiki on labsdb hosts (wiki replicas) T254462 [12:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:07] (03CR) 10Muehlenhoff: [C: 03+2] Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615665 (owner: 10Muehlenhoff) [12:45:35] (03CR) 10Marostegui: "s5.dblist, db-eqiad.php and db-codfw.php look good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:49:26] (03PS3) 10Andrew Bogott: Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616156 (https://phabricator.wikimedia.org/T258826) [12:49:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311', diff saved to https://phabricator.wikimedia.org/P12051 and previous config saved to /var/cache/conftool/dbconfig/20200727-125045-marostegui.json [12:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3311 with less weight', diff saved to https://phabricator.wikimedia.org/P12052 and previous config saved to /var/cache/conftool/dbconfig/20200727-125207-marostegui.json [12:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311', diff saved to https://phabricator.wikimedia.org/P12053 and previous config saved to /var/cache/conftool/dbconfig/20200727-125351-marostegui.json [12:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:19] (03PS1) 10Kormat: mariadb: Fix read-only logic in db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/616515 (https://phabricator.wikimedia.org/T258566) [12:57:40] (03CR) 10Urbanecm: "Hey Amir, I'd appreciate your review of this one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [12:57:43] (03PS1) 10Muehlenhoff: Failover IDP to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/616516 [12:58:06] (03CR) 10Kormat: "A beautiful PCC run: https://puppet-compiler.wmflabs.org/compiler1001/24149/" [puppet] - 10https://gerrit.wikimedia.org/r/616515 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [12:58:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3311 with less weight', diff saved to https://phabricator.wikimedia.org/P12054 and previous config saved to /var/cache/conftool/dbconfig/20200727-125824-marostegui.json [12:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:35] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [13:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:03:24] (03PS1) 10JMeybohm: wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) [13:03:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:13] (03CR) 10Muehlenhoff: [C: 03+2] Failover IDP to idp1001 [dns] - 10https://gerrit.wikimedia.org/r/616516 (owner: 10Muehlenhoff) [13:05:36] (03CR) 10Marostegui: [C: 03+1] mariadb: Fix read-only logic in db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/616515 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [13:06:35] (03CR) 10Ottomata: [C: 03+1] profile::analytics::refinery::job::refine: disable monitor [puppet] - 10https://gerrit.wikimedia.org/r/616198 (owner: 10Elukey) [13:07:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311', diff saved to https://phabricator.wikimedia.org/P12055 and previous config saved to /var/cache/conftool/dbconfig/20200727-130713-marostegui.json [13:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:01] (03CR) 10Kormat: [C: 03+2] mariadb: Fix read-only logic in db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/616515 (https://phabricator.wikimedia.org/T258566) (owner: 10Kormat) [13:09:15] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::refine: disable monitor [puppet] - 10https://gerrit.wikimedia.org/r/616198 (owner: 10Elukey) [13:09:19] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 562 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:09:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:09:38] (03CR) 10JMeybohm: [C: 03+2] wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:09:53] (03CR) 10Muehlenhoff: wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:09:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3311 with normal weight and pool db1089 into vslow', diff saved to https://phabricator.wikimedia.org/P12056 and previous config saved to /var/cache/conftool/dbconfig/20200727-130954-marostegui.json [13:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'More weight to db1089 in main traffic', diff saved to https://phabricator.wikimedia.org/P12057 and previous config saved to /var/cache/conftool/dbconfig/20200727-131123-marostegui.json [13:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:30] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Ottomata) First I've heard that WikimediaEvents uses memcached to track funnels. From https://meta.wikimedia.org/wiki/Schema_talk:EditorJourney, it looks like... [13:12:07] (03CR) 10JMeybohm: [C: 03+2] wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:14:18] (03CR) 10Muehlenhoff: wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:14:27] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 48 probes of 562 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:14:51] RECOVERY - MariaDB read only db_inventory on db1115 is OK: Version 10.1.44-MariaDB, Uptime 950817s, read_only: False, event_scheduler: True, 1648.14 QPS, connection latency: 0.002110s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:17:52] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10colewhite) utils/hiera_lookup shows me the same error. On the puppetmaster, I was not running with sudo and recieved: ` $ puppet lookup --ex... [13:19:52] !log standardize some cr2-esams interfaces [13:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:01] (03PS2) 10JMeybohm: wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) [13:21:14] (03CR) 10Elukey: [C: 03+1] "Let's do it!" [puppet] - 10https://gerrit.wikimedia.org/r/615754 (owner: 10Muehlenhoff) [13:21:27] (03PS1) 10Jbond: diamond: remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/616518 (https://phabricator.wikimedia.org/T258943) [13:21:43] (03PS4) 10Elukey: Add PTR/AAAA records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) [13:21:56] (03CR) 10JMeybohm: wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:22:37] (03CR) 10Ottomata: [C: 03+1] ":)" [software/httpbb] - 10https://gerrit.wikimedia.org/r/615570 (owner: 10RLazarus) [13:23:22] (03CR) 10Elukey: [C: 03+2] Add PTR/AAAA records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:24:04] 10Operations, 10Puppet, 10Patch-For-Review: Audit sudo permissions - https://phabricator.wikimedia.org/T258943 (10jbond) [13:24:15] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Joe) Data in redis got evicted at furious rates for years before sessionstore, if we remove one server from the ring i don't expect any real issue. [13:24:17] 10Operations, 10Puppet, 10Patch-For-Review: Audit sudo permissions - https://phabricator.wikimedia.org/T258943 (10jbond) >>! In T258943#6337484, @Aklapper wrote: > @jbond: On which systems / who is we? (In other words: Could you add a project tag, please? Thanks!) thanks updated [13:26:38] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10Joe) >>! In T252391#6248986, @RLazarus wrote: > Side note: This question is also interesting from a DC switchover perspective (T243316) since that will also ef... [13:26:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:26:58] (03CR) 10JMeybohm: [C: 03+2] wtp: Change partition scheme to partman/custom/mw-raid1-lvm.cfg [puppet] - 10https://gerrit.wikimedia.org/r/616517 (https://phabricator.wikimedia.org/T258775) (owner: 10JMeybohm) [13:27:31] (03PS1) 10Andrew Bogott: Ceph: temporarily hack the public network to include both old and new [puppet] - 10https://gerrit.wikimedia.org/r/616519 (https://phabricator.wikimedia.org/T258826) [13:28:03] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: temporarily hack the public network to include both old and new [puppet] - 10https://gerrit.wikimedia.org/r/616519 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [13:28:57] (03PS2) 10Hashar: zuul: stop prefixing report with the job name [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) [13:29:00] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) > On the puppetmaster, I was not running with sudo and received I don't think puppet lookup will work without sudo as it tries to loo... [13:29:48] (03CR) 10Hashar: [C: 03+1] "Trivial rebase." [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [13:30:16] ACKNOWLEDGEMENT - MariaDB read only db_inventory on db2093 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.12-MariaDB-log, Uptime 954196s, event_scheduler: False, 13.60 QPS, connection latency: 0.003028s Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:34:15] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10kostajh) >>! In T252391#6337452, @Ottomata wrote: > First I've heard that WikimediaEvents uses memcached to track funnels. From https://meta.wikimedia.org/wik... [13:34:29] (03PS1) 10Andrew Bogott: Ceph: update ip for cloudcephmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/616520 (https://phabricator.wikimedia.org/T258826) [13:35:29] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: update ip for cloudcephmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/616520 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [13:35:33] 10Operations, 10DBA, 10User-Kormat: Refactor tendril+zarcillo roles/profiles - https://phabricator.wikimedia.org/T258566 (10Kormat) 05Open→03Resolved Refactoring success \o/ [13:35:37] 10Operations, 10DBA, 10Epic, 10User-Kormat: Use zarcillo as an authoritative inventory of db instances/roles - https://phabricator.wikimedia.org/T257814 (10Kormat) [13:35:48] 10Operations, 10Puppet, 10Patch-For-Review: Audit sudo permissions in the puppet repo - https://phabricator.wikimedia.org/T258943 (10herron) p:05Triage→03Medium [13:36:56] (03CR) 10Muehlenhoff: "Per https://phabricator.wikimedia.org/T210993 it might be used in Tool Labs, but possibly that task is outdated?" [puppet] - 10https://gerrit.wikimedia.org/r/616518 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [13:38:06] 10Operations, 10Analytics, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10herron) p:05Triage→03Medium [13:38:59] 10Operations, 10vm-requests: codfw: 1 VM for kafkamon - https://phabricator.wikimedia.org/T257561 (10herron) p:05Triage→03Medium a:03herron [13:39:16] 10Operations, 10vm-requests: eqiad: 1 VM for kafkamon - https://phabricator.wikimedia.org/T257560 (10herron) p:05Triage→03Medium a:03herron [13:40:21] 10Operations, 10Wikimedia-General-or-Unknown: Periodically run purgeExpiredBlocks.php maintenance script - https://phabricator.wikimedia.org/T257473 (10herron) p:05Triage→03Medium [13:41:57] 10Operations, 10Analytics, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:42:00] 10Operations, 10SRE-tools: Improve sre.hosts.decommission (additionally find host yaml files) - https://phabricator.wikimedia.org/T257297 (10herron) p:05Triage→03Medium [13:44:07] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10herron) p:05Triage→03Medium [13:44:39] 10Operations, 10Wikimedia-Logstash, 10observability: Buster elasticsearch-curator version not compatible with ELK7 - https://phabricator.wikimedia.org/T257024 (10herron) p:05Triage→03Medium a:03herron [13:45:48] 10Operations, 10serviceops, 10Patch-For-Review: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts: ` wtp1025.eqiad.wmnet ` The log can be found in `/var/log/... [13:50:43] (03CR) 10Giuseppe Lavagetto: helmfile: strawman refactoring (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:51:17] (03PS3) 10Giuseppe Lavagetto: helmfile: strawman refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) [13:57:08] !log upgrading idp2001 to CAS 6.1.7.1 [13:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:46] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616156 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [14:02:51] (03PS4) 10Andrew Bogott: Move cloudcephmon1002 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616156 (https://phabricator.wikimedia.org/T258826) [14:03:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={idp,swagger_check_cxserver_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:25] (03PS9) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [14:04:31] (03CR) 10Ottomata: Initial debian commit (037 comments) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [14:04:51] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [14:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:03] (03PS1) 10Herron: mx: add paniclog to exim logrotate [puppet] - 10https://gerrit.wikimedia.org/r/616524 (https://phabricator.wikimedia.org/T257016) [14:05:44] 10Operations, 10Mail, 10observability, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10herron) p:05Triage→03Medium [14:07:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] (03PS1) 10Mholloway: Proton: Launch Chromium with additional flags; enable host_blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/616526 [14:09:37] (03PS2) 10Mholloway: Proton: Launch Chromium with additional flags; enable host_blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/616526 [14:13:07] (03CR) 10Elukey: [C: 03+1] Turnilo: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/615461 (owner: 10Muehlenhoff) [14:14:11] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/616458 (owner: 10Muehlenhoff) [14:15:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] Proton: Launch Chromium with additional flags; enable host_blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/616526 (owner: 10Mholloway) [14:16:25] (03CR) 10MSantos: [C: 03+2] Proton: Launch Chromium with additional flags; enable host_blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/616526 (owner: 10Mholloway) [14:17:38] (03Merged) 10jenkins-bot: Proton: Launch Chromium with additional flags; enable host_blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/616526 (owner: 10Mholloway) [14:19:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] !log standardize cr1-codfw interfaces [14:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] (03PS1) 10DCausse: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) [14:21:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:25:43] OTRS SMTP is about to alert [14:25:56] (03CR) 10JMeybohm: helmfile: strawman refactoring (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:26:02] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 562 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:31:22] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 48 probes of 562 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:32:57] (03PS8) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [14:32:59] (03PS1) 10Alexandros Kosiaris: otrs: Add OTRS 6.0.29 prereq packages [puppet] - 10https://gerrit.wikimedia.org/r/616531 (https://phabricator.wikimedia.org/T187984) [14:33:35] (03CR) 10jerkins-bot: [V: 04-1] otrs: Add OTRS 6.0.29 prereq packages [puppet] - 10https://gerrit.wikimedia.org/r/616531 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [14:33:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_wikifeeds_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:34:09] (03CR) 10RLazarus: [C: 03+1] admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [14:35:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:35:52] (03PS3) 10Andrew Bogott: Move cloudcephmon1001 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616157 (https://phabricator.wikimedia.org/T258826) [14:36:01] (03PS1) 10Alexandros Kosiaris: Temporarily add ticket-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/616532 (https://phabricator.wikimedia.org/T187984) [14:37:10] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/24150/" [puppet] - 10https://gerrit.wikimedia.org/r/616458 (owner: 10Muehlenhoff) [14:38:02] (03CR) 10Muehlenhoff: [C: 03+2] Fix typo in Hiera variable name [puppet] - 10https://gerrit.wikimedia.org/r/616458 (owner: 10Muehlenhoff) [14:38:54] (03CR) 10Andrew Bogott: [C: 03+2] Move cloudcephmon1001 from .wikimedia.org to .eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/616157 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [14:40:04] (03PS2) 10Esanders: Fix VE-RealTime CSP entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615728 [14:40:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1149', diff saved to https://phabricator.wikimedia.org/P12058 and previous config saved to /var/cache/conftool/dbconfig/20200727-144034-marostegui.json [14:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:14] (03CR) 10Muehlenhoff: "Does the cron job which sends out the mail run before the logrotate kicks in? Then sounds good to me" [puppet] - 10https://gerrit.wikimedia.org/r/616524 (https://phabricator.wikimedia.org/T257016) (owner: 10Herron) [14:44:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/cloudceph on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/cloudceph is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:44:22] PROBLEM - LVS cloudceph eqiad port 9283/tcp - CloudVPS Ceph Cluster Stats- cloudceph.svc.eqiad.wmnet IPv4 on cloudceph.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.51 and port 9283: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:44:48] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/cloudceph on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/cloudceph is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:45:37] andrewbogott: ^ [14:46:24] XioNoX: hm, I don't know what that is but let's see if it recovers when 1001 comes back up [14:46:40] (03CR) 10Nuria: [C: 03+1] profile::analytics::refinery::job::refine: disable monitor [puppet] - 10https://gerrit.wikimedia.org/r/616198 (owner: 10Elukey) [14:47:32] PROBLEM - OTRS SMTP on otrs1001 is CRITICAL: connect to address 10.64.16.39 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [14:47:50] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:9283]) https://wikitech.wikimedia.org/wiki/PyBal [14:48:04] ACKNOWLEDGEMENT - LVS cloudceph eqiad port 9283/tcp - CloudVPS Ceph Cluster Stats- cloudceph.svc.eqiad.wmnet IPv4 on cloudceph.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.51 and port 9283: Connection refused andrew bogott probably because T258826 is in progress https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:48:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1144:3314 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12059 and previous config saved to /var/cache/conftool/dbconfig/20200727-144807-marostegui.json [14:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:36] !log Deploy MCR change on db1144:3314 [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:18] (03CR) 10Nuria: [C: 03+1] "Makes sense to remove as this event hasn't had data for a while: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var" [puppet] - 10https://gerrit.wikimedia.org/r/616445 (owner: 10Elukey) [14:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db1146:3314 weight while db1144:3314 is depooled', diff saved to https://phabricator.wikimedia.org/P12060 and previous config saved to /var/cache/conftool/dbconfig/20200727-145010-marostegui.json [14:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:43] (03PS2) 10Alexandros Kosiaris: otrs: Add OTRS 6.0.29 prereq packages [puppet] - 10https://gerrit.wikimedia.org/r/616531 (https://phabricator.wikimedia.org/T187984) [14:50:45] (03PS9) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [14:51:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Add OTRS 6.0.29 prereq packages [puppet] - 10https://gerrit.wikimedia.org/r/616531 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [14:53:58] RECOVERY - OTRS SMTP on otrs1001 is OK: SMTP OK - 0.005 sec. response time https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [14:55:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [14:55:39] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:57] (03PS2) 10Muehlenhoff: Assign the Hue role to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/616507 (https://phabricator.wikimedia.org/T258768) [14:59:09] (03PS1) 10Papaul: DHCP: Change MAC address for restbase2009 [puppet] - 10https://gerrit.wikimedia.org/r/616535 (https://phabricator.wikimedia.org/T256863) [15:00:27] (03CR) 10Papaul: [C: 03+2] DHCP: Change MAC address for restbase2009 [puppet] - 10https://gerrit.wikimedia.org/r/616535 (https://phabricator.wikimedia.org/T256863) (owner: 10Papaul) [15:01:29] (03CR) 10Muehlenhoff: [C: 03+2] Assign the Hue role to an-tool1009 [puppet] - 10https://gerrit.wikimedia.org/r/616507 (https://phabricator.wikimedia.org/T258768) (owner: 10Muehlenhoff) [15:03:29] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:03:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:43] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10Papaul) 05Open→03Resolved Before BIOS Version 2.4.3 Firmware Version 2.40.40.40 Lifecycle Controller Firmware 2.40.40.40 This is complete After BIOS Version 2.11.0 Firmware Version... [15:11:11] 10Operations, 10ops-codfw, 10DC-Ops: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 (10Marostegui) The alert cleared Thank you @papaul [15:11:28] (03PS1) 10CDanis: Revert "ATS: force cache revalidation on secure.wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/616493 [15:11:52] 10Operations, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [15:12:33] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.51:9283]) https://wikitech.wikimedia.org/wiki/PyBal [15:15:18] (03CR) 10Andrew Bogott: [C: 03+2] Remove public IP addresses for cloudcephmons nodes [dns] - 10https://gerrit.wikimedia.org/r/616151 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:15:22] (03PS3) 10Andrew Bogott: Remove public IP addresses for cloudcephmons nodes [dns] - 10https://gerrit.wikimedia.org/r/616151 (https://phabricator.wikimedia.org/T258826) [15:15:45] (03CR) 10Thcipriani: "There are also a few of these used for beta in hieradata/cloud/eqiad1/deployment-prep/common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [15:18:18] 10Operations, 10serviceops: All wtp servers in eqiad have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wtp1025.eqiad.wmnet'] ` and were **ALL** successful. [15:18:26] (03CR) 10Zfilipin: eslint: Update to eslint-config-wikimedia 0.16.0 and eslint 7.5.0 (031 comment) [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) (owner: 10Jared Blumer) [15:19:46] (03CR) 10Zfilipin: [C: 03+1] "Looks good to me in general, with one minor comment posted previously." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/616183 (https://phabricator.wikimedia.org/T254495) (owner: 10Jared Blumer) [15:23:59] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) Restbase2009 is ready for re-image Please open a separate task to decommission the old resetbase2009 https://netbox.wikimedia.org/dcim/devices/1099/ thanks [15:24:58] (03PS1) 10Mholloway: Proton: Remove unneeded APP_ENABLE_CANCELLABLE_PROMISES env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/616539 [15:27:20] (03CR) 10Cparle: MediaSearch A/B test on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [15:27:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:29:56] (03PS1) 10Mholloway: Update Proton to 2020-07-27-123712-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/616540 [15:30:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:30:58] (03PS1) 10Muehlenhoff: Add CAS support to Hue (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/616541 [15:32:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:33:15] (03CR) 10DCausse: MediaSearch A/B test on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [15:33:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:42:27] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/cloudceph on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/cloudceph is broken andrew bogott This is from T258826. I think it will resolve after dns catches up. https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:42:27] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/cloudceph on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/cloudceph is broken andrew bogott This is from T258826. I think it will resolve after dns catches up. https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:42:52] (03PS2) 10CDanis: Revert "ATS: force cache revalidation on secure.wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/616493 (https://phabricator.wikimedia.org/T151977) [15:44:49] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/616518 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [15:45:15] 10Operations, 10observability, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10jbond) [15:46:34] 10Operations, 10observability, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10jbond) updated to mark the following as done: " toollabs::mailrelay: diamond::collector::extendedexim { 'extended_exim_collector': }; profile::toolforge::... [15:46:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/616518 (https://phabricator.wikimedia.org/T258943) (owner: 10Jbond) [15:50:24] 10Operations, 10serviceops: All wtp servers have a bad partition scheme. - https://phabricator.wikimedia.org/T258775 (10JMeybohm) [15:52:15] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10fdans) [15:53:34] (03CR) 10Cparle: MediaSearch A/B test on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [15:53:39] (03CR) 10Cparle: [C: 03+1] MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [15:59:40] (03CR) 10Andrew Bogott: [C: 03+2] Remove eth1 addresses for cloudcephmon hosts [dns] - 10https://gerrit.wikimedia.org/r/616152 (https://phabricator.wikimedia.org/T258826) (owner: 10Andrew Bogott) [15:59:44] (03PS3) 10Andrew Bogott: Remove eth1 addresses for cloudcephmon hosts [dns] - 10https://gerrit.wikimedia.org/r/616152 (https://phabricator.wikimedia.org/T258826) [16:02:39] (03CR) 10MSantos: [C: 03+2] Update Proton to 2020-07-27-123712-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/616540 (owner: 10Mholloway) [16:03:05] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10herron) Hello, friendly ping -- when should we expect alert1001 to be online? [16:03:36] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10Jclark-ctr) @Marostegui is host down now? i can remove in about 1 hour [16:04:31] !log Stop MySQL on db1082 for onsite maintenance - T258910 [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:38] T258910: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 [16:05:17] !log Will show up on labsdb hosts for s5 [16:05:18] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10wiki_willy) a:03Jclark-ctr [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:49] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10Marostegui) @Jclark-ctr the host is now off, you can proceed whenever you want Thank you! [16:14:13] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10sguebo_WMF) Hello, Thanks again for the patch. While he can access the stats server (stat1005.eqiad.wmnet... [16:18:55] (03PS1) 10Urbanecm: Move footer logos to wmg* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616544 (https://phabricator.wikimedia.org/T257732) [16:18:57] (03PS1) 10Urbanecm: Move footer logos to /static/images/footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616545 (https://phabricator.wikimedia.org/T257732) [16:19:10] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [16:19:48] (03CR) 10jerkins-bot: [V: 04-1] Move footer logos to wmg* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616544 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [16:23:20] (03PS2) 10Urbanecm: Move footer logos to wmg* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616544 (https://phabricator.wikimedia.org/T257732) [16:23:58] (03PS2) 10Urbanecm: Move footer logos to /static/images/footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616545 (https://phabricator.wikimedia.org/T257732) [16:29:46] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10RKemper) We discussed this during this week's SRE meeting and resolved to enable full root access for... [16:32:12] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Urbanecm) Indeed, nahidunlimited is only in analytics-privatedata-users group, not in restricted. For the... [16:32:32] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Urbanecm) @sguebo_wmf For the record, https://github.com/wikimedia/puppet/blob/production/modules/admin/da... [16:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2087:3316, db2087:3317 after on-site maintenance T258587', diff saved to https://phabricator.wikimedia.org/P12063 and previous config saved to /var/cache/conftool/dbconfig/20200727-163311-marostegui.json [16:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:17] T258587: db2087 internal IPMI error - https://phabricator.wikimedia.org/T258587 [16:34:08] (03CR) 10Dzahn: [C: 03+1] "was mentioned in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [16:44:35] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cumin1001.eqiad.wmnet [16:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:57] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephosd1003.eqiad.wmnet [16:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:05] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephosd1002.eqiad.wmnet [16:48:08] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephosd1001.eqiad.wmnet [16:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:45] !log andrew@cumin1001 conftool action : set/pooled=no; selector: name=cloudcephosd1001.wikimedia.org [16:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:51] !log andrew@cumin1001 conftool action : set/pooled=no; selector: name=cloudcephosd1002.wikimedia.org [16:48:55] !log andrew@cumin1001 conftool action : set/pooled=no; selector: name=cloudcephosd1003.wikimedia.org [16:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:58] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephmon1001.eqiad.wmnet [16:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:03] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephmon1002.eqiad.wmnet [16:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:13] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephmon1003.eqiad.wmnet [16:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:31] !log andrew@cumin1001 conftool action : set/pooled=inactive; selector: name=cloudcephosd1001.eqiad.wmnet [16:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:42] !log andrew@cumin1001 conftool action : set/pooled=inactive; selector: name=cloudcephosd1002.eqiad.wmnet [16:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:47] !log andrew@cumin1001 conftool action : set/pooled=inactive; selector: name=cloudcephosd1003.eqiad.wmnet [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:12] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/cloudceph on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:54:42] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudceph_9283: Servers cloudcephmon1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:55:08] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - cloudceph_9283: Servers cloudcephmon1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:55:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/cloudceph on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:57:58] !log andrew@cumin1001 conftool action : set/pooled=inactive; selector: name=cloudcephmon1002.eqiad.wmnet [16:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:16] !log andrew@cumin1001 conftool action : set/pooled=no; selector: name=cloudcephmon1002.eqiad.wmnet [16:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:03] !log andrew@cumin1001 conftool action : set/pooled=yes; selector: name=cloudcephmon1002.eqiad.wmnet [16:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:27] (03PS1) 10Hnowlan: aptrepo: add component for future envoy packages [puppet] - 10https://gerrit.wikimedia.org/r/616560 (https://phabricator.wikimedia.org/T254908) [17:00:05] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T1700). [17:00:29] (03CR) 10Elukey: [C: 03+2] eventlogging: remove MobileWebUIClickTracking event [puppet] - 10https://gerrit.wikimedia.org/r/616445 (owner: 10Elukey) [17:00:44] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Nuria) @sguebo_WMF since you are requesting access to data and no data is on that listed server I do not t... [17:02:39] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Urbanecm) >>! In T256971#6338145, @Nuria wrote: > @sguebo_WMF since you are requesting access to data and... [17:04:54] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10Jclark-ctr) @Marostegui removed bbu. host is booting up now [17:09:33] (03CR) 10Milimetric: [C: 03+1] profile::analytics::refinery::job::refine: disable monitor [puppet] - 10https://gerrit.wikimedia.org/r/616198 (owner: 10Elukey) [17:13:15] ACKNOWLEDGEMENT - HP RAID on db1082 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T258965 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:13:20] 10Operations, 10ops-eqiad: Degraded RAID on db1082 - https://phabricator.wikimedia.org/T258965 (10ops-monitoring-bot) [17:14:07] !log dpifke@deploy1001 Started deploy [performance/arc-lamp@f14888b]: Deploying arclamp-compress-logs (T235456) [17:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:12] !log dpifke@deploy1001 Finished deploy [performance/arc-lamp@f14888b]: Deploying arclamp-compress-logs (T235456) (duration: 00m 05s) [17:14:13] T235456: Let Arc-Lamp store its trace "log" files in compressed format - https://phabricator.wikimedia.org/T235456 [17:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:36] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission [17:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:40] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [17:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:37] 10Operations, 10ops-eqiad: Degraded RAID on db1082 - https://phabricator.wikimedia.org/T258965 (10RhinosF1) T258910 / [17:21:32] 10Operations, 10ops-eqiad: Degraded RAID on db1082 - https://phabricator.wikimedia.org/T258965 (10Marostegui) 05Open→03Declined This is the BBU being removed onsite as part of T258910 [17:22:59] (03CR) 10Dzahn: [C: 03+2] airflow: Allow scap deploy user to set variables [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson) [17:23:22] ryankemper: ^ so that's that part [17:25:56] (03PS3) 10Dave Pifke: arclamp: run arclamp-compress-logs from cron [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T257931) [17:26:35] (03PS4) 10Dave Pifke: arclamp: run arclamp-compress-logs from cron [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T257931) [17:28:37] (03PS1) 10Dzahn: admins: turn all wdqs-admins into wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) [17:29:32] (03CR) 10Dzahn: [C: 04-1] "replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/616564" [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [17:29:37] (03CR) 10Dzahn: "replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/616564" [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [17:29:59] (03PS3) 10Dave Pifke: arclamp: Run & scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/613359 (https://phabricator.wikimedia.org/T256035) [17:30:55] !log promoting train to group2 [17:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:26] (03PS2) 10Dzahn: admins: turn all wdqs-admins into wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) [17:33:35] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Nuria) >Trust and Safety's job is to help to reset 2FA, passwords and similar, which is done via the maint... [17:34:32] (03PS3) 10Dzahn: admins: turn all wdqs-admins into wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) [17:35:38] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) [17:35:42] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Dzahn) 05Resolved→03Open [17:35:55] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Dzahn) a:05Nahid→03None [17:36:24] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) I've removed all the sites which are done, all that is left is some network cabling for codfw and the full update of eqiad. I've also removed myself as a... [17:37:48] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Dzahn) Hi, reopening the ticket to get this done. > Trust and Safety's job is to help to reset 2FA, pass... [17:38:39] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616565 [17:38:41] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616565 (owner: 10Lars Wirzenius) [17:39:36] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616565 (owner: 10Lars Wirzenius) [17:39:46] 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Dzahn) Here's a new patch that moves all wdqs-admins to wdqs-roots and then also removes the wdqs-adm... [17:42:08] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.1 [17:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:25] (03CR) 10Krinkle: [C: 03+1] "Good to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T257931) (owner: 10Dave Pifke) [17:46:06] (03CR) 10Dzahn: [C: 03+2] arclamp: run arclamp-compress-logs from cron [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T257931) (owner: 10Dave Pifke) [17:46:43] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10jrbs) >>! In T256971#6338337, @Dzahn wrote: > Hi, reopening the ticket to get this done. > >> Trust and... [17:50:17] (03CR) 10Dzahn: "Notice: /Stage[main]/Arclamp/Cron[arclamp_compress_logs]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/616179 (https://phabricator.wikimedia.org/T257931) (owner: 10Dave Pifke) [17:54:13] (03PS1) 10Dzahn: admins: add nahidunlimited to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/616568 (https://phabricator.wikimedia.org/T256971) [17:54:18] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 4005 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [17:54:38] RECOVERY - Disk space on webperf1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=webperf1002&var-datasource=eqiad+prometheus/ops [17:58:39] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Urbanecm) Or perhaps implement resetUserEmail.php in MW interface, which would reduc... [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T1800). [18:00:05] Jdlrobson: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:44] present! [18:05:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:06:31] (03PS1) 10Ottomata: Remove now unused wgEventServiceStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616570 (https://phabricator.wikimedia.org/T229863) [18:07:16] RoanKattouw: Urbanecm are either of you around? [18:07:26] Oh sorry I'm here [18:07:28] Will do the deployment [18:07:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:09:06] (03CR) 10Catrope: [C: 03+2] Remove WPBSkinBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) (owner: 10Peter.ovchyn) [18:09:59] (03Merged) 10jenkins-bot: Remove WPBSkinBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609490 (https://phabricator.wikimedia.org/T254675) (owner: 10Peter.ovchyn) [18:11:02] PROBLEM - Host mw1411 is DOWN: PING CRITICAL - Packet loss = 100% [18:11:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:11:26] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10wiki_willy) Hi @Vgutierrez - just wanted to follow up to see if you've seen any issues since....and if this can be closed now. Much appreciated. Thanks, Willy [18:11:29] Jdlrobson: That first patch probably doesn't need mwdebug testing right? [18:11:41] RoanKattouw: i can double check just in case but prob not [18:11:47] OK great [18:11:48] should be dead code [18:11:50] RECOVERY - Host mw1411 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:12:02] is mw1411 being reimaged? [18:12:51] (03CR) 10Ladsgroup: Initial configuration for avkwiki (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [18:13:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:13:28] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove WPBSkinBlacklist (T254675) (duration: 00m 57s) [18:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:34] T254675: Rename WPBSkinBlacklist - https://phabricator.wikimedia.org/T254675 [18:14:52] (03PS2) 10Catrope: Enable desktop click tracking instrumentation on (fr,he,fa)wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616125 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [18:14:58] it looks like it crashed, but I can't tell why yet [18:15:13] (03CR) 10Catrope: [C: 03+2] Enable desktop click tracking instrumentation on (fr,he,fa)wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616125 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [18:15:31] syslog file has the characteristic 'block of NULs' during the crash, [18:15:42] racadm getraclog shows [18:15:44] Timestamp = 2020-07-03 08:34:27 [18:15:46] Message = System CPU Resetting. [18:16:40] cdanis: it rebooted but was back before i got on mgmt [18:16:53] yeah, I haven't found anything obvious in terms of errors [18:16:54] ack, CPU resetting.. uhmm [18:17:02] (03PS3) 10Ayounsi: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641 [18:17:08] aside from that, which only shows in racadm getraclog, not in racadm getsel [18:17:24] (03CR) 10Urbanecm: Initial configuration for avkwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [18:17:31] (03PS4) 10Ayounsi: Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 [18:19:28] (03CR) 10Ladsgroup: Initial configuration for avkwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612897 (https://phabricator.wikimedia.org/T257943) (owner: 10Urbanecm) [18:20:19] (03Merged) 10jenkins-bot: Enable desktop click tracking instrumentation on (fr,he,fa)wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616125 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [18:22:44] cdanis: line 9019 in /var/log/syslog "@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^.... " and then it just reboots [18:22:52] yeah, so that's an unclean reboot [18:23:22] and there's no mgmt login in the rac logs before the CPU reset event [18:25:40] RoanKattouw: shall i test? [18:25:50] One second [18:26:18] OK test now [18:27:13] cdanis: guess we can file it under "solar flare" or something unless it happens again [18:27:21] yah [18:27:31] also at least it was just an appserver and not a database or something 🙃 [18:27:34] RoanKattouw: lgtm [18:27:36] indeed :) [18:28:13] 10Operations, 10ops-eqiad: Remove db1082's BBU on-site - https://phabricator.wikimedia.org/T258910 (10Marostegui) 05Open→03Resolved Thanks - I can see it: ` root@db1082:~# hpssacli controller all show detail | grep -i battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 0 ` Thank you gu... [18:28:16] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) [18:29:15] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable desktop web UI click tracking instrumentation on frwiki, hewiki, fawiki (T258058) (duration: 00m 56s) [18:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:21] T258058: Enable sidebar instrumentation on test wikis - https://phabricator.wikimedia.org/T258058 [18:29:59] OK all done [18:31:48] cool RoanKattouw . I'm going to monitor traffic to the endpoint so if you could keep yourself available for next 30 mins in the unlikely case there are any problems i'd appreciate it. Thank you so much! [18:32:55] 10Operations, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10RobH) I went ahead and powered it down and then back up via the PDU outlet control. It is now booted and waiting at login prompt, awaiting wipe: [18:34:02] (03PS1) 10Andrew Bogott: Ceph: temporarily hack the cluster network to include both old and new [puppet] - 10https://gerrit.wikimedia.org/r/616576 (https://phabricator.wikimedia.org/T258968) [18:35:47] 10Operations, 10ops-eqsin: Decommission cr1-eqsin - https://phabricator.wikimedia.org/T256947 (10RobH) a:03ayounsi I don't have network root, so reassigning to arzhel for wipe/factory defaults. [18:38:47] RoanKattouw: Jdlrobson: Hello, do you mind if I deploy something now? [18:39:19] Urbanecm: go for it [18:39:28] Jdlrobson: yeah I'll be around [18:39:35] thanks [18:42:06] (03PS3) 10Urbanecm: Move footer logos to wmg* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616544 (https://phabricator.wikimedia.org/T257732) [18:42:17] (03CR) 10Urbanecm: [C: 03+2] Move footer logos to wmg* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616544 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [18:43:05] (03Merged) 10jenkins-bot: Move footer logos to wmg* variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616544 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [18:47:46] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [18:48:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c6a9674366d9c8d273ce0e74dfb6a04c91d64307: Move footer logos to wmg* variables (T257732) (duration: 00m 57s) [18:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:06] T257732: Change the footer logos in Turkish Wikipedia - https://phabricator.wikimedia.org/T257732 [18:49:21] !log urbanecm@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 01s) [18:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:38] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [18:50:18] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 00m 57s) [18:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:50] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: c6a9674366d9c8d273ce0e74dfb6a04c91d64307: Move footer logos to wmg* variables (T257732) (duration: 00m 56s) [18:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:53] (03PS3) 10Urbanecm: Move footer logos to /static/images/footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616545 (https://phabricator.wikimedia.org/T257732) [18:52:02] (03CR) 10Urbanecm: [C: 03+2] Move footer logos to /static/images/footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616545 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [18:52:53] (03Merged) 10jenkins-bot: Move footer logos to /static/images/footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616545 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [18:57:32] !log urbanecm@deploy1001 sync-file aborted: 3833b135caf4171daa0814eba81393b6c44db619: Move footer logos to /static/images/footer (T257732) (duration: 00m 04s) [18:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:38] T257732: Change the footer logos in Turkish Wikipedia - https://phabricator.wikimedia.org/T257732 [18:58:37] (03PS1) 10Urbanecm: Revert "Move footer logos to /static/images/footer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616502 (https://phabricator.wikimedia.org/T257732) [18:58:44] (03CR) 10Urbanecm: [C: 03+2] Revert "Move footer logos to /static/images/footer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616502 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [18:59:30] (03Merged) 10jenkins-bot: Revert "Move footer logos to /static/images/footer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616502 (https://phabricator.wikimedia.org/T257732) (owner: 10Urbanecm) [19:00:11] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'echostore' for release 'production' . [19:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:53] RoanKattouw: done for no [19:02:27] (03PS2) 10Ppchelko: Remove now unused wgEventServiceStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616570 (https://phabricator.wikimedia.org/T229863) (owner: 10Ottomata) [19:03:54] (03CR) 10Ppchelko: [C: 03+1] Remove now unused wgEventServiceStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616570 (https://phabricator.wikimedia.org/T229863) (owner: 10Ottomata) [19:05:20] RoanKattouw: I assume https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/616545 would actually require full sync, right? [19:06:38] I don't think it would [19:06:50] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . [19:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:04] 1) sync static/images ; 2) sync wmf-config/InitialiseSettings.php ; 3) purge those URLs with purgeList.php [19:07:09] (03PS1) 10Ladsgroup: labs: Enable itemquality model in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616580 [19:07:25] InitialiseSettings.php will be pointing to the wrong paths for a minute, but that's only for a short time and the Varnish cache should save us there [19:07:39] in that case, https://wikitech.wikimedia.org/wiki/How_to_deploy_code would then probably need an update, it seems [19:07:54] it says `However, scap sync-file is only capable of synchronizing files within directories that already exist on the cluster, so it won't work with newly added directories` [19:09:13] (03PS2) 10Andrew Bogott: Ceph: update network settings [puppet] - 10https://gerrit.wikimedia.org/r/616576 (https://phabricator.wikimedia.org/T258968) [19:09:34] (03CR) 10Andrew Bogott: [C: 04-2] "We don't want to merge this until the workload is moved off of the old osd nodes" [puppet] - 10https://gerrit.wikimedia.org/r/616576 (https://phabricator.wikimedia.org/T258968) (owner: 10Andrew Bogott) [19:11:26] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [19:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:42] (03PS3) 10Andrew Bogott: Ceph: update network settings [puppet] - 10https://gerrit.wikimedia.org/r/616576 (https://phabricator.wikimedia.org/T258968) [19:11:44] (03PS1) 10Andrew Bogott: Ceph: make cloudcephosd1004, 5 and 6 osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616582 (https://phabricator.wikimedia.org/T258968) [19:11:46] (03PS1) 10SBassett: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) [19:11:46] I'll try that tomorrow anyway, let's see [19:13:00] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: make cloudcephosd1004, 5 and 6 osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/616582 (https://phabricator.wikimedia.org/T258968) (owner: 10Andrew Bogott) [19:14:42] (03CR) 10Ladsgroup: [C: 03+2] "noop for production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616580 (owner: 10Ladsgroup) [19:15:31] (03Merged) 10jenkins-bot: labs: Enable itemquality model in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616580 (owner: 10Ladsgroup) [19:16:00] (03PS4) 10Andrew Bogott: Ceph: update network settings [puppet] - 10https://gerrit.wikimedia.org/r/616576 (https://phabricator.wikimedia.org/T258968) [19:16:02] (03PS1) 10Andrew Bogott: Ceph: make cloudcephosd1004, 1005, 1006 into OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/616584 (https://phabricator.wikimedia.org/T258968) [19:16:10] (03CR) 10SBassett: [C: 04-1] "Hold for config deploy during today's security window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) (owner: 10SBassett) [19:16:48] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: make cloudcephosd1004, 1005, 1006 into OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/616584 (https://phabricator.wikimedia.org/T258968) (owner: 10Andrew Bogott) [19:19:33] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [19:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:01] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/616524 (https://phabricator.wikimedia.org/T257016) (owner: 10Herron) [19:23:55] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [19:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:32] (03CR) 10Jeena Huneidi: [C: 03+2] Update blubberoid to 2020-07-24-194337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/616158 (https://phabricator.wikimedia.org/T254629) (owner: 10Ahmon Dancy) [19:30:35] (03Merged) 10jenkins-bot: Update blubberoid to 2020-07-24-194337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/616158 (https://phabricator.wikimedia.org/T254629) (owner: 10Ahmon Dancy) [19:34:09] (03PS1) 10BryanDavis: dynamicproxy: update error pages [puppet] - 10https://gerrit.wikimedia.org/r/616585 [19:34:11] (03PS1) 10BryanDavis: dynamicproxy: Add custom response for missing proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/616586 (https://phabricator.wikimedia.org/T258730) [19:35:44] (03PS1) 10Ladsgroup: labs: Configure "itemquality" model in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616587 [19:35:50] 10Operations, 10wmf-sre-laptop: Split SRE-specific components into an SRE sub-package; create sub-packages for other teams as well - https://phabricator.wikimedia.org/T256872 (10herron) p:05Triage→03Medium [19:36:12] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [19:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:43] 10Operations, 10serviceops, 10Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762 (10herron) p:05Triage→03Medium [19:37:34] 10Operations, 10Wikimedia-Mailing-lists: Puppetize mailman3 web and hyperkitty (mailman archiver) - https://phabricator.wikimedia.org/T256542 (10herron) p:05Triage→03Low [19:38:35] 10Operations, 10Wikimedia-Mailing-lists: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10herron) p:05Triage→03Low [19:44:46] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:14] (03PS2) 10BryanDavis: dynamicproxy: update error pages [puppet] - 10https://gerrit.wikimedia.org/r/616585 [19:45:16] (03PS2) 10BryanDavis: dynamicproxy: Add custom response for missing proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/616586 (https://phabricator.wikimedia.org/T258730) [19:50:54] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:06] (03PS3) 10BryanDavis: dynamicproxy: Add custom response for missing proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/616586 (https://phabricator.wikimedia.org/T258730) [19:55:27] (03CR) 10BryanDavis: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) (owner: 10SBassett) [19:56:54] (03CR) 10BryanDavis: "PCC for this patch and its related parent: https://puppet-compiler.wmflabs.org/compiler1002/24154/" [puppet] - 10https://gerrit.wikimedia.org/r/616586 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis) [19:57:56] (03CR) 10BryanDavis: "PCC for this patch and its related child: https://puppet-compiler.wmflabs.org/compiler1002/24154/" [puppet] - 10https://gerrit.wikimedia.org/r/616585 (owner: 10BryanDavis) [19:58:41] |] [20:00:05] halfak and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T2000). [20:04:09] 10Operations, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10herron) p:05Triage→03Low [20:07:27] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10Dzahn) >>! In T256971#6338371, @jrbs wrote: >>>! In T256971#6338337, @Dzahn wrote: >... [20:13:05] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616587 (owner: 10Ladsgroup) [20:13:35] (03Merged) 10jenkins-bot: labs: Configure "itemquality" model in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616587 (owner: 10Ladsgroup) [20:14:08] rebased ^ [20:15:27] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Special:HideBanners is not really cacheable - https://phabricator.wikimedia.org/T256447 (10herron) p:05Triage→03Medium [20:15:38] (03CR) 10Bstorm: [C: 03+2] wiki-replicas: Add clouddb naming to regexes [puppet] - 10https://gerrit.wikimedia.org/r/615823 (https://phabricator.wikimedia.org/T257987) (owner: 10Bstorm) [20:18:27] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet failure on deployment-cache-text06 - https://phabricator.wikimedia.org/T256064 (10herron) p:05Triage→03Medium [20:19:13] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10herron) 05Open→03Stalled p:05Triage→03Medium [20:20:31] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Replace Liberation 1 fonts with Liberation 2 for svg rendering - https://phabricator.wikimedia.org/T253600 (10herron) p:05Triage→03Medium [20:22:12] 10Operations, 10Wikimedia-Mailing-lists, 10Accessibility: Pipermail uses background color without foreground colors - https://phabricator.wikimedia.org/T190061 (10herron) p:05Triage→03Low [20:24:03] 10Operations, 10Puppet: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641 (10herron) p:05Triage→03Medium [20:24:20] 10Operations, 10Wikimedia-Mailing-lists, 10Accessibility: Pipermail uses background color without foreground colors - https://phabricator.wikimedia.org/T190061 (10Ladsgroup) Maybe we should just decline this. The whole thing is going to be replaced with the new version of mailman. I'm not sure if it's worth... [20:24:52] (03PS2) 10SBassett: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) [20:25:18] (03PS3) 10SBassett: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) [20:26:35] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10herron) p:05Triage→03Low [20:27:04] (03PS1) 10Urbanecm: Use LanguageLinksHook to sort interwiki links [extensions/InterwikiSorting] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616504 (https://phabricator.wikimedia.org/T257625) [20:28:02] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10Dzahn) a:03Dzahn ok, biting [20:28:04] 10Operations: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984 (10herron) p:05Triage→03Medium [20:33:41] (03CR) 10Urbanecm: [C: 03+2] Use LanguageLinksHook to sort interwiki links [extensions/InterwikiSorting] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616504 (https://phabricator.wikimedia.org/T257625) (owner: 10Urbanecm) [20:35:48] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) gentle ping. any updates here from Privacy team? [20:37:11] (03Merged) 10jenkins-bot: Use LanguageLinksHook to sort interwiki links [extensions/InterwikiSorting] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/616504 (https://phabricator.wikimedia.org/T257625) (owner: 10Urbanecm) [20:40:58] (03PS1) 10Dwisehaupt: Remove fr-tech-ops contact where fr-tech is specified [puppet] - 10https://gerrit.wikimedia.org/r/616592 [20:41:23] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/InterwikiSorting/: c5f6c97856a5dbe673064afd2804bebb9b787580: Use LanguageLinksHook to sort interwiki links (T257625) (duration: 00m 59s) [20:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:30] T257625: Interwiki sorting broken - https://phabricator.wikimedia.org/T257625 [20:41:50] Amir1: it seems to work at cswiki! [20:42:11] did you check xhgui? [20:42:32] (03CR) 10Krinkle: [C: 03+1] "This LGTM, but I note that this might be the first time we use this on a "less important" apache server. I don't know if that would interf" [puppet] - 10https://gerrit.wikimedia.org/r/608973 (https://phabricator.wikimedia.org/T215740) (owner: 10Dave Pifke) [20:43:32] Amir1: no :(. Right now, I can't eve get it running [20:43:41] it shows an internal error https://usercontent.irccloud-cdn.com/file/wNthEM7l/image.png [20:44:07] you need to click on the timestamp :D [20:44:32] stupid interface :D [20:44:47] https://performance.wikimedia.org/xhgui/run/view?id=5f1f3ca03f3dfa107c2c4c6c, that's better :) [20:45:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:45:33] (03Abandoned) 10Dwisehaupt: Remove fr-tech-ops contact where fr-tech is specified [puppet] - 10https://gerrit.wikimedia.org/r/616592 (owner: 10Dwisehaupt) [20:45:52] 10Operations, 10Traffic: planet.wm.org missing from planet.discovery.wmnet Subject Alternative Name - https://phabricator.wikimedia.org/T257840 (10Dzahn) 05Resolved→03Open p:05Medium→03Low [20:46:06] (03CR) 10Herron: "Thanks for this! Please see comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:46:09] 10Operations, 10Traffic: fix planet.wm.org redirect nitpick (was: missing from planet.discovery.wmnet Subject Alternative Name) - https://phabricator.wikimedia.org/T257840 (10Dzahn) [20:46:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:47:28] Amir1: it threw an icinga alert, but that was expected, because it was syncing files at that time [20:47:59] Urbanecm: next time please don't sync the whole extension in one go in these cases, it causes a huge spike of errors. Sync it in a way that's backward compatible if possible (sometimes it's not) [20:48:49] e.g. first sync https://gerrit.wikimedia.org/r/c/mediawiki/extensions/InterwikiSorting/+/616504/1/src/LanguageLinksHandler.php [20:48:53] noted, thanks [20:49:09] (03CR) 10Herron: [C: 03+1] admins: add nahidunlimited to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/616568 (https://phabricator.wikimedia.org/T256971) (owner: 10Dzahn) [20:49:13] then extension.json, and then the rest, right? [20:49:24] extactly [20:49:30] *exactly [20:49:54] (03CR) 10Dzahn: admins: turn all wdqs-admins into wdqs-roots (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:50:05] now should we clean up the hacks? [20:50:45] * Urbanecm looking at the xhgui profiling, but I'm not 100% sure it looks normally [20:50:51] (03CR) 10Herron: [C: 03+2] "Moving forward with this as the access was requested in the description of the linked task, and approved there." [puppet] - 10https://gerrit.wikimedia.org/r/616568 (https://phabricator.wikimedia.org/T256971) (owner: 10Dzahn) [20:51:24] (03CR) 10Dzahn: "> Also, I see some a few other references to wdqs-admins in puppet, but afaict are independent from the shell user (monitoring and cron co" [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [20:51:53] onlanguagelinks seems to complete fast (4 µs), through the whole hook thing takes significantly longer [20:52:01] (at the very least, I don't think I affected that) [20:52:09] so, I think so, yes, it's time for the hacks [20:53:31] PROBLEM - puppet last run on otrs1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:53:36] (03PS1) 10Herron: Revert "admins: add nahidunlimited to restricted group" [puppet] - 10https://gerrit.wikimedia.org/r/616505 [20:54:04] (03CR) 10Herron: [C: 03+2] Revert "admins: add nahidunlimited to restricted group" [puppet] - 10https://gerrit.wikimedia.org/r/616505 (owner: 10Herron) [20:55:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:56:51] (03PS1) 10Herron: admins: add nahidunlimited to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/616606 (https://phabricator.wikimedia.org/T256971) [20:57:35] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:58:12] (03PS1) 10Dzahn: admins: add all members of wdqs-admins to wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616593 (https://phabricator.wikimedia.org/T258739) [20:58:45] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10herron) >>! In T256971#6338903, @gerritbot wrote: > Change 616568 **merged** by Herr... [20:59:07] (03CR) 10Dzahn: "lgtm but the same as the previous one you reverted?" [puppet] - 10https://gerrit.wikimedia.org/r/616606 (https://phabricator.wikimedia.org/T256971) (owner: 10Herron) [20:59:44] (03CR) 10Dzahn: [C: 03+1] "gotcha" [puppet] - 10https://gerrit.wikimedia.org/r/616606 (https://phabricator.wikimedia.org/T256971) (owner: 10Herron) [20:59:49] (03CR) 10Herron: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/616606 (https://phabricator.wikimedia.org/T256971) (owner: 10Herron) [21:00:04] Reedy and sbassett: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T2100) [21:01:00] (03PS2) 10Dzahn: admins: add all members of wdqs-admins to wdqs-roots [puppet] - 10https://gerrit.wikimedia.org/r/616593 (https://phabricator.wikimedia.org/T258739) [21:01:12] (03CR) 10Dzahn: [C: 04-2] "let's start with just this https://gerrit.wikimedia.org/r/c/operations/puppet/+/616593" [puppet] - 10https://gerrit.wikimedia.org/r/616564 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn) [21:02:58] * sbassett has two sec patches for today: T238075 and https://gerrit.wikimedia.org/r/616583 [21:05:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:07:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:07:23] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:02] !log Deployed mitigations for T238075 [21:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:33] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:44] (03PS4) 10SBassett: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) [21:27:25] (03CR) 10SBassett: [C: 03+2] Adding blob: to CentralNoticeContentSecurityPolicy script-src directive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) (owner: 10SBassett) [21:28:15] (03Merged) 10jenkins-bot: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) (owner: 10SBassett) [21:28:24] (03CR) 10SBassett: Adding blob: to CentralNoticeContentSecurityPolicy script-src directive (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616583 (https://phabricator.wikimedia.org/T258459) (owner: 10SBassett) [21:28:34] (03CR) 10CRusnov: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [21:31:45] !log sbassett@deploy1001 Synchronized wmf-config/CommonSettings.php: Deployed CentralNotice CSP conifg change for T258459 (duration: 00m 57s) [21:31:49] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:56] T258459: Uncaught SecurityError: Failed to construct 'Worker': Access to the script at 'blob:https://ca.wikipedia.org/92522a15-8318-403f-bb45-8e554fc893c0' is denied by the document's Content Security Policy. - https://phabricator.wikimedia.org/T258459 [21:36:38] (03Abandoned) 10SBassett: Enable wgBreakFrames across all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606723 (https://phabricator.wikimedia.org/T255881) (owner: 10SBassett) [21:37:28] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:55] PROBLEM - Check whether ferm is active by checking the default input chain on cloudcephosd1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:45:37] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10herron) Hi @thcipriani, @greg, could you please review the portion of this request r... [21:50:41] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:57] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:11:09] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10thcipriani) >>! In T256971#6339049, @herron wrote: > Hi @thcipriani, @greg, could yo... [22:21:50] (03PS1) 10Ebernhardson: increment extra plugin to 6.5.4-wmf-11 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 [22:46:45] RECOVERY - Check whether ferm is active by checking the default input chain on cloudcephosd1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:51:05] (03PS4) 10Jdrewniak: Enable desktop improvements by default for testing group (round 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614890 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [23:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200727T2300). [23:00:04] Seddon: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:11] (03PS3) 10Dzahn: phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796 (https://phabricator.wikimedia.org/T238593) [23:05:16] (03PS4) 10Dzahn: phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796 (https://phabricator.wikimedia.org/T238593) [23:08:13] (03CR) 10Dzahn: [C: 03+2] phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [23:14:29] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:54] (03PS1) 10Dzahn: aphlict: do not attempt to remove log and run dirs when absent [puppet] - 10https://gerrit.wikimedia.org/r/616627 [23:20:40] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10Papaul) [23:21:47] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn removing aphlict https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:53] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [23:23:11] (03CR) 10Dzahn: [C: 03+2] aphlict: do not attempt to remove log and run dirs when absent [puppet] - 10https://gerrit.wikimedia.org/r/616627 (owner: 10Dzahn) [23:23:39] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10Papaul) [23:23:45] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:41] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [23:26:44] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: update network settings [puppet] - 10https://gerrit.wikimedia.org/r/616576 (https://phabricator.wikimedia.org/T258968) (owner: 10Andrew Bogott) [23:28:53] !log otrs1001 - ran puppet (it was alerting in icinga that puppet failed, but it was neither disabled nor failing and changed nothing when it ran) [23:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:49] Ooooops I missed the backport [23:31:36] I am here and will be for the next half hour but I'll reschedule for the morning [23:32:13] RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:36:57] ACKNOWLEDGEMENT - exim queue on mx1001 is CRITICAL: CRITICAL: 6153 mails in exim queue. Herron The mail system on mx1001 itself is healthy. A single remote recipient has 5700 deferred messages in the queue due to temp fail on their host: SMTP error from remote mail server after end of data: 451 temporary failure for one or more recipients redacted@hrs.ilga.gov:deferred. should clear without action when the remote mail resumes del [23:36:57] kitech.wikimedia.org/wiki/Exim [23:38:47] ACKNOWLEDGEMENT - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 17784, 1807729s 1728000s). daniel_zahn announced migration ongoing https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [23:38:47] ACKNOWLEDGEMENT - Long running screen/tmux on netbox1001 is CRITICAL: CRIT: Long running tmux process. (user: crusnov PID: 17784, 1807729s 1728000s). daniel_zahn announced migration ongoing https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [23:40:19] (03PS1) 10Tim Starling: Re-enable LilyPond/Score in safe mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616628 [23:48:48] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ac8e5d0]: airflow: head queries report, managed variables, refinery-drop-hive-partitions support [23:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:13] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10herron) Thanks! Moving forward with this now [23:49:34] (03PS2) 10Herron: admins: add nahidunlimited to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/616606 (https://phabricator.wikimedia.org/T256971) [23:49:42] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ac8e5d0]: airflow: head queries report, managed variables, refinery-drop-hive-partitions support (duration: 00m 54s) [23:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:20] (03CR) 10Herron: [C: 03+2] admins: add nahidunlimited to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/616606 (https://phabricator.wikimedia.org/T256971) (owner: 10Herron) [23:54:30] (03PS3) 10Dzahn: ATS: add new backend for phabricator aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593) [23:59:19] 10Operations, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10herron) 05Open→03Resolved a:03herron The patch adding `nahidunlimited` to grou...