[00:07:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [00:07:30] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [00:11:40] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:12:19] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:37:39] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [00:38:10] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [00:41:40] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:42:19] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:07:20] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [01:07:50] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [01:11:39] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:12:00] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:37:39] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [01:38:09] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [01:41:40] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:42:19] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:07:19] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [02:07:49] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [02:11:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:11:59] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:28:30] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 08m 38s) [02:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:30] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [02:38:00] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [02:38:46] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Aug 16 02:38:46 UTC 2018 (duration 10m 17s) [02:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:40] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:42:10] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:50:29] 10Operations: Feedback Appreciatted: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Legoktm) a:05Akondrahman>03None I'm not sure what kind of a useful answer you're going to get...I suspect each case has a different answer/reason. For ~/vagrant, it's used as a development tool on indi... [03:07:39] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [03:08:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [03:11:50] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:12:19] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:26:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 861.39 seconds [03:37:29] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [03:38:00] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [03:41:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 270.29 seconds [03:41:39] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:42:10] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:07:40] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [04:08:10] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [04:12:00] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:12:29] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:37:30] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [04:38:10] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [04:41:49] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:42:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:07:30] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [05:07:59] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [05:08:00] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:09:09] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:11:50] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:12:47] <_joe_> !log moving away corrupted commitlog file on aqs1007 cassandra-a instance, trying to restart it [05:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:59] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [05:15:49] RECOVERY - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is OK: TCP OK - 0.000 second response time on 10.64.0.213 port 9042 [06:28:59] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/jobrunner.svc.eqiad.wmnet.crt] [06:29:10] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:52:29] 10Operations, 10Cassandra: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 (10ema) 05Open>03Resolved a:03ema @Joe removed the log and restarted cassandra-a. The service seems now to be working fine. ``` 05:12 _joe_: moving away corrupted commitlog file on aqs1... [06:54:41] <_joe_> heh sorry, but I opened the file [06:54:52] <_joe_> and it was all zeroes [06:55:03] <_joe_> so there was really nothing that could be done with it [06:59:19] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:20] 10Operations, 10Cassandra: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 (10Joe) For the record: I removed the file (still on disk at `/srv/cassandra-a/commitlog/CommitLog-5-1530620590775.log.bak` once I noticed it was all zeroes. Since there was no real informat... [06:59:29] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:34] _joe_: thanks for taking care of that! [07:03:56] <_joe_> ema: yeah well, I just saw this system alarming all night long and tried to fix it, I can't say I'm sure what I did was 100% correct but I an all-zeroes file can't really do anything meaningful IMHO [07:04:26] 10Operations: Feedback Appreciatted: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Aklapper) Proposing to close this task as invalid as it's vague and not actionable. Please also read and understand T201576#4490641. Dropping automatically created lists of http links without any further... [07:05:00] 10Puppet: Suspicious Comments in Puppet Scripts - https://phabricator.wikimedia.org/T201576 (10Aklapper) [07:22:20] (03CR) 10Volans: [V: 032 C: 032] LDAP: allow to specify multiple search strings [software/debmonitor] - 10https://gerrit.wikimedia.org/r/452686 (owner: 10Volans) [07:24:39] !log rebooting install2002 for kernel security update [07:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:34] !log rebooting install1002 for kernel security update [07:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:39] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [07:33:46] (03PS1) 10Volans: Updated src to v0.1.8 and rebuilt wheels [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/453092 [07:37:40] (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.8 and rebuilt wheels [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/453092 (owner: 10Volans) [07:40:12] (03PS2) 10Giuseppe Lavagetto: PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) [07:40:14] (03PS1) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [07:41:11] (03CR) 10jerkins-bot: [V: 04-1] PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [07:41:50] 10Operations, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Vgutierrez) @MoritzMuehlenhoff ack, thanks for pinging us [07:43:34] (03PS1) 10Gehel: elasticsearch: storage device name changed with new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/453094 (https://phabricator.wikimedia.org/T198391) [07:45:28] !log volans@deploy1001 Started deploy [debmonitor/deploy@1f01fd1]: Release v0.1.8 [07:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:59] !log volans@deploy1001 Finished deploy [debmonitor/deploy@1f01fd1]: Release v0.1.8 (duration: 00m 31s) [07:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:30] (03PS1) 10Volans: debmonitor: allow access to WMF+NDA groups [puppet] - 10https://gerrit.wikimedia.org/r/453096 [07:49:14] !log reimaging elastic10(23|24) [07:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:43] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1023.eqiad.wmnet', 'elastic1024... [07:52:26] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/453096 (owner: 10Volans) [07:53:23] (03CR) 10Volans: [C: 032] debmonitor: allow access to WMF+NDA groups [puppet] - 10https://gerrit.wikimedia.org/r/453096 (owner: 10Volans) [08:03:04] 10Operations, 10ops-codfw, 10Traffic: Decommission baham - https://phabricator.wikimedia.org/T199247 (10Vgutierrez) [08:03:11] 10Operations, 10ops-codfw, 10Traffic, 10decommission: Decommission baham - https://phabricator.wikimedia.org/T199247 (10Vgutierrez) [08:03:49] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:05:23] 10Operations, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Vgutierrez) 05Open>03Resolved [08:05:36] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Vgutierrez) [08:06:49] (03CR) 10Jcrespo: [C: 04-1] "Public exposure of a credential- please change it and document it on the private repo only." [puppet] - 10https://gerrit.wikimedia.org/r/452997 (owner: 10Andrew Bogott) [08:08:55] (03PS1) 10Vgutierrez: authdns: Remove radon from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/453099 (https://phabricator.wikimedia.org/T202040) [08:12:36] (03CR) 10Vgutierrez: [C: 032] authdns: Remove radon from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/453099 (https://phabricator.wikimedia.org/T202040) (owner: 10Vgutierrez) [08:12:41] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1023.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['elastic1023.eqiad.wmnet... [08:12:44] (03PS2) 10Vgutierrez: authdns: Remove radon from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/453099 (https://phabricator.wikimedia.org/T202040) [08:13:55] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1023.eqiad.wmnet'] ``` The log... [08:14:10] PROBLEM - Check systemd state on elastic1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:15:10] RECOVERY - Check systemd state on elastic1024 is OK: OK - running: The system is fully operational [08:16:39] !log upgrading wikidiff to 1.7.2 on mw1334-mw1338/mw1307/mw1318/ (HHVM bytecode cache is pruned during update) [08:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:26] (03PS1) 10Vgutierrez: site: Reimage radon as stretch spare system [puppet] - 10https://gerrit.wikimedia.org/r/453100 (https://phabricator.wikimedia.org/T202040) [08:32:55] !log uploaded jenkins 2.121.3 to apt.wikimedia.org (for jessie and stretch) [08:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:02] (03CR) 10Vgutierrez: [C: 032] site: Reimage radon as stretch spare system [puppet] - 10https://gerrit.wikimedia.org/r/453100 (https://phabricator.wikimedia.org/T202040) (owner: 10Vgutierrez) [08:35:03] !log Reimaging radon as spare system - T202040 [08:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:10] T202040: Decommission radon - https://phabricator.wikimedia.org/T202040 [08:37:18] !log reboot cp2009 for kernel upgrade [08:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:57] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1023.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['elastic1023.eqiad.wmnet... [08:39:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10daniel) Is the GPG signature I added to the description sufficient? If not, I'll be in the WMDE office in a couple of hours, so I could do a quick hangout. [08:40:18] !log upgrading wikidiff to 1.7.2 on mw1319-mw1333 (HHVM bytecode cache is pruned during update) [08:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:04] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` radon.wikimedia.org ``` The log can be found in `/var/... [08:58:06] (03PS10) 10Jcrespo: db backup statistics: Initial implementation of the backup stats [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) [09:07:05] (03CR) 10Jcrespo: "> I don't know if it is planned but being able to specify a wiki to" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [09:14:34] (03CR) 10Jcrespo: "> I like the abstraction level of "section" so at restore time we can" [puppet] - 10https://gerrit.wikimedia.org/r/449681 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [09:16:35] !log reimaging elastic1022 [09:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:56] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1022.eqiad.wmnet'] ``` The log... [09:19:06] 10Operations, 10Traffic: cp3032 PS Redundancy Lost - https://phabricator.wikimedia.org/T202046 (10ema) [09:19:35] 10Operations, 10Traffic: cp3032 PS Redundancy Lost - https://phabricator.wikimedia.org/T202046 (10ema) p:05Triage>03Normal [09:19:55] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['radon.wikimedia.org'] ``` and were **ALL** successful. [09:20:08] 10Operations, 10ops-esams, 10Traffic: cp3032 PS Redundancy Lost - https://phabricator.wikimedia.org/T202046 (10ema) [09:20:30] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema https://phabricator.wikimedia.org/T202046 [09:23:15] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Vgutierrez) [09:24:22] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3049.esams.wmnet', 'cp2001.codfw.wmnet'] ``` The log can be found in `/var/l... [09:26:08] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp4023.ulsfo.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [09:26:44] (03PS7) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [09:27:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [09:30:31] (03CR) 10MarcoAurelio: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [09:31:51] (03CR) 10jerkins-bot: [V: 04-1] Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [09:34:57] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 60 not-conn: cp3049_v4, cp3049_v6, cp4023_v4, cp4023_v6 [09:35:38] (03PS2) 10Muehlenhoff: Tweak fragmentation memory limits [puppet] - 10https://gerrit.wikimedia.org/r/452901 (https://phabricator.wikimedia.org/T201608) [09:36:32] (03CR) 10Ema: [C: 031] "We could mention the previous defaults for reference, LGTM otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452901 (https://phabricator.wikimedia.org/T201608) (owner: 10Muehlenhoff) [09:38:36] ACKNOWLEDGEMENT - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 60 not-conn: cp3049_v4, cp3049_v6, cp4023_v4, cp4023_v6 Ema reimaging [09:39:09] (03PS3) 10Muehlenhoff: Tweak fragmentation memory limits [puppet] - 10https://gerrit.wikimedia.org/r/452901 (https://phabricator.wikimedia.org/T201608) [09:40:25] RECOVERY - Device not healthy -SMART- on cp2009 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp2009&var-datasource=codfw%2520prometheus%252Fops [09:41:34] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1022.eqiad.wmnet'] ``` and were **ALL** successful. [09:42:13] (03CR) 10Muehlenhoff: [C: 032] Tweak fragmentation memory limits [puppet] - 10https://gerrit.wikimedia.org/r/452901 (https://phabricator.wikimedia.org/T201608) (owner: 10Muehlenhoff) [09:46:28] PROBLEM - Check systemd state on labvirt1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:46:34] !log all elasticsearch nodes reimaged (except elastic1029, waiting on memory issue) - T198391 / T193649 / T201991 [09:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:44] T201991: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 [09:46:45] T198391: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 [09:46:46] T193649: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 [09:53:31] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [09:53:44] ` systemd-sysctl[49685]: Couldn't write '262144' to 'net/ipv6/ip6frag_high_thresh', ignoring: Invalid argument` [09:53:47] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Volans) I've added Effie to the "wmf" LDAP group. [09:53:54] moritzm: could this be related to some kernel upgrade? [09:53:57] is on labvirt1016 [09:55:12] ACKNOWLEDGEMENT - Check systemd state on labvirt1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Arturo Borrero Gonzalez Looking [09:55:34] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2001.codfw.wmnet', 'cp3049.esams.wmnet'] ``` and were **ALL** successful. [09:57:21] PROBLEM - Check systemd state on labtestmetal2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:58:14] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4023.ulsfo.wmnet'] ``` and were **ALL** successful. [09:58:20] arturo: looking [09:58:42] PROBLEM - Check systemd state on labtestvirt2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:59:42] PROBLEM - Check systemd state on cloudvirt1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:59:53] ACKNOWLEDGEMENT - Check systemd state on labtestmetal2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Arturo Borrero Gonzalez looking [10:00:41] !log add jiji to the 'ops' LDAP group - T201849 [10:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:54] T201849: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 [10:01:06] arturo: seems to be limited to the new jessie-based labvirts, right? [10:01:33] on the trusty ones, the settings have been successfully applied [10:02:30] moritzm: ok, could we add an `if` switch? [10:02:42] on 1016 all the values have been set, but for some reason it failed to apply ip6frag_high_thresh [10:02:55] arturo: let's rather fix the bug and bring them in line [10:03:33] setting the sysctl value manually via "sysctl -w" also works fine [10:04:57] I'm running puppet on 1018 to see whether it also happens there [10:06:02] moritzm: I just did a simple `sudo systemctl restart systemd-sysctl.service` and now the unit is not failing -_- [10:06:04] RECOVERY - Check systemd state on labvirt1016 is OK: OK - running: The system is fully operational [10:06:17] worked fine on 1018 as well running puppet [10:06:23] PROBLEM - Check systemd state on labtestcontrol2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:09:32] arturo: it seems all the affected hosts run systemd from jessie-backports, that's what I mentioned in the ticket about too loose pinning, we should strictly only pull in the OpenStack packages from jessie-backports [10:10:48] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10faidon) @RobH, @cmjohnson, this has been open for two months now -- why is this taking such a long time to resolve? [10:10:56] I'll fix up the systemd-sysctl status where it failed, these settings won't be re-set again, as the new jessie kernel reduces the default value [10:11:37] the sysctl application via puppet does the same, but avoid another round of reboots, but it's effectively a one time effort until the servers are rebooted again [10:11:59] arturo: we are getting labcontrol1001 cronspam [10:13:00] paravoid: ack I saw it this morning [10:14:03] RECOVERY - Check systemd state on cloudvirt1022 is OK: OK - running: The system is fully operational [10:15:12] RECOVERY - Check systemd state on labtestvirt2003 is OK: OK - running: The system is fully operational [10:15:33] RECOVERY - Check systemd state on labtestcontrol2003 is OK: OK - running: The system is fully operational [10:15:43] arturo: ^should be all sorted [10:18:56] systemd from jessie-backports? ouch! [10:27:04] ACKNOWLEDGEMENT - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data import in progress [10:27:15] <_joe_> !log restarting cpjobqueue on scb1002, not listening on its tcp port [10:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:55] (03PS1) 10Jijiki: admin: added user jiji to ops group [puppet] - 10https://gerrit.wikimedia.org/r/453107 (https://phabricator.wikimedia.org/T201849) [10:30:23] <_joe_> !log restarting changeprop on scb1002, not listening on its tcp port [10:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:31] 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T201761 (10jcrespo) 05Open>03Resolved Thanks, ``` root@db2039:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli Smart Array P420i in Slot 0 (Embedded) array A Logical Drive: 1 Size: 3.... [10:34:42] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:36:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:42:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12105/puppetmaster1001.eqiad.wmnet/ looks good, merging" [puppet] - 10https://gerrit.wikimedia.org/r/453107 (https://phabricator.wikimedia.org/T201849) (owner: 10Jijiki) [10:43:47] !log upgrading wikidiff to 1.7.2 on mw1285-mw1290 and mw1312-mw1317 (HHVM bytecode cache is pruned during update) [10:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:43] (03PS8) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [10:53:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [10:55:26] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10jcrespo) [11:00:06] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T1100). [11:00:06] No GERRIT patches in the queue for this window AFAICS. [11:00:57] Can I push a change for SWAT? [11:02:20] zeljkof: ^ [11:02:21] Amir1: go ahead, I'm on vacation :) [11:02:29] oh nice, enjoy! [11:03:36] !log manually delete glance rsync image cronjob from the glancesync user in labcontrol1001.wikimedia.org (leftover after glance merge in main/eqiad1) [11:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:05] !log upgrading wikidiff to 1.7.2 on mw1339-mw1348 (HHVM bytecode cache is pruned during update) [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:30] (03PS9) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [11:12:21] !log stopping db2042 for maintenance T202051 [11:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] T202051: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 [11:31:28] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.16/maintenance/populateChangeTagDef.php: SWAT: [[gerrit:452950|Add option to populateChangeTagDef not to update the count]] (duration: 00m 53s) [11:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:33] !log EU mid-day SWAT is done [11:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:41] !log T201473 copy `prometheus-pdns-exporter` from trusty-wikimedia to jessie-wikimedia in reprepro [11:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:48] T201473: prometheus-pdns-exporter for Jessie? - https://phabricator.wikimedia.org/T201473 [11:45:48] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10jcrespo) 05Open>03Resolved a:03jcrespo Solved with a reboot, let's reopen if it happens after some time CC @Marostegui @Papaul. [11:46:09] (03PS1) 10Ema: ATS: fix routing to Restbase [puppet] - 10https://gerrit.wikimedia.org/r/453111 (https://phabricator.wikimedia.org/T199720) [11:51:17] (03PS10) 10Vgutierrez: [WIP] Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [11:56:34] (03PS3) 10Giuseppe Lavagetto: PHP: create module for modern Debian-based distributions [puppet] - 10https://gerrit.wikimedia.org/r/452664 (https://phabricator.wikimedia.org/T201140) [11:56:35] (03PS2) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [11:57:23] (03PS11) 10Vgutierrez: Refactor certcentral.certificate_management() [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [12:00:17] !log installing ruby2.3 security updates [12:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:54] (03PS1) 10Arturo Borrero Gonzalez: d/rules: prevent dh_installinit from installing sysvinit files [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453112 (https://phabricator.wikimedia.org/T201473) [12:06:16] !log depooling wdqs[12]003 to catchup on updates [12:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:06] !log T201473 install a new version of `prometheus-pdns-exporter` (0.3) into jessie-wikimedia, due to errors in the postinst script [12:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:13] T201473: prometheus-pdns-exporter for Jessie? - https://phabricator.wikimedia.org/T201473 [12:08:15] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate new entry for v0.3 [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453115 (https://phabricator.wikimedia.org/T201473) [12:09:43] (03CR) 10Arturo Borrero Gonzalez: [C: 032] d/rules: prevent dh_installinit from installing sysvinit files [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453112 (https://phabricator.wikimedia.org/T201473) (owner: 10Arturo Borrero Gonzalez) [12:10:12] (03CR) 10Arturo Borrero Gonzalez: [C: 032] d/changelog: generate new entry for v0.3 [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453115 (https://phabricator.wikimedia.org/T201473) (owner: 10Arturo Borrero Gonzalez) [12:11:52] RECOVERY - puppet last run on cloudservices1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:12:24] \o/ [12:15:19] (03CR) 10Muehlenhoff: "That's not really needed? If a systemd unit is around,systemd will simply ignore the sysvinit script." [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453112 (https://phabricator.wikimedia.org/T201473) (owner: 10Arturo Borrero Gonzalez) [12:17:31] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "dh_installinit will put some code in postinst that will try to call invoke-rc.d for prometheus-pdns-exporter, which doesn't exists, and pa" [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453112 (https://phabricator.wikimedia.org/T201473) (owner: 10Arturo Borrero Gonzalez) [12:40:09] (03CR) 10Muehlenhoff: "invoke-rc.d is shipped in sysv-rc which is "priority: required", that should not happen. It's also installed on cloudservices1003?" [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453112 (https://phabricator.wikimedia.org/T201473) (owner: 10Arturo Borrero Gonzalez) [12:44:08] !log upgrading wikidiff to 1.7.2 on labweb* (HHVM bytecode cache is pruned during update) [12:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:27] !log restarting blazegraph on wdqs[12]003 [13:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:39] !log rebooting serpens for kernel security update [13:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:03] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [13:21:15] (03PS1) 10Jcrespo: mariadb: Depool es1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453122 [13:31:26] !log rebooting seaborgium for kernel security update [13:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:53] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:35:05] (03PS16) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [13:35:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "> invoke-rc.d is shipped in sysv-rc which is "priority: required"," [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/453112 (https://phabricator.wikimedia.org/T201473) (owner: 10Arturo Borrero Gonzalez) [13:35:36] there is high criticals in the last half an hour [13:36:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:39:32] This basic grammar fix up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/452716 [13:39:36] that would be great [13:42:13] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:43:38] (03PS1) 10Vgutierrez: Implement different Certificate.save() modes [software/certcentral] - 10https://gerrit.wikimedia.org/r/453124 [13:44:41] (03CR) 10jerkins-bot: [V: 04-1] Implement different Certificate.save() modes [software/certcentral] - 10https://gerrit.wikimedia.org/r/453124 (owner: 10Vgutierrez) [13:46:14] !log rebooting labtestservices2002/2003 for kernel security update [13:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:40] (03CR) 10Andrew Bogott: "That hash seems to originally come from 44d4872620e30f47a3465f01b2f3e9f12e3634a4 -- I guess I assumed that it was a dummy :) I'll refres" [puppet] - 10https://gerrit.wikimedia.org/r/452997 (owner: 10Andrew Bogott) [13:47:51] (03Restored) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [13:48:21] (03PS2) 10Vgutierrez: Implement different Certificate.save() modes [software/certcentral] - 10https://gerrit.wikimedia.org/r/453124 [13:52:41] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, thanks for implementing the ; Only doubt I had is you stop getting spammed by its output, which looks like a net win, but it's a cha" [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [13:53:13] (03PS1) 10BBlack: puppetmaster: use strong ciphers only [puppet] - 10https://gerrit.wikimedia.org/r/453126 [13:53:15] (03PS1) 10BBlack: tlsproxy: no-op rename of params to tlsproxy namespace [puppet] - 10https://gerrit.wikimedia.org/r/453127 [13:53:17] (03PS1) 10BBlack: tlsproxy: parameterize ciphersuite level [puppet] - 10https://gerrit.wikimedia.org/r/453128 [13:53:19] (03PS1) 10BBlack: role::cache::*: explicit tlsproxy compat level [puppet] - 10https://gerrit.wikimedia.org/r/453129 [13:53:21] (03PS1) 10BBlack: tlsproxy: default ciphersuite_level strong [puppet] - 10https://gerrit.wikimedia.org/r/453130 [13:54:30] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy: parameterize ciphersuite level [puppet] - 10https://gerrit.wikimedia.org/r/453128 (owner: 10BBlack) [13:54:38] !log rebooting labtestvirt2003 for kernel security update [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] (03CR) 10Bstorm: [C: 032] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [13:58:33] bstorm_: oh wow, this is awesome, nice! [13:59:20] :) [14:01:15] Except it has a problem on the server that I didn't see in the compiler, lol. Shouldn't be hard to fix [14:01:33] ? [14:01:40] something from wikibugs? [14:01:42] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:22] bstorm_: I'm curious, why keep the sunday => Sun mapping and not just convert callers to pass Sun/Mon/Tue as $weekday? [14:03:12] I was keying off how it was done originally in that. I might change that in the next patch (needed to fix the dependency). I thought the systemd:unit would be enough. It wants a systemd:service :-p [14:03:45] The compiler didn't error, which is weird *shrugs* [14:04:09] Gonna revert it and fix it up quick [14:04:41] (03PS1) 10Bstorm: Revert "labstore: Change backup cron to a systemd timer" [puppet] - 10https://gerrit.wikimedia.org/r/453135 [14:04:43] PROBLEM - IPMI Sensor Status on elastic1022 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:06:00] (03CR) 10Bstorm: [C: 032] Revert "labstore: Change backup cron to a systemd timer" [puppet] - 10https://gerrit.wikimedia.org/r/453135 (owner: 10Bstorm) [14:07:29] (03CR) 10Alex Monk: [C: 04-1] "The TODO in this make it seem like this commit completely breaks our LE cert issuance?" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [14:09:00] <_joe_> bstorm_: what was the problem? [14:09:35] It depends on systemd:service, but I had done a systemd:unit beforehand. It said there was no systemd:service with that name in the catalog [14:09:38] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Invalid relationship: Systemd::Service[block_sync] { require => Systemd::Service[block_sync.service] }, because Systemd::Service[block_sync.service] doesn't seem to be in the catalog [14:09:51] <_joe_> oh richt [14:10:01] So I'll adjust that and put it back :) [14:10:01] <_joe_> dependencies are resolved by the agent [14:10:03] <_joe_> not the master [14:10:08] That makes sense [14:10:15] <_joe_> it's the biggest limitation of our compiler [14:10:39] (03CR) 10Alex Monk: [C: 032] Implement different Certificate.save() modes [software/certcentral] - 10https://gerrit.wikimedia.org/r/453124 (owner: 10Vgutierrez) [14:11:37] !log rebooting labtestweb2001 for kernel security update [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:43] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:12:02] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:13:43] Is anyone else missing Phab boards? Our backlog board seems to have disappeared today [14:13:45] (03PS2) 10Ema: ATS: fix routing to Restbase [puppet] - 10https://gerrit.wikimedia.org/r/453111 (https://phabricator.wikimedia.org/T199720) [14:13:47] https://phabricator.wikimedia.org/tag/readers-web-backlog/ https://phabricator.wikimedia.org/project/board/67/ [14:14:35] (03CR) 10Ema: [C: 032] ATS: fix routing to Restbase [puppet] - 10https://gerrit.wikimedia.org/r/453111 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [14:16:55] (03PS1) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) [14:22:26] !log repooling wdqs[12]003 [14:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:43] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp4024.ulsfo.wmnet', 'cp2002.codfw.wmnet'] ``` The log can be found in `/var/l... [14:25:18] (03PS2) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) [14:25:29] !log starting moving asw2-a-eqiad servers' uplinks for T201694 [14:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:35] T201694: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 [14:27:04] !log lvs1015 moving cross connect from asw2-a2 to asw2-a5 T201694 [14:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:41] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: move the other private wikis to the define [puppet] - 10https://gerrit.wikimedia.org/r/451255 (https://phabricator.wikimedia.org/T196968) [14:28:27] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: move the other private wikis to the define [puppet] - 10https://gerrit.wikimedia.org/r/451255 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:28:55] (03CR) 10Alex Monk: [C: 04-1] Refactor certcentral.certificate_management() (036 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [14:29:04] (03PS3) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) [14:33:44] !log cloudelastic1001 moving uplink from asw2-a eqiad to asw2-a2 [14:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453122 (owner: 10Jcrespo) [14:34:30] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10waldyrious) >>! In T199816#4444683, @fgiunchedi wrote: > I've setup a very bare deprecation page for status.wikimedia.org, we can sunset the DNS name in... [14:34:57] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@166eafa]: Update mobileapps to a808c9d (T201979) [14:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:03] T201979: Fix usage of deprecated API query pattern(s) - https://phabricator.wikimedia.org/T201979 [14:35:25] (03Merged) 10jenkins-bot: mariadb: Depool es1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453122 (owner: 10Jcrespo) [14:37:11] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:37:49] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453144 [14:38:32] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool es2018 (duration: 00m 55s) [14:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:40] (03PS4) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) [14:39:47] !log bblack@neodymium conftool action : set/pooled=no; selector: name=dns1001.wikimedia.org [14:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:30] !log dns1001 moving uplink from asw2-a eqiad to asw-a-eqiad [14:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:00] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@166eafa]: Update mobileapps to a808c9d (T201979) (duration: 06m 03s) [14:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:06] T201979: Fix usage of deprecated API query pattern(s) - https://phabricator.wikimedia.org/T201979 [14:43:10] !log dbproxy1012 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=dns1001.wikimedia.org [14:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:52] !log labstore1008 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:30] !log db1116 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:21] (03PS3) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) [14:47:40] (03CR) 10Vgutierrez: Refactor certcentral.certificate_management() (033 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [14:49:24] !log db1118 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:42] (03CR) 10Alex Monk: [C: 04-1] Refactor certcentral.certificate_management() (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [14:49:54] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4024.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4024.ulsfo.wmnet'] ``` [14:49:59] (03CR) 10jenkins-bot: mariadb: Depool es1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453122 (owner: 10Jcrespo) [14:50:15] RECOVERY - Host mw2184 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [14:50:44] !log db1066 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:56] 10Operations, 10monitoring: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Imarlier) [14:51:40] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Sunset Watchmouse's status.wikimedia.org - https://phabricator.wikimedia.org/T199816 (10Imarlier) @waldyrious Good point. I added T202061 as a task to implement a replacement. [14:51:57] !log shutting down mw2184 for maintenance [14:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:18] (03PS4) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) [14:53:04] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [14:53:25] PROBLEM - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:42] !log labstore1009 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:15] !log dbproxy1013 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:06] !log ms-be1040 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:58] (03Abandoned) 10Jcrespo: mariadb: Set s2 in read only mode due to maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452620 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [14:58:18] (03Abandoned) 10Jcrespo: mariadb: Set s2 as read-write and promote db1122 as the new s2 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452632 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [14:58:39] (03Abandoned) 10Jcrespo: mariadb: Failover db1066 (eqiad s2 master) to db1122 [puppet] - 10https://gerrit.wikimedia.org/r/452637 (https://phabricator.wikimedia.org/T197073) (owner: 10Jcrespo) [14:58:54] (03Abandoned) 10Jcrespo: mariadb: Point s2-master CNAME to db1122 [dns] - 10https://gerrit.wikimedia.org/r/452642 (https://phabricator.wikimedia.org/T201694) (owner: 10Jcrespo) [14:58:54] !log torrelay1001 moving uplink from asw2-a-eqiad to asw-a-eqiad [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:56] PROBLEM - Host mw2184.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:00:30] !log stopping es2018 for upgrade [15:00:33] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp107[56]\.eqiad\.wmnet [15:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:46] (03PS2) 10Volans: Log: use local variable for dry_run [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) [15:00:52] (03PS2) 10Volans: dry-run: remove the module, inject the parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) [15:00:54] (03PS13) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [15:00:59] (03PS1) 10Volans: tests: enable pytest logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/453145 (https://phabricator.wikimedia.org/T199079) [15:01:01] (03PS1) 10Volans: config: fix docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) [15:01:03] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp107[56]\.eqiad\.wmnet [15:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:16] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp107[56]\.eqiad\.wmnet [15:01:18] (03CR) 10Volans: "inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:24] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10RobH) I did not check this, just didn't notice it assigned to me. The Tech Direct doesn't work, was normal support attempted? I've emailed our team, & CCed Chris. > Dell Team, > > We're experiencin... [15:01:26] (03CR) 10jerkins-bot: [V: 04-1] Log: use local variable for dry_run [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:01:28] (03CR) 10jerkins-bot: [V: 04-1] dry-run: remove the module, inject the parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:01:30] (03CR) 10jerkins-bot: [V: 04-1] tests: enable pytest logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/453145 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:01:32] (03CR) 10jerkins-bot: [V: 04-1] config: fix docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:01:38] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:02:04] why? [15:02:28] because it's a jerk [15:02:38] they all passsed locally [15:02:44] !log cp1075 moving uplink from asw2-a-eqiad to asw-a-eqiad [15:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:49] _joe_: np [15:03:45] !log cp1076 moving uplink from asw2-a-eqiad to asw-a-eqiad [15:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:29] (03PS5) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) [15:04:50] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp107[56]\.eqiad\.wmnet [15:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:10] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp107[78]\.eqiad\.wmnet [15:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:20] damn bugged test-dependencies [15:05:52] !log cp107[78] moving uplink from asw2-a-eqiad to asw-a-eqiad [15:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] (03CR) 10Gehel: [C: 031] "LGTM, trivial" [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:07:19] gehel: the prospector issues are https://github.com/PyCQA/prospector/issues/276 :( [15:07:31] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp107[78]\.eqiad\.wmnet [15:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] I can guarantee that all was good with 2.3.1 and all test were passing [15:08:41] RECOVERY - Host mw2184 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [15:08:44] (03CR) 10Gehel: "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453145 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:08:48] (03CR) 10Gehel: [C: 031] tests: enable pytest logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/453145 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:09:12] !log stopping pybal on lvs1016 to fail traffic to lvs1006 for T201694 [15:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:18] T201694: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 [15:09:39] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:09:49] (03CR) 10Volans: [V: 032 C: 032] Log: use local variable for dry_run [software/spicerack] - 10https://gerrit.wikimedia.org/r/452379 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:10:00] (03CR) 10Bstorm: [C: 032] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [15:10:14] (03PS6) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/453137 (https://phabricator.wikimedia.org/T171394) [15:10:31] (03CR) 10Volans: [V: 032 C: 032] tests: enable pytest logging [software/spicerack] - 10https://gerrit.wikimedia.org/r/453145 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:10:38] (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:11:22] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [15:11:24] (03CR) 10Volans: [V: 032 C: 032] dry-run: remove the module, inject the parameter (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/452378 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:11:42] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:11:52] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1001.eqiad.wmnet:2379 (min=42) [15:11:55] ^ lvs1016 alerts expected, see logmsg earlier [15:12:07] (03CR) 10Vgutierrez: Refactor certcentral.certificate_management() (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [15:13:21] !log lvs1016 moving uplink from asw2-a-eqiad to asw-a-eqiad [15:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:41] 10Operations, 10Cassandra: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 (10Eevans) Just for posterity sake: I don't know why the log would have been corrupted like this (almost certainly a bug), but the commitlog only exists to append incoming writes until what wa... [15:16:23] !log restarting pybal on lvs1016 - T201694 [15:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] T201694: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 [15:16:42] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [15:16:52] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 42 connections established with conf1001.eqiad.wmnet:2379 (min=42) [15:16:52] (03CR) 10Vgutierrez: [C: 04-2] "sigh.. I've swapped CHALLENGES_SOLVED and CHALLENGES_PUSHED status implementation, back to WIP :(" [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [15:17:31] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [15:17:32] PROBLEM - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:46] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1018 for maintenance" and depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453144 [15:20:20] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10Cmjohnson) [15:21:17] (03PS3) 10Jcrespo: Revert "mariadb: Depool es1018 for maintenance" and depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453144 [15:21:21] RECOVERY - Host mw2184 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [15:21:52] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Cmjohnson) [15:22:10] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Cmjohnson) [15:23:21] PROBLEM - mediawiki-installation DSH group on mw2184 is CRITICAL: Host mw2184 is not in mediawiki-installation dsh group [15:23:32] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10Cmjohnson) [15:23:55] 10Operations, 10ops-eqiad, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Cmjohnson) [15:24:17] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10Cmjohnson) [15:24:49] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10Cmjohnson) [15:25:19] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10Cmjohnson) [15:25:27] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 2 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10Cmjohnson) 05Open>03Resolved [15:25:31] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202063 (10Tim_WMDE) [15:26:38] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1018 for maintenance" and depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453144 (owner: 10Jcrespo) [15:27:57] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1018 for maintenance" and depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453144 (owner: 10Jcrespo) [15:29:27] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool es2018, depool es2019 (duration: 00m 50s) [15:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:15] 10Operations: onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201855 (10jijiki) [15:31:17] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10jijiki) [15:31:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request production global root access for Effie Mouzeli - https://phabricator.wikimedia.org/T201849 (10jijiki) 05Open>03Resolved [15:34:31] !log stopping es2019 for upgrade [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:47] 10Operations, 10ops-codfw: mw2184 stuck after reboot - https://phabricator.wikimedia.org/T202006 (10Papaul) a:05Papaul>03MoritzMuehlenhoff This is what was showing {F25020319} - Drain the power - Upgrade BIOS from version 2.3.3 to 2.6.0 - Upgrade IDRAC from 1.4.2 to 2.60 server is back up [15:38:30] (03PS2) 10Volans: config: fix docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) [15:38:32] (03PS14) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [15:38:39] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:39:24] (03CR) 10jerkins-bot: [V: 04-1] config: fix docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:39:26] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:42:47] 10Operations, 10ops-codfw: mw2184 stuck after reboot - https://phabricator.wikimedia.org/T202006 (10MoritzMuehlenhoff) 05Open>03Resolved Thanks! I've run "scap pull" and repooled the server. [15:44:04] (03PS1) 10Cmjohnson: Removing second mgmt dns entry for scandium [dns] - 10https://gerrit.wikimedia.org/r/453150 (https://phabricator.wikimedia.org/T201366) [15:44:20] (03CR) 10Cmjohnson: [C: 032] Removing second mgmt dns entry for scandium [dns] - 10https://gerrit.wikimedia.org/r/453150 (https://phabricator.wikimedia.org/T201366) (owner: 10Cmjohnson) [15:45:49] (03PS1) 10Jcrespo: mariadb: Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453151 [15:46:32] (03PS3) 10Volans: config: rename parameter to avoid negation [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) [15:46:34] (03PS15) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [15:46:39] sorry for the spam of -1 [15:47:15] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) [15:47:16] (03CR) 10jerkins-bot: [V: 04-1] config: rename parameter to avoid negation [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:47:17] 10Operations, 10netops, 10Patch-For-Review: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) 05Open>03Resolved a:03Cmjohnson [15:47:24] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:47:32] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host labsdb1001-3 [dns] - 10https://gerrit.wikimedia.org/r/453152 (https://phabricator.wikimedia.org/T184832) [15:48:06] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host labsdb1001-3 [dns] - 10https://gerrit.wikimedia.org/r/453152 (https://phabricator.wikimedia.org/T184832) (owner: 10Cmjohnson) [15:50:07] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453151 (owner: 10Jcrespo) [15:51:22] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1018 for maintenance" and depool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453144 (owner: 10Jcrespo) [15:51:24] (03Merged) 10jenkins-bot: mariadb: Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453151 (owner: 10Jcrespo) [15:51:41] (03CR) 10jenkins-bot: mariadb: Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453151 (owner: 10Jcrespo) [15:51:51] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832 (10Cmjohnson) [15:52:57] herron, hey, just wanted to check in to see where we're at with https://gerrit.wikimedia.org/r/439774 and/or https://gerrit.wikimedia.org/r/439791 ? [15:53:48] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool es2019 (duration: 00m 51s) [15:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:32] (03CR) 10Gehel: "Minor comments inline, otherwise LGTM" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:55:49] !log stopping db2034 for upgrade [15:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] (03PS1) 10Cmjohnson: Removing mgmt dns for decom server zinc [dns] - 10https://gerrit.wikimedia.org/r/453153 (https://phabricator.wikimedia.org/T191352) [15:56:35] 10Operations, 10SRE-Access-Requests: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202069 (10Tonina_Zhelyazkova_WMDE) [15:56:54] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom server zinc [dns] - 10https://gerrit.wikimedia.org/r/453153 (https://phabricator.wikimedia.org/T191352) (owner: 10Cmjohnson) [15:57:45] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) [15:58:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom zinc/WMF3298 - https://phabricator.wikimedia.org/T191352 (10Cmjohnson) [15:58:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Cmjohnson) [15:58:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom zinc/WMF3298 - https://phabricator.wikimedia.org/T191352 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [15:58:43] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission server zinc - https://phabricator.wikimedia.org/T182016 (10Cmjohnson) 05Open>03Resolved duplicate [16:00:04] godog, moritzm, and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:34] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host vanadium [dns] - 10https://gerrit.wikimedia.org/r/453154 (https://phabricator.wikimedia.org/T191351) [16:01:08] 10Operations, 10DC-Ops, 10cloud-services-team, 10netops: Refresh switch ports descriptions for recently renamed cloud servers - https://phabricator.wikimedia.org/T201444 (10RobH) p:05Triage>03Normal [16:04:06] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host vanadium [dns] - 10https://gerrit.wikimedia.org/r/453154 (https://phabricator.wikimedia.org/T191351) (owner: 10Cmjohnson) [16:04:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom vanadium/WMF3291 - https://phabricator.wikimedia.org/T191351 (10Cmjohnson) [16:04:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10Cmjohnson) [16:04:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom vanadium/WMF3291 - https://phabricator.wikimedia.org/T191351 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [16:12:32] !log banning, depooling and shutting down elastic1029 for memory replacement - T201991 [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:39] T201991: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 [16:14:54] PROBLEM - Host elastic1029 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:28] damn, elastic1029 is obviously me [16:17:40] (03PS1) 10Bstorm: labstore: trying to make dependency issues work [puppet] - 10https://gerrit.wikimedia.org/r/453156 (https://phabricator.wikimedia.org/T171394) [16:18:54] PROBLEM - Host elastic1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:22:23] RECOVERY - Host elastic1029 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:23:23] RECOVERY - mediawiki-installation DSH group on mw2184 is OK: OK [16:24:13] RECOVERY - Host elastic1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [16:25:29] 10Operations, 10ops-eqiad: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 (10Cmjohnson) I reseated the DIMM and moved all on side A to side B. Powered on and server came back normally. [16:32:20] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10RStallman-legalteam) NDA is fully signed and on file with legal. Thanks! [16:35:07] (03PS1) 10Urbanecm: Throttle exeptions for Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453160 (https://phabricator.wikimedia.org/T202038) [16:35:59] 10Operations, 10SRE-Access-Requests: Requesting Access to view EventLogging data - https://phabricator.wikimedia.org/T202072 (10gabriel-wmde) [16:37:55] jouncebot, next [16:37:55] In 0 hour(s) and 22 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T1700) [16:42:31] (03PS1) 10Cmjohnson: Adding mgmt dns for analyticsmaster1001-2 [dns] - 10https://gerrit.wikimedia.org/r/453161 (https://phabricator.wikimedia.org/T201939) [16:43:00] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for analyticsmaster1001-2 [dns] - 10https://gerrit.wikimedia.org/r/453161 (https://phabricator.wikimedia.org/T201939) (owner: 10Cmjohnson) [16:43:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:46:35] (03CR) 10Dzahn: [C: 032] "GPG is a great solution, though for some reason i got bad signature (probably missing whitespace or something like that). anyways, confirm" [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) (owner: 10Dzahn) [16:47:16] (03PS2) 10Dzahn: admins: add new SSH key for Daniel Kinzler [puppet] - 10https://gerrit.wikimedia.org/r/452844 (https://phabricator.wikimedia.org/T201913) [16:47:53] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:49:02] (03PS16) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [16:49:04] (03PS3) 10Volans: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) [16:49:25] (03PS2) 10Bstorm: labstore: trying to make dependency issues work [puppet] - 10https://gerrit.wikimedia.org/r/453156 (https://phabricator.wikimedia.org/T171394) [16:49:35] (03CR) 10Volans: "See inline, thanks a lot for the review!" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:49:44] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:49:46] (03CR) 10jerkins-bot: [V: 04-1] Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [16:50:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10Dzahn) Yes, GPG signature was a great solution, though for some reason i got 'bad signature', probably a missing whitespace during copy/paste or similar. Anyways, confirmed with a h... [16:52:17] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) [16:52:31] 10Operations, 10ops-eqiad, 10netops: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10ayounsi) p:05Triage>03High [16:57:00] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10daniel) [16:57:13] 04Critical Alert for device mr1-eqiad.wikimedia.org - Duplicate IP on mgmt network got acknowledged [16:58:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10daniel) > Yes, GPG signature was a great solution, though for some reason i got 'bad signature', probably a missing whitespace during copy/paste or similar. I guess I introduced a l... [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T1700). [17:00:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10Dzahn) 05Open>03Resolved a:03Dzahn ran puppet on bastion hosts and mwmaint1001. key has been updated there. all other hosts will follow automatically [17:00:48] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) [17:01:18] (03PS3) 10Bstorm: labstore: trying to make dependency issues work [puppet] - 10https://gerrit.wikimedia.org/r/453156 (https://phabricator.wikimedia.org/T171394) [17:02:39] (03CR) 10Bstorm: [C: 032] labstore: trying to make dependency issues work [puppet] - 10https://gerrit.wikimedia.org/r/453156 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [17:06:18] (03PS1) 10Bstorm: Revert "labstore: trying to make dependency issues work" [puppet] - 10https://gerrit.wikimedia.org/r/453162 [17:07:21] (03CR) 10Bstorm: [C: 032] Revert "labstore: trying to make dependency issues work" [puppet] - 10https://gerrit.wikimedia.org/r/453162 (owner: 10Bstorm) [17:09:00] Nothing for ORES today [17:09:47] (03PS1) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [17:10:23] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[block_sync.service] [17:10:33] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[block_sync.service] [17:21:59] (03PS1) 10Catrope: Enable ORES filters for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453168 [17:22:17] (03PS2) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [17:25:15] (03PS3) 10Ema: ATS: storage configuration [puppet] - 10https://gerrit.wikimedia.org/r/453164 (https://phabricator.wikimedia.org/T199720) [17:29:30] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10Cmjohnson) a:05Cmjohnson>03RobH dbproxy1015 had the same ip in the idrac. Fixed [17:30:23] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:34] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:36:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10Dzahn) I manually copied this key from here: https://pgp.mit.edu/pks/lookup?op=get&search=0x7DB725DFC506256E and imported it and then i could verify. ( i could not find it with --s... [17:36:39] !log reimaging elastic1029 [17:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:48] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1029.eqiad.wmnet'] ``` The log... [17:39:46] 10Operations, 10ops-eqiad: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 (10Gehel) 05Open>03Resolved a:03Gehel Looking good! [17:40:19] (03PS1) 10RobH: dbproxy101[56] mac update [puppet] - 10https://gerrit.wikimedia.org/r/453171 [17:41:34] (03CR) 10RobH: [C: 032] dbproxy101[56] mac update [puppet] - 10https://gerrit.wikimedia.org/r/453171 (owner: 10RobH) [17:45:32] 10Operations, 10ops-eqiad, 10Discovery, 10Discovery-Search, 10Elasticsearch: check elastic1022 power supply redundancy - https://phabricator.wikimedia.org/T177631 (10Gehel) @Cmjohnson confirms that there is still nothing in the H/W logs and the PSU seem to work correctly. IPMI reporting a false positive... [17:55:45] 10Operations, 10ops-eqiad, 10Performance-Team: tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628 (10Krinkle) [17:57:43] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@fec00bc]: Push updated transfer-to-es oozie job [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:52] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@fec00bc]: Push updated transfer-to-es oozie job (duration: 00m 08s) [17:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:22] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@122080c]: push new python dependency handling [17:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:41] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@122080c]: push new python dependency handling (duration: 00m 20s) [17:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T1800). [18:00:04] Urbanecm and RoanKattouw: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] Present [18:00:19] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) [18:01:17] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1029.eqiad.wmnet'] ``` and were **ALL** successful. [18:02:20] I'll do it [18:02:38] (03PS1) 10Bstorm: labstore and systemd: change timer module to use simpler interface [puppet] - 10https://gerrit.wikimedia.org/r/453173 (https://phabricator.wikimedia.org/T171394) [18:03:14] (03CR) 10jerkins-bot: [V: 04-1] labstore and systemd: change timer module to use simpler interface [puppet] - 10https://gerrit.wikimedia.org/r/453173 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [18:05:33] (03PS2) 10Catrope: Throttle exeptions for Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453160 (https://phabricator.wikimedia.org/T202038) (owner: 10Urbanecm) [18:06:17] (03CR) 10Catrope: [C: 032] Throttle exeptions for Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453160 (https://phabricator.wikimedia.org/T202038) (owner: 10Urbanecm) [18:06:29] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d994cb9]: push new python dependency handling [18:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:34] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d994cb9]: push new python dependency handling (duration: 00m 05s) [18:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:52] Hi RoanKattouw :) [18:07:47] (03Merged) 10jenkins-bot: Throttle exeptions for Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453160 (https://phabricator.wikimedia.org/T202038) (owner: 10Urbanecm) [18:08:02] (03PS3) 10Ori.livneh: Declare /var/cache/coal_web [puppet] - 10https://gerrit.wikimedia.org/r/452953 [18:08:14] (03PS2) 10Ori.livneh: Ensure coal-web caches are warm via a bi-hourly cron job [puppet] - 10https://gerrit.wikimedia.org/r/452984 [18:08:41] (03PS2) 10Bstorm: labstore and systemd: change timer module to use simpler interface [puppet] - 10https://gerrit.wikimedia.org/r/453173 (https://phabricator.wikimedia.org/T171394) [18:10:28] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@0a704b6]: push new python dependency handling [18:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:18] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@0a704b6]: push new python dependency handling (duration: 03m 49s) [18:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:06] Sorry, missed the message about that patch being merged [18:16:34] !log catrope@deploy1001 Synchronized wmf-config/throttle.php: Throttle exemptions for cswiki (T202038) (duration: 00m 53s) [18:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:42] T202038: Account creation throttling exception request for Friday 17 and 24 August 2018 - https://phabricator.wikimedia.org/T202038 [18:18:05] (03PS2) 10Catrope: Enable ORES filters for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453168 [18:18:12] (03CR) 10Catrope: [C: 032] Enable ORES filters for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453168 (owner: 10Catrope) [18:19:44] (03Merged) 10jenkins-bot: Enable ORES filters for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453168 (owner: 10Catrope) [18:19:57] (03CR) 10jenkins-bot: Throttle exeptions for Czech Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453160 (https://phabricator.wikimedia.org/T202038) (owner: 10Urbanecm) [18:19:59] (03CR) 10jenkins-bot: Enable ORES filters for PageTriage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453168 (owner: 10Catrope) [18:21:01] (03CR) 10Dzahn: [C: 032] Declare /var/cache/coal_web [puppet] - 10https://gerrit.wikimedia.org/r/452953 (owner: 10Ori.livneh) [18:22:32] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ORES filters in PageTriage on testwiki (duration: 00m 50s) [18:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:57] (03PS1) 10Ankry: Allow bureaucrats to remove the 'interface-admin' right in plwikisource (T202085) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453177 [18:22:59] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453177 (owner: 10Ankry) [18:24:39] (03Abandoned) 10Bstorm: labstore and systemd: change timer module to use simpler interface [puppet] - 10https://gerrit.wikimedia.org/r/453173 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [18:24:45] (03CR) 10Dzahn: [C: 032] Ensure coal-web caches are warm via a bi-hourly cron job [puppet] - 10https://gerrit.wikimedia.org/r/452984 (owner: 10Ori.livneh) [18:25:29] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@5d87cc0]: now without the shebang [18:25:33] (03Abandoned) 10Ankry: Allow bureaucrats to remove the 'interface-admin' right in plwikisource (T202085) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453177 (owner: 10Ankry) [18:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:47] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@5d87cc0]: now without the shebang (duration: 00m 17s) [18:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:11] (03CR) 10Gehel: [C: 031] "LGTM, trivial enough" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:29:12] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:30:20] (03PS1) 10Ankry: Allow bureaucrats to remove 'interface-admin' right in plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453179 (https://phabricator.wikimedia.org/T202085) [18:30:47] (03CR) 10Ori.livneh: "thank you, dzahn :)" [puppet] - 10https://gerrit.wikimedia.org/r/452984 (owner: 10Ori.livneh) [18:30:54] (03CR) 10Dzahn: [C: 032] "cron has been added and i manually ran the resulting command user user nobody on webperf2001. it showed no errors" [puppet] - 10https://gerrit.wikimedia.org/r/452984 (owner: 10Ori.livneh) [18:32:33] (03PS1) 10Bstorm: labstore and systemd: Change timer dependency to unit instead of service [puppet] - 10https://gerrit.wikimedia.org/r/453180 [18:33:23] (03CR) 10jerkins-bot: [V: 04-1] labstore and systemd: Change timer dependency to unit instead of service [puppet] - 10https://gerrit.wikimedia.org/r/453180 (owner: 10Bstorm) [18:34:23] (03PS2) 10Bstorm: labstore and systemd: Change timer dependency to unit instead of service [puppet] - 10https://gerrit.wikimedia.org/r/453180 (https://phabricator.wikimedia.org/T171394) [18:35:31] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic1022 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Gehel tracked in https://phabricator.wikimedia.org/T177631 [18:37:24] !log arlolra@deploy1001 Started deploy [parsoid/deploy@59f6585]: Updating Parsoid to dbbad6a [18:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:20] (03CR) 10Volans: [V: 032 C: 032] config: rename parameter to avoid negation [software/spicerack] - 10https://gerrit.wikimedia.org/r/453146 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:40:12] (03CR) 10Volans: [V: 032 C: 032] Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:45:05] !log reimage of elasticsearch eqiad completed - T198391 / T193649 [18:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:14] T198391: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 [18:45:15] T193649: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 [18:47:15] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@59f6585]: Updating Parsoid to dbbad6a (duration: 09m 51s) [18:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:26] (03PS3) 10Bstorm: labstore and systemd: Change timer dependency to unit instead of service [puppet] - 10https://gerrit.wikimedia.org/r/453180 (https://phabricator.wikimedia.org/T171394) [18:50:27] (03PS2) 10Ankry: Allow bureaucrats to remove 'interface-admin' right in plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453179 (https://phabricator.wikimedia.org/T202085) [18:52:39] !log Updated Parsoid to dbbad6a (T201115) [18:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:46] T201115: MediaWiki API deprecation warnings - https://phabricator.wikimedia.org/T201115 [18:53:38] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10RobH) a:05RobH>03Cmjohnson Dell fixed the ownership info for us, you can put in requests for support and parts now. [18:55:55] (03CR) 10Bstorm: [C: 032] labstore and systemd: Change timer dependency to unit instead of service [puppet] - 10https://gerrit.wikimedia.org/r/453180 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [18:56:57] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@040690e]: coordinator.properties should reference a coordinator, not bundle [18:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:15] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@040690e]: coordinator.properties should reference a coordinator, not bundle (duration: 00m 18s) [18:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] thcipriani and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Gerrit All-Users/Cache change. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T1900). [19:00:23] * thcipriani on it [19:05:45] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@9fb53c4]: (no justification provided) [19:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@9fb53c4]: (no justification provided) (duration: 00m 17s) [19:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:16] !log T201314 mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'EricEnfermero' 'Larry Hockett' --ignorestatus [19:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:27] T201314: Please unblock stuck global rename: EricEnfermero to Larry Hockett - https://phabricator.wikimedia.org/T201314 [19:11:22] !log twentyafterfour and thcipriani performing online maintenance on gerrit All-Users repo [19:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:48] !log clearing gerrit accounts cache [19:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:05] Hi! I've lost track of the hhvm/php7 discussions... Are we running php7 in prod now? Is it ok to use the ?? syntax? [19:20:26] no, we're not [19:20:28] But yes it is [19:20:51] Because hhvm supports it? [19:21:07] yup [19:21:23] we're in a weird limbo [19:22:00] limbo is not cool, but the ?? syntax is. Thanks Reedy [19:22:00] https://docs.hhvm.com/hhvm/configuration/INI-settings#php-7-settings are related to the things that you can't do [19:25:41] twentyafterfour thcipriani did you reindex too? [19:28:18] paladox: we determined it was likely not necessary for what we changed [19:28:29] oh [19:28:31] ok [19:33:15] (03PS1) 10Andrew Bogott: nfs-exportd: add exports for neutron IPs [puppet] - 10https://gerrit.wikimedia.org/r/453192 (https://phabricator.wikimedia.org/T202088) [19:33:52] (03CR) 10jerkins-bot: [V: 04-1] nfs-exportd: add exports for neutron IPs [puppet] - 10https://gerrit.wikimedia.org/r/453192 (https://phabricator.wikimedia.org/T202088) (owner: 10Andrew Bogott) [19:34:41] (03PS2) 10Andrew Bogott: nfs-exportd: add exports for neutron IPs [puppet] - 10https://gerrit.wikimedia.org/r/453192 (https://phabricator.wikimedia.org/T202088) [19:36:09] (03PS3) 10Andrew Bogott: nfs-exportd: add exports for neutron IPs [puppet] - 10https://gerrit.wikimedia.org/r/453192 (https://phabricator.wikimedia.org/T202088) [19:37:22] (03CR) 10Andrew Bogott: [C: 032] nfs-exportd: add exports for neutron IPs [puppet] - 10https://gerrit.wikimedia.org/r/453192 (https://phabricator.wikimedia.org/T202088) (owner: 10Andrew Bogott) [19:42:54] (03CR) 10Dzahn: [V: 031 C: 032] "was once used for performance.wm website, that (and nothing else i can see) is using it anymore, reduces apache module, which is what we w" [puppet] - 10https://gerrit.wikimedia.org/r/452687 (owner: 10Krinkle) [19:43:08] (03PS2) 10Dzahn: apache: Remove unused apache::static_site type [puppet] - 10https://gerrit.wikimedia.org/r/452687 (owner: 10Krinkle) [19:44:28] (03PS4) 10Volans: Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) [19:45:12] (03CR) 10Volans: "removed logging when raising exceptions, addressed 1 comment, 1 pending, see inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:45:14] (03CR) 10jerkins-bot: [V: 04-1] Add confctl module to interact with conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:54:22] !log phab1002 closing idle root screen that was used for rsyncing repos [19:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:17] (03PS2) 10Dzahn: memcached: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450319 [19:56:00] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/12118/mc1020.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/450319 (owner: 10Dzahn) [20:07:54] 10Operations, 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: convert cloud VPS projects from apache to httpd module (wikidata-query/ldfclient) - https://phabricator.wikimedia.org/T202092 (10Dzahn) [20:08:16] 10Operations, 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: convert cloud VPS projects from apache to httpd module (wikidata-query/ldfclient) - https://phabricator.wikimedia.org/T202092 (10Dzahn) p:05Triage>03Low [20:11:31] 10Operations, 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: convert cloud VPS projects from apache to httpd module (wikidata-query/ldfclient) - https://phabricator.wikimedia.org/T202092 (10Dzahn) related: convert the "role(simplelamp)" which is used by more things: https://gerr... [20:16:41] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) [20:18:01] 10Operations, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) checked that "ops" LDAP group was also done. Icinga and mail we will do Monday, talked about it on Service Ops meeting for the pwstore part we will need a GPG key from you @jijiki but it has time [20:19:06] @seen thcipriani [20:19:06] mutante: thcipriani is in here, right now [20:19:17] o/ [20:19:44] heh:) hi Tyler. i would like to schedule a reboot of gerrit servers [20:19:51] though gerrit2001 i would jfdi [20:23:12] (03CR) 10Paladox: "I doin't know if we want to do this or keep it in puppet? Seeing as these files are basically deprecated except that they may be kept for " [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890 (owner: 10Paladox) [20:26:43] !log gerrit2001 - scheduled downtime, rebooting [20:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:10] 10Operations: `sql centralauth` is broken on mwmaint1001 - https://phabricator.wikimedia.org/T202096 (10Legoktm) [20:33:13] !log releases1001/2001: upgrading apache2 packages [20:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:08] !log releases2001: upgrading openjdk, systemd, jenkins [20:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:25] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@76dddd2]: point transfer_to_es at spark 2.x [20:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:40] 10Operations, 10DBA: `sql centralauth` is broken on mwmaint1001 - https://phabricator.wikimedia.org/T202096 (10Reedy) [20:47:59] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@76dddd2]: point transfer_to_es at spark 2.x (duration: 01m 33s) [20:48:00] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@76dddd2]: point transfer_to_es at spark 2.x [20:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:13] 10Operations, 10DBA: `sql centralauth` is broken on mwmaint1001 - https://phabricator.wikimedia.org/T202096 (10Reedy) Looks just broken for you? ``` reedy@mwmaint1001:~$ sql centralauth Reading table information for completion of table and column names You can turn off this feature to get a quicker startup wi... [20:48:46] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@76dddd2]: point transfer_to_es at spark 2.x (duration: 00m 46s) [20:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:31] !log releases2001 - rebooting [20:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:21] 10Operations, 10DBA: `sql centralauth` is broken on mwmaint1001 - https://phabricator.wikimedia.org/T202096 (10Legoktm) :'( [20:53:49] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@76dddd2]: (no justification provided) [20:53:52] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@76dddd2]: (no justification provided) (duration: 00m 02s) [20:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:19] 10Operations, 10DBA: `sql centralauth` is broken on mwmaint1001 - https://phabricator.wikimedia.org/T202096 (10Legoktm) 05Open>03Invalid I had an old ~/.my.cnf that was apparently getting in the way. My bad :( [21:00:10] !log releases1001 - installing package upgrade like on releases2001 before, scheduling downtime, reboot [21:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:12] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@76dddd2]: debug git-fat initialization fail [21:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:18] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@76dddd2]: debug git-fat initialization fail (duration: 00m 06s) [21:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:00] ebernhardson: you might need --force, if scap thinks a revision is already deployed (i.e. it sees the directory /srv/deployment/[repo]-cache/revs/[76dddd2]) it's going to assume it's deployed already and not try again [21:04:28] thcipriani: ahha, yea it went awfully fast and didn't run promote :) [21:04:45] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@76dddd2]: debug git-fat initialization fail [21:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:01] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@76dddd2]: debug git-fat initialization fail (duration: 00m 16s) [21:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:36] thcipriani: i'll file a ticket, but from what i can tell there might be a race between git-fat init and the promote script? [21:06:01] or git-fat pull i suppose [21:06:14] hrm, I think git fat pull should block [21:06:26] but if you file a task I can dig deeper on it [21:06:55] sure, it could have been any number of errors. The symptom is pip failing to install a .whl because it's not a zip file. I added a 5 second pause and it works, but who knows what happened [21:07:33] IIRC: git fat init/pull should run as part of fetch, then fetch scripts run, then promote stage/symlink swap, then promote scripts run [21:07:52] that is, I think git fat stuff happens as part of a different stage [21:08:00] hmm, ok then that's certainly odd [21:08:50] !log contint2001 - installing jenkins upgrade [21:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:25] RECOVERY - Memory correctable errors -EDAC- on wtp2011 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [21:12:07] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202 (10Jdforrester-WMF) [21:12:10] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Jdforrester-WMF) 05Open>03Resolved a:03Vgutierrez Please re-open if I'm wrong. [21:12:22] 10Operations, 10Traffic, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181 (10Jdforrester-WMF) [21:12:24] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202 (10Jdforrester-WMF) 05Open>03Resolved a:03Vgutierrez Please re-open if I'm wrong. [21:13:08] 10Operations, 10Traffic: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181 (10Jdforrester-WMF) I believe that the planning and execution of the work is all now complete? [21:17:48] (03CR) 10Gehel: [C: 031] "Good enough, minor comments inline, but feel free to merge as-is" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451254 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:18:14] (03Abandoned) 10Bstorm: labstore: set up an icinga plugin to check cron exit codes [puppet] - 10https://gerrit.wikimedia.org/r/451181 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [21:21:24] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030 (10RobH) Link-level type: Flexible-Ethernet, MTU: 9192, MRU: 9200, Speed: 40Gbps, BPDU Error: None, Loop Detect PDU Error: None, Loopback: Disabled, Source filtering: Disabled, Flow contr... [21:23:18] !log contint1001 - installing jenkins upgrade [21:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:55] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@5731563]: point oozie sharelib at spark2.3.1 [21:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:13] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@5731563]: point oozie sharelib at spark2.3.1 (duration: 00m 18s) [21:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:42] jouncebot: now [21:30:42] No deployments scheduled for the next 1 hour(s) and 29 minute(s) [21:30:45] (03PS1) 10Dzahn: releases/mediawiki: proper Icinga monitoring for both Apache vhosts [puppet] - 10https://gerrit.wikimedia.org/r/453267 [21:32:38] (03CR) 10Paladox: releases/mediawiki: proper Icinga monitoring for both Apache vhosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/453267 (owner: 10Dzahn) [21:34:00] hello, we have to reboot the gerrit prod server. sorry for logging you out soon. [21:34:28] you can stop me if you are currently doing that one-hour inline edit mega patchset [21:35:05] quick question: the second interface admin config change (removing the ability to edit MW space from sysops) will go in the European SWAT on the 27th, right? [21:36:11] (03CR) 10Dzahn: releases/mediawiki: proper Icinga monitoring for both Apache vhosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/453267 (owner: 10Dzahn) [21:38:30] (03PS2) 10Dzahn: releases/mediawiki: proper Icinga monitoring for both Apache vhosts [puppet] - 10https://gerrit.wikimedia.org/r/453267 [21:39:33] enterprisey: this? [config] 453179 Allow 'crats to remove IA right in pl.ws ? [21:39:43] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@5731563]: point oozie sharelib at spark2.3.0 [21:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:49] that's the only thing i see scheduled on that day in the calendar [21:39:57] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@5731563]: point oozie sharelib at spark2.3.0 (duration: 00m 14s) [21:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:31] mutante: something related ot https://phabricator.wikimedia.org/T190015 [21:40:53] I don't think the patch has even appeared in the thread yet, so this may be a silly time to ask the question [21:41:05] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@5731563]: point oozie sharelib at spark2.3.0 [21:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:15] enterprisey: i think you'll have to get it on https://wikitech.wikimedia.org/wiki/Deployments#Monday,_August_27 or it might be no [21:41:26] I see [21:41:31] it's all about the calendar page [21:41:36] that's also what jouncebot reads [21:41:51] yeah I'm not involved in the dev work for this patch, enwp is just very interested in our deadline to start granting the perm to people [21:42:25] i see.. yea, but you can stil ask for it to be deployed by somebody or in SWAT depending on the nature of the change [21:42:33] alright solid thanks [21:42:38] quite welcome [21:44:47] !log rebooting cobalt (gerrit.wikimedia.org) for kernel upgrade [21:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:48] Ah, right, that's why gerrit's dead. :-) [21:46:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@5731563]: point oozie sharelib at spark2.3.0 (duration: 04m 56s) [21:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:42] James_F: yes, i made a mini announcement on -dev and here. coming back shortly [21:47:34] server is back.. service should start in a few [21:48:30] James_F: back [21:48:36] Thanks! [21:50:15] did Jenkins go down or is still processing the queue? [21:50:46] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@5731563]: try again after git-fat init fail [21:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:04] davidwbarratt: it had to be restarted but is separate from gerrit, it happened earlier [21:51:04] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@5731563]: try again after git-fat init fail (duration: 00m 18s) [21:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:16] mutante oh ok [21:53:43] davidwbarratt: are you asking because zuul looks backed up? Or something else? [21:54:08] thcipriani oh no, it was just taking awhile, but jenkinsbot finally got back to me [21:54:18] ah, cool [21:55:35] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting Access to view EventLogging data - https://phabricator.wikimedia.org/T202072 (10Addshore) [21:55:54] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202069 (10Addshore) [21:55:57] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@5731563]: try again after git-fat init fail [21:55:59] !log ebernhardson@deploy1001 deploy aborted: try again after git-fat init fail (duration: 00m 01s) [21:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:01] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data - https://phabricator.wikimedia.org/T202063 (10Addshore) [21:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:35] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@651904b]: use spark 2.3.0, oozie still doesnt like 2.3.1 [21:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:51] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@651904b]: use spark 2.3.0, oozie still doesnt like 2.3.1 (duration: 00m 16s) [21:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [22:09:54] (03PS6) 10Ayounsi: [WIP] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [22:09:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [22:20:36] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/12120/dns1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [22:24:25] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@e70f9d5]: dont override the spark sharelib [22:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:36] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@e70f9d5]: dont override the spark sharelib (duration: 00m 11s) [22:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:48] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@e70f9d5]: retry git fat init fail [22:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@e70f9d5]: retry git fat init fail (duration: 00m 14s) [22:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:38] 10Operations, 10Scap: Intermittent git-fat failure during deploy - https://phabricator.wikimedia.org/T202100 (10thcipriani) It is that same problem! The current version of git-fat doesn't have my commit in it: https://github.com/wikimedia/operations-debs-git-fat/commit/0e3abb0c5e8b1e4d81470397ec17138c6d24d9e8... [22:32:01] !log re-activating BGP sessions between cr1/2-ulsfo and the office's router2 [22:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:30] (03CR) 10Dzahn: "leaving inline comments for stuff found out while debugging" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [22:56:14] RECOVERY - Long running screen/tmux on phab1002 is OK: OK: No SCREEN or tmux processes detected. [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180816T2300). [23:00:05] Jdlrobson and nray: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:23] \o [23:02:32] \o [23:03:35] mark: twentyafterfour RoanKattouw thcipriani Niharika any of you able to swat right now? [23:03:48] I can do it [23:04:48] oh man, mistaken for a spammer! [23:05:06] yeah looks like @jdlrobson got booted for spam? Surely a mistake [23:05:13] all it takes is mentioning 5 nicks I guess? [23:05:25] Sigyn is a bot [23:06:45] Can we bring him back? [23:06:59] He says he got a message saying "please email this person if you think this is a mistaek" [23:07:19] Thankfully he and I are in the same office :) [23:07:33] haha [23:23:26] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10Dzahn) a:05Aleksey_WMDE>03Dzahn [23:25:59] i can tell #freenode it was a mistake [23:26:48] he should also auth with freenode [23:26:52] if he gets a wikimedia cloak or a wikipedia cloak the bot will not kick him. [23:28:07] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Dzahn) a:03Dzahn [23:30:14] <@Unit193> mutante: Might want to have him be more careful around Sigyn. [23:30:17] :p [23:34:39] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.16/skins/MinervaNeue/resources/skins.minerva.content.styles.images/magnifying-glass.svg: Correct MinervaNeue search icon (T199000) (duration: 00m 51s) [23:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:47] T199000: Remove redundant/non-critical styling rules in Minerva - https://phabricator.wikimedia.org/T199000 [23:35:44] RoanKattouw: fixed it for Jon 19:34 <@Unit193> mutante: I've removed it at this point. [23:48:21] thanks ... (looks around before [23:48:37] thanks! (looks around before daring to @) @mutante [23:50:17] jdlrobson you should register your nick and login :) [23:51:05] paladox: i have... [23:51:08] that's what makes it so strange [23:51:10] oh [23:51:15] unless something broke? [23:52:33] we don't see you with your cloak at least [23:52:38] so maybe something broke, yea [23:52:42] hmm [23:53:16] NickServ says I'm identified.. unless you are talking about something else? [23:54:44] i can confirm that. nickserv says you are logged in (status: 3) [23:54:59] we meant (additionally) that you have the /wikimedia as address [23:56:24] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [23:57:07] it might help even more, i dont know about the rules of that bot of course [23:58:34] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.