[00:09:21] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:39:24] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) [00:39:26] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) 05Open>03Resolved Seems to be working fine now, thanks! [00:40:40] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) [00:41:02] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962 (10BBlack) [00:41:04] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install cp1075-cp1090 - https://phabricator.wikimedia.org/T195923 (10BBlack) 05Open>03Resolved These are fully in-service. Will file separate ticket(s) about decomming various older cp10xx machines. [01:47:23] (03CR) 10Dzahn: [C: 032] "same thing as https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450317/ where i already got +1 from ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/450318 (owner: 10Dzahn) [01:47:39] (03PS2) 10Dzahn: eventlogging/kafka::analytics: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450318 [01:53:23] (03CR) 10Dzahn: [C: 032] "nothing on kafka1023/kafka1002, eventlog1002.. as expected" [puppet] - 10https://gerrit.wikimedia.org/r/450318 (owner: 10Dzahn) [01:59:24] (03PS1) 10Dzahn: labs: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451818 [01:59:26] (03PS1) 10Dzahn: restbase: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451819 [02:00:32] (03PS1) 10Dzahn: mail::mx: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/451820 [02:07:21] (03PS2) 10Prtksxna: Remove obsolete $wgPopupsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450906 [02:07:23] (03PS5) 10Prtksxna: Remove obsolete $wgPopupsBetaFeature from InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 [02:18:47] (03PS1) 10Dzahn: puppetmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [02:21:00] (03CR) 10Dzahn: [C: 04-1] puppetmaster: convert from apache to httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [03:10:51] PROBLEM - Disk space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 57307 MB (3% inode=99%) [03:25:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 840.01 seconds [03:33:13] ACKNOWLEDGEMENT - Disk space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 53805 MB (3% inode=99%): eevans Investigating [03:41:12] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 168.35 seconds [03:45:45] (03PS1) 10Tulsi Bhagat: Enable Rollbacker User Group at ru.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 [03:56:44] (03PS2) 10Tulsi Bhagat: Enable Rollbacker User Group at ru.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) [04:03:27] (03PS3) 10Tulsi Bhagat: Enable Rollbacker User Group at ru.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) [05:06:03] (03PS6) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [05:51:18] (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451823 (https://phabricator.wikimedia.org/T200201) (owner: 10Tulsi Bhagat) [06:03:05] !log test [06:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:15] yikes. [06:05:12] I hope that this only worked because of the cloak identification. I hope this can't be done anonymously. I actually didn't expect this to do anything for me. :) log entry undone [06:05:19] didn't think about my cloak [06:09:48] !log This time, it should not work [06:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:57] :< [06:10:47] I *think* this is a potential issue. [06:13:25] NickServ's protection won't work: This channel can be joined by any "registered" user on freenode, who can rename after joining, to any nick of their choice. They would not even lose their "registered" status on the IRCd. [06:13:59] if the target is not connected 24/7, messages can be sent in their name before NickServ's 30-second renaming timer kicks in. [06:14:03] * ToBeFree shrugs [06:14:14] probably not too dangerous I guess. [06:31:22] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:51] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:53:33] (03PS1) 10Ema: caches: set numa_networking by default [puppet] - 10https://gerrit.wikimedia.org/r/451826 (https://phabricator.wikimedia.org/T193865) [06:56:22] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:03:19] !log installing mutt security updates [07:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:42] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10ema) [07:16:45] (03PS1) 10Muehlenhoff: Also enable microcode for ATS hosts [puppet] - 10https://gerrit.wikimedia.org/r/451828 [07:17:37] (03CR) 10Ema: [C: 031] Also enable microcode for ATS hosts [puppet] - 10https://gerrit.wikimedia.org/r/451828 (owner: 10Muehlenhoff) [07:23:02] (03PS10) 10Volans: Add cookbook entry point script [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) [07:23:59] (03CR) 10Muehlenhoff: [C: 032] Also enable microcode for ATS hosts [puppet] - 10https://gerrit.wikimedia.org/r/451828 (owner: 10Muehlenhoff) [07:43:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:43:33] (03PS2) 10Ema: ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) [07:44:31] RECOVERY - Disk space on restbase1016 is OK: DISK OK [07:45:02] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:46:22] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3040_v4, cp3040_v6 [07:46:22] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3040_v4, cp3040_v6 [07:46:22] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3040_v4, cp3040_v6 [07:46:43] that's me, all good ^ [07:54:32] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [07:55:38] mmh, kernel issues on cp3040 apparently (depooled) [07:57:21] !log powercycle cp3040, kernel crash [07:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:21] RECOVERY - Host cp3040 is UP: PING OK - Packet loss = 0%, RTA = 83.83 ms [08:00:32] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 52 ESP OK [08:00:41] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [08:00:41] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 52 ESP OK [08:03:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [08:03:59] !log upgrade and restart db2094 [08:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:44] (03PS3) 10Ema: ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) [08:11:31] (03CR) 10jerkins-bot: [V: 04-1] ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [08:12:44] (03PS4) 10Ema: ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) [08:13:59] (03PS1) 10Gehel: wmf-auto-reimage: fix type mismatch when no puppet certs for host [puppet] - 10https://gerrit.wikimedia.org/r/451829 [08:15:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [08:19:03] 10Operations, 10Traffic: cp3040: kernel crash in ipsec code shortly after reboot - https://phabricator.wikimedia.org/T201666 (10ema) [08:19:41] 10Operations, 10Traffic: cp3040: kernel crash in ipsec code shortly after reboot - https://phabricator.wikimedia.org/T201666 (10ema) p:05Triage>03Normal [08:21:42] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/12039/cp2009.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [08:27:38] (03PS1) 10Muehlenhoff: Enable microcode for WMCS puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/451830 (https://phabricator.wikimedia.org/T127825) [08:28:09] !log rebuild python-pykube for stretch-wikimedia and add it to apt.wikimedia.org T200660 [08:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:16] T200660: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 [08:28:39] 10Operations, 10Packaging, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10aborrero) a:03aborrero [08:29:13] !log upgrade and restart db2095 [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:35] 10Operations, 10Packaging, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10aborrero) It should be done. Please, let me know any issue. For the record (specially myself in the future) I used these commands: ``` (in my laptop): sbuild -d stretch-wikim... [08:36:10] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for WMCS puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/451830 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:42:11] (03CR) 10Volans: [C: 031] "LGTM, thanks for fixing this edge case bug" [puppet] - 10https://gerrit.wikimedia.org/r/451829 (owner: 10Gehel) [08:42:56] (03PS2) 10Gehel: wmf-auto-reimage: fix type mismatch when no puppet certs for host [puppet] - 10https://gerrit.wikimedia.org/r/451829 [08:43:48] (03CR) 10Gehel: [C: 032] wmf-auto-reimage: fix type mismatch when no puppet certs for host [puppet] - 10https://gerrit.wikimedia.org/r/451829 (owner: 10Gehel) [08:52:52] (03PS5) 10Ema: ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) [08:53:05] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Jalexander) [08:54:40] (03CR) 10Vgutierrez: [C: 031] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [08:56:09] (03PS6) 10Ema: ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) [08:57:03] (03CR) 10Ema: [C: 032] ATS: allow to specify outbound TLS connection settings [puppet] - 10https://gerrit.wikimedia.org/r/451654 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [08:57:10] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) [08:58:18] !log rebooting labpuppetmaster1001 for kernel security update/microcode update [08:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:39] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10JanWMF) approved [09:00:28] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10JanWMF) approved [09:03:26] !log upgrade and restart db1124 [09:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:22] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [09:12:59] !log rebooting labpuppetmaster1002 for kernel security update/microcode update [09:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:33] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2017.codfw.wmnet'] ``` The log can... [09:17:28] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2017.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2017.codfw.wmnet... [09:18:35] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2017.codfw.wmnet'] ``` The log can... [09:19:15] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2017.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2017.codfw.wmnet... [09:19:34] !log rearmed keyholder on labpuppetmaster* [09:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:58] 10Operations, 10Operations-Software-Development: wmf-auto-reimage should retry on ipmi failures - https://phabricator.wikimedia.org/T201669 (10Gehel) [09:25:03] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia Community User Group Albania mailing list request - https://phabricator.wikimedia.org/T201670 (10Sidorela) [09:27:15] 10Operations, 10DC-Ops, 10Discovery-Search (Current work): Transient failures of IPMI commands to elastic2017 - https://phabricator.wikimedia.org/T201671 (10Gehel) [09:27:58] !log upgrade and restart db1125 [09:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:45] !log mobrovac@deploy1001 Started deploy [citoid/deploy@983d80c]: Remove the bibtex spec.yaml x-ample - T197242 [09:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:52] T197242: Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 [09:35:25] Morning [09:35:31] Having slow performance with image uploads [09:35:49] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@983d80c]: Remove the bibtex spec.yaml x-ample - T197242 (duration: 03m 04s) [09:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:00] Any upgrades in progress? [09:41:31] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [09:52:36] (03CR) 10Jcrespo: [C: 032] mariadb-package: Package MariaDB 10.1.35 for stretch [software] - 10https://gerrit.wikimedia.org/r/451280 (owner: 10Jcrespo) [09:53:15] (03PS7) 10Jcrespo: mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) [09:55:48] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Start backing up s2-5 from the new eqiad backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/450929 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:16:12] (03PS1) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [10:20:23] (03PS1) 10Muehlenhoff: Fix user name of Pats Pena in ldap user table [puppet] - 10https://gerrit.wikimedia.org/r/451839 (https://phabricator.wikimedia.org/T199557) [10:23:39] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) [10:52:55] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [10:53:39] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [10:53:48] (03PS1) 10Muehlenhoff: Enable microcode on gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/451842 [10:54:28] (03CR) 10Muehlenhoff: [C: 032] Enable microcode on gerrit servers [puppet] - 10https://gerrit.wikimedia.org/r/451842 (owner: 10Muehlenhoff) [10:55:17] (03PS2) 10Muehlenhoff: Fix user name of Pats Pena in ldap user table [puppet] - 10https://gerrit.wikimedia.org/r/451839 (https://phabricator.wikimedia.org/T199557) [10:58:08] (03CR) 10Muehlenhoff: [C: 032] Fix user name of Pats Pena in ldap user table [puppet] - 10https://gerrit.wikimedia.org/r/451839 (https://phabricator.wikimedia.org/T199557) (owner: 10Muehlenhoff) [11:02:18] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) >>! In T197242#4492958, @Krenair wrote: > Hopefully this is the right place for my questions (... [11:02:49] (03CR) 10Arturo Borrero Gonzalez: "New catalog test:" [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [11:06:18] !log installing libxcursor security updates [11:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:08] !log installing ant security updates [11:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:22] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2069 due to crash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451844 [11:25:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Depool db2069 due to crash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451844 (owner: 10Jcrespo) [11:26:12] (03Abandoned) 10Muehlenhoff: Enable microcode for a few more misc roles [puppet] - 10https://gerrit.wikimedia.org/r/451608 (owner: 10Muehlenhoff) [11:38:51] !log installing jansson security updates for trusty [11:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:14] !log installing busybox security updates [11:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:53] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [11:49:06] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2069 due to crash" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451844 [11:51:08] (03CR) 10Gehel: Add cookbook entry point script (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:54:53] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [11:55:25] (03CR) 10Ema: [C: 031] Revert TTLs back to 600 for misc->text moves [dns] - 10https://gerrit.wikimedia.org/r/451695 (https://phabricator.wikimedia.org/T164609) (owner: 10BBlack) [11:59:22] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp2024 is CRITICAL: connect to address 10.192.48.28 and port 3126: Connection refused [11:59:22] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp2024 is CRITICAL: connect to address 10.192.48.28 and port 80: Connection refused [11:59:53] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp2024 is CRITICAL: connect to address 10.192.48.28 and port 3121: Connection refused [11:59:54] looking, that's one of my reboots ^ [12:00:13] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp3038 is CRITICAL: connect to address 10.20.0.173 and port 3125: Connection refused [12:00:13] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp2024 is CRITICAL: connect to address 10.192.48.28 and port 3125: Connection refused [12:00:14] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp2024 is CRITICAL: connect to address 10.192.48.28 and port 3122: Connection refused [12:00:32] mmh the icinga downtime part failed apparently, sorry for the noise [12:00:43] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp3038 is CRITICAL: connect to address 10.20.0.173 and port 3121: Connection refused [12:00:43] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp3038 is CRITICAL: connect to address 10.20.0.173 and port 3124: Connection refused [12:00:43] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp3038 is CRITICAL: connect to address 10.20.0.173 and port 3123: Connection refused [12:00:53] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp3038 is CRITICAL: connect to address 10.20.0.173 and port 3122: Connection refused [12:01:03] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp3038 is CRITICAL: connect to address 10.20.0.173 and port 80: Connection refused [12:01:38] (03CR) 10BBlack: [C: 031] caches: set numa_networking by default [puppet] - 10https://gerrit.wikimedia.org/r/451826 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [12:02:43] PROBLEM - puppet last run on cp3038 is CRITICAL: Return code of 255 is out of bounds [12:03:22] PROBLEM - Host cp2024 is DOWN: PING CRITICAL - Packet loss = 100% [12:04:13] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [12:04:42] RECOVERY - Host cp2024 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [12:05:23] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp2024 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.072 second response time [12:05:23] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp2024 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.072 second response time [12:05:32] RECOVERY - Host cp3038 is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [12:05:53] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp2024 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.072 second response time [12:06:12] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp3038 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [12:06:13] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp2024 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.073 second response time [12:06:13] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp2024 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.072 second response time [12:06:52] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp3038 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [12:06:52] RECOVERY - Varnish HTTP upload-frontend - port 3123 on cp3038 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [12:06:52] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp3038 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.168 second response time [12:07:02] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp3038 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [12:07:12] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp3038 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.167 second response time [12:07:43] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:08:09] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2017.codfw.wmnet'] ``` The log can... [12:08:46] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2017.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2017.codfw.wmnet... [12:09:12] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2031.codfw.wmnet', 'elastic2032.cod... [12:11:13] PROBLEM - Host elastic2031 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:13] PROBLEM - Host elastic2032 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:45] ^ those 2 shoudl have been downtimed, checking [12:11:51] gehel: I think downtiming must be broken somehow [12:12:03] ema: so it seems... [12:12:31] gehel: it failed for cp3038/cp2024 too [12:12:42] I've had some issues try to reimage those 2 previously, so that might be a bad state just for them [12:12:56] Oh, if cp* have the same issue, then it is probably more generic [12:13:42] RECOVERY - Host elastic2031 is UP: PING OK - Packet loss = 0%, RTA = 36.35 ms [12:14:02] RECOVERY - Host elastic2032 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [12:14:28] I also see warnings for all passive checks so maybe there's something wrong with icinga [12:15:41] https://phabricator.wikimedia.org/T196336 [12:15:52] PROBLEM - MD RAID on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:02] PROBLEM - Elasticsearch HTTPS on elastic2031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:16:13] PROBLEM - Check size of conntrack table on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:13] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:22] PROBLEM - dhclient process on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:22] PROBLEM - dhclient process on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:23] PROBLEM - DPKG on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:23] PROBLEM - Check systemd state on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:31] I just tried to downtime those checks through the icinga web ui, and it does not seem to work either [12:16:40] yeah I wouldn't trust reimager's downtimes, at least not for 2+ hosts in parallel [12:16:42] PROBLEM - Check size of conntrack table on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:42] PROBLEM - MD RAID on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:42] PROBLEM - Elasticsearch HTTPS on elastic2032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:16:42] PROBLEM - DPKG on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:43] PROBLEM - Disk space on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:43] PROBLEM - Disk space on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:47] gehel: yeah I think we just ran into T196336 again [12:16:48] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [12:16:52] PROBLEM - configured eth on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:52] PROBLEM - configured eth on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:16:52] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:16:58] even without the passive checks problems [12:17:03] PROBLEM - Check systemd state on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:17:20] reimager parallel -> downtime races with itself and fails, so I switched to using manual downtimes before doing parallel reimage runs [12:17:32] PROBLEM - puppet last run on elastic2032 is CRITICAL: Return code of 255 is out of bounds [12:17:52] PROBLEM - puppet last run on elastic2031 is CRITICAL: Return code of 255 is out of bounds [12:18:03] according to T196336, a restart of icinga **fixed** the issue last time, let's try [12:18:12] for some value of "fixed" [12:18:19] !log restarting icinga on einsteinium - T196336 [12:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:03] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:20:11] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:21:05] bblack: I've done a bunch of reimages this week. I've had a few other issues so far, but icinga completely failing to downtime is a first :/ [12:26:07] (03PS1) 10BBlack: sitemaps.wikimedia.org: new generic microsite [puppet] - 10https://gerrit.wikimedia.org/r/451847 [12:27:34] !log resetting management card on elastic2017 - T201671 [12:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:43] T201671: Transient failures of IPMI commands to elastic2017 - https://phabricator.wikimedia.org/T201671 [12:28:13] RECOVERY - Disk space on elastic2031 is OK: DISK OK [12:28:45] 10Operations, 10DC-Ops, 10Discovery-Search (Current work): Transient failures of IPMI commands to elastic2017 - https://phabricator.wikimedia.org/T201671 (10Gehel) Resetting the mgmt card might help, according to https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card Note: updat... [12:28:54] RECOVERY - Check size of conntrack table on elastic2031 is OK: OK: nf_conntrack is 0 % full [12:29:03] RECOVERY - MD RAID on elastic2031 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:29:23] RECOVERY - dhclient process on elastic2031 is OK: PROCS OK: 0 processes with command name dhclient [12:29:42] RECOVERY - dhclient process on elastic2032 is OK: PROCS OK: 0 processes with command name dhclient [12:29:42] RECOVERY - DPKG on elastic2032 is OK: All packages OK [12:30:03] RECOVERY - configured eth on elastic2032 is OK: OK - interfaces up [12:31:52] (03PS1) 10BBlack: cache::text: add sitemaps support [puppet] - 10https://gerrit.wikimedia.org/r/451848 [12:31:54] (03PS1) 10BBlack: wikimedia.org: add sitemaps [dns] - 10https://gerrit.wikimedia.org/r/451849 [12:32:03] (03PS2) 10Ema: caches: set numa_networking by default [puppet] - 10https://gerrit.wikimedia.org/r/451826 (https://phabricator.wikimedia.org/T193865) [12:32:33] RECOVERY - configured eth on elastic2031 is OK: OK - interfaces up [12:32:34] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: keystone: allow IPv6 connections from foreign services [puppet] - 10https://gerrit.wikimedia.org/r/451850 (https://phabricator.wikimedia.org/T201504) [12:32:43] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2017.codfw.wmnet'] ``` The log can... [12:32:53] (03CR) 10Ema: [C: 032] caches: set numa_networking by default [puppet] - 10https://gerrit.wikimedia.org/r/451826 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [12:33:07] (03PS2) 10BBlack: Revert TTLs back to 600 for misc->text moves [dns] - 10https://gerrit.wikimedia.org/r/451695 (https://phabricator.wikimedia.org/T164609) [12:33:09] (03PS2) 10BBlack: wikimedia.org: add sitemaps [dns] - 10https://gerrit.wikimedia.org/r/451849 [12:33:13] RECOVERY - Check size of conntrack table on elastic2032 is OK: OK: nf_conntrack is 0 % full [12:33:20] 10Operations, 10DC-Ops, 10Discovery-Search (Current work): Transient failures of IPMI commands to elastic2017 - https://phabricator.wikimedia.org/T201671 (10Gehel) 05Open>03Resolved a:03Gehel Looks like a reset of the mgmt card fixed the issue. [12:33:22] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2032 is OK: OK ferm input default policy is set [12:33:44] (03CR) 10BBlack: [C: 032] Revert TTLs back to 600 for misc->text moves [dns] - 10https://gerrit.wikimedia.org/r/451695 (https://phabricator.wikimedia.org/T164609) (owner: 10BBlack) [12:34:02] (03CR) 10BBlack: [C: 032] sitemaps.wikimedia.org: new generic microsite [puppet] - 10https://gerrit.wikimedia.org/r/451847 (owner: 10BBlack) [12:34:23] (03PS2) 10BBlack: sitemaps.wikimedia.org: new generic microsite [puppet] - 10https://gerrit.wikimedia.org/r/451847 [12:34:27] (03CR) 10BBlack: [V: 032 C: 032] sitemaps.wikimedia.org: new generic microsite [puppet] - 10https://gerrit.wikimedia.org/r/451847 (owner: 10BBlack) [12:34:47] (03PS2) 10BBlack: cache::text: add sitemaps support [puppet] - 10https://gerrit.wikimedia.org/r/451848 [12:35:03] RECOVERY - Disk space on elastic2032 is OK: DISK OK [12:35:06] (03CR) 10BBlack: [C: 032] cache::text: add sitemaps support [puppet] - 10https://gerrit.wikimedia.org/r/451848 (owner: 10BBlack) [12:35:27] cmon jerkins [12:36:04] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2031.codfw.wmnet', 'elastic2032.codfw.wmnet'] ``` and were **ALL** successful. [12:37:04] bblack: I imagine jenkins being British, "you what mate" is probably a better way to start a fight with it [12:37:10] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10mark) >>! In T200297#4493122, @Halfak wrote: > I talked to @mark today. Here's what I understood from th... [12:37:32] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:37:53] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:38:06] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/451850 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [12:38:15] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: keystone: allow IPv6 connections from foreign services [puppet] - 10https://gerrit.wikimedia.org/r/451850 (https://phabricator.wikimedia.org/T201504) [12:38:33] RECOVERY - MD RAID on elastic2032 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:43:02] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational [12:43:22] (03PS5) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [12:44:23] RECOVERY - Elasticsearch HTTPS on elastic2031 is OK: SSL OK - Certificate elastic2031.codfw.wmnet valid until 2023-08-09 12:42:59 +0000 (expires in 1824 days) [12:45:22] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2031 is OK: OK ferm input default policy is set [12:45:32] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational [12:46:53] RECOVERY - Elasticsearch HTTPS on elastic2032 is OK: SSL OK - Certificate elastic2032.codfw.wmnet valid until 2023-08-09 12:44:43 +0000 (expires in 1824 days) [12:47:12] RECOVERY - DPKG on elastic2031 is OK: All packages OK [12:50:49] (03CR) 10BBlack: [C: 032] wikimedia.org: add sitemaps [dns] - 10https://gerrit.wikimedia.org/r/451849 (owner: 10BBlack) [12:55:53] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2033.codfw.wmnet'] ``` The log can... [12:58:54] (03PS1) 10Muehlenhoff: Enable microcode for more roles [puppet] - 10https://gerrit.wikimedia.org/r/451853 [13:03:24] (03CR) 10Muehlenhoff: [C: 032] Enable microcode for more roles [puppet] - 10https://gerrit.wikimedia.org/r/451853 (owner: 10Muehlenhoff) [13:06:05] PROBLEM - Check Varnish expiry mailbox lag on cp1088 is CRITICAL: CRITICAL: expiry mailbox lag is 2178725 [13:06:57] (03PS6) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [13:13:16] (03PS7) 10Arturo Borrero Gonzalez: cloudvps: merge main/eqiad1 keystone services [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) [13:19:20] (03PS4) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [13:22:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [13:22:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2033.codfw.wmnet'] ``` and were **ALL** successful. [13:30:07] (03PS2) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [13:33:07] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2017.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2017.codfw.wmnet... [13:36:19] (03PS8) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [13:37:05] (03CR) 10jerkins-bot: [V: 04-1] labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [13:41:22] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2019.codfw.wmnet', 'elastic2020.cod... [13:41:37] (03PS9) 10Bstorm: labstore: Change backup cron to a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) [13:45:46] (03PS3) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [13:49:40] (03CR) 10Bstorm: "Now it seems to work. I'd love a review, though." [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [13:50:42] (03PS1) 10BBlack: turn off backend_warming for eqiad caches [puppet] - 10https://gerrit.wikimedia.org/r/451856 [13:51:02] (03CR) 10BBlack: [V: 032 C: 032] turn off backend_warming for eqiad caches [puppet] - 10https://gerrit.wikimedia.org/r/451856 (owner: 10BBlack) [13:52:13] (03CR) 10Bstorm: "Compiler seems to show the right thing:" [puppet] - 10https://gerrit.wikimedia.org/r/451657 (https://phabricator.wikimedia.org/T171394) (owner: 10Bstorm) [14:01:15] (03PS1) 10Ottomata: Use Oozie REST API to update sharelib for spark2 instead of CLI [puppet] - 10https://gerrit.wikimedia.org/r/451857 (https://phabricator.wikimedia.org/T200732) [14:09:38] (03CR) 10Gehel: [C: 04-1] "See comments inline" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451538 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:10:09] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2020.codfw.wmnet', 'elastic2019.codfw.wmnet', 'elastic2021.codfw.wmnet'] ``` an... [14:10:51] (03CR) 10Ottomata: [C: 032] Use Oozie REST API to update sharelib for spark2 instead of CLI [puppet] - 10https://gerrit.wikimedia.org/r/451857 (https://phabricator.wikimedia.org/T200732) (owner: 10Ottomata) [14:13:42] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) Status update: * `dumps.wikimedia.org` really didn't work out well. The... [14:18:37] (03PS11) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:19:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:20:10] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp2005 is CRITICAL: connect to address 10.192.0.126 and port 3128: Connection refused [14:20:39] (03PS1) 10Muehlenhoff: Enable microcode for LVS load balancers [puppet] - 10https://gerrit.wikimedia.org/r/451858 (https://phabricator.wikimedia.org/T127825) [14:20:56] cp2005 is me, host depooled ^ [14:23:10] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.072 second response time [14:24:28] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:42:08] (03PS1) 10BBlack: Sitemap rewrite for itwiki, before VCL switching [puppet] - 10https://gerrit.wikimedia.org/r/451863 (https://phabricator.wikimedia.org/T199252) [14:44:23] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:44:24] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:44:33] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:44:34] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:44:37] oh you gotta be kidding me [14:44:44] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:44:44] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:44:44] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:44:47] /o\ [14:44:53] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:44:54] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:45:03] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:45:04] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:45:05] that's still me, sorry for the spam [14:45:13] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:45:13] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:45:14] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:45:23] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:45:23] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp5005_v4, cp5005_v6 [14:45:23] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp5005_v4, cp5005_v6 [14:45:32] (03PS1) 10Ottomata: Use Kafka jumbo-eqiad cluster for eventlogging consumer mysql eventbus [puppet] - 10https://gerrit.wikimedia.org/r/451864 (https://phabricator.wikimedia.org/T201420) [14:46:02] the varnishes are expressing their displeasure with you, because they know you've been working on ATS patches [14:46:05] (03PS12) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:47:03] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @BBlack to confirm on the third bullet, the current itwiki map sho... [14:47:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:47:34] (03CR) 10Imarlier: [C: 031] Sitemap rewrite for itwiki, before VCL switching [puppet] - 10https://gerrit.wikimedia.org/r/451863 (https://phabricator.wikimedia.org/T199252) (owner: 10BBlack) [14:48:48] !log powercycle cp5005, stuck rebooting [14:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:17] (03PS2) 10Ottomata: Use Kafka jumbo-eqiad cluster for eventlogging consumer mysql eventbus [puppet] - 10https://gerrit.wikimedia.org/r/451864 (https://phabricator.wikimedia.org/T201420) [14:49:38] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) @Imarlier - Ok thanks! Before I push the rewrite buttons in https:/... [14:50:54] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12047/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/451864 (https://phabricator.wikimedia.org/T201420) (owner: 10Ottomata) [14:51:00] (03CR) 10Ottomata: [C: 032] Use Kafka jumbo-eqiad cluster for eventlogging consumer mysql eventbus [puppet] - 10https://gerrit.wikimedia.org/r/451864 (https://phabricator.wikimedia.org/T201420) (owner: 10Ottomata) [14:51:53] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 68 ESP OK [14:51:53] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [14:51:54] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [14:52:03] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 68 ESP OK [14:52:04] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [14:52:10] look who's back [14:52:13] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [14:52:14] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [14:52:14] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 68 ESP OK [14:52:24] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 68 ESP OK [14:52:24] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 68 ESP OK [14:52:24] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 68 ESP OK [14:52:25] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [14:52:33] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [14:52:34] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [14:52:43] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 68 ESP OK [14:52:44] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [14:52:53] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 68 ESP OK [14:56:00] (03CR) 10Gehel: Add cookbook entry point script (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:56:22] (03PS1) 10Vgutierrez: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 [14:56:24] (03PS1) 10Vgutierrez: [WIP] CertCentral tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [14:57:23] (03CR) 10jerkins-bot: [V: 04-1] Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 (owner: 10Vgutierrez) [14:57:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] CertCentral tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [14:57:37] such a jerk [14:58:16] :D [14:58:22] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, 10WMDE-QWERTY-Sprint-2018-07-31: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10WMDE-Fisch) >>! In T199801#4492477, @MoritzMuehlenhoff wrote: >>>! In T199801#4471991, @... [15:00:30] (03CR) 10Volans: Add cookbook entry point script (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/450937 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:01:05] 10Operations, 10TCB-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-07-17, 10WMDE-QWERTY-Sprint-2018-07-31: Update wikidiff2 library on the WMF production cluster to v1.7.2 - https://phabricator.wikimedia.org/T199801 (10MoritzMuehlenhoff) >>! In T199801#4494465, @WMDE-Fisch wrote: > No this bug is older so... [15:01:42] (03PS2) 10BBlack: Sitemap rewrite for itwiki, before VCL switching [puppet] - 10https://gerrit.wikimedia.org/r/451863 (https://phabricator.wikimedia.org/T199252) [15:04:14] (03PS2) 10Vgutierrez: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 [15:04:16] (03PS2) 10Vgutierrez: [WIP] CertCentral tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [15:05:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] CertCentral tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [15:05:41] (03PS3) 10Vgutierrez: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 [15:05:43] (03PS3) 10Vgutierrez: [WIP] CertCentral tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 [15:06:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] CertCentral tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451867 (owner: 10Vgutierrez) [15:08:42] (03PS13) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [15:09:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:19:15] vgutierrez, one thing I've been wondering about is certificate chains [15:19:28] certcentral doesn't export anything other than the public and private parts [15:19:40] i.e. it doesn't add /usr/local/share/ca-certificates/Lets_Encrypt_Authority_X3.crt in [15:19:55] right now I have puppet doing that [15:19:56] Krenair: acme_requests does it [15:20:13] in a way that does not please jenkins [15:20:22] hm [15:20:28] save(full_chain=True) will get you the certificate + the intermediate CA [15:21:10] should we be allowing hosts to download just the public part without the intermediate bit? [15:21:12] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Looks good to me: ``` imarlier@WMF2024 ~/dev/src/mediawiki-docker-... [15:21:56] not sure if maybe some software needs them in separate files? [15:22:20] right now the library let you choose [15:22:32] the cert, or full chain [15:22:36] yeah [15:22:51] whet it lacks is fetch the full chain - the own certificate [15:22:54] but it's easily implementable [15:23:25] but does certcentral need to get both public and intermediate and store in separate files? get both and call it the public part? get public and let the client figure out the intermediate? [15:25:11] oops [15:25:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:25:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:25:13] 503 [15:25:34] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:25:53] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:25:54] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:25:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:26:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:26:14] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:26:18] ema: ^^ [15:26:33] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:26:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:26:53] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:27:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:27:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:27:33] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:28:26] XioNoX: network issues? [15:28:36] what [15:29:29] XioNoX: 503 spike in all sites https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?orgId=1&panelId=2&fullscreen&from=1533914068428&to=1533914944122&var-site=All&var-cache_type=All&var-status_type=5 [15:29:46] "sites" as in DCs [15:30:07] yeah looking [15:30:16] thanks [15:30:20] Krenair: nginx expects the full-chain along with the certificate in one file https://nginx.org/en/docs/http/configuring_https_servers.html#chains - Apache adopted the same behavior in 2.4.8 deprecating SSLCertificateChainFile [15:30:24] it was a healthy 503 spike [15:30:54] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:30:54] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:30:56] Krenair: so I'd say that certcentral should deliver the full-chain [15:31:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:31:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:31:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:31:14] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:31:20] seems to have all been cp1089 [15:31:34] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:31:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:32:04] (as in, cp1089 was at the bottom of the x-cache on all the 503s during the spike) [15:32:08] it's asw2-d [15:32:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:32:16] didn't lose link at the host level though [15:32:40] Okay folks someone in en was saying they'd seen an error? [15:33:05] Was there a 'blip'? [15:33:22] yes [15:33:23] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-1h&to=now [15:33:36] no logs on asw2-d [15:33:47] ~300/sec rate of 503s for a couple of minutes there [15:34:33] I can't see anything that points to the network so far [15:35:04] vgutierrez, alright but are we causing problems for non-web server software? [15:35:36] it might be simpler to just offer the cert itself and the intermediate chain as separate file offerings [15:35:45] and let the consumer decide to pull both and/or cat them together? [15:36:00] sure, no problem [15:36:01] or offer all 3 for maximum flexibility (both separate, and a catted variant) [15:36:19] just different names [15:36:53] foo.pem (just the cert) foo-chain.pem (just the intermediate) foo-chained.pem (both catted) [15:37:16] foo.public.pem, foo.chain.pem, and foo.fullchain.pem ? [15:37:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:37:24] just public, just intermediate, both combined [15:37:43] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:37:53] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:38:04] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:38:25] if it's just bikeshedding, our existing sslcert puppetization for manual certs has some convention that's similar [15:38:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:39:04] alright [15:39:13] vgutierrez, so let's do that ^ [15:39:21] Krenair: ack, I'll implement it on Monday :) [15:39:23] cool [15:40:19] yeah, our existing sslcert module uses the convention: foo.pem (cert) foo.chain.pem (intermediate(s)) foo.chained.pem (cert+intermediates) [15:40:29] probably I was randomly remembering that when I said something similar above [15:41:10] err, I guess it uses ".crt" rather than ".pem", but either way [15:41:47] yup, I'll check the existing codebase [15:42:49] (03PS5) 10Gehel: [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [15:44:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:44:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:44:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:44:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:44:53] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:45:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:45:14] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:45:14] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:45:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:45:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:45:34] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:45:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 (owner: 10Gehel) [15:45:54] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:46:03] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:46:14] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:49:03] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:49:31] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Engineers (Wong Kee Heng & Kelvin Goh Keng Yew) from Unisys (sub-contracted by Dell for Pro support) will be onsite on Monday, August 13th between 1500 and 1700 Singapore lo... [15:49:44] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:49:44] bblack, ema, ^ is it the same issue as before? [15:49:55] 10Operations, 10Wikimedia-Mailing-lists: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10herron) a:03herron [15:50:00] XioNoX: it looks similar, yes [15:50:05] but different server [15:50:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:50:14] saw the chat on -traffic [15:50:24] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:50:24] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:50:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:50:29] the second server is on asw2-c [15:50:36] so it's probably not a particular switch stack failing [15:50:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:51:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:51:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:51:52] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Growth Team Mailing List - https://phabricator.wikimedia.org/T201467 (10herron) @JTannerWMF am I understanding correctly that all email addresses listed in the description should be added as (secondary) list administrators? [15:53:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:54:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:57:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:57:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:02:53] !log reindexing Malay wikis on elastic@eqiad and elastic@codfw (T200204) [16:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:01] T200204: Re-index Malay and Indonesian Wikis to use new unpacked analysis chain - https://phabricator.wikimedia.org/T200204 [16:03:24] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:04:44] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:05:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:05:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:05:34] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [16:08:22] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2022.codfw.wmnet', 'elastic2023.cod... [16:12:25] !log reindexing Malay wikis on elastic@eqiad and elastic@codfw abandoned (T200204) [16:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:31] T200204: Re-index Malay and Indonesian Wikis to use new unpacked analysis chain - https://phabricator.wikimedia.org/T200204 [16:15:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:21:14] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:21:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:21:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:21:56] (03PS4) 10Vgutierrez: Replace acme_tiny with acme_requests [software/certcentral] - 10https://gerrit.wikimedia.org/r/451866 [16:22:03] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [16:22:14] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:22:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:22:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:22:34] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:24:03] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:25:24] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [16:25:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:25:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:25:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:25:53] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) p:05Triage>03High [16:26:10] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) [16:26:12] 10Operations, 10netops: Move servers off asw2-a-eqiad - https://phabricator.wikimedia.org/T201694 (10ayounsi) [16:26:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:31:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:32:21] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [16:33:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [16:33:11] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:35:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2022.codfw.wmnet', 'elastic2023.codfw.wmnet', 'elastic2024.codfw.wmnet'] ``` an... [16:36:29] !log depool cp1089, cp1085 [16:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:50] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1085.eqiad.wmnet [16:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:57] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1089.eqiad.wmnet [16:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:41:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:41:14] memcached, that's new [16:43:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:55:20] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:11] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 78673 bytes in 0.111 second response time [16:56:30] Error: 503, Backend fetch failed at Fri, 10 Aug 2018 16:56:15 GMT [16:57:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:57:21] 503 503 503 [16:57:55] hoo, revi: working on it, thanks! [16:58:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:58:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:58:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:58:20] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:58:31] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [16:58:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:58:41] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [16:59:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [16:59:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:00:10] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:00:21] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:00:41] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:01:01] I'm sure you already know, but yes, errors going on when editing pages on mediawiki [17:01:49] ...and...now gone. [17:04:19] !log restart varnish-be on cp1083 [17:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:41] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:05:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:06:11] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:06:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:06:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:06:21] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:06:30] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:06:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:12:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:13:50] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [17:14:00] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [17:14:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [17:14:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:16:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:18:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:19:39] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2034.codfw.wmnet', 'elastic2035.cod... [17:21:12] o/ is there anyone that can help me with this ticket: https://phabricator.wikimedia.org/T186748 [17:25:32] debt: in what way? [17:26:51] bblack: there seems to be several blockers that might / might not have been taken care of? but not entirely sure and wanted to check to see what is needed still to be done and by who [17:29:07] bblack: it looks like the grafana dashboard was created in https://phabricator.wikimedia.org/T201158 (which was requested in T186748) and that *should* mean it's ready to go into production? [17:29:08] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [17:30:50] debt: I have no idea really, but it seems there are a few open child tasks. Seems like all the most-likely people to know things on that ticket are probably gone for the week though. It's ~beer:30 or later in the EU. [17:34:11] !log cp1085 + cp1089: restart varnish with wiped storage, repool [17:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:35] PROBLEM - Elasticsearch HTTPS on elastic2036 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2036.codfw.wmnet [17:46:37] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2035.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['elastic2035.codfw.wmnet... [17:48:16] PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:51:34] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10herron) p:05Triage>03Normal [17:51:47] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10herron) p:05Triage>03Normal [17:52:21] (03CR) 10BBlack: [C: 032] Sitemap rewrite for itwiki, before VCL switching [puppet] - 10https://gerrit.wikimedia.org/r/451863 (https://phabricator.wikimedia.org/T199252) (owner: 10BBlack) [17:58:56] PROBLEM - Elasticsearch HTTPS on elastic2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:01:16] ebernhardson: are those expected? [18:01:44] herron: they look related to the reimage gehel is currently doing [18:02:23] herron: i don't think expected 2035 to fail reimage, but cluster is fine without it [18:02:40] cool thanks [18:03:51] herron: yep, that's me, those are depooled and should be good anyways from the cluster point of view [18:04:41] (03PS1) 10BBlack: Sitemap rewrite for itwiki inside misc-frontend as well [puppet] - 10https://gerrit.wikimedia.org/r/451889 (https://phabricator.wikimedia.org/T199252) [18:05:00] herron: thanks for reporting! That failure was unexpected! [18:05:23] (03CR) 10BBlack: [C: 032] Sitemap rewrite for itwiki inside misc-frontend as well [puppet] - 10https://gerrit.wikimedia.org/r/451889 (https://phabricator.wikimedia.org/T199252) (owner: 10BBlack) [18:05:26] hehe np, yeah I saw the failure in there and thought it might be unexpectedly down [18:06:53] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on sarin.codfw.wmnet for hosts: ``` ['elastic2035.codfw.wmnet'] ``` The log can... [18:08:52] RECOVERY - Elasticsearch HTTPS on elastic2036 is OK: SSL OK - Certificate elastic2036.codfw.wmnet valid until 2023-08-09 18:07:19 +0000 (expires in 1824 days) [18:11:33] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10BBlack) Seems to be working now after the fixup above, will need to cleanup... [18:20:04] thanks, bblack ...that was kind of the response I was expecting, I'll check back on Monday :) [18:22:32] PROBLEM - Disk space on elastic1041 is CRITICAL: DISK CRITICAL - free space: /srv 73041 MB (10% inode=99%) [18:24:32] Here we go again [18:24:42] nlwiki down [18:25:56] Request from 84.81.160.164 via cp1081 cp1081, Varnish XID 105480345 [18:25:56] Error: 503, Backend fetch failed at Fri, 10 Aug 2018 18:25:51 GMT [18:26:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:26:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:26:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:28] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:28] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:38] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:27:57] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:28:08] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:28:28] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:29:18] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:29:36] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815 (10Jdlrobson) [18:31:14] any sysadmin online? [18:31:24] ah, I see it's in progress [18:31:38] nvm [18:31:40] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2035.codfw.wmnet'] ``` and were **ALL** successful. [18:31:47] RECOVERY - Disk space on elastic1041 is OK: DISK OK [18:33:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:34:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:34:28] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:28] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:34:38] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:35:28] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:35:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:36:48] RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational [18:37:17] RECOVERY - Elasticsearch HTTPS on elastic2035 is OK: SSL OK - Certificate elastic2035.codfw.wmnet valid until 2023-08-09 18:36:00 +0000 (expires in 1824 days) [18:40:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:41:17] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:41:33] cp1081 theory confirmed (:P e.ma) [18:41:54] !log restart varnish-be on cp1081 [18:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:27] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:43:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:44:12] !log restart varnish-be on cp1075 [18:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:49] !log restart varnish-be on cp1079 [18:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:58] !log restart varnish-be on cp1087 [18:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:27] RECOVERY - Check Varnish expiry mailbox lag on cp1088 is OK: OK: expiry mailbox lag is 0 [19:01:51] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10herron) Looping in @nuria for review/approval of `analytics-privatedata-users` membership request [19:02:10] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10herron) Looping in @nuria for review/approval of `analytics-privatedata-users` membership request [19:26:40] 10Operations, 10Operations-Software-Development: wmf-auto-reimage should retry on ipmi failures - https://phabricator.wikimedia.org/T201669 (10herron) p:05Triage>03Normal [19:27:13] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10herron) p:05Triage>03High [19:28:46] 10Operations, 10TimedMediaHandler-Transcode: Increase job runners on video scalers to maximize load efficiency - https://phabricator.wikimedia.org/T201358 (10herron) p:05Triage>03Normal [19:33:40] (03PS15) 10Bstorm: WIP toolforge: write a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) [19:34:27] (03CR) 10jerkins-bot: [V: 04-1] WIP toolforge: write a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:38:45] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 2 others: RFC: Modern Event Platform - Choose Schema Tech - https://phabricator.wikimedia.org/T198256 (10kchapman) This is being placed on Last Call closing August 22nd ending at 2pm PST(22:00 UTC, 23:00 CET) [19:47:30] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10kchapman) TechCom is scheduling a discussion of this RFC on 22 August at 2pm PST(21:00 UTC, 23:00 CET) in... [20:07:46] 10Operations, 10cloud-services-team, 10decommission, 10hardware-requests: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10RobH) p:05Triage>03Normal [20:13:59] (03PS1) 10RobH: decom labtestnet2001 prod dns [dns] - 10https://gerrit.wikimedia.org/r/451902 (https://phabricator.wikimedia.org/T201440) [20:15:20] (03PS1) 10RobH: decom labtestnet2001 [puppet] - 10https://gerrit.wikimedia.org/r/451903 (https://phabricator.wikimedia.org/T201440) [20:15:24] (03CR) 10RobH: [C: 032] decom labtestnet2001 prod dns [dns] - 10https://gerrit.wikimedia.org/r/451902 (https://phabricator.wikimedia.org/T201440) (owner: 10RobH) [20:16:23] (03CR) 10RobH: [C: 032] decom labtestnet2001 [puppet] - 10https://gerrit.wikimedia.org/r/451903 (https://phabricator.wikimedia.org/T201440) (owner: 10RobH) [20:18:26] 10Operations, 10ops-codfw, 10cloud-services-team, 10decommission: Decommission labtestnet2001.codfw.wmnet - https://phabricator.wikimedia.org/T201440 (10RobH) a:05RobH>03Papaul [20:43:43] greg-g/whomever: Is it OK for me to do an emergency deploy for an UploadWizard fatal with campaigns? T201708 [20:43:44] T201708: UploadWizard campaigns don't go beyond Release Rights phase - https://phabricator.wikimedia.org/T201708 [20:52:36] James_F: I'm not "official", but I would personally push a fatal fix. That one looks pretty safe too. [20:57:14] OK, I have the conch. (When CI comes back up from zombieland.) [20:59:13] James_F: in the meanwhile maybe https://gerrit.wikimedia.org/r/#/c/labs/tools/wikibugs2/+/452000/ ? [20:59:22] I have access, but never used 'fab' before [21:00:13] Hauskatze: I've not touched wikibugs_ "prod" for over a year, I'd probably break it. :-( [21:00:26] James_F: fair enough :) [21:00:41] I'll leave MvD handle it [21:00:55] Yeah. [21:07:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:07:46] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:10:25] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [21:10:35] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [21:12:11] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [21:12:35] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor1003.eqiad.wmnet, thumbor1001.eqiad.wmnet are marked down but pooled [21:13:01] does uploadwizard change impact thumbor? [21:14:09] it could potentially I guess, uploads -> some sort of thumb is rendered [21:14:19] <_joe_> load average: 110.91, 106.76, 67.50 on thumbor1001 [21:14:45] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [21:14:46] <_joe_> yes, revert whatever might be remotely related, then look at swift / cache_upload maybe [21:14:51] 1002 is bad too [21:15:02] <_joe_> bblack: I assume they are all over capacity [21:15:22] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.607 second response time [21:15:24] upload isn't showing a spike in reqs or 5xxs [21:15:33] <_joe_> (I'm still on vacations, btw ) [21:15:35] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:16:03] thumbor does seem to be legitimately maxing out doing thumnailing jobs [21:17:03] <_joe_> https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?panelId=87&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=thumbor&var-instance=All [21:17:32] ah I had the wrong graph, cache_upload is showing some 5xx as well [21:17:40] 500s [21:18:08] not too huge a rate, though [21:18:13] curl is getting stuck mostly for new thumbs [21:18:19] e.g. new uploads [21:18:27] https://commons.wikimedia.org/wiki/Special:Contributions/Rodrigo.Argenton there's a lot but I don't know what's standard [21:18:47] awesome [21:18:53] but my experience may be anecdotical? [21:19:26] is there no knob to ratelimit how fast thumbor will work on new thumbs? [21:19:38] funnily, I get a 429? [21:19:43] with a single request [21:19:46] Hmm. The top line in fatalmonitor is blank? [21:19:52] <_joe_> bblack: don't think so [21:20:08] jynus: thumbor has its own 429 emissions, separate from varnish ratelimiting [21:20:16] I'm not sure what drives thumbor's 429s [21:20:27] ok [21:20:38] <_joe_> overload i guess [21:20:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:20:45] <_joe_> too many items in queue [21:21:04] so kill/restart or wait for analysis? [21:21:08] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) Confirmed that https://it.wikipedia.org/sitemap.xml is returning.... [21:21:19] https://wikitech.wikimedia.org/wiki/Thumbor#Throttling such as it is [21:21:21] <_joe_> apergos: you might be onto something [21:21:42] <_joe_> the user, I mean [21:21:45] Unlike Mediawiki, Thumbor doesn't implement a per-user Poolcounter throttle. unfortunately [21:21:46] <_joe_> the timing corresponds [21:21:51] where's our elastic autoprovisioning? we must handle this critical load or this user won't get all his photographs uploaded before he has to run to starbucks in a few [21:22:10] <_joe_> bblack: if thumbor was on kubernetes... [21:22:20] but seriously, that upload rate is insane [21:22:41] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [21:22:42] <_joe_> well, 30 images/minute is hardly insane [21:22:55] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:22:55] <_joe_> we could ask to block the user [21:23:02] <_joe_> revi: around by any chance? [21:23:19] <_joe_> it's hammering memcached too, I guess [21:23:41] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 1.089 second response time [21:24:02] <_joe_> apergos: can you search for a commons admin? [21:24:09] we had another mystersy temporary spike of memcached issue earlier, without the thumbor/upload parts [21:24:13] may have been related [21:24:18] lready on it [21:24:23] <_joe_> thanks :) [21:28:01] <_joe_> that user stopped uploading images a couple minutes ago [21:28:26] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers thumbor1004.eqiad.wmnet are marked down but pooled [21:28:34] I see you're already on it, if needed I'm around (sent from mobile) [21:28:55] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:29:17] <_joe_> the load is not going down [21:29:26] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:29:33] <_joe_> volans|off: we are far from understanding the root cause [21:30:15] <_joe_> it's alexawiki-bot [21:30:35] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [21:31:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:31:48] what's alexawiki-bot [21:32:05] is there some easy way to see what thumb requests are still in the queue (not yet processed by thumbor)? [21:32:05] bblack: The load on the Thumbor boxes. [21:32:06] <_joe_> bblack: some bot for amazon alexa [21:32:14] <_joe_> and it's creating the load on thumbor [21:32:19] how? [21:32:25] or maybe "why?" [21:32:32] <_joe_> requesting tons of thumbs at hi-res [21:32:49] should we block it on UA maybe? [21:32:56] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:32:59] <_joe_> bblack: yes [21:33:03] <_joe_> that was my proposal [21:33:11] <_joe_> in the other channel :P [21:33:35] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:33:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:35:05] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:35:11] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host [21:35:42] <_joe_> bblack: let's ban them [21:35:45] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:35:50] <_joe_> they're not listening to 429s either [21:35:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:36:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:36:49] (03PS1) 10BBlack: block alexawikibot for now [puppet] - 10https://gerrit.wikimedia.org/r/452007 [21:37:16] (03CR) 10Giuseppe Lavagetto: [C: 031] block alexawikibot for now [puppet] - 10https://gerrit.wikimedia.org/r/452007 (owner: 10BBlack) [21:37:24] <_joe_> ops@wikimedia is moderated [21:37:31] security@ isn't. [21:37:31] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 12.532 second response time [21:37:32] <_joe_> but it's an ok point of contact [21:37:32] (03CR) 10BBlack: [V: 032 C: 032] block alexawikibot for now [puppet] - 10https://gerrit.wikimedia.org/r/452007 (owner: 10BBlack) [21:38:11] oh bleh [21:38:25] does "moderated" mean I won't see it and we have to wake up someone else to get to see it? [21:38:42] anyone with the list master password can see it right? and that's in pwstore [21:38:46] <_joe_> means you need to log into the mailing list interface [21:38:48] Only the list moderators will see it and will have to manually unblock. [21:38:50] ok [21:38:50] Or that. [21:38:51] <_joe_> as admin :) [21:39:07] security@ seems inappropriate messaging [21:39:11] well whatever [21:39:19] (03PS16) 10Bstorm: WIP toolforge: write a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) [21:39:25] <_joe_> yes, it's ok [21:39:46] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:39:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:40:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:40:05] it's pushed everywhere now [21:40:06] (03CR) 10jerkins-bot: [V: 04-1] WIP toolforge: write a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:40:30] aren't you on vacation? quit opening your laptop :P [21:41:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:41:45] (03PS17) 10Bstorm: WIP toolforge: write a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) [21:42:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:43:06] <_joe_> thumbor is back [21:43:16] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:43:32] <_joe_> bblack: I figured others with thumbor knowledge were on vacation too, and I kinda promised to show up in such cases :) [21:43:36] load good: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=thumbor&cluster=thumbor&orgId=1 [21:43:54] <_joe_> jynus: yes, the ginormous load was coming from that bot [21:43:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:44:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:44:57] <_joe_> https://grafana.wikimedia.org/dashboard/db/thumbor?panelId=9&fullscreen&orgId=1&from=1533933336139&to=1533937462068 [21:45:05] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:50:48] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.16/extensions/UploadWizard: T201708 UBN fix (e498c7d) (duration: 00m 56s) [21:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:56] T201708: UploadWizard campaigns don't go beyond Release Rights phase - https://phabricator.wikimedia.org/T201708 [21:51:46] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [21:51:55] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [21:54:43] (All done with the UW fix.) [22:14:54] 10Operations, 10ops-eqiad, 10decommission, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420 (10RobH) a:03RobH [22:17:56] 10Operations, 10ops-eqiad, 10decommission, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420 (10RobH) [22:19:18] (03PS1) 10RobH: decom hafnium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/452014 (https://phabricator.wikimedia.org/T193420) [22:20:23] (03PS1) 10RobH: hafnium decom [puppet] - 10https://gerrit.wikimedia.org/r/452016 (https://phabricator.wikimedia.org/T193420) [22:20:32] (03CR) 10RobH: [C: 032] decom hafnium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/452014 (https://phabricator.wikimedia.org/T193420) (owner: 10RobH) [22:21:12] (03CR) 10RobH: [C: 032] hafnium decom [puppet] - 10https://gerrit.wikimedia.org/r/452016 (https://phabricator.wikimedia.org/T193420) (owner: 10RobH) [22:38:24] 10Operations, 10ops-eqiad, 10decommission, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420 (10RobH) [22:38:39] 10Operations, 10ops-eqiad, 10decommission, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420 (10RobH) a:05RobH>03Cmjohnson [23:36:34] Can someone tell me what version of librsvg we are now using on the scaling servers? (now that they've been migrated to Debian Stretch)