[01:34:38] (03PS17) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [01:49:20] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.014 second response time [01:50:20] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.040 second response time [01:51:59] (03PS18) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [01:52:58] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [01:53:40] PROBLEM - thumbor@8805 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8805 is inactive [01:54:36] (03PS19) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [01:58:06] (03PS20) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [01:58:50] (03PS21) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:06:24] (03PS22) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:07:50] RECOVERY - thumbor@8805 service on thumbor1001 is OK: OK - thumbor@8805 is active [02:08:06] (03PS23) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:10:21] (03PS24) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:11:56] (03PS25) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:14:56] (03PS26) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:16:05] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 05m 12s) [02:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:45] (03PS27) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:20:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 31 02:20:21 UTC 2016 (duration 4m 16s) [02:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:22] (03PS28) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:26:21] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:31:48] (03PS29) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:42:53] (03PS30) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:53:15] (03CR) 10Madhuvishy: [C: 031] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [02:54:57] 06Operations: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543#2755618 (10yuvipanda) [02:55:51] (03PS31) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (https://phabricator.wikimedia.org/T149543) [02:55:58] (03PS32) 10Yuvipanda: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (https://phabricator.wikimedia.org/T149543) (owner: 10Madhuvishy) [02:56:12] (03CR) 10Yuvipanda: [C: 032 V: 032] "Let's gooooooo!" [puppet] - 10https://gerrit.wikimedia.org/r/288086 (https://phabricator.wikimedia.org/T149543) (owner: 10Madhuvishy) [02:56:35] yuvipanda: did you hit submit :D [02:56:45] obligatory fuck you gerrit [02:56:55] madhuvishy: I was trying to puppet merge and realized [02:57:08] ha ha D: [02:59:17] 06Operations, 13Patch-For-Review: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543#2755618 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by yuvipanda on neodymium.eqiad.wmnet for hosts: ``` ['notebook1001.eqiad.wmnet'] ``` The log can be found in `... [02:59:39] !log start reimaging notebook1001 for T149543 [02:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:45] T149543: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543 [03:00:00] madhuvishy: nice, the script automatically sets downtime also [03:00:15] yuvipanda: ohh i just did that anyway [03:11:10] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [03:11:28] looking [03:12:00] madhuvishy: thanks! [03:13:20] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [03:15:07] looks like transient [03:21:55] (03PS1) 10Yuvipanda: paws_internal: Rename hiera file to match [puppet] - 10https://gerrit.wikimedia.org/r/318875 [03:21:57] (03PS1) 10Yuvipanda: jupyterhub: Safer defaults for authenticator [puppet] - 10https://gerrit.wikimedia.org/r/318876 (https://phabricator.wikimedia.org/T149543) [03:22:35] madhuvishy: ^ [03:23:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 725.36 seconds [03:23:53] (03PS2) 10Yuvipanda: paws_internal: Rename hiera file to match [puppet] - 10https://gerrit.wikimedia.org/r/318875 [03:23:59] (03CR) 10Yuvipanda: [C: 032 V: 032] paws_internal: Rename hiera file to match [puppet] - 10https://gerrit.wikimedia.org/r/318875 (owner: 10Yuvipanda) [03:24:10] (03PS2) 10Yuvipanda: jupyterhub: Safer defaults for authenticator [puppet] - 10https://gerrit.wikimedia.org/r/318876 (https://phabricator.wikimedia.org/T149543) [03:24:14] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Safer defaults for authenticator [puppet] - 10https://gerrit.wikimedia.org/r/318876 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [03:25:01] 06Operations, 13Patch-For-Review: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543#2755650 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['notebook1001.eqiad.wmnet'] ``` and were **ALL** successful. [03:35:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 279.45 seconds [03:52:36] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2751310 (10Tgr) The job queue can discard duplicates. Not sure if there is any job that relies on that for correc... [04:00:52] (03PS1) 10Yuvipanda: jupyterhub: Specify proper class for config keys [puppet] - 10https://gerrit.wikimedia.org/r/318877 [04:01:33] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Specify proper class for config keys [puppet] - 10https://gerrit.wikimedia.org/r/318877 (owner: 10Yuvipanda) [04:02:41] (03PS2) 10Yuvipanda: jupyterhub: Specify proper class for config keys [puppet] - 10https://gerrit.wikimedia.org/r/318877 [04:04:27] (03CR) 10Yuvipanda: [C: 032] jupyterhub: Specify proper class for config keys [puppet] - 10https://gerrit.wikimedia.org/r/318877 (owner: 10Yuvipanda) [04:14:57] (03PS1) 10Yuvipanda: jupyterhub: Add some hardening for the notebooks [puppet] - 10https://gerrit.wikimedia.org/r/318878 (https://phabricator.wikimedia.org/T149543) [04:15:46] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add some hardening for the notebooks [puppet] - 10https://gerrit.wikimedia.org/r/318878 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [04:15:58] (03PS2) 10Yuvipanda: jupyterhub: Add some hardening for the notebooks [puppet] - 10https://gerrit.wikimedia.org/r/318878 (https://phabricator.wikimedia.org/T149543) [04:16:10] (03PS1) 10Madhuvishy: jupyterhub: Remove venv creation from deploy script [puppet] - 10https://gerrit.wikimedia.org/r/318879 [04:16:32] (03PS2) 10Yuvipanda: jupyterhub: Remove venv creation from deploy script [puppet] - 10https://gerrit.wikimedia.org/r/318879 (owner: 10Madhuvishy) [04:16:37] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Remove venv creation from deploy script [puppet] - 10https://gerrit.wikimedia.org/r/318879 (owner: 10Madhuvishy) [04:20:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:23:04] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:24:07] (03PS3) 10Yuvipanda: jupyterhub: Add some hardening for the notebooks [puppet] - 10https://gerrit.wikimedia.org/r/318878 (https://phabricator.wikimedia.org/T149543) [04:28:40] (03CR) 10Alex Monk: [C: 031] icinga: remove ServerAlias with hardcoded hostname [puppet] - 10https://gerrit.wikimedia.org/r/318439 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [04:36:04] (03CR) 10Madhuvishy: [C: 032] jupyterhub: Add some hardening for the notebooks [puppet] - 10https://gerrit.wikimedia.org/r/318878 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [04:37:03] PROBLEM - thumbor@8809 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8809 is inactive [04:44:43] RECOVERY - thumbor@8809 service on thumbor1002 is OK: OK - thumbor@8809 is active [04:48:06] !log Upgraded systemd notebook1001 to 230-7~bpo8+2 from backports [04:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:23] PROBLEM - thumbor@8820 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8820 is inactive [05:04:44] !log Upgraded systemd on notebook1002 to 230-7~bpo8+2 from backports [05:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:13] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1808.357432 Seconds [05:06:13] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 46.674165 Seconds [05:07:53] RECOVERY - thumbor@8820 service on thumbor1001 is OK: OK - thumbor@8820 is active [05:29:02] (03PS1) 10Yurik: LABS: Enable tabular remote access to tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318883 (https://phabricator.wikimedia.org/T148745) [05:31:03] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:35:13] PROBLEM - thumbor@8812 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8812 is inactive [05:44:33] RECOVERY - thumbor@8812 service on thumbor1002 is OK: OK - thumbor@8812 is active [06:00:43] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:33] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[xfsprogs] [06:55:53] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:05:42] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:06:21] PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.017 second response time [07:06:31] PROBLEM - Apache HTTP on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [07:07:21] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 75434 bytes in 0.258 second response time [07:07:31] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.043 second response time [07:10:39] !log Deploying schema change s1 enwiki codfw (db2016 - master) - T147166 [07:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:46] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [07:13:59] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2755141 (10Marostegui) No trace of HW logs I assume? :( [07:14:41] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:41] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [07:16:01] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:17:31] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:41] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [07:24:21] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:25:01] PROBLEM - MariaDB Slave Lag: s1 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.41 seconds [07:25:11] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 634.55 seconds [07:25:18] :( [07:25:21] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.14 seconds [07:25:21] PROBLEM - MariaDB Slave Lag: s1 on db2069 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 646.90 seconds [07:25:35] I thought silencing the master would replicate and silence the slaves [07:25:38] :_( [07:25:42] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 663.96 seconds [07:26:01] PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 683.77 seconds [07:26:11] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 692.23 seconds [07:32:59] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:33:39] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:29] PROBLEM - Apache HTTP on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.008 second response time [07:35:49] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [07:36:29] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.024 second response time [07:43:28] !log powercycled cp2010 (not reachable via ssh, com2 console showed a frozen screen) [07:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:39] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [07:44:49] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [07:44:49] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [07:44:49] RECOVERY - Host cp2010 is UP: PING OK - Packet loss = 0%, RTA = 37.69 ms [07:44:49] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [07:44:59] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [07:45:09] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [07:45:19] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [07:45:19] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [07:45:19] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [07:45:29] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [07:45:29] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [07:45:39] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [07:45:39] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [07:47:09] PROBLEM - Freshness of OCSP Stapling files on cp2010 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [07:49:22] cp2010 seems working fine, pooled and varnishlog shows traffic [07:49:38] not sure what happened though [07:56:37] !log stopping replication on db1057 (s1-master) from codfw for codfw maintenance [07:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:59] RECOVERY - Freshness of OCSP Stapling files on cp2010 is OK: OK [08:07:59] PROBLEM - thumbor@8829 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8829 is inactive [08:14:20] RECOVERY - thumbor@8829 service on thumbor1002 is OK: OK - thumbor@8829 is active [08:17:25] !log rebooting rdb2* for kernel update [08:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:50] (03CR) 10Jcrespo: "Ok with the idea, but this needs careful testing or we could stop all production servers." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [08:19:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Besides the obvious need for tests, I don't like the "--host" switch name much." [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [08:29:19] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:30:09] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [08:31:44] Telia maintenance --^ [08:32:09] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2755837 (10jcrespo) ``` Fri May 27 2016 17:31:34 Correctable memory error rate exceeded for DIMM_A1. Fri May 27 2016 18:42:52 Correctable memory error rate exceeded for DIMM_A1. Tue Jun 07 2016 17:21:45 Correctable memo... [08:33:25] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2755838 (10jcrespo) ``` MEM0001: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. 2016-10-30T15:14:10-0500 Log Sequence Number: 362 Detailed Description: The memory has encountered a uncorrec... [08:54:10] (03PS2) 10Alexandros Kosiaris: icinga: remove ServerAlias with hardcoded hostname [puppet] - 10https://gerrit.wikimedia.org/r/318439 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [08:54:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: remove ServerAlias with hardcoded hostname [puppet] - 10https://gerrit.wikimedia.org/r/318439 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [08:59:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "why move the file under the osm module when the reference to it is from the role module. I 'd rather we moved it into the role module." [puppet] - 10https://gerrit.wikimedia.org/r/318453 (owner: 10Dzahn) [09:01:29] PROBLEM - HHVM rendering on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.012 second response time [09:02:49] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 75450 bytes in 0.874 second response time [09:03:31] (03CR) 10Volans: "@Giuseppe, what is your proposal for the switch name?" [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [09:20:30] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.98 seconds [09:22:09] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [09:22:40] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 0.70 seconds [09:23:09] RECOVERY - MariaDB Slave Lag: s1 on db2042 is OK: OK slave_sql_lag Replication lag: 0.58 seconds [09:23:19] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 0.53 seconds [09:23:39] RECOVERY - MariaDB Slave Lag: s1 on db2069 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [09:26:49] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [09:26:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:27:09] PROBLEM - MariaDB Slave SQL: s1 on db2034 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1594, Errmsg: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the masters binary log is corrupted (you can check this by running mysqlbinlog on the binary log), the slaves relay log is corrupted (you can check this by running mysqlbinlog on the relay log), a network problem, or a [09:30:07] marostegui, jynus it's you? ^^^ [09:30:16] volans: yeah [09:30:19] I just silenced it [09:30:23] ok then :) [09:30:24] it needs to be reimaged anyways [09:30:30] thanks [09:31:12] volans, it is not really us [09:31:15] it crashed [09:31:22] probably is corrupted [09:31:26] needs reimage [09:31:45] ack [09:35:22] 06Operations, 10Dumps-Generation: Reboot dataset1001 - https://phabricator.wikimedia.org/T148737#2755891 (10ArielGlenn) 05Open>03Resolved This was done on Oct 29 after the second dump run of the month completed. [09:38:10] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.28 [debs/linux44] - 10https://gerrit.wikimedia.org/r/318549 (owner: 10Muehlenhoff) [09:44:57] (03Abandoned) 10Gehel: Externalize Postgresql user creation from role::osm::master [puppet] - 10https://gerrit.wikimedia.org/r/297786 (owner: 10Gehel) [09:46:52] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/316810 (https://phabricator.wikimedia.org/T121789) (owner: 10Filippo Giunchedi) [09:55:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A comment inline, plus the fact that this probably needs a sudo rule since it executes iptables." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) (owner: 10Muehlenhoff) [09:59:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: move files/icinga/ into module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [10:05:06] 06Operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#2755923 (10akosiaris) [10:05:15] 06Operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#2755927 (10akosiaris) [10:05:17] 06Operations, 10Icinga, 10Shinken, 13Patch-For-Review: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#2755926 (10akosiaris) [10:05:19] 06Operations: setup/deploy einsteinium as monitoring host - https://phabricator.wikimedia.org/T121582#1882881 (10akosiaris) 05Open>03Resolved Yes :-) [10:06:00] (03PS1) 10Marostegui: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318892 (https://phabricator.wikimedia.org/T149553) [10:06:49] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:08:54] <_joe_> !log uploaded mcrouter 0.24.0-1 to jessie-wikimedia T132317 [10:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:00] T132317: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317 [10:09:36] _joe_ \o/ [10:09:46] 06Operations, 13Patch-For-Review, 07Performance, 15User-Joe, and 2 others: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#2755937 (10Joe) @aaron the package has been prepared and uploaded to jessie-wikimedia; feel free to experiment with it somewhere... [10:09:53] <_joe_> elukey: It's butt-ugly but mostly works [10:11:24] (03PS1) 10ArielGlenn: clean up indentation, formatting and comments [puppet] - 10https://gerrit.wikimedia.org/r/318893 [10:13:31] (03CR) 10ArielGlenn: "Great idea :-D" [puppet] - 10https://gerrit.wikimedia.org/r/318654 (owner: 10Dzahn) [10:15:23] (03PS2) 10Muehlenhoff: Temporarily disable poolcounter1001 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318509 [10:17:30] (03CR) 10Matthias Mullie: [C: 031] Verify license tags for custom license in Commons' UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318518 (https://phabricator.wikimedia.org/T140903) (owner: 10Bartosz Dziewoński) [10:21:58] 06Operations, 13Patch-For-Review, 07Performance, 15User-Joe, and 2 others: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#2755953 (10Joe) 05Open>03Resolved a:03Joe [10:23:35] (03PS3) 10Alexandros Kosiaris: base::service_unit: enable/disable the service if managed here [puppet] - 10https://gerrit.wikimedia.org/r/318315 (owner: 10Giuseppe Lavagetto) [10:23:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "merging per IRC conversation with _joe_" [puppet] - 10https://gerrit.wikimedia.org/r/318315 (owner: 10Giuseppe Lavagetto) [10:27:19] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.010 second response time [10:28:29] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 75434 bytes in 0.215 second response time [10:31:31] (03CR) 10Jcrespo: "Now I am unsure if we should handle this on the module, give that base::service_unit includes some (but not all) of the functionality. How" [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [10:35:39] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:36:28] (03PS3) 10Muehlenhoff: Temporarily disable poolcounter1001 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318509 [10:37:51] (03CR) 10Muehlenhoff: [C: 032] Temporarily disable poolcounter1001 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318509 (owner: 10Muehlenhoff) [10:38:19] (03Merged) 10jenkins-bot: Temporarily disable poolcounter1001 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318509 (owner: 10Muehlenhoff) [10:40:13] !log jmm@tin Synchronized wmf-config/ProductionServices.php: disabled poolcounter1001 for maintenance (duration: 00m 47s) [10:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:42] !log temporarily disabled poolcounter1001 for maintenance [10:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:29] PROBLEM - thumbor@8818 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8818 is inactive [10:43:58] (03PS1) 10Alexandros Kosiaris: tendril: Supply a robots.txt disallow all robots [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) [10:44:39] RECOVERY - thumbor@8818 service on thumbor1002 is OK: OK - thumbor@8818 is active [10:45:36] (03PS2) 10Alexandros Kosiaris: tendril: Supply a robots.txt disallow all robots [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) [10:48:40] !log rebooting poolcounter1001 for kernel update [10:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:15] (03PS1) 10ArielGlenn: dump namespace along with page title for allpagetitle dump [dumps] - 10https://gerrit.wikimedia.org/r/318901 (https://phabricator.wikimedia.org/T59739) [10:57:02] (03PS1) 10Muehlenhoff: Reenable poolcounter1001 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318902 [10:57:15] (03CR) 10Jcrespo: "Is the train of thought to supply a robots.txt on puppet to override a potential mistake on tendril, as infrastructure protection?" [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [10:57:50] 06Operations, 10vm-requests: Site: (1) VM request for tendril - https://phabricator.wikimedia.org/T149557#2756016 (10akosiaris) [10:58:05] 06Operations, 10vm-requests: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#2756028 (10akosiaris) [10:58:35] (03CR) 10ArielGlenn: [C: 032] dump namespace along with page title for allpagetitle dump [dumps] - 10https://gerrit.wikimedia.org/r/318901 (https://phabricator.wikimedia.org/T59739) (owner: 10ArielGlenn) [10:59:11] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2756036 (10Joe) @Tgr I hope we can implement this switch in a way that would just make it a new jobqueue driver... [11:00:10] !log restarting cassandra on aqs100[456] for OpenJDK upgrades [11:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:32] (03CR) 10Muehlenhoff: [C: 032] Reenable poolcounter1001 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318902 (owner: 10Muehlenhoff) [11:01:53] (03CR) 10Alexandros Kosiaris: [C: 031] Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [11:02:01] (03Merged) 10jenkins-bot: Reenable poolcounter1001 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318902 (owner: 10Muehlenhoff) [11:03:29] PROBLEM - thumbor@8820 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8820 is inactive [11:03:39] PROBLEM - thumbor@8840 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8840 is inactive [11:03:53] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Reenabled poolcounter1001 after maintenance (duration: 00m 45s) [11:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:18] (03CR) 10Giuseppe Lavagetto: [C: 032] Release 0.0.2 [software/service-checker] - 10https://gerrit.wikimedia.org/r/318517 (owner: 10Giuseppe Lavagetto) [11:09:10] Question what the hell is up with thumbor [11:10:02] <_joe_> Zppix: it's crashing, but it's still an experimental service, so I'm not surprised that can happen [11:11:03] Shouldnt we be ack those errors from icinga-wm [11:11:04] ? [11:11:31] <_joe_> Zppix: nope, we should resolve the issue that causes them now [11:12:01] <_joe_> which I am doing just now [11:12:40] Ack [11:14:09] RECOVERY - thumbor@8820 service on thumbor1002 is OK: OK - thumbor@8820 is active [11:14:19] RECOVERY - thumbor@8840 service on thumbor1002 is OK: OK - thumbor@8840 is active [11:14:33] <_joe_> uhm looks like someone just restarted those [11:14:41] <_joe_> heh, puppet did, ofc [11:16:01] Lol [11:18:21] (03PS1) 10Giuseppe Lavagetto: thumbor: use restart:always instead of on-failure [puppet] - 10https://gerrit.wikimedia.org/r/318903 [11:18:27] <_joe_> Zppix: ^^ [11:18:39] <_joe_> gilles: ^^ [11:19:52] 06Operations, 10ops-codfw, 06DC-Ops, 10Parsoid: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2756059 (10mobrovac) >>! In T148710#2746052, @Arlolra wrote: >> But, better in what way ? > > From the command line. ``` mobrovac@wtp1001:~$ confctl select dc=.*,cluste... [11:20:03] (03PS1) 10BBlack: non-crit for client handshake SSL_R_VERSION_TOO_LOW [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318904 (https://phabricator.wikimedia.org/T148893) [11:20:05] (03PS1) 10BBlack: nginx (1.11.4-1+wmf13) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318905 [11:21:00] Thanks joe [11:21:40] (03PS1) 10ArielGlenn: add snapshot1001 to deploy targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/318906 [11:21:59] I have critical on my notification list therefore when icinga-wm complains i get pinged [11:22:02] <_joe_> Zppix: I'm not done :P [11:22:15] <_joe_> Zppix: you should not, that will drive you insane [11:22:16] I know [11:22:26] <_joe_> there is definitely a lot of noise going on here [11:22:29] Im already insane [11:22:37] <_joe_> ahah ok fair point :P [11:23:19] Its ok im auto subbed to new tasks [11:23:24] :) [11:23:45] :) [11:27:51] (03CR) 10ArielGlenn: [C: 032 V: 032] add snapshot1001 to deploy targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/318906 (owner: 10ArielGlenn) [11:32:18] !log updating parsoid in codfw to nodejs 4.6.0 [11:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:49] 06Operations, 06Performance-Team, 10Thumbor: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2756090 (10Joe) [11:39:30] ^ no buneo _joe_ [11:45:18] <_joe_> Zppix: that's way more common than you'd expect [11:45:19] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [11:46:12] I wish that was why my bot was exiting [11:48:25] (03CR) 10BBlack: [C: 032 V: 032] non-crit for client handshake SSL_R_VERSION_TOO_LOW [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318904 (https://phabricator.wikimedia.org/T148893) (owner: 10BBlack) [11:48:33] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.4-1+wmf13) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318905 (owner: 10BBlack) [11:49:35] !log uploaded nginx-1.11.4-1+wmf13 to carbon jessie-wikimedia (logfile spam fixup) [11:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:10] (03PS1) 10ArielGlenn: add mirror list to dumps download page, clean up formatting [puppet] - 10https://gerrit.wikimedia.org/r/318909 [12:01:18] 06Operations, 10Traffic, 13Patch-For-Review: nginx SSL_do_handshake spam filling disks - https://phabricator.wikimedia.org/T148893#2756139 (10BBlack) 05Open>03Resolved a:03BBlack wmf13 nginx package fixes this [12:03:05] (03CR) 10ArielGlenn: [C: 032] add mirror list to dumps download page, clean up formatting [puppet] - 10https://gerrit.wikimedia.org/r/318909 (owner: 10ArielGlenn) [12:03:58] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#2756149 (10jcrespo) [12:04:12] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#2213475 (10jcrespo) 05Resolved>03Open [12:07:14] !log upgrading nginx to 1.11.4-1+wmf13 on cache_upload [12:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:07] !log upgrading nginx to 1.11.4-1+wmf13 on cache_upload - T148917 [12:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:12] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [12:12:45] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:22:44] PROBLEM - puppet last run on bohrium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:14] PROBLEM - HHVM rendering on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [12:24:14] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 75464 bytes in 0.109 second response time [12:27:09] !log failover ganeti1002 as new master in eqiad [12:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:54] !log migrating nodes from ganeti1001 for kernel reboot [12:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] (03CR) 10Zppix: [C: 031] thumbor: use restart:always instead of on-failure [puppet] - 10https://gerrit.wikimedia.org/r/318903 (owner: 10Giuseppe Lavagetto) [12:32:14] !log upgrading nginx to 1.11.4-1+wmf13 on cache_misc - T148917 [12:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:19] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [12:39:30] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T149509#2756223 (10Marostegui) [12:45:11] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2756230 (10Marostegui) [12:46:39] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T149509#2754729 (10Marostegui) This is indeed correct. Disk in slot 3 is broken ``` Enclosure Device ID: 32 Slot Number: 3 Drive's position: DiskGroup: 0, Span: 1, Arm: 1 Enclosure position: N/A Device Id: 3 WWN: 5... [12:50:15] RECOVERY - puppet last run on bohrium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:53:06] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is failed [13:04:10] (03PS1) 10Urbanecm: Enable NewUserMessage extension on kkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318914 (https://phabricator.wikimedia.org/T149563) [13:06:51] !log rebooting ganeti1001 for kernel update [13:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:14] (03PS3) 10Alexandros Kosiaris: tendril: Supply a robots.txt disallow all robots [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) [13:30:53] (03PS6) 10Alexandros Kosiaris: icinga: Specify mode for nagios_host, nagios_service [puppet] - 10https://gerrit.wikimedia.org/r/317791 [13:30:55] (03PS7) 10Alexandros Kosiaris: icinga: Increase max_concurrent_checks [puppet] - 10https://gerrit.wikimedia.org/r/317763 [13:30:57] (03PS1) 10Alexandros Kosiaris: Kill monitoring::decommission_monitor_host [puppet] - 10https://gerrit.wikimedia.org/r/318918 [13:30:59] (03PS1) 10Alexandros Kosiaris: icinga: Purge unmanaged local resources [puppet] - 10https://gerrit.wikimedia.org/r/318919 (https://phabricator.wikimedia.org/T149376) [13:31:19] !log rebooting labstore1002 for kernel update [13:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:42] thcipriani|afk: Could you deploy 318914 for me? ;) Now it's EU SWAT but no change is scheduled for it... [13:32:22] 06Operations, 06Multimedia, 10Traffic, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2756346 (10BBlack) >>! In T148917#2739171, @BBlack wrote: > Can anyone still repro this is... [13:32:52] or anomie aude twentyafterfour RoanKattouw Dereckson ^ [13:34:07] I can [13:34:10] Is it swat? [13:34:34] We can do it for EU SWAT Reedy [13:34:34] yup [13:34:42] Should I add it to the calendar? [13:34:52] Can do [13:35:33] (03PS2) 10Alexandros Kosiaris: Kill monitoring::decommission_monitor_host [puppet] - 10https://gerrit.wikimedia.org/r/318918 [13:35:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Kill monitoring::decommission_monitor_host [puppet] - 10https://gerrit.wikimedia.org/r/318918 (owner: 10Alexandros Kosiaris) [13:35:41] !log rebooting labstore2001 for kernel update [13:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] (03PS2) 10Alexandros Kosiaris: icinga: Purge unmanaged local resources [puppet] - 10https://gerrit.wikimedia.org/r/318919 (https://phabricator.wikimedia.org/T149376) [13:35:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Purge unmanaged local resources [puppet] - 10https://gerrit.wikimedia.org/r/318919 (https://phabricator.wikimedia.org/T149376) (owner: 10Alexandros Kosiaris) [13:36:10] (03CR) 10Reedy: [C: 032] "swat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318914 (https://phabricator.wikimedia.org/T149563) (owner: 10Urbanecm) [13:36:37] jouncebot: next [13:36:37] (03Merged) 10jenkins-bot: Enable NewUserMessage extension on kkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318914 (https://phabricator.wikimedia.org/T149563) (owner: 10Urbanecm) [13:36:37] In 3 hour(s) and 23 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T1700) [13:36:42] jouncebot: now [13:36:42] For the next 0 hour(s) and 23 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T1300) [13:36:52] 13:01:14 -!- jouncebot [tools.joun@instance-tools-exec-1402.tools.wmflabs.org] has joined #wikimedia-operations [13:37:01] the bot wasn't here at 13:00 [13:37:08] That explains the lack of notification. [13:37:19] At 13:00 what timezone? UTC? [13:37:23] Yes. [13:37:32] Thanks Reedy to take care of this window. [13:37:33] I'm in UTC now the blocks went back [13:37:38] *clocks [13:37:53] Reedy: Added to the calendar. [13:39:51] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable newusermessage on kkwiki T149563 (duration: 00m 55s) [13:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:57] T149563: Enable NewUserMessage extension on Kazakh Wikipedia - https://phabricator.wikimedia.org/T149563 [13:40:03] Urbanecm: you've other changes in the To deploy column of the site request workboard: [13:40:06] * [config] {{Gerrit|316295}} Show changes from last 14 days in watchlist in cswiki ({{phabT|148327}}) [13:40:09] * [config] {{Gerrit|318645}} Working Class Movement Library (Salford) throttle rule ({{phabT|149443}}) [13:40:32] (throttle rule is odder's) [13:41:11] 06Operations, 10Monitoring, 13Patch-For-Review: Icinga stale resources, possible artifact of the Icinga upgrade - https://phabricator.wikimedia.org/T149376#2756387 (10akosiaris) 05Open>03Resolved Fixed in https://gerrit.wikimedia.org/r/#/c/318919/ ``` Info: Applying configuration version '1477921036' No... [13:41:33] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2756391 (10Gehel) [13:41:36] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2756389 (10Gehel) 05Open>03Resolved Re-image is complete, initial tile generation is in progress and working fine, but we are going to switch it to Cassa... [13:41:50] Okay, in this way 316295 can be deployed too. I didn't create 318645 but it seems it is ready for deployment. [13:42:22] (03CR) 10Ottomata: [C: 031] add mapped IPv6 address for eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/317192 (owner: 10Dzahn) [13:42:45] Reedy: Seems it works, thanks for the deployment! [13:43:59] (03CR) 10Ottomata: [C: 031] add mapped IPv6 address for krypton [puppet] - 10https://gerrit.wikimedia.org/r/316041 (owner: 10Dzahn) [13:44:30] (03PS4) 10Reedy: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318645 (https://phabricator.wikimedia.org/T149443) (owner: 10Odder) [13:44:39] (03CR) 10Reedy: [C: 032] Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318645 (https://phabricator.wikimedia.org/T149443) (owner: 10Odder) [13:44:47] (03PS1) 10BBlack: stream.wm.o: remove old monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/318921 [13:45:04] (03Merged) 10jenkins-bot: Working Class Movement Library (Salford) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318645 (https://phabricator.wikimedia.org/T149443) (owner: 10Odder) [13:46:49] !log reedy@tin Synchronized wmf-config/throttle.php: Throttle rule for T149443 (duration: 00m 46s) [13:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:55] T149443: Account creation throttle exemption for WCML edit-a-thon on 2016-11-20 - https://phabricator.wikimedia.org/T149443 [13:47:19] (03CR) 10BBlack: [C: 032] stream.wm.o: remove old monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/318921 (owner: 10BBlack) [13:48:54] (03PS4) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) [13:49:10] (03CR) 10jenkins-bot: [V: 04-1] Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [13:49:12] (03CR) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [13:49:50] Urbanecm: Needs a manual rebase [13:49:59] Reddy: I know, working on it. [13:51:06] (03PS2) 10Reedy: Move Aff|LegalContactPages to MetaContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315838 (owner: 10Chad) [13:51:11] (03CR) 10Reedy: [C: 032] Move Aff|LegalContactPages to MetaContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315838 (owner: 10Chad) [13:51:39] (03Merged) 10jenkins-bot: Move Aff|LegalContactPages to MetaContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315838 (owner: 10Chad) [13:51:42] (03PS5) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) [13:51:50] Reedy: Done in PS5 [13:52:26] PROBLEM - thumbor@8816 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8816 is inactive [13:52:51] !log reedy@tin Synchronized wmf-config/MetaContactPages.php: Stage new file (duration: 00m 46s) [13:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] !log powercycling labcontrol1002 [13:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:04] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Use MetaContactPages (duration: 00m 48s) [13:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:21] (03PS1) 10Reedy: Update noc links for ContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318922 [13:56:46] RECOVERY - Host labcontrol1002 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [13:56:51] (03CR) 10Reedy: [C: 032] Update noc links for ContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318922 (owner: 10Reedy) [13:57:02] !log reedy@tin Synchronized wmf-config/: Remove old ContactPage files (duration: 00m 47s) [13:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:27] (03Merged) 10jenkins-bot: Update noc links for ContactPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318922 (owner: 10Reedy) [13:57:45] (03PS1) 10ArielGlenn: be more conservative in initial guess for xmlstream chunk size [dumps] - 10https://gerrit.wikimedia.org/r/318923 (https://phabricator.wikimedia.org/T145380) [13:58:20] (03PS6) 10Reedy: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [13:58:34] !log reedy@tin Synchronized docroot/noc/: nocnocnoc (duration: 00m 45s) [13:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:46] (03CR) 10Reedy: [C: 032] Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [13:59:14] (03Merged) 10jenkins-bot: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [13:59:45] !log powercycling labnet1002 [13:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:07] (03CR) 10ArielGlenn: [C: 032] be more conservative in initial guess for xmlstream chunk size [dumps] - 10https://gerrit.wikimedia.org/r/318923 (https://phabricator.wikimedia.org/T145380) (owner: 10ArielGlenn) [14:00:20] !log reedy@tin Synchronized wmf-config/: (no message) (duration: 00m 50s) [14:00:22] !log that deploy was was "Show changes from last 14 days in watchlist in cswiki T148327 " [14:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:25] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:29] T148327: Show changes which was made in last 14 days in watchlist in cswiki by default (for new users) - https://phabricator.wikimedia.org/T148327 [14:02:55] RECOVERY - Host labnet1002 is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [14:07:07] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2756488 (10MoritzMuehlenhoff) [14:07:15] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2756500 (10MoritzMuehlenhoff) a:03Papaul [14:08:35] RECOVERY - thumbor@8816 service on thumbor1001 is OK: OK - thumbor@8816 is active [14:09:16] 06Operations, 06Labs, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/2) [tracking] - https://phabricator.wikimedia.org/T146154#2756506 (10chasemp) We may need to reschedule as {T149567} is a an issue [14:12:09] (03PS2) 10Ottomata: Add kafka1003 to main-eqiad Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/318570 (https://phabricator.wikimedia.org/T148849) [14:12:33] (03CR) 10Ottomata: [C: 032 V: 032] Add kafka1003 to main-eqiad Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/318570 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [14:12:48] !log adding kafka1003 as kafka broker in main-eqiad cluster [14:12:50] !log rebooting labstore2003 for kernel update [14:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:55] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:15:35] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [14:16:45] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3068792 keys, up 5 hours 54 minutes - replication_delay is 0 [14:17:28] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[eventlogging/eventbus] [14:17:30] (03PS4) 10Giuseppe Lavagetto: docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) [14:24:36] (03PS5) 10Giuseppe Lavagetto: docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) [14:25:27] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: puppetization for production [puppet] - 10https://gerrit.wikimedia.org/r/318050 (https://phabricator.wikimedia.org/T148966) (owner: 10Giuseppe Lavagetto) [14:25:32] (03CR) 10Urbanecm: [C: 031] Remove patrol from autoconfirmed and reviewer for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [14:26:11] ^^^ on that [14:26:55] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [14:28:03] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [14:28:53] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [14:29:41] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2756656 (10MoritzMuehlenhoff) Please also have a look at labstore2003, on system boot it shows the message " All of the disks from your previous configuration are gone. If this is an unexpected message, then please... [14:31:13] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:31:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [14:31:58] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2756672 (10Zareenf) @Dzahn here is a new SSH key for production access: ssh-rsa AAAAB3NzaC1yc2EAAAADAQA... [14:32:54] !log rebooting labstore2004 for kernel update [14:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:06] (03Abandoned) 10Ori.livneh: varnish: add prototype cookie-based backend selection [puppet] - 10https://gerrit.wikimedia.org/r/247970 (https://phabricator.wikimedia.org/T91820) (owner: 10Ori.livneh) [14:39:33] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:39:37] (03Abandoned) 10Ori.livneh: WIP: Add mwgrep-web [puppet] - 10https://gerrit.wikimedia.org/r/232668 (https://phabricator.wikimedia.org/T71489) (owner: 10Ori.livneh) [14:39:45] (03PS1) 10Giuseppe Lavagetto: docker::registry: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/318929 [14:39:57] (03Abandoned) 10Ori.livneh: Make pybal accept 30[12] for ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/233054 (https://phabricator.wikimedia.org/T102393) (owner: 10Ori.livneh) [14:40:03] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2756713 (10MoritzMuehlenhoff) And the same on labstore2004. [14:40:16] (03Abandoned) 10Ori.livneh: include mediawiki::multimedia on all application servers [puppet] - 10https://gerrit.wikimedia.org/r/250291 (https://phabricator.wikimedia.org/T35186) (owner: 10Ori.livneh) [14:40:27] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/318929 (owner: 10Giuseppe Lavagetto) [14:40:39] (03CR) 10Giuseppe Lavagetto: [V: 032] docker::registry: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/318929 (owner: 10Giuseppe Lavagetto) [14:43:05] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:44:05] (03CR) 10Ori.livneh: "@godog, I don't agree. The two checks (continuing to receive updates, and reported values are within a certain range) are logically indepe" [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [14:44:19] <_joe_> sigh, big patch rebase fail [14:45:08] 06Operations, 06Labs: cronspam from labstores, labcontrol, labstestservices - https://phabricator.wikimedia.org/T149574#2756726 (10faidon) [14:45:14] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2756741 (10faidon) Ping! [14:49:45] !log adding kafka1003 in as replicas for active main-eqiad topics [14:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:05] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2756757 (10Papaul) @MoritzMuehlenhoff please c heck https://phabricator.wikimedia.org/T102626 [14:53:04] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2756785 (10chasemp) We need to schedule a downtime to do this move from labsdb1005 to labsdb1004. This should be a very short window of actual outage.... [14:55:15] (03CR) 10Reedy: [C: 031] "Yes per T135888" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) (owner: 10CSteipp) [14:55:54] (03PS1) 10BBlack: check_ssl: support OCSP Stapling and related [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [14:55:56] (03PS1) 10BBlack: check_sslxNN: require OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/318932 (https://phabricator.wikimedia.org/T148490) [14:56:58] (03PS1) 10Giuseppe Lavagetto: docker::registry: fix parameter passing for swift [puppet] - 10https://gerrit.wikimedia.org/r/318933 [14:57:20] (03PS2) 10BBlack: check_sslxNN: require OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/318932 (https://phabricator.wikimedia.org/T148490) [14:57:21] (03PS2) 10BBlack: check_ssl: support OCSP Stapling and related [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [14:59:17] (03PS3) 10BBlack: check_sslxNN: require OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/318932 (https://phabricator.wikimedia.org/T148490) [14:59:19] (03PS3) 10BBlack: check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [15:00:16] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: fix parameter passing for swift [puppet] - 10https://gerrit.wikimedia.org/r/318933 (owner: 10Giuseppe Lavagetto) [15:01:39] <_joe_> jeez, I'm on fire... [15:03:05] * Nikerabbit puts a blanket over the fire [15:03:17] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: remove duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/318935 [15:03:23] <_joe_> I was being sarcastic, I managed to screw one more up... [15:03:42] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::registry: remove duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/318935 (owner: 10Giuseppe Lavagetto) [15:03:46] (03CR) 10Giuseppe Lavagetto: [V: 032] profile::docker::registry: remove duplicate declaration [puppet] - 10https://gerrit.wikimedia.org/r/318935 (owner: 10Giuseppe Lavagetto) [15:05:36] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:14:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1003 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [10000.0] [15:17:03] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2756898 (10Papaul) The last time we replaced the memory on DIMM A1 on this system. What i will do is to swap DIMM B2 with DIMM B1 and clean the logs. If the message show on DIMM B2 then I will request new memory. [15:17:17] ^^ is ok [15:17:21] silencing [15:17:43] !log reedy@tin Synchronized php-1.28.0-wmf.23/extensions/WikimediaMaintenance/createExtensionTables.php: Add OATHAuth (duration: 00m 46s) [15:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318892 (https://phabricator.wikimedia.org/T149553) (owner: 10Marostegui) [15:21:08] !log created oathauth_users table on officewiki T135889 [15:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1003 is OK: OK: Less than 50.00% above the threshold [1000.0] [15:22:24] (03PS2) 10Marostegui: db-codfw.php: Depool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318892 (https://phabricator.wikimedia.org/T149553) [15:23:26] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [15:24:24] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2756912 (10jcrespo) a:03jcrespo So, 'maintainviews' will be the user used to create the view (you will connect to mysql using that user). viewmaster wi... [15:25:52] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: define correctly the swift password [puppet] - 10https://gerrit.wikimedia.org/r/318940 [15:26:09] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] profile::docker::registry: define correctly the swift password [puppet] - 10https://gerrit.wikimedia.org/r/318940 (owner: 10Giuseppe Lavagetto) [15:26:54] (03PS1) 10Volans: keyholder: be systemd compatible [puppet] - 10https://gerrit.wikimedia.org/r/318941 (https://phabricator.wikimedia.org/T148273) [15:26:56] (03PS1) 10Volans: keyholder: fix flake8 [puppet] - 10https://gerrit.wikimedia.org/r/318942 (https://phabricator.wikimedia.org/T148273) [15:26:58] (03PS1) 10Volans: keyholder: add support for SHA256 key fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/318943 (https://phabricator.wikimedia.org/T148273) [15:28:30] PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:31] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2034 for maintenance - T149553 (duration: 00m 46s) [15:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:36] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [15:28:40] PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:40] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:51] 06Operations, 10hardware-requests: codfw/eqiad: 12x swift backend refresh - https://phabricator.wikimedia.org/T149336#2756954 (10mark) @Robh: please request quotes for this, we'll likely lease these. [15:29:13] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: fixup for I4383ea6e [puppet] - 10https://gerrit.wikimedia.org/r/318945 [15:30:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] profile::docker::registry: fixup for I4383ea6e [puppet] - 10https://gerrit.wikimedia.org/r/318945 (owner: 10Giuseppe Lavagetto) [15:30:13] (03PS1) 10Ema: cache_text varnishtest: beacon and CP [puppet] - 10https://gerrit.wikimedia.org/r/318946 (https://phabricator.wikimedia.org/T131503) [15:35:09] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936600 (10yuvipanda) If we settle on a date and announce on labs-announce... [15:35:37] !log Disabled ports cr2-eqiad:xe-5/1/[0-3] (row A-D uplinks) [15:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:33] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T149509#2754729 (10Cmjohnson) Disks swapped...waiting on rebuild. [15:44:31] !log Chris moved cr2-eqiad:xe-5/1/[0-3] to xe-3/1/[0-3] [15:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:51] (03PS1) 10Reedy: Enforce same password policy for ombudsman as for checkuser et al [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318948 (https://phabricator.wikimedia.org/T104372) [15:45:29] (03CR) 10Reedy: Enforce same password policy for ombudsman as for checkuser et al (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318948 (https://phabricator.wikimedia.org/T104372) (owner: 10Reedy) [15:45:59] (03PS1) 10Faidon Liambotis: nagios: do both RSA/ECDSA checks in check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/318949 [15:47:03] (03Abandoned) 10Reedy: Enforce same password policy for ombudsman as for checkuser et al [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318948 (https://phabricator.wikimedia.org/T104372) (owner: 10Reedy) [15:47:36] (03CR) 10Faidon Liambotis: "Do you know how will that fail in case of OCSP issues? I think it might just fall back on the "failed to connect" error case. Will the out" [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) (owner: 10BBlack) [15:51:21] (03PS1) 10Reedy: Update minimum bot password length to 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318951 (https://phabricator.wikimedia.org/T104145) [15:55:25] 06Operations, 10ops-codfw, 06DC-Ops, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2757117 (10chasemp) Note this seems to have hit us today w/ a needed human intervention in codfw. This has be next on the agenda for storage fixup. [15:58:40] PROBLEM - Host kafka1018 is DOWN: PING CRITICAL - Packet loss = 100% [15:59:15] wow [15:59:18] ottomata: --^ [15:59:20] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:59:50] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [16:00:12] <_joe_> uh is that expected? [16:00:26] kakfa1018 [16:00:27] not expected [16:00:31] looking [16:00:54] could be network related? asw-d-eqiad.mgmt.eqiad.wmnet is suspicious [16:00:55] cmjohnson1: hey [16:01:02] hey [16:01:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [16:01:39] one of the PDUs of rack D2 is down [16:01:40] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [16:01:44] kafka1018 is there [16:01:49] ok [16:02:11] as are a few other servers, but they are probably connected redundantly? [16:02:21] https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1562 is the rack, fwiw [16:02:21] <_joe_> seems like it's the case [16:02:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [16:02:53] ottomata: let us know if Kafka is in trouble [16:02:57] paravoid: i think its ok [16:03:00] looking though [16:03:30] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [16:03:33] ottomata: from racktables it seems that 1020 is also in the rack [16:03:43] but the recovery looks good [16:03:45] mmmm [16:03:53] yeah i can reach 1020 too [16:03:57] it seems ok [16:04:41] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:04:41] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:04:41] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:04:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1013 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [16:04:41] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:04:41] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:04:41] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:04:42] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:04:44] hehe [16:04:50] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:04:50] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:04:51] paravoid: yeah, 1018 is unreachable, but the cluster seem sfine [16:05:00] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:00] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:05:00] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:05:00] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:00] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:10] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:10] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:10] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:10] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:10] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:11] ottomata: k, thanks [16:05:11] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:20] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:20] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:20] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:05:20] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:20] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:21] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:05:21] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:22] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:30] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:30] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:30] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:05:30] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:30] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:31] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:05:31] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:32] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:32] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:33] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:33] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:34] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:34] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:40] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:40] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:40] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:40] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:40] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:41] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:41] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:42] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1018_v4,kafka1018_v6 [16:05:42] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1018_v4,kafka1018_v6 [16:05:43] ohshutup [16:05:47] (03PS1) 10Rush: labs: tc-setup param thresholds applied [puppet] - 10https://gerrit.wikimedia.org/r/318952 [16:05:47] lol [16:05:50] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1018_v4,kafka1018_v6 [16:05:50] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1018_v4,kafka1018_v6 [16:06:47] (03CR) 10jenkins-bot: [V: 04-1] labs: tc-setup param thresholds applied [puppet] - 10https://gerrit.wikimedia.org/r/318952 (owner: 10Rush) [16:08:30] RECOVERY - Host ps1-d2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.99 ms [16:08:50] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [16:09:24] (03CR) 10Yuvipanda: [C: 031] labs: tc-setup param thresholds applied [puppet] - 10https://gerrit.wikimedia.org/r/318952 (owner: 10Rush) [16:09:28] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2757157 (10chasemp) thank you @jcrespo! fyi this is maintained here atm (both user and pass are set in private) https://phabricator.wikimedia.org/diffus... [16:11:53] (03PS2) 10Rush: labs: tc-setup param thresholds applied [puppet] - 10https://gerrit.wikimedia.org/r/318952 [16:12:15] (03PS1) 10Gehel: maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) [16:13:16] (03CR) 10jenkins-bot: [V: 04-1] maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [16:13:26] moritzm: ^ it listed you as on clinic duty this week on the etherpad (so ive updated the topic to reflect you are on duty ;) [16:13:51] (03PS4) 10BBlack: check_sslxNN: require OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/318932 (https://phabricator.wikimedia.org/T148490) [16:13:53] (03PS4) 10BBlack: check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [16:14:01] robh: ack [16:15:20] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [10.0] [16:15:33] 06Operations, 06Discovery-Search (Current work): Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2757176 (10Gehel) [16:15:35] 06Operations, 06Discovery-Search (Current work), 13Patch-For-Review, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2757174 (10Gehel) 05Resolved>03Open The patch *should* resolve the issue, but it is not yet deployed. So at thi... [16:15:40] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [10.0] [16:16:40] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [10.0] [16:18:16] these ones are kafka1018 related [16:18:30] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [16:19:40] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [16:20:30] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [10.0] [16:20:33] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2757223 (10jcrespo) I feel there is another missunderstanding, there is $::passwords::mysql::maintain_views and $::passwords::labsdb::maintainviews. I wil... [16:20:40] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:21:40] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0] [16:21:43] Question are dbs automatic (creatable via software/exes) for tools labs or do i have to create one [16:22:07] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2757233 (10chasemp) >>! In T123731#2757001, @yuvipanda wrote: > If we settle on a date and announce on labs-announce... @yuvipanda I think the asks h... [16:22:50] Zppix, you have to create one- it has to have a specific names start- it is somewhere on the docs [16:22:50] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [16:23:08] Zppix, you will get better answers on #wikimedia-labs [16:23:09] jynus: damn ok [16:23:10] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [16:23:29] jynus: nah i think they are bored of me xD [16:24:13] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2757251 (10Marostegui) I am fine with that. What I want to do: - Move the snapshot from dbstore2001 to dbstore2002 and labsdb1008 (needs coordination with Chase). - Build dbstore2002 from there (... [16:24:16] (03PS1) 10Alexandros Kosiaris: icinga: Add comments about paging infrastructure update [puppet] - 10https://gerrit.wikimedia.org/r/318955 [16:25:32] (03CR) 10Rush: [C: 032] labs: tc-setup param thresholds applied [puppet] - 10https://gerrit.wikimedia.org/r/318952 (owner: 10Rush) [16:25:40] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [16:26:00] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 86.67% of data above the critical threshold [10.0] [16:26:21] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 89.66% of data above the critical threshold [10.0] [16:27:30] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [16:28:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [10.0] [16:30:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [16:31:30] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [16:34:40] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exportd is active [16:34:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [16:36:22] kart_: the /etc/init/ garbage for cxserver and apertium-apy have been cleaned on scbXXXX [16:37:37] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#2213575 (10Marostegui) ** Number of crashes es2019: 23rd March & 22nd April & 30th Oct ** Number of crashes es2017: 26th May 30th May, ** Number... [16:39:00] (03PS1) 10Yuvipanda: nfs: Wait 10s between nfs-exportsd restarts [puppet] - 10https://gerrit.wikimedia.org/r/318959 [16:39:10] chasemp: madhuvishy ^ [16:41:35] (03Abandoned) 10Zppix: Adds translations to the user's lang in the links within the readme in the ROOT dir. [puppet] - 10https://gerrit.wikimedia.org/r/315728 (owner: 10Zppix) [16:41:53] (03CR) 10Zppix: "Per i mean" [puppet] - 10https://gerrit.wikimedia.org/r/315728 (owner: 10Zppix) [16:42:18] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#2225024 (10RobH) I'll review all the past and linked ticket histories. We'll need to generate a list of each system, and the overall errors and m... [16:43:03] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757307 (10GWicke) [16:44:16] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2748980 (10Zppix) What's the current node version? [16:46:50] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757321 (10GWicke) @Zppix, we currently use the latest 4.x in production. [16:48:19] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757331 (10Zppix) Maybe upgrade to node 5 then see what that does? I feel like the benefits are outweighed by the issues that will come [16:48:28] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2757344 (10chasemp) ok thanks, `$::passwords::labsdb::maintainviews` works for me [16:48:30] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:49:29] 06Operations, 06ELiSo, 10RESTBase, 10VisualEditor, 07Esperanto-Sites: RESTBase thinks beta.wikiversity pages don't exist - https://phabricator.wikimedia.org/T148861#2757349 (10Psychoslave) Seems like it works now, thank you to those who helped. :) [16:50:50] RECOVERY - MegaRAID on db1050 is OK: OK: optimal, 1 logical, 2 physical [16:54:26] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T149509#2757360 (10Marostegui) It got rebuilt ``` ˜/icinga-wm 17:50> RECOVERY - MegaRAID on db1050 is OK: OK: optimal, 1 logical, 2 physical Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [16:54:46] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T149509#2757361 (10Marostegui) 05Open>03Resolved [16:59:30] PROBLEM - Check size of conntrack table on kafka1014 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T1700). Please do the needful. [17:00:23] !log kafka preferred replica election on main-eqiad kafka cluster to promote kafka1003 as leader for its preferred partitions [17:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:55] nothing planned for deployment in WDQS this week (unless SMalyshev has a last minute addition) [17:01:00] ottomata: kafka1014 worries me a bit [17:01:31] elukey: what's up? [17:01:40] nf_conntrack is 90 % full [17:03:15] looking [17:04:26] hm [17:04:42] so /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait is ok [17:04:50] (not 120 but 65) [17:05:02] i would think 1014 and 1020 would be the two highest [17:05:08] since they are leader for a few more partitions [17:05:09] i think it is just more load to the other brokers due to kafka1018's failure [17:05:11] than theothers [17:05:18] !log reboot labstore1004 [17:05:20] but 1013 and 1014 are the highest [17:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:22] aye [17:05:44] paravoid: what's up with that pdu? [17:05:56] do we konw? [17:06:02] ottomata: according to cmjohnson1 the phase tripped [17:06:11] kafka1018 was the only one affected because if has a bad power supply [17:06:23] cmjohnson1 said he'll file a task about that one [17:06:25] Yay it worked [17:06:28] mutante ^^ [17:06:36] ah ok [17:06:50] ottomata: so maybe it is only a matter of preferred replica election? [17:07:06] hmm, [17:07:07] mayyyybe [17:07:12] i doubt it though [17:07:31] since, because 1018 went down, its parittions leader would be moved elsewhere [17:07:38] i don't think an election would change distribution [17:07:44] let's try [17:07:50] can I ? [17:07:50] grrrit-wm: !grrrit-wm-die [17:07:54] !grrrit-wm-die [17:08:01] (03PS1) 10Ottomata: Add kafka1003 into conftool for eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/318961 (https://phabricator.wikimedia.org/T148849) [17:08:13] elukey: ja sure [17:08:51] ottomata: ah snap sorry kafka1018 is still down, I was on kafka1014 and mistakenly thought that it was up again [17:08:54] my bad [17:09:06] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757409 (10GWicke) @Zppix, we generally move between LTS versions. It's clear that we'll move to Node 6 next, the question is more about the timing of the upgrade. [17:09:14] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757410 (10MoritzMuehlenhoff) I'd expect nodejs to be uploaded to unstable soon and since it's available in experimental it's not a big deal anyway, from the internal maintenance perspective providing packages... [17:09:49] (03PS3) 10Muehlenhoff: Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) [17:09:53] ahh [17:09:55] yeah still down [17:10:08] (03CR) 10Ottomata: [C: 032] Add kafka1003 into conftool for eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/318961 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [17:10:35] cmjohnson1: ^^ [17:10:40] ottomata: probably raising a bit the max conntrack could help reaching the 100% [17:10:46] moritzm: ---^ [17:10:58] kafka1014 is up to 90% nf_conntrack [17:11:15] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757412 (10Zppix) @GWicke I say we work out what needs to be changed/fixed in entirety before doing this update. [17:11:17] I'd like to raise the max value a bit across the cluster until kafka1018 is down [17:11:20] having a look [17:11:37] probably most of them are timewaits [17:12:10] +1 [17:12:14] fixed, I'll make the same change to the other kafka brokers [17:12:30] RECOVERY - Check size of conntrack table on kafka1014 is OK: OK: nf_conntrack is 42 % full [17:12:56] Oh i really though doing !grrrit-wm-die for ^^ would have at least said a quit message [17:13:01] mutante ^^ yay it worked [17:13:14] elukey, ottomata: it's not the time_wait, but the same bug; the kafka brokers have an increased table size compared to the rest of services; 512k [17:13:22] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757441 (10ssastry) The parsoid regression is actually based on a transient parser tests run. Later test runs showed that the perf. was identical to node v4 and v5. I am yet to do real benchmark runs on full p... [17:13:25] ottomata: coming up now [17:13:29] (03PS2) 10Dzahn: admin: add niedzielski to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/318775 (https://phabricator.wikimedia.org/T149233) [17:13:29] paladox: what exactly was it now? [17:13:32] i had a spare psu so i rpelaced [17:13:34] paladox i think you have to specifiy within the code [17:13:34] and that sysctl race set the default value into effect [17:13:37] paladox: SSL ? [17:13:38] mutante !grrrit-wm-die [17:13:48] Yeh it has ssl [17:13:50] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2757445 (10ssastry) [17:13:58] but i mean issuing !grrrit-wm-die in the pm restarts the bot [17:14:10] paladox: ah, that's cool :) [17:14:17] Yeh :) [17:14:25] moritzm: we will have to increase the size on 1018 when it comes up too, right? [17:14:26] paladox you set up the permissions right? [17:14:30] (03PS1) 10Madhuvishy: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 [17:14:48] elukey, ottomata: but I've fixed this in principal on Friday, only needs to be puppetised (and applied to trusty, but irrelevant to kafka): https://phabricator.wikimedia.org/T136094 [17:14:51] Zppix not yet, figuring out how to support these commands, then will have to look at how to do that [17:14:53] (03CR) 10Madhuvishy: [C: 031] nfs: Wait 10s between nfs-exportsd restarts [puppet] - 10https://gerrit.wikimedia.org/r/318959 (owner: 10Yuvipanda) [17:14:54] moritzm: oh sorry I thought it was 256k :( I checked only the /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_time_wait [17:15:12] paladox if i had the access i would tottaly help you out :P [17:15:21] Oh :) [17:15:30] (03CR) 10jenkins-bot: [V: 04-1] nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 (owner: 10Madhuvishy) [17:15:32] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2757461 (10Ottomata) [17:15:32] moritzm: thanks a lot! [17:15:53] ottomata: yes, once it's up run: [17:15:55] sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 [17:15:56] and [17:16:02] sysctl -w net.netfilter.nf_conntrack_max=524288 [17:16:35] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2734746 (10Ottomata) Looking good! https://config-master.wikimedia.org/conftool/eqiad/eventbus [17:18:10] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [17:18:10] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [17:18:10] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [17:18:11] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [17:18:11] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [17:18:11] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [17:18:22] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [17:18:22] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [17:18:22] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [17:18:22] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK [17:18:22] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [17:18:22] RECOVERY - Host kafka1018 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [17:18:22] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [17:18:22] moritzm: For some reason I was convinced that we moved the net.netfilter.nf_conntrack_max to 256k [17:18:22] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [17:18:30] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [17:18:30] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [17:18:30] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [17:18:30] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [17:18:30] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [17:18:40] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [17:18:40] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [17:18:40] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [17:18:40] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [17:18:45] hello kafka1018 [17:18:48] welcome back [17:18:50] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [17:18:50] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [17:18:50] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [17:18:50] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [17:18:50] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [17:18:51] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [17:18:51] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [17:18:52] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [17:18:52] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [17:18:53] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [17:18:53] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [17:18:54] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 36 ESP OK [17:19:00] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [17:19:00] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [17:19:00] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [17:19:00] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [17:19:00] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [17:19:01] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [17:19:01] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [17:19:02] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [17:19:02] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [17:19:03] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [17:19:08] it's certainly effusive in its return [17:19:20] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [17:19:20] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [17:19:50] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [17:20:57] elukey, ottomata: corrected the conntrack values on all kafka/analytics (includeing 1018) [17:23:41] What the actual hell icinga-wm [17:24:27] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2757492 (10ellery) [[ http://www.geforce.com/hardware/10series/titan-x-pascal | This ]] is the GPU we would like to order. [17:24:51] !grrrit-wm-die [17:24:55] Yay [17:25:03] mutante ^^ it actually work [17:25:13] mutante try !grrrit-wm-die when the bot comes back [17:27:32] !log Chris moved cr2-eqiad:xe-5/0/[0-2] and xe-5/1/2 to xe-3/1/[0-3] [17:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:34] (03CR) 10Dzahn: [C: 032] "approved in ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/318775 (https://phabricator.wikimedia.org/T149233) (owner: 10Dzahn) [17:35:00] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2757566 (10mark) Row A-D uplinks to cr2-eqiad have all been moved from fpc 5 to fpc 3. Remaining: - pfw1 uplinks (xe-5/0/3) - Zayo wavelength to codfw (xe-5/2/3) - Equinix Ashburn port (x... [17:36:21] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [17:36:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 86.67% of data above the critical threshold [5000000.0] [17:36:55] (03PS2) 10Madhuvishy: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 [17:37:30] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [17:37:30] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [17:37:50] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [17:39:20] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 76.47% of data above the critical threshold [5000000.0] [17:40:01] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2757599 (10Dzahn) a:03Dzahn [17:40:40] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [17:40:49] !grrrit-wm-die [17:40:59] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2746004 (10Dzahn) @Niedzielski @Sniedzielski @hashar on gallium: [gallium:~] $ id niedzi... [17:41:26] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2757607 (10Dzahn) [17:41:40] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2757609 (10greg) [17:41:48] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2746004 (10Dzahn) 05Open>03Resolved [17:49:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:49:53] 06Operations, 10netops: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2757634 (10mark) Reverse DNS (interface names) should also be updated for all moved ports... [17:51:50] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2757635 (10RobH) [17:52:05] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10RobH) [17:59:08] !log kafka preferred-prelica-election for analytics-eqiad to promote kafka1018 as leader [17:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T1800). Please do the needful. [18:03:29] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2757698 (10yuvipanda) [18:04:21] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10yuvipanda) [18:05:07] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10RobH) Please note we need to have some additional rationale on why these systems will be needed, since they are high cost syst... [18:05:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [18:06:04] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2757722 (10yuvipanda) I edited the task to have some more info on rationale. [18:07:30] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:10:20] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 406 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [18:13:55] (03PS5) 10BBlack: check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [18:13:57] (03PS1) 10BBlack: check_ssl: clean up ssl_verify/_subject_matches [puppet] - 10https://gerrit.wikimedia.org/r/318968 [18:13:59] (03PS1) 10BBlack: check_ssl: add --sans argument [puppet] - 10https://gerrit.wikimedia.org/r/318969 [18:14:01] (03PS1) 10BBlack: check_ssl: append (RSA|ECDSA) to name if authalg specified [puppet] - 10https://gerrit.wikimedia.org/r/318970 [18:14:03] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2757739 (10yuvipanda) [18:14:03] (03PS1) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [18:14:52] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Requesting access to contint for niedzielski - https://phabricator.wikimedia.org/T149233#2757740 (10Niedzielski) Thanks @Dzahn, @hashar, @Legoktm, @greg! [18:16:31] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2757742 (10yuvipanda) [18:16:46] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10yuvipanda) [18:17:20] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 634 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3080188 keys, up 9 hours 55 minutes - replication_delay is 634 [18:19:50] PROBLEM - Varnishkafka Delivery Errors per minute on cp3045 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [20000.0] [18:19:57] since noone is doing the swat, and since i forgot to add my patches to the list, I will swat them myself (labs config change). [18:21:21] (03PS2) 10Yurik: LABS: Enable tabular remote access to tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318883 (https://phabricator.wikimedia.org/T148745) [18:21:39] (03PS2) 10Dzahn: admin: create shell account for Zareen Farooqui [puppet] - 10https://gerrit.wikimedia.org/r/318688 (https://phabricator.wikimedia.org/T149211) [18:22:40] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3069969 keys, up 10 hours - replication_delay is 0 [18:22:57] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2649413 (10jcrespo) > match existing labsdbs ordered on T131363 For that, we bought HDs, not full servers, but the plan was to buy HDs a... [18:23:24] (03CR) 10MaxSem: [C: 031] LABS: Enable tabular remote access to tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318883 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [18:24:08] 06Operations, 10Gerrit, 10grrrit-wm: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757787 (10Paladox) [18:27:41] (03PS1) 10Rush: labstore: secondary cluster setup eth1 using interface::manual [puppet] - 10https://gerrit.wikimedia.org/r/318973 [18:28:02] (03CR) 10Faidon Liambotis: "Looks good, but I'm wondering if the semantic of --sans should be the exhaustive list of SANs, rather than the minimum set of SANs present" [puppet] - 10https://gerrit.wikimedia.org/r/318969 (owner: 10BBlack) [18:28:06] (03PS2) 10Rush: labstore: secondary cluster setup eth1 using interface::manual [puppet] - 10https://gerrit.wikimedia.org/r/318973 [18:28:36] (03PS3) 10Dzahn: admin: create shell account for Zareen Farooqui [puppet] - 10https://gerrit.wikimedia.org/r/318688 (https://phabricator.wikimedia.org/T149211) [18:28:59] 06Operations, 10Gerrit, 10grrrit-wm: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757805 (10Zppix) Easy fix have lolrrrit send a message in -labs or saying !log grrrit-wm restarting for maintenance (or whatever) then having it wait... [18:29:02] (03CR) 10Faidon Liambotis: [C: 032] "Nice catch." [puppet] - 10https://gerrit.wikimedia.org/r/318968 (owner: 10BBlack) [18:29:53] (03CR) 10Dzahn: [C: 032] admin: create shell account for Zareen Farooqui [puppet] - 10https://gerrit.wikimedia.org/r/318688 (https://phabricator.wikimedia.org/T149211) (owner: 10Dzahn) [18:29:57] (03PS4) 10Dzahn: admin: create shell account for Zareen Farooqui [puppet] - 10https://gerrit.wikimedia.org/r/318688 (https://phabricator.wikimedia.org/T149211) [18:30:20] (03CR) 10Faidon Liambotis: [C: 04-1] Replace check_sslxNN with check_ssl_unified (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318971 (owner: 10BBlack) [18:31:17] (03CR) 10Faidon Liambotis: [C: 032] check_ssl: append (RSA|ECDSA) to name if authalg specified [puppet] - 10https://gerrit.wikimedia.org/r/318970 (owner: 10BBlack) [18:31:42] (03CR) 10jenkins-bot: [V: 04-1] labstore: secondary cluster setup eth1 using interface::manual [puppet] - 10https://gerrit.wikimedia.org/r/318973 (owner: 10Rush) [18:32:14] (03CR) 10Faidon Liambotis: [C: 032] check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) (owner: 10BBlack) [18:32:39] 06Operations, 10Gerrit, 10grrrit-wm: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757835 (10Dzahn) Yes, i asked for this ticket to implement exactly that. The fix isn't as trivial as you make it sound though. First of all there need... [18:33:30] 06Operations, 10Gerrit, 10grrrit-wm: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757837 (10Zppix) dz [18:33:33] (03PS2) 10BBlack: check_ssl: add --sans argument [puppet] - 10https://gerrit.wikimedia.org/r/318969 [18:33:35] (03CR) 10Faidon Liambotis: [C: 031] Replace check_sslxNN with check_ssl_unified (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318971 (owner: 10BBlack) [18:33:37] (03PS2) 10BBlack: check_ssl: append (RSA|ECDSA) to name if authalg specified [puppet] - 10https://gerrit.wikimedia.org/r/318970 [18:33:39] (03PS6) 10BBlack: check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [18:34:07] 06Operations, 10Gerrit, 10grrrit-wm: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757838 (10Zppix) Dzahn couldn't we just do what i suggested it's doing the same thing no? [18:42:15] (03CR) 10Rush: nfs: Add script to manage NFS server on labstore secondary cluster (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318963 (owner: 10Madhuvishy) [18:42:59] (03PS3) 10Rush: labstore: secondary cluster setup eth1 using interface::manual [puppet] - 10https://gerrit.wikimedia.org/r/318973 [18:47:11] (03CR) 10Yurik: [C: 032] LABS: Enable tabular remote access to tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318883 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [18:47:40] (03Merged) 10jenkins-bot: LABS: Enable tabular remote access to tabular data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318883 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [18:48:42] (03PS4) 10Rush: labstore: secondary cluster setup eth1 using interface::manual [puppet] - 10https://gerrit.wikimedia.org/r/318973 [18:49:09] (03PS1) 10Dzahn: fix permissions on changepw script, let all users run it [puppet] - 10https://gerrit.wikimedia.org/r/318974 [18:50:12] (03PS2) 10Dzahn: fix permissions on changepw script, let all users run it [puppet] - 10https://gerrit.wikimedia.org/r/318974 [18:52:02] (03CR) 10Rush: [C: 032] labstore: secondary cluster setup eth1 using interface::manual [puppet] - 10https://gerrit.wikimedia.org/r/318973 (owner: 10Rush) [18:52:19] (03CR) 10Dzahn: [C: 032] fix permissions on changepw script, let all users run it [puppet] - 10https://gerrit.wikimedia.org/r/318974 (owner: 10Dzahn) [18:52:24] (03PS3) 10Dzahn: fix permissions on changepw script, let all users run it [puppet] - 10https://gerrit.wikimedia.org/r/318974 [18:55:09] !log yurik@tin Synchronized wmf-config: labs syncup https://gerrit.wikimedia.org/r/#/c/318883 (duration: 00m 49s) [18:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:39] (03PS4) 10Yuvipanda: tools: Grant clush user complete sudo rights for everything [puppet] - 10https://gerrit.wikimedia.org/r/315736 [18:56:50] (03CR) 10Yuvipanda: [C: 032 V: 032] "Here we go!" [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:59:22] hmm, there seems to be a flood of logs for wmgWatchlistNumberOfDaysShow [19:01:06] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2757960 (10yuvipanda) @jcrespo my understanding of what was communicated to you (both at the offsite and other non-phabricator venues) wa... [19:03:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 618 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3073598 keys, up 10 hours 42 minutes - replication_delay is 618 [19:07:28] PROBLEM - configured eth on wmf4750 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:07:39] Reedy, did you deploy https://gerrit.wikimedia.org/r/#/c/316295/ ??? [19:07:54] yurik: Yeah [19:08:09] Reedy, i just did a sync-dir on wmf-config, and it flooded logs [19:08:15] and i only deployed -labs files [19:08:27] investigating ... [19:08:35] its complaining about $wgDefaultUserOptions['watchlistdays'] = $wmgWatchlistNumberOfDaysShow; [19:08:42] 14:00 Reedy: that deploy was was "Show changes from last 14 days in watchlist in cswiki T148327 " [19:08:42] 14:00 reedy@tin: Synchronized wmf-config/: (no message) (duration: 00m 50s) [19:08:43] T148327: Show changes which was made in last 14 days in watchlist in cswiki by default (for new users) - https://phabricator.wikimedia.org/T148327 [19:09:47] yurik: I can only presume there's some issue with InitialiseSettings not being reparsed, and just loaded from cache? [19:09:58] touch and sync it maybe? [19:10:04] trying... [19:10:10] Reedy maybe recache the initsettings [19:10:26] Zppix: I just said that [19:10:42] And this shouldn't be needed. scap touchs IS on the appservers as part of every deploy [19:10:45] (03PS2) 10Yuvipanda: nfs: Wait 10s between nfs-exportsd restarts [puppet] - 10https://gerrit.wikimedia.org/r/318959 [19:11:25] (03CR) 10Yuvipanda: [C: 032 V: 032] nfs: Wait 10s between nfs-exportsd restarts [puppet] - 10https://gerrit.wikimedia.org/r/318959 (owner: 10Yuvipanda) [19:12:20] ok, i just did `touch InitialiseSettings.php` , scaping it... [19:12:25] !log yurik@tin Synchronized wmf-config/InitialiseSettings.php: touch and sync - logs are flooded (duration: 00m 46s) [19:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:05] If that works... I think we should file a task to check on the appserver touching that is supposed to happen to invalidate the file cache of IS [19:13:15] Orrr... It's hhvm sucking [19:13:37] yep, seems like ti worked [19:13:46] thanks yurik [19:14:03] thank you, i didn't even think of cache, was trying to figure out what's wrong with the code [19:14:46] Usually, it's something like that for starters [19:14:56] Seemingly sync-dir wmf-config isn't so useful [19:14:58] Reedy maybe have an auto init settings touch and sync when certain things are done ? [19:15:07] Zppix: It does [19:15:17] Reedy, should i file a bug, or do you want to do it? [19:15:22] Zppix: https://phabricator.wikimedia.org/T60618 [19:15:22] ah i see [19:15:23] ok [19:15:37] Been fixed for 2.5 years [19:17:31] (03CR) 10BBlack: Replace check_sslxNN with check_ssl_unified (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318971 (owner: 10BBlack) [19:17:48] !log restarted varnishkafka-webrequest on cp2018 and cp3045 (CRITICALs in icinga, librdkafka errors logged for kafka1018.eqiad.wmnet) [19:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:08] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3070547 keys, up 10 hours 56 minutes - replication_delay is 0 [19:18:51] grrrit-wm-230500525-h7118 [19:18:54] Wops [19:19:04] !grrrit-wm-die [19:19:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp3045 is OK: OK: Less than 80.00% above the threshold [0.0] [19:21:40] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2758017 (10jcrespo) a:05jcrespo>03None So this are the privileges created on all labsdbs (not yet on 9/10/11), but on 8 and the existing labs dbs: {P... [19:23:08] PROBLEM - puppet last run on aluminium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:25:08] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [19:28:05] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2758037 (10Krinkle) >>! In T147845#2749511, @BBlack wrote: > Ok, I was only considering the websockets case. Still, since the python code is unaware of X-Client-IP..... [19:30:29] (03PS1) 10Yurik: LABS: Enable tabular data lua support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318977 (https://phabricator.wikimedia.org/T148745) [19:32:30] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2758046 (10BBlack) I just don't really want to support that at the end of the day - the complexity cost is too high for all the rest of our stack (not that websockets... [19:33:09] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2758047 (10jcrespo) Arbitrary access clarified, I still see as new serving extra datasets that was not part of the original communication... [19:35:21] 06Operations, 10MediaWiki-Configuration, 10Wikimedia-Developer-Summit (2017): Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2758050 (10Joe) [19:36:46] (03Abandoned) 10Dzahn: admin: let datacenter-ops run script to change mgmt passwords [puppet] - 10https://gerrit.wikimedia.org/r/318654 (owner: 10Dzahn) [19:41:10] (03PS5) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [19:42:33] (03PS3) 10BBlack: check_ssl: add --sans argument [puppet] - 10https://gerrit.wikimedia.org/r/318969 [19:42:35] (03PS2) 10BBlack: check_ssl: clean up ssl_verify/_subject_matches [puppet] - 10https://gerrit.wikimedia.org/r/318968 [19:42:37] (03PS6) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [19:42:39] (03PS3) 10BBlack: check_ssl: append (RSA|ECDSA) to name if authalg specified [puppet] - 10https://gerrit.wikimedia.org/r/318970 [19:42:41] (03PS7) 10BBlack: check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [19:44:55] (03CR) 10BBlack: [C: 032] check_ssl: add --sans argument [puppet] - 10https://gerrit.wikimedia.org/r/318969 (owner: 10BBlack) [19:45:58] !grrrit-wm-die [19:47:47] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758118 (10GWicke) [19:51:21] RECOVERY - puppet last run on aluminium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:52:57] !grrrit-wm-die [19:59:14] (03PS7) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [19:59:17] (03PS4) 10BBlack: check_ssl: append (RSA|ECDSA) to name if authalg specified [puppet] - 10https://gerrit.wikimedia.org/r/318970 [19:59:18] (03PS8) 10BBlack: check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T2000). [20:01:11] (03CR) 10BBlack: [C: 032 V: 032] check_ssl: support OCSP Stapling [puppet] - 10https://gerrit.wikimedia.org/r/318931 (https://phabricator.wikimedia.org/T148490) (owner: 10BBlack) [20:08:58] (03PS1) 10Yuvipanda: ssh: Disable 2fa for labs [puppet] - 10https://gerrit.wikimedia.org/r/318981 (https://phabricator.wikimedia.org/T147998) [20:11:11] no mobileapps deploy today [20:11:19] bblack: hola [20:12:18] bblack: if you are there can you give me some feedback as to whether my varnish pseudo code in the A/B testing doc makes sense? https://docs.google.com/document/d/1jRGjVAthJXoCovxyvXWyg07R1POb8zvD_n8IlJXrPVM/edit# [20:12:55] nuria: yes, eventually :) [20:13:04] jaja, that works [20:14:00] bblack: I think we have clear how to properly select users into bucket so we get statistically sound samples , we are going to work whether these buckets allow us to analyze the data [20:14:35] bblack: also any privacy implications that you see would be great to flag. i think partition our user base in >=1000 buckets should be fine [20:14:39] (03PS8) 10BBlack: Replace check_sslxNN with check_ssl_unified [puppet] - 10https://gerrit.wikimedia.org/r/318971 [20:14:53] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2758199 (10RobH) 05Open>03stalled I received an IRC notice from Mark to start working on this, from an out of band coversat... [20:15:06] !log starting Parsoid deploy [20:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:22] 06Operations, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758217 (10GWicke) Here is a quick test of Parsoid's performance with 4.6 vs. 6.9: - Test setup: - Enable WMF wikis in Parsoid config - Start with single worker (`node_modules/.bin/service-runner -n 1`)... [20:16:03] !log upgrading cache_maps to nginx-1.11.4-1+wmf13 [20:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:04] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758240 (10GWicke) [20:22:26] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2758245 (10BBlack) [20:22:28] 06Operations, 10Traffic, 13Patch-For-Review: Extend check_sslxnn to check OCSP Stapling - https://phabricator.wikimedia.org/T148490#2758242 (10BBlack) 05Open>03Resolved a:03BBlack Fixed now in check_ssl itself. [20:24:10] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1149975 (10BBlack) [20:24:11] 06Operations, 10Traffic, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2758247 (10BBlack) [20:24:16] 06Operations, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2758249 (10yuvipanda) This continues to cause issues. Clush doesn't work from tools-puppetmaster-02, at least partially because: ``` Oct 31... [20:25:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:26:21] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:27:27] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2758256 (10BBlack) This task has continued to evolve. Basically, the remaining steps on the current path to resolution are: 1. Deploy... [20:34:22] !log updated Parsoid to version e503e801 (T149504) [20:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:24] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2758264 (10BBlack) @EdErhart-WMF are you the person now working on this? Can we get a status update fixing the remaining issue (correct HSTS header)? [20:39:31] RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 37.28 ms [20:41:31] RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 37.42 ms [20:43:24] papaul: Hi! it looks like labstore2001 won't come online and needs manual intervention, could you see if you can bring it back up? [20:44:02] madhuvishy: working on ti [20:44:04] it [20:44:09] papaul: thank you! [20:44:12] yw [20:44:41] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758280 (10GWicke) So, here is a proposal: - Double-check Node 6 support for all production services. We have been testing major services against Node 6 for a while now without finding any i... [20:53:21] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:54:41] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:55:45] (03CR) 10Dzahn: "yes, i'd have to agree with bd808, since i already saw some discussion about what the default value is on different distros." [puppet] - 10https://gerrit.wikimedia.org/r/318981 (https://phabricator.wikimedia.org/T147998) (owner: 10Yuvipanda) [21:00:05] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T2100). [21:00:05] (03CR) 10Dzahn: "since it only changes that one thing it could even be just "enable_challenge_response_auth" yes/no rather than having to explain what "ena" [puppet] - 10https://gerrit.wikimedia.org/r/318981 (https://phabricator.wikimedia.org/T147998) (owner: 10Yuvipanda) [21:00:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:00:41] (03PS2) 10Reedy: Enable Ex:OATHAuth on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) (owner: 10CSteipp) [21:01:14] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2758336 (10Dzahn) a:03Dzahn [21:01:21] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3071383 keys, up 12 hours 39 minutes - replication_delay is 0 [21:03:06] (03PS3) 10CSteipp: Enable Ex:OATHAuth on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) [21:05:08] (03CR) 10Reedy: [C: 032] Enable Ex:OATHAuth on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) (owner: 10CSteipp) [21:05:36] (03Merged) 10jenkins-bot: Enable Ex:OATHAuth on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) (owner: 10CSteipp) [21:07:28] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable OATHAuth on officewiki (duration: 00m 47s) [21:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:35] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2758360 (10atgo) [21:08:58] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Enable OATHAuth on officewiki (duration: 00m 48s) [21:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:27] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2527912 (10atgo) [21:12:11] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2758393 (10Dzahn) >>! In T149211#2747507, @Ottomata wrote: > researchers, statistics-privatedata-users,... [21:17:18] (03PS1) 10Rush: labstore: nfs-manage-binds add option to list bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/318991 [21:17:30] (03PS2) 10Rush: labstore: nfs-manage-binds add option to list bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/318991 [21:18:21] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758422 (10GWicke) [21:20:36] (03CR) 10MaxSem: [C: 031] Removed unused wmgUseGraphWithNamespace support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318862 (owner: 10Yurik) [21:23:01] (03PS4) 10Madhuvishy: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 [21:24:45] (03PS3) 10Rush: labstore: nfs-manage-binds add option to list bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/318991 [21:27:19] (03PS1) 10Dzahn: admin: add zareen to *-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/318992 (https://phabricator.wikimedia.org/T149211) [21:27:44] (03PS5) 10Madhuvishy: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 [21:28:35] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:25] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:20] PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:35] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [21:30:35] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [21:30:54] * volans looking [21:31:11] RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes [21:31:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2758464 (10Dzahn) Ok, I see now. It was about "analytics-users" being redundant. [21:32:05] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:32:18] (03PS6) 10Rush: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 (owner: 10Madhuvishy) [21:33:05] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:35:26] (03CR) 10Rush: [C: 031] "small notes, looks good thanks!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318963 (owner: 10Madhuvishy) [21:36:49] (03PS2) 10Dzahn: admin: add zareen to *-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/318992 (https://phabricator.wikimedia.org/T149211) [21:36:57] (03PS3) 10Dzahn: admin: add zareen to *-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/318992 (https://phabricator.wikimedia.org/T149211) [21:37:38] (03CR) 10Dzahn: [C: 032] admin: add zareen to *-privatedata-users, researchers [puppet] - 10https://gerrit.wikimedia.org/r/318992 (https://phabricator.wikimedia.org/T149211) (owner: 10Dzahn) [21:40:49] (03PS4) 10Rush: labstore: nfs-manage-binds add option to list bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/318991 [21:42:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2758491 (10Dzahn) Hi @Zareenf your user has now been created and added to the groups: researchers, st... [21:42:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users for Zareen - https://phabricator.wikimedia.org/T149211#2758492 (10Dzahn) 05Open>03Resolved [21:42:55] (03PS5) 10Rush: labstore: nfs-manage-binds add option to list bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/318991 [21:50:47] 06Operations, 10DBA: db1065 paged for NRPE timeout - https://phabricator.wikimedia.org/T149633#2758508 (10Volans) [21:51:20] * volans opened the above task for db1065 for further investigation, all looks good for now ^^^ [21:52:21] (03CR) 10Rush: [C: 032] labstore: nfs-manage-binds add option to list bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/318991 (owner: 10Rush) [21:54:01] (03CR) 10Rush: "Should be added to 'down' post nfsd stop:" [puppet] - 10https://gerrit.wikimedia.org/r/318963 (owner: 10Madhuvishy) [21:58:39] (03PS7) 10Madhuvishy: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 [22:00:22] papaul: any luck with labstore2001? [22:00:47] (03CR) 10Madhuvishy: [C: 032] nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 (owner: 10Madhuvishy) [22:00:54] (03PS8) 10Madhuvishy: nfs: Add script to manage NFS server on labstore secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/318963 [22:00:56] madhuvishy: stay working on it [22:01:08] papaul: okay :) [22:01:12] (03PS1) 10Yurik: Turn off revision number in graph img srv [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318994 [22:10:12] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2758571 (10Dzahn) >>! In T149609#2757838, @Zppix wrote: > Easy fix have lolrrrit send a message in -labs or saying !log grrrit-wm... [22:12:27] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757787 (10Luke081515) grrrit-wm can post a !log message on his own, like wikibugs does. That would not require firewall changes... [22:12:32] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758575 (10GWicke) [22:12:45] mutante: you have an reply ;)+ [22:14:22] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2758578 (10Dzahn) And how would that be triggered from gerrit, when the whole point of needing the restart of the bot is. that it... [22:16:06] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 37.77 ms [22:17:41] 06Operations, 06Parsing-Team, 06Services (later): Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2758582 (10GWicke) [22:18:26] PROBLEM - Disk space on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:26] PROBLEM - SSH on labstore2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:18:38] !grrrit-wm-die [22:18:46] !grrrit-wm-die [22:18:46] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:56] PROBLEM - MD RAID on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:56] PROBLEM - puppet last run on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:59] mutante ^^ ive figured out a whitelist [22:19:04] Terminator ^^ [22:19:06] PROBLEM - DPKG on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:06] PROBLEM - configured eth on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:17] PROBLEM - dhclient process on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:22] now i need to to figure out a regex based one [22:20:26] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:20:36] paladox: can you make the number of "r"s optional?:) j/k [22:20:51] madhuvishy: please check [22:20:58] mutante not sure what you mean? [22:21:06] madhuvishy: updating the ticket now [22:21:42] paladox: is this possible? instead of "!grrrit-wm die" this "grrrit-wm: die" [22:21:58] nice about the whitelist [22:22:46] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:27] mutante i think a known problem is we should wait at least 1 - 2 minutes after the bot rejoins before issuing !grrrit-wm-die otherwise it takes a while to rejoin [22:23:41] Oh [22:23:45] i doint think so [22:23:51] grrrit-wm: die [22:26:00] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2758617 (10Papaul) labstore2001 I don't know how this system was designed in the first place but here is the problem in the RAID controller settings the option “Enable BIOS stop on Error” was checked that is the r... [22:28:11] grrrit-wm: die [22:28:15] mutante ^^ [22:28:16] it is now [22:28:48] papaul: hmmm is it supposed to be back? [22:29:10] i can't ssh in, also icinga claims it's down [22:30:09] grrrit-wm: restart [22:33:05] grrrit-wm: die restart [22:34:46] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 57 failures. Last run 2 minutes ago with 57 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [22:38:40] grrrit-wm: die restart [22:41:00] madhuvishy: when i ssh in to mgmt and do console com2 here is what i get [22:41:03] Login incorrect. [22:41:06] Give root password for maintenance [22:41:08] (or type Control-D to continue): [22:41:29] madhuvishy: i get ^ [22:41:36] papaul: ow [22:49:26] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:50:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [22:53:18] grrrit-wm: die restart [22:53:36] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:54:25] paladox, what are you doing to grrrit-wm? [22:54:48] Krenair i am creating a command to restart it from irc based on a whitelist. [22:55:00] This is so when restarting production gerrit we can automate things [22:55:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [22:55:36] can't you use a test bot somewhere? [22:55:57] We doint have a test bot, but that was approved today [22:56:10] Krenair could you setup a project in labs for this please? [22:56:17] no [22:56:32] https://phabricator.wikimedia.org/T149529 [22:57:06] papaul: do you know if there's anything we can do to get it back up? [22:59:16] Krenair i am going to do one more restart then im stopping, just adding more users to the white list [22:59:24] madhuvishy: i think we need to ask Moritz [23:00:00] the whitelist is per-user instead of cloak-wildcarding? [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161031T2300). Please do the needful. [23:00:05] kaldari, yurik, and bmansurov: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] here [23:00:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [23:00:27] Krenair yes, i would do cloaks but i doint know how to do that [23:00:28] here [23:01:31] moritzm: if/when you are around, we can't seem to bring labstore2001 and up, and when papaul tries to ssh into mgmt he gets a Login incorrect, Give root password for maintenance (or type Control-D to continue): prompt. The updated ticket is here - https://phabricator.wikimedia.org/T149567#2758617. [23:02:21] Krenair https://gerrit.wikimedia.org/r/#/c/318976/ [23:02:47] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [23:03:51] 06Operations, 10ops-codfw: labstore2001 doesn't boot - https://phabricator.wikimedia.org/T149567#2758698 (10madhuvishy) Labstore2001 is still down. When @papaul tries to ssh into management console and does console com2 it says: Login incorrect. Give root password for maintenance (or type Control-D to continu... [23:05:06] here [23:05:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [23:05:18] madhuvishy: i need to leave now if not a 40 minutes drive home will be 1:30 [23:05:47] papaul: yeah alright thanks for your help today [23:06:58] anybody swat deploying today? [23:10:26] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [23:11:07] (03PS1) 10Madhuvishy: nfs: Move drbd resource config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/319004 [23:12:21] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2758735 (10BBlack) [23:13:20] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2684468 (10BBlack) After some IRC discussion, it seemed better to host the content pages on metawiki. It will look more-official, and it will also be easier to develop an... [23:13:56] (03PS2) 10Madhuvishy: nfs: Move drbd resource config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/319004 [23:15:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [23:15:47] (03CR) 10Madhuvishy: [C: 032] nfs: Move drbd resource config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/319004 (owner: 10Madhuvishy) [23:16:44] Dereckson, thcipriani, ostriches, twentyafterfour: anybody SWATing today? [23:18:36] hrm. No one else available? It's a bad time. I can get it done though. Give me a minute. [23:18:59] (03PS4) 10Thcipriani: Create patroller usergroup for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [23:19:15] thcipriani: Yeah, bad time for everyone I think. Middle of WMF halloween party :) [23:19:24] heh, oh [23:19:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [23:19:48] kaldari: just do it live! a real scary for the halloween party [23:19:54] ^ [23:19:56] *scare [23:20:10] (03Merged) 10jenkins-bot: Create patroller usergroup for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [23:20:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [23:20:46] kaldari: patch is live on mw1099, check please [23:20:57] checking.... [23:21:52] !log disabled puppet on bromine temp. issue with reprepo config for releases [23:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:19] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10Volans) [23:22:26] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:22:34] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758752 (10Volans) [23:24:12] thcipriani: OK, looks good from mw1099, feel free to sync [23:25:07] kaldari: ok, going live [23:25:16] ACKNOWLEDGEMENT - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] Jeff_Green RAID fail, ticketing in phabricator [23:26:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:317824|Create patroller usergroup for enwiki (T149019)]] (duration: 00m 46s) [23:26:02] ^ kaldari live everywhere [23:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:06] T149019: Add the patroller group to the English Wikipedia - https://phabricator.wikimedia.org/T149019 [23:26:17] thcipriani: looks good, checking error logs... [23:26:23] thanks :) [23:26:26] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:26:27] (03PS2) 10Thcipriani: Removed unused wmgUseGraphWithNamespace support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318862 (owner: 10Yurik) [23:26:34] ^ yurik still around? [23:26:37] yep [23:26:41] okie doke [23:26:49] * yurik is always around... except when he breaks things [23:27:02] yurik i agree [23:27:10] thcipriani: looks like we're all good, thanks! [23:27:22] kaldari: awesome, thanks for checking :) [23:27:23] musikanimal: It's live now: https://en.wikipedia.org/wiki/Special:ListGroupRights , wanna help me test [23:27:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318862 (owner: 10Yurik) [23:27:51] (03Merged) 10jenkins-bot: Removed unused wmgUseGraphWithNamespace support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318862 (owner: 10Yurik) [23:28:01] musikanimal: the new group is called "New page reviewers" [23:28:42] yurik: first change https://gerrit.wikimedia.org/r/#/c/318862/2 live on mw1099, check please [23:28:47] how should I test? [23:28:48] (if there is anything to check) [23:29:06] musikanimal: I think admins should be able to assign this group [23:29:07] testing... [23:29:32] musikanimal: see if you can add me to it [23:30:01] thcipriani, looks good [23:30:08] yurik: ok, going live everywhere [23:30:25] done [23:31:30] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:318862|Removed unused wmgUseGraphWithNamespace support]] PART I (duration: 00m 47s) [23:31:33] musikanimal: looks like it's filling up: https://en.wikipedia.org/w/index.php?title=Special:ListUsers&group=patroller [23:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:53] I've added some people [23:32:18] (03PS2) 10Thcipriani: LABS: Enable tabular data lua support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318977 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [23:32:27] (03PS1) 10Madhuvishy: nfs: Fix hiera variable access for drbd config [puppet] - 10https://gerrit.wikimedia.org/r/319006 [23:32:28] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:318862|Removed unused wmgUseGraphWithNamespace support]] PART II (duration: 00m 45s) [23:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:37] ^ yurik first change done syncing [23:32:56] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318977 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [23:33:14] ^ that should go out on beta next time beta-scap-eqiad runs [23:33:23] (well, once it merges :)) [23:33:24] (03Merged) 10jenkins-bot: LABS: Enable tabular data lua support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318977 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [23:34:09] (03CR) 10Madhuvishy: [C: 032] nfs: Fix hiera variable access for drbd config [puppet] - 10https://gerrit.wikimedia.org/r/319006 (owner: 10Madhuvishy) [23:35:14] (03PS2) 10Thcipriani: Turn off revision number in graph img srv [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318994 (owner: 10Yurik) [23:35:37] !log thcipriani@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:318977|LABS: Enable tabular data lua support (T148745)]] (housekeeping sync) (duration: 00m 46s) [23:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:43] T148745: Epic: Enable data namespace with tabular support on Commons - https://phabricator.wikimedia.org/T148745 [23:35:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318994 (owner: 10Yurik) [23:36:27] (03Merged) 10jenkins-bot: Turn off revision number in graph img srv [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318994 (owner: 10Yurik) [23:37:10] yurik: ^ is live on mw1099, check please [23:37:18] testing... [23:38:01] (03CR) 10Andy M. Wang: [C: 031] Remove patrol from autoconfirmed and reviewer for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [23:38:29] thcipriani, seems to be ok [23:38:45] hopefully we will get a much higher cache hit rate after this [23:38:54] yurik: ok, going live everywhere [23:39:15] thcipriani, do you have that phrase as a hotkey or a keyboard macro somewhere? :) [23:39:34] only in my mind :) [23:39:58] (but I probably should, would save me time—metascap) [23:40:03] hehe [23:40:14] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:318994|Turn off revision number in graph img srv]] (duration: 00m 46s) [23:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:21] ^ yurik live everywhere [23:40:48] bmansurov: still around for SWAT? [23:40:56] yes [23:41:02] ok [23:41:21] (03PS2) 10Thcipriani: MF Beta: Don't move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318978 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [23:41:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318978 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [23:42:26] (03Merged) 10jenkins-bot: MF Beta: Don't move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318978 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [23:43:22] bmansurov: change is live on mw1099, check please [23:43:35] thcipriani: ok [23:44:09] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2758792 (10Jgreen) [23:44:09] thcipriani: it's working as expected [23:44:12] thcipriani: thank you [23:44:24] bmansurov: ok, going live everywhere [23:44:56] thcipriani: great [23:46:52] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:318978|MF Beta: Do not move first paragraph before infobox (T145216) (T149389)]] (duration: 00m 46s) [23:46:58] ^ bmansurov live everywhere [23:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:58] T145216: MobileFormatter should relocate first paragraph ahead of infobox - https://phabricator.wikimedia.org/T145216 [23:46:59] T149389: Not Found Error in MobileFormatter - https://phabricator.wikimedia.org/T149389 [23:47:09] thcipriani, all my patches done? [23:47:26] yurik: all the ones I saw... [23:47:35] thcipriani: awesome thanks [23:47:59] cool, thx! [23:48:02] yurik: yeap. The 2nd one was labs only, just sync'd everywhere without having you test. beta-scap-eqiad will deploy automatically [23:48:41] cool. Declaring evening SWAT complete. [23:54:26] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 93.33% of data above the critical threshold [1800.0] [23:54:46] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures