[00:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:08] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:421194|guwiki: fix rollback -> rollbacker (group) (T190370)]] (duration: 01m 16s) [00:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:14] T190370: guwiki sysops cannot add or remove 'rollback' but they're allowed on AddGroups/RemoveGroups - https://phabricator.wikimedia.org/T190370 [00:00:19] Hauskatze: ^ live everywhere [00:00:27] purrfect [00:00:31] !log Evening SWAT is done [00:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:48] Hey, you missed my patch, Amir1 [00:01:15] Hauskatze: I've added eswikibooks to the profiling dashboard. Data keys are there even tho there is no data yet [00:01:16] twkozlowski: you didn't show up [00:01:40] I joined this channel at 23:00 UTC precisely [00:01:57] Hauskatze: so I think everything is fine. We just need to wait a bit [00:02:02] !log restarted nova-network on labnet1001 and nova-compute on labvirt1015 as part of debugging T190367 [00:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:08] T190367: deployment-maps01 has no security groups, none can be added - https://phabricator.wikimedia.org/T190367 [00:02:14] dmaza: right on, thanks for doing that and we'll of course wait until the system gathers the data [00:02:26] you joined under name of odder, that confused me [00:02:36] I shall speak to musikanimal though, as we've got filters wastin 11 conditions :/ [00:02:37] np ;) [00:02:43] https://grafana.wikimedia.org/dashboard/db/mediawiki-abusefilter-profiling?orgId=1&from=now-3h&to=now [00:03:34] twkozlowski: ^ [00:03:43] dmaza: neat, thanks again [00:04:10] I beg your permission but I'm going to sleep now. [00:08:55] (03CR) 10Imarlier: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [00:10:18] (03PS20) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [00:28:36] (03PS1) 10Dzahn: mediawiki::deployment: don't use mysql module, fix stretch suppport [puppet] - 10https://gerrit.wikimedia.org/r/421197 (https://phabricator.wikimedia.org/T175288) [00:30:01] (03PS2) 10Dzahn: mediawiki::deployment: don't use mysql module, fix stretch suppport [puppet] - 10https://gerrit.wikimedia.org/r/421197 (https://phabricator.wikimedia.org/T175288) [00:30:18] (03PS1) 10Papaul: Netboot: Add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421198 (https://phabricator.wikimedia.org/T189633) [00:36:56] (03CR) 10Dzahn: [C: 032] mediawiki::deployment: don't use mysql module, fix stretch suppport [puppet] - 10https://gerrit.wikimedia.org/r/421197 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [00:37:47] (03PS2) 10Dzahn: Netboot: Add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421198 (https://phabricator.wikimedia.org/T189633) (owner: 10Papaul) [00:38:31] (03CR) 10Dzahn: [C: 032] Netboot: Add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421198 (https://phabricator.wikimedia.org/T189633) (owner: 10Papaul) [00:41:45] dmaza: what wiki was Haus talking about? [00:49:58] (03CR) 10Dzahn: [C: 032] "also see https://phabricator.wikimedia.org/T165625" [puppet] - 10https://gerrit.wikimedia.org/r/421197 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [00:50:05] musikanimal: es.wikibooks [00:50:48] 10Operations, 10Cloud-Services, 10Community-Wikimetrics, 10DBA, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#4071489 (10Dzahn) stopped using on in mediawiki_deployment_server role: https://gerrit.wikimedia.org/r/#/c/421197/ [00:50:59] (03PS1) 10Dzahn: mediawiki::deployment: stretch support for php-readline package [puppet] - 10https://gerrit.wikimedia.org/r/421201 (https://phabricator.wikimedia.org/T175288) [00:51:34] (03PS2) 10Dzahn: mediawiki::deployment: stretch support for php-readline package [puppet] - 10https://gerrit.wikimedia.org/r/421201 (https://phabricator.wikimedia.org/T175288) [00:52:26] (03CR) 10Dzahn: [C: 032] mediawiki::deployment: stretch support for php-readline package [puppet] - 10https://gerrit.wikimedia.org/r/421201 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [00:56:42] (03CR) 10Krinkle: [C: 031] "LGTM. I'm sure the continuation on restart works fine. Still somewhat unsure as to why it appeared to persistently restart from 24h ago wh" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [00:57:52] thanks [01:10:19] !log started in-place reindex of all wikis on both elasticsearch clusters [01:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:24] (03PS1) 10Dzahn: mediawik_deployment: stretch support for PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) [01:25:03] (03CR) 10jerkins-bot: [V: 04-1] mediawik_deployment: stretch support for PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [01:26:55] (03CR) 10Dzahn: "it's a copy of the existing php5 class that is then adjusted. the debug symbol package is missing and only in sid." [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [01:30:01] PROBLEM - Long running screen/tmux on labstore2003 is CRITICAL: CRIT: Long running SCREEN process. (PID: 24347, 1737766s 1728000s). [01:52:24] !curl increase cluster.routing.allocation.disk.watermark.low to 80% on eqiad elasticsearch due to shards not allocating during reindex [01:52:33] !log increase cluster.routing.allocation.disk.watermark.low to 80% on eqiad elasticsearch due to shards not allocating during reindex [01:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:21] PROBLEM - Nginx local proxy to apache on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:12] RECOVERY - Nginx local proxy to apache on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.235 second response time [02:30:21] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.25) (duration: 07m 46s) [02:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 798.06 seconds [04:07:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 260.07 seconds [05:03:45] PROBLEM - configured eth on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:05:26] PROBLEM - dhclient process on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:07:15] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:07:15] PROBLEM - puppet last run on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:08:55] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:10:36] PROBLEM - DPKG on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:12:25] PROBLEM - Disk space on ms-be2043 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:13:45] RECOVERY - DPKG on ms-be2043 is OK: All packages OK [05:13:46] RECOVERY - configured eth on ms-be2043 is OK: OK - interfaces up [05:14:15] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational [05:14:25] RECOVERY - Disk space on ms-be2043 is OK: DISK OK [05:14:35] RECOVERY - dhclient process on ms-be2043 is OK: PROCS OK: 0 processes with command name dhclient [05:17:15] RECOVERY - puppet last run on ms-be2043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:29:12] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4071747 (10Papaul) [05:34:17] PROBLEM - MD RAID on ms-be2042 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:38:57] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2043 is OK: OK: synced at Thu 2018-03-22 05:38:51 UTC. [05:39:27] PROBLEM - configured eth on ms-be2042 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:40:18] RECOVERY - MD RAID on ms-be2042 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:40:27] RECOVERY - configured eth on ms-be2042 is OK: OK - interfaces up [05:47:08] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4071801 (10Papaul) [05:51:33] <_joe_> Amir1: ping :D [05:55:21] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4071808 (10Papaul) [05:56:03] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4048384 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi All yours. [06:03:13] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [06:08:34] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3010_v4, cp3010_v6 [06:08:43] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3010_v4, cp3010_v6 [06:08:43] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp3010_v4, cp3010_v6 [06:08:43] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3010_v4, cp3010_v6 [06:08:44] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:08:53] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: cp3010_v4, cp3010_v6 [06:08:54] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:03] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:04] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:13] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:23] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:23] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp3010_v4, cp3010_v6 [06:09:33] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:33] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp3010_v4, cp3010_v6 [06:09:33] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp3010_v4, cp3010_v6 [06:09:33] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:33] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:09:34] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:12:20] (03PS2) 10Marostegui: dbproxy100{1,6}: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420991 (https://phabricator.wikimedia.org/T183469) [06:13:59] (03CR) 10Marostegui: [C: 032] dbproxy100{1,6}: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420991 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [06:14:53] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:15:59] !log Reload dbproxy1001 to pick up the new standby host - T183469 [06:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:06] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [06:16:31] !log Reload dbproxy1006 to pick up the new standby host - T183469 [06:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:14] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3010_v4, cp3010_v6 [06:20:04] (03PS1) 10Marostegui: m1.hosts: Remove db1016 [software] - 10https://gerrit.wikimedia.org/r/421214 (https://phabricator.wikimedia.org/T190179) [06:21:10] (03CR) 10Marostegui: [C: 032] m1.hosts: Remove db1016 [software] - 10https://gerrit.wikimedia.org/r/421214 (https://phabricator.wikimedia.org/T190179) (owner: 10Marostegui) [06:21:53] (03Merged) 10jenkins-bot: m1.hosts: Remove db1016 [software] - 10https://gerrit.wikimedia.org/r/421214 (https://phabricator.wikimedia.org/T190179) (owner: 10Marostegui) [06:22:07] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4071834 (10Marostegui) a:03RobH This host is now ready for DC Ops steps. Assigning it to @RobH [06:23:00] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4071840 (10Marostegui) [06:25:06] !log Stop MySQL on db1001 to get ready to decommission it - T190262 [06:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:11] T190262: Decommission db1001 - https://phabricator.wikimedia.org/T190262 [06:25:53] !log Remove db1001 from tendril - T190262 [06:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:13] (03PS1) 10Marostegui: mariadb: Remove db1001 [puppet] - 10https://gerrit.wikimedia.org/r/421216 (https://phabricator.wikimedia.org/T190262) [06:31:30] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [06:32:38] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler03/10575/" [puppet] - 10https://gerrit.wikimedia.org/r/421216 (https://phabricator.wikimedia.org/T190262) (owner: 10Marostegui) [06:33:43] (03PS1) 10Marostegui: m1.hosts: Remove db1001 [software] - 10https://gerrit.wikimedia.org/r/421217 (https://phabricator.wikimedia.org/T190262) [06:34:47] (03CR) 10Marostegui: [C: 032] m1.hosts: Remove db1001 [software] - 10https://gerrit.wikimedia.org/r/421217 (https://phabricator.wikimedia.org/T190262) (owner: 10Marostegui) [06:35:14] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1001 [puppet] - 10https://gerrit.wikimedia.org/r/421216 (https://phabricator.wikimedia.org/T190262) (owner: 10Marostegui) [06:35:32] (03Merged) 10jenkins-bot: m1.hosts: Remove db1001 [software] - 10https://gerrit.wikimedia.org/r/421217 (https://phabricator.wikimedia.org/T190262) (owner: 10Marostegui) [06:38:17] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4071893 (10Marostegui) a:03RobH This host is now ready for DC Ops steps. Assigning it to @RobH [06:38:38] (03PS1) 10Urbanecm: Make eswikibooks logo normal size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421219 (https://phabricator.wikimedia.org/T190366) [06:39:19] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4071898 (10Marostegui) [06:41:38] 10Operations, 10DBA, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#4071905 (10Marostegui) [06:41:41] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4071906 (10Marostegui) [06:41:44] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4071903 (10Marostegui) 05Open>03Resolved All the hosts have been replaced. The old hosts are now ready for DC Ops to finish the decommissioned and... [06:43:45] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3153495 (10Marostegui) All the hosts <=db1050 have now been retired from service and are just pending to be decommissioned by DC Ops - they have their own individual deco... [06:56:31] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:32] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421222 [07:18:46] (03PS4) 10Marostegui: mariadb: Move parsercaches socket location to the default one [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [07:26:16] (03PS1) 10Elukey: eventlogging_cleaner: add a new parameter to ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/421233 (https://phabricator.wikimedia.org/T171203) [07:28:01] (03PS2) 10Elukey: eventlogging_cleaner: add a new parameter to ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/421233 (https://phabricator.wikimedia.org/T171203) [07:30:19] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421222 (owner: 10Marostegui) [07:30:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421222 (owner: 10Marostegui) [07:32:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421222 (owner: 10Marostegui) [07:32:23] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421222 (owner: 10Marostegui) [07:33:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool pc1006 for kernel, mariadb and socket location upgrade (duration: 01m 16s) [07:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:29] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#4071962 (10jcrespo) [07:44:24] 10Operations, 10DBA, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4071966 (10jcrespo) [07:54:53] cp3010 down ? [07:55:18] checking console [07:56:19] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/421045 (https://phabricator.wikimedia.org/T135991) [07:56:38] console not reachable, depool just to be sure + powercycle [07:58:25] !log depool cp3010 + powercycle (no ssh access, mgmt console frozen) [07:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:20] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/421045 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:59:55] !log Stop MySQL on pc1006 for kernel, mariadb and socket path upgrade [08:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:34] * elukey sees marostegui as a F1 mechanic [08:00:40] hahahaha [08:00:42] (03PS5) 10Marostegui: mariadb: Move parsercaches socket location to the default one [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [08:01:08] RECOVERY - Host cp3010 is UP: PING WARNING - Packet loss = 58%, RTA = 83.79 ms [08:01:08] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [08:01:08] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [08:01:09] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [08:01:17] hello cp3010, welcome back [08:01:18] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 38 ESP OK [08:01:28] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 38 ESP OK [08:01:28] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [08:01:29] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 38 ESP OK [08:01:31] elukey: https://media.giphy.com/media/10RFAsWPP8atkA/source.gif ? [08:01:38] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [08:01:38] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [08:01:38] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [08:01:39] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [08:01:39] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [08:01:48] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [08:01:48] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [08:01:49] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [08:01:49] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [08:01:58] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [08:01:58] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [08:01:59] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [08:01:59] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 38 ESP OK [08:02:00] (03CR) 10Marostegui: [C: 032] mariadb: Move parsercaches socket location to the default one [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [08:02:03] marostegui: yes! [08:02:27] moritzm: ok to merge your changes? [08:03:16] yep [08:03:24] done! [08:03:40] next set of alarms that worries me are the elastic servers [08:03:47] 4 of them with disk warning alarms [08:04:14] dcausse: ^ when you're around [08:04:20] moritzm: yes [08:04:31] elukey: eqiad or codfw? [08:04:47] !log Restart pt-heartbeat on pc1004 and pc1005 [08:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:55] eqiad, on elastic1027, elastic1030 and elastic1044 [08:05:00] (03PS1) 10Jcrespo: mariadb: Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421235 (https://phabricator.wikimedia.org/T181777) [08:05:07] 1028 just cleared out by itself [08:05:12] DISK WARNING - free space: /srv 79165 MB (16% inode=99%): (1027) [08:05:20] ah [08:05:21] DISK WARNING - free space: /srv 81845 MB (17% inode=99%): (1030) [08:05:30] DISK WARNING - free space: /srv 119480 MB (17% inode=99%) (1044) [08:05:36] !log Restart pt-heartbeat on pc2004 and pc2005 [08:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:42] on 1027 it is /dev/md2 493G 391G 78G 84% /srv [08:05:45] I did not see the icinga ping? [08:05:57] still warnings [08:06:04] checking, I think Erik ran a full reindex yesterday so that might the cause [08:06:23] super, it might just be a temporary usage of disks but better be sure :) [08:06:23] !log Restart pt-heartbeat on pc2006 [08:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:29] (03CR) 10Marostegui: [C: 032] "> Change has been successfully merged by Marostegui" [puppet] - 10https://gerrit.wikimedia.org/r/420647 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [08:07:43] so annoying that the tmux/screen alerts are always in NRPE: Unable to read output [08:08:02] is there a task to track it? It confuses me a lot, especially when I am not well caffeinated :D [08:08:18] yeah: https://phabricator.wikimedia.org/T187528 [08:08:38] elukey, moritzm, checked on terbium and yes a full reindex is in progress, so this alert will be flapping a bit during this process [08:08:51] dcausse: ack! [08:09:32] ack [08:12:12] do I wait for you to repool the pc? [08:12:38] no no [08:12:41] go ahead [08:12:52] actually, I have to ammend [08:12:57] <_joe_> elukey: I gave up on those alerts, apparently the rest of the team thinks they're useful [08:13:04] <_joe_> I think they're a waste of time [08:13:12] <_joe_> and that not many people care [08:13:17] <_joe_> I do regularly check [08:13:43] <_joe_> and thus I am annoyed by them [08:13:51] (03PS2) 10Jcrespo: mariadb: Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421235 (https://phabricator.wikimedia.org/T181777) [08:14:25] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421235 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [08:14:36] _joe_ I don't have a strong opinion on the alert itself but only on the icinga spam :) [08:15:00] (unable to read etc..) [08:15:21] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421235 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [08:17:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 (duration: 01m 15s) [08:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] (03CR) 10jenkins-bot: mariadb: Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421235 (https://phabricator.wikimedia.org/T181777) (owner: 10Jcrespo) [08:19:33] (03PS1) 10Filippo Giunchedi: puppetmaster: blacklist pdb v4 mbeans [puppet] - 10https://gerrit.wikimedia.org/r/421236 (https://phabricator.wikimedia.org/T190252) [08:20:28] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: blacklist pdb v4 mbeans [puppet] - 10https://gerrit.wikimedia.org/r/421236 (https://phabricator.wikimedia.org/T190252) (owner: 10Filippo Giunchedi) [08:21:25] !log upgrade and restart db1060 [08:21:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421237 [08:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:32] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421237 [08:25:03] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421239 [08:25:27] jynus: I can wait for db1060 if you want [08:25:45] no, go ahead [08:25:53] ok! [08:25:54] It is rebooting now, it will take 5 minutes [08:25:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421237 (owner: 10Marostegui) [08:27:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421237 (owner: 10Marostegui) [08:28:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421237 (owner: 10Marostegui) [08:29:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool pc1006 after kernel, mariadb and socket location upgrade (duration: 01m 11s) [08:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:35] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4072048 (10elukey) [08:30:52] !log Truncate updatelog on s4,s5,s6,s8 - T174804 [08:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:58] T174804: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804 [08:34:37] (03PS1) 10Elukey: cassandra: upgrade version 2.2 package settings [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [08:38:22] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421239 [08:39:20] (03PS1) 10Filippo Giunchedi: install_server: use dell recipe for new ms-be systems [puppet] - 10https://gerrit.wikimedia.org/r/421242 (https://phabricator.wikimedia.org/T189633) [08:39:22] (03PS1) 10Filippo Giunchedi: install_server: delete unused ms-be-esams recipe [puppet] - 10https://gerrit.wikimedia.org/r/421243 [08:40:34] (03CR) 10Filippo Giunchedi: [C: 032] install_server: use dell recipe for new ms-be systems [puppet] - 10https://gerrit.wikimedia.org/r/421242 (https://phabricator.wikimedia.org/T189633) (owner: 10Filippo Giunchedi) [08:40:36] (03CR) 10Filippo Giunchedi: [C: 032] install_server: delete unused ms-be-esams recipe [puppet] - 10https://gerrit.wikimedia.org/r/421243 (owner: 10Filippo Giunchedi) [08:41:37] 10Operations, 10ops-eqiad: install conf1004-6 ssd upgrades - https://phabricator.wikimedia.org/T190230#4072064 (10Joe) These hosts are not in production and the swap can be done at any time it suits @Cmjohnson best. I'm definitely not needed for the HD swaps. [08:41:51] (03CR) 10Muehlenhoff: [C: 04-1] mediawik_deployment: stretch support for PHP packages (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [08:41:56] 10Operations, 10ops-eqiad: install conf1004-6 ssd upgrades - https://phabricator.wikimedia.org/T190230#4072065 (10Joe) a:05Joe>03Cmjohnson [08:42:44] (03CR) 10Muehlenhoff: [C: 04-1] mediawik_deployment: stretch support for PHP packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [08:42:48] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "There is a work in progress by apergos on this, we should not duplicate effort." [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [08:43:24] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421239 (owner: 10Jcrespo) [08:43:54] (03CR) 10Elukey: "Assumption 1) does not hold, maps-test uses 2.2 from pcc:" [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [08:44:35] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421239 (owner: 10Jcrespo) [08:45:10] !log Truncate updatelog on s2 - T174804 [08:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:16] T174804: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804 [08:45:19] (03CR) 10Elukey: "Adding more people to review code + discuss what's best.." [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [08:48:23] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421239 (owner: 10Jcrespo) [08:51:11] (03CR) 10Muehlenhoff: [C: 04-1] [WIP] php7 manifests for mediawiki on stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [08:51:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 (duration: 01m 15s) [08:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:48] (03CR) 10Giuseppe Lavagetto: "I have a few small comments, but once Moritz's comments have been addressed, this overall LGTM; I want us to unblock the dumps transition," (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [08:59:52] (03CR) 10Muehlenhoff: "If maps also uses 2.2, seems still unproblematic to me? Or is the change done in wmf3 known to cause a regression if not using JMX? This o" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [09:03:34] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4072088 (10fgiunchedi) So I was initially confused because the boot disks (i.e. ssds) on these dell machines are sda/sdb (as opposed to sdm/sdn) like we do on hp machines and the partma... [09:03:58] (03CR) 10Elukey: "In the code change I also force the 2.2 cassandra version to wmf3, so IIUC (might be wrong) this would cause the following once puppet run" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [09:04:27] !log Truncate updatelog on s7 - T174804 [09:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:33] T174804: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804 [09:05:36] (03PS2) 10Elukey: cassandra: upgrade version 2.2 package settings [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) [09:19:02] !log Truncate updatelog on s1 - T174804 [09:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:08] T174804: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804 [09:21:09] !log Truncate updatelog on s3 - T174804 [09:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:36] (03PS2) 10Matthias Mullie: [WIP] Add 3d2png scap targets [puppet] - 10https://gerrit.wikimedia.org/r/406997 [09:22:45] (03CR) 10Matthias Mullie: [WIP] Add 3d2png scap targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406997 (owner: 10Matthias Mullie) [09:30:09] (03PS31) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [09:30:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [09:34:50] (03PS2) 10KartikMistry: apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) [09:35:25] (03CR) 10jerkins-bot: [V: 04-1] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [09:37:18] (03PS32) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [09:37:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [09:41:25] (03PS1) 10Filippo Giunchedi: Deprecate ms-be Dell-specific partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/421248 (https://phabricator.wikimedia.org/T189633) [09:41:27] (03PS1) 10Filippo Giunchedi: site: add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421249 (https://phabricator.wikimedia.org/T189633) [09:43:15] (03PS33) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [09:45:17] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [09:45:40] (03CR) 10Filippo Giunchedi: "PCC noop https://puppet-compiler.wmflabs.org/compiler03/10577/" [puppet] - 10https://gerrit.wikimedia.org/r/421248 (https://phabricator.wikimedia.org/T189633) (owner: 10Filippo Giunchedi) [09:55:14] (03CR) 10Filippo Giunchedi: [C: 032] Deprecate ms-be Dell-specific partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/421248 (https://phabricator.wikimedia.org/T189633) (owner: 10Filippo Giunchedi) [09:55:48] !log rolling restart of yarn nodemanagers on the analytics hadoop workers for openjdk-8 upgrade [09:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:10] (03PS2) 10Filippo Giunchedi: site: add ms-be204[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/421249 (https://phabricator.wikimedia.org/T189633) [10:10:51] !log repool cp3010 [10:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:56] (03CR) 10Vgutierrez: "check inline comment" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/420120 (owner: 10Mark Bergsma) [10:18:04] (03CR) 10Ema: [C: 031] Enable base::service_auto_restart for systemd-journald [puppet] - 10https://gerrit.wikimedia.org/r/421058 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:22:24] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for systemd-journald [puppet] - 10https://gerrit.wikimedia.org/r/421058 (https://phabricator.wikimedia.org/T135991) [10:24:47] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for systemd-journald [puppet] - 10https://gerrit.wikimedia.org/r/421058 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:28:49] !log cp-upload_esams: carry on with reboots for retpoline kernel updates T188092 [10:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:50] (03CR) 10Ema: [C: 031] pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [10:36:22] (03PS7) 10Vgutierrez: pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [10:38:49] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - kubelet_operational_latencies is 22404 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:39:43] (03CR) 10Vgutierrez: [C: 032] pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [10:39:49] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - kubelet_operational_latencies is 6440 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:42:52] !log update puppet compiler's fact [10:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:23] * elukey re-reads Filippo email before proceeding [10:45:25] godog: o/ - just to be sure, to update the pcc compiler's fact I can run the script unmodified right? [10:47:32] elukey: yeah should work just fine [10:47:37] super thanks [10:47:58] 10Operations: Review changes to /etc/java-8-openjdk/security/java.security in Kafka from u162 update - https://phabricator.wikimedia.org/T190400#4072255 (10MoritzMuehlenhoff) [10:49:25] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4072268 (10Joe) 05Open>03Resolved [10:53:39] !log due to miscommunication, second update of puppet compiler facts happening now. oh well [10:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:11] (03PS1) 10Alexandros Kosiaris: Annotate pods for prometheus statsd scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/421262 [10:59:12] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#4072289 (10fgiunchedi) [10:59:56] 10Operations, 10DBA, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4072290 (10Marostegui) I talked to @mark and we'll leave this ticket open until they have been fully decommissioned by @RobH and @Cmjohnson [11:01:16] !log T189722 reboot labtestvirt2001 to downgrade kernel [11:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:39] (03PS2) 10Alexandros Kosiaris: Annotate pods for prometheus statsd scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/421262 (https://phabricator.wikimedia.org/T184923) [11:01:41] (03PS1) 10Alexandros Kosiaris: Switch mathoid to use the local statsd_prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/421263 (https://phabricator.wikimedia.org/T184923) [11:02:29] !log installing plexus-utils security updates [11:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:05] (03PS3) 10Tpt: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) [11:09:19] !log T189722 reboot labtestvirt2002 to downgrade kernel [11:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:02] 10Operations, 10Puppet, 10Patch-For-Review: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4072311 (10Volans) [11:17:32] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:18:24] !log and a third time to try updating the puppet compiler facts, this time using puppetmaster2001 [11:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Annotate pods for prometheus statsd scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/421262 (https://phabricator.wikimedia.org/T184923) (owner: 10Alexandros Kosiaris) [11:20:44] !log rolling restart of the hadoop hdfs datanode daemons on all the analytics hadoop workers for openjdk-8 upgrade [11:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Switch mathoid to use the local statsd_prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/421263 (https://phabricator.wikimedia.org/T184923) (owner: 10Alexandros Kosiaris) [11:28:14] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - kubelet_operational_latencies is 28356 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:29:23] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - kubelet_operational_latencies is 1834 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:36:54] (03PS1) 10BBlack: VCL: Do not request AE:gzip from applications [puppet] - 10https://gerrit.wikimedia.org/r/421267 (https://phabricator.wikimedia.org/T125938) [11:37:53] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: CRITICAL - kubelet_operational_latencies is 29773 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:38:39] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#4072379 (10akosiaris) The metrics part has been validated. In fact https://grafana.wikimedia.org/dashboard... [11:38:53] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: OK - kubelet_operational_latencies is 1923 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:43:38] (03PS1) 10Muehlenhoff: Various updates for Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/421268 [11:45:03] (03CR) 10Muehlenhoff: [C: 032] Various updates for Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/421268 (owner: 10Muehlenhoff) [11:45:07] (03PS1) 10Ladsgroup: Remove forceWriteTermsTableSearchFields from testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421269 (https://phabricator.wikimedia.org/T189776) [11:48:19] (03CR) 10Mforns: [C: 031] eventlogging_cleaner: add a new parameter to ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/421233 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [11:50:35] (03CR) 10Ema: [C: 031] VCL: Do not request AE:gzip from applications [puppet] - 10https://gerrit.wikimedia.org/r/421267 (https://phabricator.wikimedia.org/T125938) (owner: 10BBlack) [11:51:10] (03PS1) 10Vgutierrez: pybal: Fix prometheus bgp sessions query [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) [11:52:23] (03CR) 10Vgutierrez: "Example check output with the fixed query:" [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [11:53:01] (03PS2) 10BBlack: VCL: Do not request AE:gzip from applications [puppet] - 10https://gerrit.wikimedia.org/r/421267 (https://phabricator.wikimedia.org/T125938) [11:55:02] (03CR) 10BBlack: [C: 032] VCL: Do not request AE:gzip from applications [puppet] - 10https://gerrit.wikimedia.org/r/421267 (https://phabricator.wikimedia.org/T125938) (owner: 10BBlack) [11:58:09] I'm stopping puppet fleetwide for 15/20 min for T189891 [11:58:09] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [11:58:22] godog: hang on [11:58:34] bblack: ok [11:58:41] did you already disable? [11:58:46] no I didn't [11:59:08] give me a min and I'll get things out of your way, I have all the cp's disabled too for a short period... [11:59:10] (03PS2) 10Vgutierrez: pybal: Fix prometheus bgp sessions query [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) [11:59:19] bblack: sounds good, LMK when done [12:01:32] <_joe_> oh that's why warnings are exploding on icinga [12:01:33] <_joe_> :P [12:02:17] maybe it shouldn't warn then :P [12:02:56] (03PS1) 10Muehlenhoff: Fix syntax error in alias [puppet] - 10https://gerrit.wikimedia.org/r/421274 [12:04:04] (03CR) 10Muehlenhoff: [C: 032] Fix syntax error in alias [puppet] - 10https://gerrit.wikimedia.org/r/421274 (owner: 10Muehlenhoff) [12:05:42] (03CR) 10Vgutierrez: "pcc is happy as well: https://puppet-compiler.wmflabs.org/compiler03/10582/" [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [12:10:15] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10583/ looks as expected" [puppet] - 10https://gerrit.wikimedia.org/r/421233 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [12:10:24] (03PS3) 10Elukey: eventlogging_cleaner: add a new parameter to ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/421233 (https://phabricator.wikimedia.org/T171203) [12:11:39] godog: I'm done / non-blocking [12:12:03] bblack: kk, thanks [12:12:56] !log stopping puppet fleetwide for ca migration - T189891 [12:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:03] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [12:14:57] (03PS3) 10Filippo Giunchedi: hieradata: use puppetmaster2001 as ca_server [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) [12:17:11] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use puppetmaster2001 as ca_server [puppet] - 10https://gerrit.wikimedia.org/r/420721 (https://phabricator.wikimedia.org/T189891) (owner: 10Filippo Giunchedi) [12:20:39] !log running puppet on puppetmaster[21]001 - T189891 [12:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:45] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [12:24:15] (03CR) 10Ema: [C: 031] pybal: Fix prometheus bgp sessions query [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [12:24:24] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [12:26:05] known ^ on it [12:26:37] <_joe_> godog: it's the ca key which has weird permissions, I guess beacause of the UID change between distros? [12:27:43] _joe_: possible, where did you see the error? [12:27:50] do you have the error at hand? or can I re-run puppet [12:27:54] <_joe_> syslog [12:28:00] <_joe_> Mar 22 12:25:24 puppetmaster2001 puppet-master[13144]: Could not prepare for execution: Permission denied @ rb_sysopen - /var/lib/puppet/server/ssl/ca/ca_key.pem [12:28:16] <_joe_> and an ls -la shows the key is owned by nagios [12:28:59] nagios?!?! [12:29:13] <_joe_> and yes, puppet is UID 109:114 on jessie (1001) [12:29:29] <_joe_> and nagios is UID 109:114 on stretch (2001) [12:29:29] nagios:x:109:114 [12:29:37] <_joe_> so it's expected [12:29:44] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 25 failures. Last run 4 minutes ago with 25 failures. Failed resources (up to 3 shown) [12:29:47] <_joe_> our rsync assumed the UIDs to be the same [12:29:57] indeed, I'll fix the ownership [12:30:03] * volans checking neodymium, whas doesn't have puppet disabled? [12:30:11] <_joe_> if you chown the ownership of the whole tree, it should be ok [12:30:29] volans: it was my test host [12:30:34] ah ok [12:31:10] !log chown puppet:puppet /var/lib/puppet/server/ssl/ca on puppetmaster2001 [12:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:43] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Puppet has 25 failures. Last run 6 minutes ago with 25 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/etc/profile.d/mysql-ps1.sh],File[/usr/local/bin/phaste] [12:32:27] ok progress! now cert failure [12:32:42] <_joe_> godog: pro tip: the puppetmasters log their errors to syslog, look there [12:33:48] _joe_: indeed, thanks! [12:35:20] (03CR) 10Vgutierrez: [C: 032] pybal: Fix prometheus bgp sessions query [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [12:35:32] (03PS3) 10Vgutierrez: pybal: Fix prometheus bgp sessions query [puppet] - 10https://gerrit.wikimedia.org/r/421273 (https://phabricator.wikimedia.org/T188085) [12:36:06] (03PS1) 10BBlack: varnishslowlog: no Resp filter on backends, either [puppet] - 10https://gerrit.wikimedia.org/r/421275 (https://phabricator.wikimedia.org/T181315) [12:39:24] still trying to debug why we're getting certs errors from puppet-master and a 500 on agents [12:39:40] ack, any clue so far? [12:40:12] Could not run: The certificate retrieved from the master does not match the agent's private key [12:41:30] (03CR) 10Alexandros Kosiaris: [C: 032] Increase cache size for osm2pgsql import [puppet] - 10https://gerrit.wikimedia.org/r/421074 (https://phabricator.wikimedia.org/T190110) (owner: 10Catrope) [12:41:37] godog: and it is indeed differente [12:41:40] (03PS2) 10Alexandros Kosiaris: Increase cache size for osm2pgsql import [puppet] - 10https://gerrit.wikimedia.org/r/421074 (https://phabricator.wikimedia.org/T190110) (owner: 10Catrope) [12:41:42] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Increase cache size for osm2pgsql import [puppet] - 10https://gerrit.wikimedia.org/r/421074 (https://phabricator.wikimedia.org/T190110) (owner: 10Catrope) [12:41:47] from the one in /var/lib/puppet/ssl/certs/ca.pem [12:41:52] on neodymijm [12:42:22] <_joe_> so something has not synced correctly [12:43:07] <_joe_> what is different, volans ? [12:43:09] volans: /var/lib/puppet/ssl/certs/ca.pem on neodymium different than which file on puppetmaster2001 ? [12:43:19] mmmh _joe_ but neodymium cert is the same of /var/lib/puppet/server/ssl/ca/ca_crt.pem [12:43:28] was puppet master restarted? [12:43:43] could be cached in memory? [12:43:44] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:51] (03CR) 10Muehlenhoff: [WIP] php7 manifests for mediawiki on stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:44:33] volans: yeah I restarted it at least once [12:45:02] (03CR) 10Vgutierrez: [C: 031] Fix Attribute.__eq__ and .__ne__ (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/420119 (owner: 10Mark Bergsma) [12:45:11] I checked the logs and after https://gerrit.wikimedia.org/r/c/420721 nothing besides the expected proxypass in apache config has changed btw [12:46:45] godog: the fingerprint in the log is the certificate for 'puppet' [12:47:18] <_joe_> yes [12:48:57] _joe_: and is used for what? is it a client? [12:49:13] <_joe_> volans: let me check something [12:52:31] also as a data point, from the agent's perspective this is a 500 afaict and the errors are from puppet-master itself, not sure doing what tho [12:52:43] <_joe_> yes [12:55:15] (03PS3) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [12:55:50] (03CR) 10jerkins-bot: [V: 04-1] Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn) [12:56:52] <_joe_> ok so [12:57:19] <_joe_> for some reason I don't really get, the private key for 'puppet' and the public cert don't match [12:57:28] but that's the same on 1001 [12:57:32] <_joe_> let me go on puppetmaster1001 and check what's going on [12:57:52] <_joe_> volans: I doubt that's the case, let me check something [12:57:53] (03PS4) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [12:57:58] or better, the puppet fingerprint shown on cert list is the same [12:58:00] on both hosts [12:58:27] (03CR) 10jerkins-bot: [V: 04-1] Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn) [12:58:55] <_joe_> ok found the problem [12:59:05] <_joe_> the cert that is registered with puppet [12:59:30] <_joe_> has the private key in puppetmaster1001:/var/lib/puppet/ssl/private_keys/puppet.pem [12:59:48] I'm guessing puppet-master has two different behaviours when running in ca mode vs not (?) [12:59:50] <_joe_> and that's not synced [13:00:04] <_joe_> no it's an error on our part AFAICT [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T1300). [13:00:05] Urbanecm, Tpt, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] _joe_: same md5 [13:00:14] o/ [13:00:19] Mine is not teestable [13:00:19] <_joe_> uhm [13:00:29] <_joe_> volans: please be more explicit [13:00:33] I can SWAT today [13:00:38] here [13:00:47] _joe_: /var/lib/puppet/ssl/private_keys/puppet.pem has same md5 on 1001 and 2001 [13:00:47] <_joe_> no actually the files are the same in /var/lib/puppet/ssl too [13:00:51] Tpt: around for SWAT? [13:00:57] o/ [13:01:03] Amir1: want to deploy your change? [13:01:24] yeah sure [13:01:34] I can go last [13:01:34] Amir1: go ahead while I review other patches [13:01:39] okay [13:01:42] let me start [13:02:14] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421269 (https://phabricator.wikimedia.org/T189776) (owner: 10Ladsgroup) [13:03:41] (03Merged) 10jenkins-bot: Remove forceWriteTermsTableSearchFields from testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421269 (https://phabricator.wikimedia.org/T189776) (owner: 10Ladsgroup) [13:05:14] !log stop rsync of ca/volatile on puppetmaster1001 [13:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:29] (03PS2) 10Zfilipin: Change bewikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421093 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [13:08:19] (03CR) 10jenkins-bot: Remove forceWriteTermsTableSearchFields from testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421269 (https://phabricator.wikimedia.org/T189776) (owner: 10Ladsgroup) [13:09:31] (03CR) 10Zfilipin: [C: 031] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421093 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [13:09:39] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: [[gerrit:421269|Remove forceWriteTermsTableSearchFields from testwikidatawiki, part I (T189776)]] (duration: 01m 16s) [13:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:44] T189776: Disable reading from term_search_key from wb_terms table in beta cluster and testwikidata - https://phabricator.wikimedia.org/T189776 [13:11:03] RECOVERY - puppetmaster backend https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 0.201 second response time [13:11:12] !log ladsgroup@tin Synchronized wmf-config/Wikibase.php: [[gerrit:421269|Remove forceWriteTermsTableSearchFields from testwikidatawiki, part II (T189776)]] (duration: 01m 15s) [13:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:37] I'm done, the floor is yours [13:11:39] zeljkof: ^ [13:11:47] Amir1: thanks! taking over swat [13:12:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421093 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [13:12:37] zeljkof, guess you're deploying my patches? [13:12:57] Urbanecm: correct, sorry, forgot to let you know [13:13:28] (03CR) 10Zfilipin: [C: 031] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421219 (https://phabricator.wikimedia.org/T190366) (owner: 10Urbanecm) [13:13:32] (03Merged) 10jenkins-bot: Change bewikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421093 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [13:13:44] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:13:46] (03CR) 10jenkins-bot: Change bewikibooks logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421093 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [13:14:02] (03PS1) 10Arturo Borrero Gonzalez: labtestvirt2002: boot into d-i rescue mode [puppet] - 10https://gerrit.wikimedia.org/r/421279 (https://phabricator.wikimedia.org/T189722) [13:14:44] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:15:35] Urbanecm: the first patch, 421093, is at mwdebug1002 [13:15:43] zeljkof, testing [13:17:03] zeljkof, working, please deploy to the whole universe [13:17:27] Urbanecm: ok, will do, just a minute, trying something new with scap, reading the docs [13:17:34] ack [13:19:15] (03PS34) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [13:20:57] Urbanecm: I'm trying to deploy 421093 in one go [13:21:07] (03CR) 10Ema: [C: 031] varnishslowlog: no Resp filter on backends, either [puppet] - 10https://gerrit.wikimedia.org/r/421275 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [13:21:22] ack [13:21:44] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Puppet has 25 failures. Last run 3 minutes ago with 25 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled],File[/etc/profile.d/mysql-ps1.sh],File[/usr/local/bin/phaste],File[/usr/local/bin/apt-upgrade-activity] [13:22:04] should be recovering ^ [13:23:02] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [13:23:27] !log reenabling puppet fleetwide to enable CA switch - T189891 [13:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:33] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [13:23:42] Urbanecm: ok, looks like that will not work, will deploy logos, then config file [13:23:55] zeljkof, may I know what you tried? [13:24:17] scap sync-file . 'message' [13:24:48] I'm sure scap sync 'message' will work but it will take some time to complete [13:25:49] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:421093|Change bewikibooks logo (T189218)]] (duration: 01m 16s) [13:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:55] T189218: Change bewikibooks logo - https://phabricator.wikimedia.org/T189218 [13:26:43] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:26:43] (03CR) 10Alexandros Kosiaris: [C: 032] Add blubber to docker integration agents [puppet] - 10https://gerrit.wikimedia.org/r/421108 (https://phabricator.wikimedia.org/T186548) (owner: 10Thcipriani) [13:26:50] (03PS2) 10Alexandros Kosiaris: Add blubber to docker integration agents [puppet] - 10https://gerrit.wikimedia.org/r/421108 (https://phabricator.wikimedia.org/T186548) (owner: 10Thcipriani) [13:26:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add blubber to docker integration agents [puppet] - 10https://gerrit.wikimedia.org/r/421108 (https://phabricator.wikimedia.org/T186548) (owner: 10Thcipriani) [13:27:31] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:421093|Change bewikibooks logo (T189218)]] (duration: 01m 15s) [13:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:30] Urbanecm: 421093 deployed, please check [13:29:04] will do [13:29:04] !log mobrovac@tin Started deploy [zotero/translators@1c30955]: Update translators - T188893 [13:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:10] T188893: Add translator for sverigesradio.se to MW translator repo - https://phabricator.wikimedia.org/T188893 [13:29:13] !log mobrovac@tin Finished deploy [zotero/translators@1c30955]: Update translators - T188893 (duration: 00m 08s) [13:29:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421219 (https://phabricator.wikimedia.org/T190366) (owner: 10Urbanecm) [13:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:44] zeljkof, can you purge static/images/project-logos/bewikibooks.png please? [13:29:57] (https://en.wikipedia.org/static/images/project-logos/bewikibooks.png is the full URL) [13:30:16] Urbanecm: sure, sorry, forgot [13:30:28] (03Merged) 10jenkins-bot: Make eswikibooks logo normal size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421219 (https://phabricator.wikimedia.org/T190366) (owner: 10Urbanecm) [13:30:42] (03CR) 10BBlack: [C: 032] varnishslowlog: no Resp filter on backends, either [puppet] - 10https://gerrit.wikimedia.org/r/421275 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [13:30:43] It wasn't meant as a complaint :) [13:30:54] (03CR) 10jenkins-bot: Make eswikibooks logo normal size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421219 (https://phabricator.wikimedia.org/T190366) (owner: 10Urbanecm) [13:30:56] (03PS2) 10BBlack: varnishslowlog: no Resp filter on backends, either [puppet] - 10https://gerrit.wikimedia.org/r/421275 (https://phabricator.wikimedia.org/T181315) [13:30:58] (03CR) 10BBlack: [V: 032 C: 032] varnishslowlog: no Resp filter on backends, either [puppet] - 10https://gerrit.wikimedia.org/r/421275 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [13:31:13] Urbanecm: did not get it as complaint, I should have done it without the reminder ;) [13:34:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labtestvirt2002: boot into d-i rescue mode [puppet] - 10https://gerrit.wikimedia.org/r/421279 (https://phabricator.wikimedia.org/T189722) (owner: 10Arturo Borrero Gonzalez) [13:34:13] (03CR) 10Zfilipin: [C: 032] "T189218#4072651" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421093 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [13:34:30] Urbanecm: purged https://phabricator.wikimedia.org/T189218#4072651 [13:34:34] thx [13:34:46] Working, can go to other logo patch I think [13:35:22] Urbanecm: 421219 is at mwdebug [13:35:25] (03PS2) 10Arturo Borrero Gonzalez: labtestvirt2002: remove d-i partman preseed config, temporal [puppet] - 10https://gerrit.wikimedia.org/r/421279 (https://phabricator.wikimedia.org/T189722) [13:35:28] checking [13:35:58] still completing a puppet run fleetwide btw [13:36:06] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labtestvirt2002: remove d-i partman preseed config, temporal [puppet] - 10https://gerrit.wikimedia.org/r/421279 (https://phabricator.wikimedia.org/T189722) (owner: 10Arturo Borrero Gonzalez) [13:36:15] zeljkof, ready to deploy [13:37:32] Urbanecm: deploying [13:37:48] Tpt[m]: please stand by, you are next [13:38:32] Hello! [13:38:37] Thank you :) [13:38:44] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:421219|Make eswikibooks logo normal size (T190366)]] (duration: 01m 16s) [13:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:49] T190366: eswikibooks logo looks a bit bigger than in other projects - https://phabricator.wikimedia.org/T190366 [13:39:08] Urbanecm: deployed, please check and thanks for deploying with #releng ;) [13:39:09] (03CR) 10EddieGP: "This is ready, it just needs some opsen to merge & puppet-merge it." [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP) [13:39:15] Tpt[m]: reviewing your commit [13:39:19] zeljkof, is it purged as well? [13:39:31] Urbanecm: argh, forgot again :( will do [13:39:44] Please do it for all URLs. Thanks! [13:40:08] Urbanecm: ok, purging all three [13:41:52] (03CR) 10Zfilipin: [C: 032] "T190366#4072682" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421219 (https://phabricator.wikimedia.org/T190366) (owner: 10Urbanecm) [13:42:04] Urbanecm: done https://phabricator.wikimedia.org/T190366#4072682 [13:43:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:43:12] (03CR) 10Zfilipin: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:43:15] (03PS4) 10Zfilipin: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:43:26] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:43:51] thx [13:44:06] It would be nice to run namespaceDupes.php on cywikisource after the deployment of the change [13:44:51] Tpt[m]: could you please add the exact command that needs to run as a gerrit comment? [13:44:55] (03Merged) 10jenkins-bot: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:45:16] zeljkof: sure [13:45:59] Tpt[m]: the commit is at mwdebug1002, please test and let me know if I can deploy [13:46:14] do I have to run the script before, or after deployment? [13:47:06] after [13:47:36] Tpt[m]: this is the script to run? `mwscript namespaceDupes.php cywikisource --fix` [13:47:41] it's to clean the state of pages in the pages in the removed namespaced [13:47:46] * namespaces [13:48:04] zeljkof: yes [13:48:13] (03CR) 10jenkins-bot: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:48:30] Tpt[m]: ok, let me know when you are done testing; by the way, do you know how to test at mwdebug? [13:49:09] I did it a few years ago. Has anaything changed in the past few month? [13:49:28] Tpt[m]: not as far as I know, there are docs, I'll find the link [13:49:43] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug ? [13:49:43] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4072704 (10elukey) Thanks! When I try to push the .changes I get that I am missing orig.tar.gz: ``` root@install1002:/srv/wikimedia# rep... [13:49:58] Tpt[m]: that's it [13:50:13] (03PS4) 10Filippo Giunchedi: Move config-master to codfw [dns] - 10https://gerrit.wikimedia.org/r/420734 (https://phabricator.wikimedia.org/T184562) [13:51:03] (03CR) 10Filippo Giunchedi: [C: 032] Move config-master to codfw [dns] - 10https://gerrit.wikimedia.org/r/420734 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [13:52:50] (03PS3) 10Filippo Giunchedi: cache: depool puppetmaster1001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/420744 (https://phabricator.wikimedia.org/T184562) [13:53:04] Tpt[m]: do you need more time to test? (just checking, no rush) [13:53:43] zeljkof: I just finished checking. it should be fine [13:53:54] (03CR) 10Filippo Giunchedi: [C: 032] cache: depool puppetmaster1001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/420744 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [13:53:55] Tpt[m]: ok, deploying and running scripts [13:54:22] thank you! [13:55:18] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:394189|Properly setup ProofreadPage namespaces for cywikisource (T181406)]] (duration: 01m 16s) [13:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:24] T181406: Welsh Wikisource issues with namespaces and indexes - https://phabricator.wikimedia.org/T181406 [13:55:26] Tpt[m]: deployed, running script [13:56:57] (03CR) 10Zfilipin: [C: 032] "T181406#4072725" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) (owner: 10Tpt) [13:57:20] Tpt[m]: scripts done https://phabricator.wikimedia.org/T181406#4072725, please check and thanks for deploying with #releng ;) [13:57:39] !log EU SWAT finished [13:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:41] zeljkof: Thank you very much! [14:00:09] !log reimage puppetmaster1001 - T184562 [14:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:14] T184562: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562 [14:01:27] (03CR) 10Muehlenhoff: [C: 031] "This looks good to merge to me! PCC on existing uses on trusty/jessie/stretch also seems sane: https://puppet-compiler.wmflabs.org/compile" [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [14:02:58] (03CR) 10ArielGlenn: "I need to test it on the snapshot instance in beta again before it gets merged." [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [14:06:43] (03PS3) 10Herron: puppet-facts-export.py: support puppetdb version 4 [puppet] - 10https://gerrit.wikimedia.org/r/409443 [14:07:00] (03PS4) 10Herron: puppet-facts-export.py: support puppetdb version 4 [puppet] - 10https://gerrit.wikimedia.org/r/409443 [14:07:42] (03CR) 10Herron: [C: 032] puppet-facts-export.py: support puppetdb version 4 [puppet] - 10https://gerrit.wikimedia.org/r/409443 (owner: 10Herron) [14:12:09] thanks herron ! [14:12:35] 👍 [14:14:17] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) [14:14:22] there seems to be load issues on s3 since 9:21 [14:16:09] (03CR) 10Dzahn: "I think that would mean that we can't replace the deployment server with new hardware _and_ upgrade it to stretch before end of quarter. s" [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [14:16:52] !log rolling restart of the three hadoop hdfs journal nodes (an1028/35/52) for openjdk-8 upgrades [14:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:12] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072792 (10BBlack) [14:18:15] It doesn't seem single database specific, it is happening on all s2 hosts [14:18:29] we have 20x the latency [14:18:36] 20 errors per second [14:18:58] on connection, which is normally a synthom of overload [14:19:18] io pressures is 3x [14:20:14] could be linter [14:23:19] (03CR) 10Muehlenhoff: "This will be fixed once 394977 is merged." [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [14:24:21] !log killing ongoing truncate to investigate s3 issues [14:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:21] not sure it is that, but I want to remove variables [14:25:33] it could be just a consequence, not a cause [14:28:35] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072782 (10Anomie) >>! In T190410, @BBlack wrote: > 1. Relatively-minor issue: It times out internally: there's a ~60s pause before any output is... [14:32:39] (03PS1) 10Alexandros Kosiaris: Introduce backup1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/421295 (https://phabricator.wikimedia.org/T189801) [14:33:16] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce backup1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/421295 (https://phabricator.wikimedia.org/T189801) (owner: 10Alexandros Kosiaris) [14:42:20] (03PS1) 10Muehlenhoff: Add library hint for cups [puppet] - 10https://gerrit.wikimedia.org/r/421300 [14:43:41] (03CR) 10Muehlenhoff: [C: 032] Add library hint for cups [puppet] - 10https://gerrit.wikimedia.org/r/421300 (owner: 10Muehlenhoff) [14:44:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup backup1001.eqiad.wmnet - https://phabricator.wikimedia.org/T189801#4072871 (10akosiaris) Unfortunately wmf4750 will not do after all. After we powered off and unracked helium we figured out the raid card was too big for the space available in the R430. We... [14:45:04] (03PS1) 10Alexandros Kosiaris: Revert "Introduce backup1001.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/421301 [14:46:41] (03PS2) 10Alexandros Kosiaris: Revert "Introduce backup1001.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/421301 (https://phabricator.wikimedia.org/T189801) [14:46:50] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Introduce backup1001.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/421301 (https://phabricator.wikimedia.org/T189801) (owner: 10Alexandros Kosiaris) [14:48:09] 10Operations, 10ops-eqiad: Non-redundant power supply on helium - https://phabricator.wikimedia.org/T186808#4072877 (10akosiaris) [14:50:56] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072879 (10BBlack) >>! In T190410#4072815, @Anomie wrote: > Define "times out internally". I mean this from the naive point of view of: when I... [14:55:13] (03PS1) 10Ottomata: Install python3 statistics packages; configure user venvs with packages in puppet [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) [14:55:43] (03CR) 10jerkins-bot: [V: 04-1] Install python3 statistics packages; configure user venvs with packages in puppet [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [14:56:11] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4072887 (10Eevans) [14:57:47] !log installing cups update from stretch point release (we only install the client libs) [14:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:09] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4044780 (10Eevans) >>! In T189529#4072704, @elukey wrote: > Thanks! When I try to push the .changes I get that I am missing orig.tar.gz:... [14:58:51] (03PS2) 10Ottomata: Install python3 statistics packages; configure user venvs from frozen requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) [14:59:20] (03CR) 10jerkins-bot: [V: 04-1] Install python3 statistics packages; configure user venvs from frozen requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [14:59:44] (03PS3) 10Ottomata: Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) [15:00:16] (03CR) 10jerkins-bot: [V: 04-1] Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [15:01:27] (03PS4) 10Ottomata: Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) [15:01:42] (03CR) 10jerkins-bot: [V: 04-1] Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [15:02:41] (03PS5) 10Ottomata: Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) [15:04:59] (03PS35) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [15:07:14] !log sbisson@tin Started deploy [kartotherian/deploy@8f3a903]: Deploying weekly progress to maps-test* [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:27] (03PS36) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [15:07:55] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190303#4072904 (10Cmjohnson) Replaced disk and it's rebuilding Enclosure Device ID: 32 Slot Number: 6 Drive's position: DiskGroup: 0, Span: 3, Arm: 0 Enclosure position: 1 De... [15:08:55] !log installing java-atk-wrapper updates from stretch point release [15:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:13] !log sbisson@tin Finished deploy [kartotherian/deploy@8f3a903]: Deploying weekly progress to maps-test* (duration: 01m 59s) [15:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:19] (03PS6) 10Ottomata: Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) [15:09:26] RECOVERY - kartotherian endpoints health on maps-test2004 is OK: All endpoints are healthy [15:09:48] (03CR) 10Vgutierrez: [C: 032] Release 1.13.9-1+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/415835 (owner: 10Vgutierrez) [15:10:20] !log replacing disk slot 11 db1061 [15:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:11] 10Operations, 10ops-eqiad, 10DBA: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4072907 (10Cmjohnson) Replaced the disk at slot 11 cmjohnson@db1061:~$ sudo megacli -PDList -aALL | grep "Firmware state" Firmware state: Online, Spun Up Firmware stat... [15:13:42] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10589/notebook1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [15:13:48] (03CR) 10Ottomata: [C: 032] Install python3 packages; configure user venvs from requirements [puppet] - 10https://gerrit.wikimedia.org/r/421306 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [15:14:04] !log db1054 replacing disk at slot 1 [15:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:40] cmjohnson1: thanks for logging, one cannot trust nowadays a good'ol RAID [15:16:22] jynus: your're welcome all disks have been replaced and are rebuilding [15:16:44] let me acks the rebuilding so a task is not created from the monitoring [15:17:07] !log installing openssh updates from stretch point release [15:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:30] 10Operations, 10ops-eqiad, 10DBA: db1054 (s2 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190302#4072920 (10Cmjohnson) Disk has been replaced and raid is rebuilding cmjohnson@db1054:~$ sudo megacli -PDList -aALL | grep "Firmware state" Firmware state: Online, Spun... [15:18:00] jynus sorry missed db1052...did you guys decide to break a bunch of disks yesterday? ;-) [15:18:17] actually we did [15:18:23] !Log db1052 replacing disk slot 2 [15:18:36] but better us than them [15:18:37] db1052 has two disks....going to replace one wait and replace the other [15:18:41] godog: what is the proper place to run puppet-merge atm? puppetmaster2001? [15:18:58] yes, start with the worse one, I think there was one in a much worse state [15:19:58] cmjohnson1: slot 2 first (I would do) [15:20:30] oh yes, i think it is :) [15:20:30] oh, you chose the same as I suggested already :-) [15:20:44] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072934 (10Anomie) >>! In T190410#4072879, @BBlack wrote: >>>! In T190410#4072815, @Anomie wrote: >> Define "times out internally". > > I mean... [15:22:21] 10Operations, 10ops-eqiad, 10DBA: db1052 (s1 master) disks with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4072949 (10Cmjohnson) Replaced the disk at slot 2....I will wait for the rebuild to complete before swapping slot 8 Firmware state: Online, Spun Up Firmware state: On... [15:22:39] hmm, puppet-merge seems to be hanging on puppetmaster2001 [15:22:52] ah ssh: connect to host puppetmaster1001.eqiad.wmnet port 22: Connection timed out [15:23:02] !log sbisson@tin Started deploy [tilerator/deploy@e259530]: Deploying weekly progress to maps-test* [15:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:26] !log ran puppet-merge on puppetmaster2001, got ssh: connect to host puppetmaster1001.eqiad.wmnet port 22: Connection timed out, hope all is ok. T189891 [15:23:28] !log sbisson@tin Finished deploy [tilerator/deploy@e259530]: Deploying weekly progress to maps-test* (duration: 00m 26s) [15:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:31] T189891: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891 [15:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:43] (03CR) 10ArielGlenn: "I've now applied this change on snapshot01 in beta and tested it, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [15:23:59] (03PS1) 10Alexandros Kosiaris: tmux/screen: retry_internal set to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/421314 (https://phabricator.wikimedia.org/T187528) [15:24:15] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10Lucas_Werkmeister_WMDE) [15:26:15] (03CR) 10Alexandros Kosiaris: [C: 031] base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [15:27:15] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 42174891 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:27:56] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 24791010 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:29:16] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6120 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:29:31] ottomata: yes it is, though indeed puppetmaster1001 is being reinstalled [15:29:38] ottomata: I'll pull it out of puppet-merge [15:29:45] Oh wow cmjohnson1 - thanks for taking care of all those disks! :) [15:29:56] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 5813 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:31:16] ACKNOWLEDGEMENT - MegaRAID on db1061 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T190418 [15:31:22] 10Operations, 10ops-eqiad: Degraded RAID on db1061 - https://phabricator.wikimedia.org/T190418#4072997 (10ops-monitoring-bot) [15:32:35] (03PS1) 10Filippo Giunchedi: hieradata: take out puppetmaster1001 as frontend [puppet] - 10https://gerrit.wikimedia.org/r/421317 (https://phabricator.wikimedia.org/T184562) [15:33:46] ACKNOWLEDGEMENT - MegaRAID on db1062 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T190419 [15:33:50] 10Operations, 10ops-eqiad: Degraded RAID on db1062 - https://phabricator.wikimedia.org/T190419#4073005 (10ops-monitoring-bot) [15:34:08] 10Operations, 10ops-eqiad: Degraded RAID on db1061 - https://phabricator.wikimedia.org/T190418#4073008 (10Marostegui) 05Open>03declined This is part of: T190418 [15:34:09] I will handle those tasks [15:34:14] we are working of that [15:34:56] 10Operations, 10ops-eqiad: Degraded RAID on db1062 - https://phabricator.wikimedia.org/T190419#4073010 (10Marostegui) 05Open>03declined This is part of: T190303 [15:34:58] 10Operations, 10netops: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4073014 (10faidon) I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it g... [15:35:04] !log sbisson@tin Started deploy [kartotherian/deploy@8f3a903]: Weekly progress to production [15:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:20] 10Operations, 10ops-eqiad: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4073015 (10Cmjohnson) 05Open>03Resolved Disks are wiped -- resolving [15:36:55] PROBLEM - Request latencies on acrab is CRITICAL: CRITICAL - apiserver_request_latencies is 41990252 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:37:06] ACKNOWLEDGEMENT - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T190420 [15:37:15] 10Operations, 10ops-eqiad: Degraded RAID on db1054 - https://phabricator.wikimedia.org/T190420#4073020 (10ops-monitoring-bot) [15:37:30] !log sbisson@tin Finished deploy [kartotherian/deploy@8f3a903]: Weekly progress to production (duration: 02m 27s) [15:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:55] RECOVERY - Request latencies on acrab is OK: OK - apiserver_request_latencies is 5313 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:37:58] 10Operations, 10netops: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071230 (10Ottomata) I'd love to see stream processing from Kafka running in Kubernetes one day (pipe dream!), and that could be highish traffic. [15:38:08] 10Operations, 10ops-eqiad: Degraded RAID on db1054 - https://phabricator.wikimedia.org/T190420#4073025 (10jcrespo) 05Open>03Invalid [15:38:17] 10Operations, 10ops-eqiad: Degraded RAID on db1054 - https://phabricator.wikimedia.org/T190420#4073026 (10Marostegui) 05Invalid>03declined This is part of: T190302 [15:38:22] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/10590/" [puppet] - 10https://gerrit.wikimedia.org/r/421317 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:39:56] ACKNOWLEDGEMENT - MegaRAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T190421 [15:40:00] 10Operations, 10ops-eqiad: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T190421#4073030 (10ops-monitoring-bot) [15:40:10] (03PS1) 10Ottomata: Fix typo in jupyterhub config [puppet] - 10https://gerrit.wikimedia.org/r/421320 (https://phabricator.wikimedia.org/T183145) [15:40:52] (03CR) 10Ottomata: [C: 032] Fix typo in jupyterhub config [puppet] - 10https://gerrit.wikimedia.org/r/421320 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [15:41:04] 10Operations, 10ops-eqiad: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T190421#4073035 (10Marostegui) 05Open>03declined This is part of: T190301 [15:41:08] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/421317 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:42:10] (03PS2) 10Filippo Giunchedi: hieradata: take out puppetmaster1001 as frontend [puppet] - 10https://gerrit.wikimedia.org/r/421317 (https://phabricator.wikimedia.org/T184562) [15:42:35] !log sbisson@tin Started deploy [tilerator/deploy@e259530]: Weekly progress to production [15:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:49] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: take out puppetmaster1001 as frontend [puppet] - 10https://gerrit.wikimedia.org/r/421317 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:43:18] !log sbisson@tin Finished deploy [tilerator/deploy@e259530]: Weekly progress to production (duration: 00m 43s) [15:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:51] (03PS1) 10Bstorm: dynamicproxy: Restrict log size to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/421321 (https://phabricator.wikimedia.org/T190218) [15:50:31] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#4073085 (10Cmjohnson) mw1259 was named in the switch as tmh1001 ge-4/0/28 (old name). deleted the port. mw1260 was named in the switch as tmh1002 ge-4/0/29 deleted the port The servers... [15:50:53] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#4073090 (10Cmjohnson) [15:54:27] (03PS1) 10Cmjohnson: removing dns entries mw1259-1260 [dns] - 10https://gerrit.wikimedia.org/r/421325 (https://phabricator.wikimedia.org/T187466) [15:55:15] (03CR) 10Cmjohnson: [C: 032] removing dns entries mw1259-1260 [dns] - 10https://gerrit.wikimedia.org/r/421325 (https://phabricator.wikimedia.org/T187466) (owner: 10Cmjohnson) [15:55:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4073114 (10fgiunchedi) Reimaging puppetmaster1001 isn't going according to plan, namely `eno1` is seemingly brought up and gets a dhcp lease, the... [15:56:05] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#4073115 (10Cmjohnson) [15:56:52] (03CR) 10Eevans: "Hrmm, so maps (maps200[1-4]) runs 2.1.13 (though 2.2.6-wmf1 is the candidate), and maps-test (maps-test200[1-3]) is running 2.2.6-wmf1." [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [16:00:05] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T1600). [16:00:05] pnorman and marlier: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:28] I'm here [16:00:53] (03CR) 10Elukey: "> Hrmm, so maps (maps200[1-4]) runs 2.1.13 (though 2.2.6-wmf1 is the" [puppet] - 10https://gerrit.wikimedia.org/r/421241 (https://phabricator.wikimedia.org/T184795) (owner: 10Elukey) [16:02:59] * pnorman is here [16:03:45] apologies I can't SWAT today, busy with other puppet [16:04:33] !log starting the asw-a/b/c-eqiad switches uplink work [16:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:45] maps-test2004 is the only one that actually needs that patch - it's for stuff the others aren't running yet [16:04:45] PROBLEM - Host db1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:05:39] XioNoX: I'm trying to reinstall a problematic host, will that impact what I'm doing ? pxe/dhcp/etc ? [16:05:43] ^ that host is going to be decommed, so that is expected [16:05:44] (03PS3) 10Sbisson: Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) [16:05:55] (03PS4) 10Sbisson: Configure maps source for localized labels [puppet] - 10https://gerrit.wikimedia.org/r/420315 (https://phabricator.wikimedia.org/T112948) [16:06:19] db1011 mgmt was a loose connection...replaced the cable [16:07:30] cmjohnson1: that host is going to be decommossioned, so… :) [16:07:40] cmjohnson1: https://phabricator.wikimedia.org/T184703 [16:07:41] okay [16:07:53] godog: if you aren't is _joe_ or moritzm doing puppet swat? [16:07:54] godog, marlier: is puppet swat canceled? looks like there is a urgent config change to revert/deploy (see _security) [16:08:05] wiping db1009 now...removing all the cable mgmt shit behind it and the cable got caught....good to know [16:08:05] or should we just move the patches to the next one? [16:08:08] icmjohnson1: it is on the DC Ops side of things [16:08:15] godog: no impact expected [16:08:20] (03PS3) 10Mforns: Modify eventlogging purging script to read from YAML whitelist [puppet] - 10https://gerrit.wikimedia.org/r/420685 (https://phabricator.wikimedia.org/T189692) [16:08:48] 10Operations, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4073147 (10awight) This has been reverted in https://gerrit.wikimedia.org/r/#/c/421316/, since I... [16:09:01] zeljkof: I haven't heard anything... [16:09:37] marlier: on puppet swat happening? or it being canceled? [16:09:45] either :-) [16:09:55] RECOVERY - Host db1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [16:10:29] zeljkof: I'd say go ahead with your config change [16:10:41] so I think that all the ops are a bit busy at the moment, but probably we'll be able to merge them separately later on [16:10:42] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4073157 (10Cmjohnson) [16:11:00] marlier: me or ottomata can take care of your patch in ~1h or so (we are in meetings now) [16:11:03] would it work for you? [16:11:10] Paul's patch from puppet swat is merged already [16:11:11] Sure, that's fine [16:11:14] super [16:11:16] ok then, reverting 420947 and deploying it [16:11:25] and the ZMQ patch isn't really puppet swat material to begin with [16:11:47] and Luca said he'd take care of it :-) [16:11:54] (03PS1) 10Zfilipin: Revert "Redeploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 [16:11:57] yeah I can take care of it with Andrew, is the change that will finally free us from eventlog1001 \o/ [16:12:16] (03PS2) 10Zfilipin: Revert "Redeploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 [16:12:43] (03PS3) 10Zfilipin: Revert "Redeploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 (https://phabricator.wikimedia.org/T189806) [16:13:31] (03PS1) 10Lucas Werkmeister (WMDE): Disable reading wb_terms search fields on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421333 (https://phabricator.wikimedia.org/T189777) [16:13:32] k, off to breakfast [16:16:20] (03CR) 10Zfilipin: [C: 032] "Requested by marostegui (Manuel Arostegui) in #mediawiki_security at 16:55 utc+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 (https://phabricator.wikimedia.org/T189806) (owner: 10Zfilipin) [16:16:51] (03CR) 10Marostegui: "https://phabricator.wikimedia.org/T189806#4073107" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 (https://phabricator.wikimedia.org/T189806) (owner: 10Zfilipin) [16:17:00] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/10591/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [16:19:47] marostegui: a couple of jobs for 421331 are "queued" :| [16:19:55] see https://integration.wikimedia.org/zuul/ [16:20:42] let's hope it doesn't take long :) [16:20:51] 10Operations, 10Puppet, 10Patch-For-Review: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4073194 (10fgiunchedi) Problems discovered today during the ca switchover: # permissions during rsync for ca/volatile are set based on uids apparently, not user names, thus... [16:22:47] marlier: do you think that we could also take the opportunity to split coal's puppet code away from the performance site? Ideally in two separate profiles [16:23:10] if you guys are busy I can try to help and take over the code review [16:23:20] (probably not today but tomorrow) [16:24:00] marostegui: Re https://phabricator.wikimedia.org/T189806#4073107 - when you say increase on s3, do you mean increase in writes? [16:24:10] Yeah, I think we can do that. If you're willing to code review it, I can have a patch to do that in an hour or so... (you can CR tomorrow, that's fine) [16:24:31] yep sure! [16:24:31] Niharika: no, in reads [16:24:50] I'll ping you when it's done [16:24:54] marostegui: Is it possible to see what the read queries looked like in grafana? [16:25:37] marlier: so I am thinking something like: profile::performance::site and profile::performance::coal (or something along those lines), then include those in role::graphite::primary [16:25:59] bblack, no_justification: https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/mcrouter,access and https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/nutcracker,access look...odd. I would have though rights would inherit from operations/debs. [16:26:00] (03Merged) 10jenkins-bot: Revert "Redeploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 (https://phabricator.wikimedia.org/T189806) (owner: 10Zfilipin) [16:26:20] * AaronSchulz wonders what the correct way is [16:26:34] marostegui: 421331 is finally merged, deploying [16:27:05] zeljkof: thanks [16:27:28] Niharika: No, not there. I could do a traffic capture if you give me some hints of how these queries look like [16:27:33] which table, for instance [16:28:10] (03CR) 10jenkins-bot: Revert "Redeploy GlobalPreferences to test wikis and mw.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421331 (https://phabricator.wikimedia.org/T189806) (owner: 10Zfilipin) [16:28:13] !log restarting graphite on labmon1001 to pick up uwsgi security update [16:28:15] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:421331|Revert "Redeploy GlobalPreferences to test wikis and mw.org" (T189806)]] (duration: 01m 14s) [16:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:23] 10Operations, 10cloud-services-team, 10netops: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424#4073206 (10RobH) p:05Triage>03Normal [16:28:26] T189806: Deploy GlobalPrefs on production - https://phabricator.wikimedia.org/T189806 [16:28:39] (03PS1) 10Arturo Borrero Gonzalez: labtestvirtXXX: update lease file for ubuntu servers [puppet] - 10https://gerrit.wikimedia.org/r/421335 (https://phabricator.wikimedia.org/T189722) [16:28:53] Niharika: I can see the decrease already after zeljkof deployed [16:28:54] marostegui: deployed [16:29:05] zeljkof: thanks a lot [16:29:06] Niharika: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=now-3h&to=now [16:29:09] great, I'll go to my meeting then :) [16:29:13] I will comment on the ticket, easier there :) [16:29:41] marostegui: Not questioning that. If I can see the reads it'll be helpful to debug. [16:30:21] Niharika: Sure :-) Can you give me some hints about how the query would look like? [16:31:24] marostegui: It'll be looking at the user_properties tables. [16:32:00] marostegui: I can dig through myself if you can tell me where to find them. I find grafana very confusing. [16:32:07] (03CR) 10RobH: [C: 031] labtestvirtXXX: update lease file for ubuntu servers [puppet] - 10https://gerrit.wikimedia.org/r/421335 (https://phabricator.wikimedia.org/T189722) (owner: 10Arturo Borrero Gonzalez) [16:32:07] I will try to find something, but now that have been reverted will be a bit harder [16:32:14] Niharika: No, it won't be in grafana :( [16:32:29] Oh. [16:32:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labtestvirtXXX: update lease file for ubuntu servers [puppet] - 10https://gerrit.wikimedia.org/r/421335 (https://phabricator.wikimedia.org/T189722) (owner: 10Arturo Borrero Gonzalez) [16:33:01] marostegui: Where can I find them then? I have shell access and can look myself. [16:33:57] Niharika: I will try to dig within mysql tendril and traffic captures first - I will ping you if I get something :) [16:34:10] Cool, thanks! [16:34:29] Niharika: Thank you, and sorry for reverting it :( [16:34:59] That's okay. Glad you caught it before we did a wider deploy. [16:37:11] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 156500 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:38:18] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4073238 (10elukey) [16:39:56] marostegui: I created a ticket to keep track of it - https://phabricator.wikimedia.org/T190425 [16:40:11] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4578 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:41:05] Niharika: Oh, sorry, I just commented on the other ticket [16:41:45] That's alright. [16:42:26] !log installing postgres security updates on netmon* [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:46] (03PS1) 10Ema: varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) [16:43:17] (03CR) 10jerkins-bot: [V: 04-1] varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [16:43:48] elukey: Sorry, was on a call. That way of splitting things makes sense, should have that fairly soon. [16:44:38] (03PS2) 10Ema: varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) [16:44:40] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There are 5 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:44:52] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073263 (10BBlack) >>! In T190410#4072934, @Anomie wrote: >>>! In T190410#4072879, @BBlack wrote: >> However, I find it a bit specious to use the... [16:52:37] (03PS3) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [16:52:39] (03PS1) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [16:54:25] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4073341 (10BBlack) Yeah I do have concerns here. It's going to take some time before I can loop back and explain them, but I just wanted to put the not... [16:55:10] !log ppchelko@tin Started deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:58] 10Operations, 10Analytics, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#4073351 (10fdans) [16:59:32] (03PS3) 10Ema: varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:25] No ORES today. [17:03:21] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 404 (expecting: 200): /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) is CRITICAL: Test Get references of a test page returned the unexpected [17:03:21] ng: 200) [17:03:49] !log ppchelko@tin Finished deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints (duration: 08m 39s) [17:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:46] Pchelolo: is the alarm expected ? [17:07:04] 404 vs 200 [17:07:39] !log ppchelko@tin Started deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints, take 2 [17:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:55] elukey: that's an interminnent issue - the LVS checker got an updated spec while the request was routed to a not-yet-deployed server [17:07:56] It could've just returned a "200 OK" plus a json message saying "not found" and then we wouldn't have to worry about these alerts! [17:07:59] cc _joe_ ^ [17:08:03] * bblack in full snark mode today [17:08:20] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [17:10:20] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:10:38] !log ppchelko@tin Finished deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints, take 2 (duration: 03m 00s) [17:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:14] !log ppchelko@tin Started deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints, take 3 [17:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:20] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [17:12:44] (03PS21) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [17:13:13] (03CR) 10jerkins-bot: [V: 04-1] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:14:07] !log ppchelko@tin Finished deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints, take 3 (duration: 02m 54s) [17:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:25] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4073423 (10Cmjohnson) Row A connections are complete cr1 xe-3/0/0 -> xe-2/0/44 #4776 cr1 xe-3/1/0 -> xe-2/0/45 #3452 cr1 xe-4/0/0 -> xe-7/0/44 #1985 cr1 xe-4/1/0 -> xe-7/0/45... [17:18:15] !log ppchelko@tin Started deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints, with increased timeout [17:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:42] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4073436 (10DYNKM) I suspect I might still be missing an access somewhere; running a hive query produces: ``` org.... [17:21:30] !log ppchelko@tin Finished deploy [restbase/deploy@93dadf7]: Release metadata and references endpoints, with increased timeout (duration: 03m 15s) [17:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:40] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190303#4073447 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good - thanks Chris! ``` root@db1062:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks:... [17:27:11] (03CR) 10Elukey: "Some early comments!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:28:35] (03PS22) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [17:29:06] (03CR) 10jerkins-bot: [V: 04-1] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:30:09] 10Operations, 10ops-eqiad, 10DBA: db1054 (s2 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190302#4073461 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now, thanks @Cmjohnson ``` root@db1054:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtu... [17:32:10] 10Operations, 10ops-eqiad, 10DBA: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4073467 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now - thanks Chris! ``` root@db1061:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Dis... [17:42:28] (03CR) 10Pmiazga: [C: 031] Enable VirtualPageViews on s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421134 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [17:43:54] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4073507 (10CCogdill_WMF) [17:44:16] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4012170 (10CCogdill_WMF) Updating task as I want to update the subdomain in the request. [17:44:48] !log install1002 - restarted dhcp server to confirm there was no syntax error [17:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:18] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Research, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4073520 (10DYNKM) Wait, scratch that; made it work! [17:49:26] (03PS23) 10Elukey: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:54:00] (03PS24) 10Elukey: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:54:28] AaronSchulz: Well it looks like a lot of ops/debs/* have funny inheritance. But they all eventually inherit from ops/debs itself so...harmless? [17:54:31] If a bit odd [17:55:50] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4073546 (10Cmjohnson) @ayounsi please check xe-2/0/44 -> cr1-eqiad:xe-3/0/1 #1989 xe-2/0/45 -> cr1-eqiad:xe-4/0/1 #3457 xe-7/0/44 -> cr1-eqiad:xe-4/1/1... [17:57:15] (03CR) 10Elukey: "Pcc looks happy https://puppet-compiler.wmflabs.org/compiler03/10596/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:57:56] !log ppchelko@tin Started deploy [changeprop/deploy@4f9fbe4]: Purge page metadata and references on html change and page deletion. [17:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:12] !log ppchelko@tin Finished deploy [changeprop/deploy@4f9fbe4]: Purge page metadata and references on html change and page deletion. (duration: 01m 16s) [17:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T1800). [18:00:04] raynor: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:06] (03CR) 10Hoo man: [C: 031] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421333 (https://phabricator.wikimedia.org/T189777) (owner: 10Lucas Werkmeister (WMDE)) [18:01:18] \o [18:02:16] (03CR) 10Imarlier: [C: 031] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:03:55] anyone for swat? [18:05:03] (03PS2) 10Ladsgroup: Enable VirtualPageViews on s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421134 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [18:05:10] I can do SWAT [18:05:12] who can SWAT ? [18:05:22] ah, @Amir1, hey ;) [18:05:23] raynor: Is it testable in mwdebu1002? [18:05:27] yes [18:05:27] Hey [18:05:32] okay [18:05:43] awesome, thanks [18:06:27] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421134 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [18:07:36] (03Merged) 10jenkins-bot: Enable VirtualPageViews on s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421134 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [18:08:05] (03PS1) 10Arturo Borrero Gonzalez: Revert "labtestvirt2002: remove d-i partman preseed config, temporal" [puppet] - 10https://gerrit.wikimedia.org/r/421347 [18:08:18] (03CR) 10jenkins-bot: Enable VirtualPageViews on s6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421134 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [18:08:46] (03PS2) 10Arturo Borrero Gonzalez: Revert "labtestvirt2002: remove d-i partman preseed config, temporal" [puppet] - 10https://gerrit.wikimedia.org/r/421347 [18:08:54] raynor: it's live in mwdebug1002 :) [18:09:24] ok, testing, give me 5 mins [18:10:19] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073589 (10Anomie) On the other hand, with all those different proxies that makes it much more likely one of them is going to throw its own error... [18:10:32] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "labtestvirt2002: remove d-i partman preseed config, temporal" [puppet] - 10https://gerrit.wikimedia.org/r/421347 (owner: 10Arturo Borrero Gonzalez) [18:11:21] it's suppper slow :/ [18:11:41] sure, take your time, the node on its own is very slow [18:13:00] (03CR) 10Dzahn: [C: 031] tmux/screen: retry_internal set to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/421314 (https://phabricator.wikimedia.org/T187528) (owner: 10Alexandros Kosiaris) [18:14:27] Amir1, - s6 wikis -> it's ru, ja and fr right? [18:14:54] let me double check [18:14:57] yes [18:15:03] I know them by heart [18:15:24] raynor: https://github.com/wikimedia/operations-mediawiki-config/blob/master/dblists/s6.dblist [18:15:29] yup :) [18:16:10] I used to know server name of master of several shards by heart but I forgot [18:16:14] (03PS25) 10Elukey: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:16:29] ok, somehow I don't see the config change [18:16:52] s/shards/sections/ [18:17:32] Amir1, could you double check that code hit production? [18:18:00] the `mw.config.get('wgPopupsVirtualPageViews');` should be set to true [18:18:02] but it's false [18:18:17] raynor: did you enable mwdebug1002 extension? [18:18:38] (03CR) 10Ottomata: [C: 031] "Let's do it!" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:18:59] yes [18:19:17] (03CR) 10Elukey: "new pcc https://puppet-compiler.wmflabs.org/compiler03/10597/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:19:24] https://gerrit.wikimedia.org/r/#/c/421134/2/wmf-config/InitialiseSettings.php -> this is our change [18:19:29] we want to change one settings for s6 wikis [18:19:33] are we doing it right? [18:19:34] (03CR) 10Imarlier: [C: 031] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:20:28] (03PS2) 10Rduran: [WIP] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [18:20:58] I've never seen section name as being used as config variable but that should work [18:21:02] it's in dblists [18:21:30] (03PS26) 10Elukey: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:22:01] marlier: merging! [18:22:10] (03CR) 10Elukey: [C: 032] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:22:13] (03Abandoned) 10Dzahn: mediawik_deployment: stretch support for PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/421202 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [18:23:12] raynor: I double checked everything [18:23:15] marlier: running puppet on graphite1001 [18:23:22] Amir1: yeah, it looks like it doesn't pick the config change [18:23:31] I don't see it on frontend :( [18:23:45] it's fine, the only explanation is that s6 can't be used [18:24:06] I don't see a single error on mwdebug after scap [18:24:15] can I create a new patch and specify fr,ru and ja one by one? [18:24:32] marlier: all good! coal is running with kafka now :) [18:24:46] raynor: that would be great [18:25:40] sure, can you unmerge the current one? [18:25:57] raynor: yes, greping the InitialiseSettings.php says there is no similar config [18:26:00] elukey: thank so much! [18:26:04] thanks, even :-) [18:26:17] raynor: no it's not possible. Make a change on top of that [18:26:20] and I sync both [18:26:26] (sync a file) [18:26:37] when you made it, let me know and I merge it [18:27:39] raynor: just press revert on gerrit, that will do what you need [18:27:59] woohooo [18:28:15] (03PS1) 10Pmiazga: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421351 [18:28:18] (03PS1) 10Pmiazga: Enable VirtualPageViews on s6 (ja,ru,fr) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421352 (https://phabricator.wikimedia.org/T189906) [18:28:20] (03PS1) 10Ottomata: Allow user venv to use system-site-packages [puppet] - 10https://gerrit.wikimedia.org/r/421353 (https://phabricator.wikimedia.org/T183145) [18:28:48] editing deployments page [18:29:02] (03PS2) 10Ladsgroup: Enable VirtualPageViews on s6 (ja,ru,fr) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421352 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [18:30:02] Amir1, done [18:30:14] https://gerrit.wikimedia.org/r/#/c/421352/ [18:30:20] (03CR) 10Ottomata: [C: 032] Allow user venv to use system-site-packages [puppet] - 10https://gerrit.wikimedia.org/r/421353 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [18:30:33] (03Abandoned) 10Ladsgroup: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421351 (owner: 10Pmiazga) [18:30:36] I added that patch to Deployments page [18:30:57] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421352 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [18:31:36] ottomata,marlier - https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=coal&from=now-1h&to=now [18:32:02] (03Merged) 10jenkins-bot: Enable VirtualPageViews on s6 (ja,ru,fr) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421352 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [18:32:04] seems already on track [18:32:12] Yep [18:32:17] (03CR) 10jenkins-bot: Enable VirtualPageViews on s6 (ja,ru,fr) wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421352 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [18:32:21] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4073679 (10Cmjohnson) @ayounsi xe-2/0/44 -> cr1-eqiad:xe-3/0/2 #1984 xe-2/0/45 -> cr1-eqiad:xe-3/1/2 #3452 xe-7/0/44 -> cr1-eqiad:xe-4/0/2 2627 xe-7/0/45 -> cr1-eqiad:xe-4/1/2 346... [18:32:44] Seeing updates to the graphs on https://performance.wikimedia.org/#!/hour as well [18:32:47] So that's good. [18:33:28] marlier: I'll wait until tomorrow before nuking eventlog1001, would it be ok if I start decommissioning it if I don't hear anything from you about horrible things happened? [18:34:06] It does appear that I don't have sudo rights on graphite1001, so I don't actually have a way to check the logs. But, given that the data is updating and seems sane, I'm not too worried. [18:34:09] I think that's fine, yes. [18:34:23] where are the logs? [18:34:57] raynor: your patch is live in mwdebgu1002 [18:35:39] marlier: so you can use journalctl and coal on graphite1001 afaics from sudoers [18:35:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0 [18:35:51] thx, checking [18:36:15] ok, now I see the config change [18:36:20] let me test the feature [18:37:07] elukey: you're right, nevermind [18:38:32] Amir1, it works \o/ [18:38:37] marlier: weird that on journalctl -u coal I can only see Mar 22 18:23:48 graphite1001 coal[40271]: 2018-03-22 18:23:48,892 [INFO] (run:220) Beginning poll cycle [18:38:39] yay [18:38:46] please proceed - deploy config to prod [18:41:00] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:421352|Enable VirtualPageViews on s6 (ja,ru,fr) wikis (T189906)]] (duration: 01m 16s) [18:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:06] T189906: Roll out VirtualPageViews to all Wikipedia wikis - https://phabricator.wikimedia.org/T189906 [18:41:15] raynor: ^ [18:41:23] it's live everywhere [18:41:43] so 's6' as dblists doesn't work? [18:42:03] jynus_: TIL I learned it too [18:42:16] I don't see any other instance, which seemed strange [18:42:23] checking [18:42:32] but I cannot guess why [18:42:53] 10Operations, 10monitoring: add tftpd monitoring - https://phabricator.wikimedia.org/T190439#4073739 (10herron) [18:42:55] it works [18:42:56] 10Operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#4073749 (10Dzahn) @faidon @ayounsi Should we keep this open (of course carbon doesn't exist anymore nowadays but install1001 does) [18:42:59] Amir1, thanks for deployment [18:43:05] cool [18:43:13] !log Morning SWAT is done [18:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:56] elukey: that's expected -- under normal operation, it only really logs anything if something's not working right. [18:45:04] 10Operations, 10netops: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#4073756 (10Dzahn) [18:48:29] 10Operations, 10netops: Security audit for tftp on install1001 - https://phabricator.wikimedia.org/T122210#4073763 (10ayounsi) [18:48:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [18:50:22] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073785 (10BBlack) >>! In T190410#4073589, @Anomie wrote: > On the other hand, with all those different proxies that makes it much more likely on... [18:51:36] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4073787 (10Cmjohnson) change cable number for cr1 xe- 4/0/1 to following xe-7/0/44 -> cr1-eqiad:xe-4/0/2 #3509 [18:52:04] !log done with the asw-a/b/c-eqiad switches uplink work [18:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:36] (03PS1) 10Chad: group2 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421358 [19:00:04] no_justification: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:00] (03CR) 10Dzahn: "re: "Keep the /etc/init.d/jenkins sudo rule for less suprise." i'm not sure if that is less surprise or actually MORE surprise. I remember" [puppet] - 10https://gerrit.wikimedia.org/r/408555 (https://phabricator.wikimedia.org/T190277) (owner: 10Hashar) [19:02:23] (03CR) 10Dzahn: "i would actually change all of that to use systemctl (just a general comment, don't mean to block this change or that it should be the sam" [puppet] - 10https://gerrit.wikimedia.org/r/408555 (https://phabricator.wikimedia.org/T190277) (owner: 10Hashar) [19:04:49] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4073815 (10BBlack) We've talked about this a bit this week. Basic initial steps of the plan at this point are: 1) T... [19:05:09] (03PS1) 10BBlack: eqsin: turn-up test, SG-only [dns] - 10https://gerrit.wikimedia.org/r/421361 (https://phabricator.wikimedia.org/T189252) [19:06:18] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4073819 (10herron) The reimage problem on puppetmaster1001 was solved by reverting https://gerrit.wikimedia.org/r/#/c/421279/ which had inadverte... [19:07:52] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073831 (10Anomie) >>! In T190410#4073785, @BBlack wrote: > Yes, it's entirely possible any of the proxies might cause or record errors, and we'd... [19:08:04] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073833 (10Anomie) [19:08:27] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072782 (10Anomie) It's now clear to me that this task is a duplicate of T40716, so I'm going to close it as such. [19:11:50] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:30] ^ can ignore that [19:13:36] hrm. ^ that was probably me [19:14:11] bblack: I am starting to work on lvs1016 and was going to utilize the existing fiber runs as much as possible [19:14:14] any issues with that? [19:14:35] from lvs1007-1012 [19:16:10] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4073863 (10ayounsi) [19:16:25] no, those hosts are dead to me. I just haven't done any decom work on them. [19:16:37] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4073864 (10ayounsi) [19:16:47] we should probably disabled them in monitoring, though, will do shortly [19:17:16] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4073866 (10ayounsi) [19:18:07] 08Warning Alert for device mr1-eqiad.wikimedia.org - Inbound interface errors [19:18:59] [done] [19:24:09] (03CR) 10Chad: [C: 032] group2 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421358 (owner: 10Chad) [19:25:18] (03Merged) 10jenkins-bot: group2 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421358 (owner: 10Chad) [19:26:25] Did something deploy today that changes the behavior of index.php?action=raw for JSON pages? [19:26:50] This is serving an HTML error where it did not used to: https://meta.wikimedia.org/w/index.php?action=raw&title=User:Sage%20(Wiki%20Ed)/dashboard%20modules/nonexistent.json [19:28:20] (03CR) 10jenkins-bot: group2 to wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421358 (owner: 10Chad) [19:28:33] whoops, should've posted to releng. [19:32:45] PROBLEM - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:45] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:45] PROBLEM - Confd template for /srv/config-master/pybal/codfw/rendering-https on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:45] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:45] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ores on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift-https on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:47] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:32:48] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/dns_rec on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:33:07] 08̶W̶a̶r̶n̶i̶n̶g Device mr1-eqiad.wikimedia.org recovered from Inbound interface errors [19:33:08] O_o [19:33:52] herron: I guess those are the race condition between icinga run and reimage in progress, but please keep an eye on them ^^^ [19:34:14] !log db1052 replacing disk slot 8 [19:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:25] volans: makes sense [19:34:35] PROBLEM - Check size of conntrack table on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/mathoid on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:35] PROBLEM - Confd template for /srv/config-master/pybal/codfw/restbase on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:35] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/druid-public-broker on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/zotero on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:36] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:37] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/parsoid on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:37] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:38] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:34:38] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/dns_rec_udp on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:35:38] sorry about the noise, downtime expired [19:36:15] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/cxserver on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/misc_web on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:15] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apaches on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:16] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/pdfrender on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:16] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventbus on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:17] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:17] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:18] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:18] PROBLEM - confd service on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:36:19] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is CRITICAL: Return code of 255 is out of bounds [19:37:09] 10Operations, 10ops-eqiad, 10DBA: db1052 (s1 master) disks with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4073898 (10Cmjohnson) The disk at slot 8 has been swapped and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun... [19:42:45] (03PS1) 10Herron: Revert "hieradata: take out puppetmaster1001 as frontend" [puppet] - 10https://gerrit.wikimedia.org/r/421370 [19:43:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "hieradata: take out puppetmaster1001 as frontend" [puppet] - 10https://gerrit.wikimedia.org/r/421370 (owner: 10Herron) [19:43:53] (03PS2) 10Herron: Revert "hieradata: take out puppetmaster1001 as frontend" [puppet] - 10https://gerrit.wikimedia.org/r/421370 [19:44:33] (03CR) 10Herron: [C: 032] Revert "hieradata: take out puppetmaster1001 as frontend" [puppet] - 10https://gerrit.wikimedia.org/r/421370 (owner: 10Herron) [19:44:41] (03PS3) 10Herron: Revert "hieradata: take out puppetmaster1001 as frontend" [puppet] - 10https://gerrit.wikimedia.org/r/421370 [19:46:43] the yellow from ̶W̶a̶r̶n̶i̶n̶g Device mr1-eqiad.wikimedia.org recovered from Inbound interface errors / is too bright [19:47:15] Its a warning to your eyes! [19:47:58] lol [19:49:52] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073920 (10BBlack) >>! In T190410#4073831, @Anomie wrote: > Yes, there's a difference between getting "500 Something broke" and getting a data st... [19:52:15] !log demon@tin rebuilt and synchronized wikiversions files: group2 to wmf.26 [19:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:42] (03PS2) 10Andrew Bogott: wiki replicas: Update help for maintain-views --table [puppet] - 10https://gerrit.wikimedia.org/r/421035 (owner: 10BryanDavis) [19:53:24] (03CR) 10Andrew Bogott: [C: 032] wiki replicas: Update help for maintain-views --table [puppet] - 10https://gerrit.wikimedia.org/r/421035 (owner: 10BryanDavis) [19:54:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/cxserver on puppetmaster1001 is OK: No errors detected [19:54:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/misc_web on puppetmaster1001 is OK: No errors detected [19:54:16] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search on puppetmaster1001 is OK: No errors detected [19:54:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apaches on puppetmaster1001 is OK: No errors detected [19:54:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventbus on puppetmaster1001 is OK: No errors detected [19:54:17] RECOVERY - confd service on puppetmaster1001 is OK: OK - confd is active [19:54:36] RECOVERY - Confd template for /srv/config-master/pybal/codfw/citoid on puppetmaster1001 is OK: No errors detected [19:54:36] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on puppetmaster1001 is OK: No errors detected [19:54:36] RECOVERY - Confd template for /srv/config-master/pybal/codfw/mathoid on puppetmaster1001 is OK: No errors detected [19:54:36] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/parsoid on puppetmaster1001 is OK: No errors detected [19:54:36] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/dns_rec_udp on puppetmaster1001 is OK: No errors detected [19:54:37] RECOVERY - Confd template for /srv/config-master/pybal/codfw/zotero on puppetmaster1001 is OK: No errors detected [19:54:37] RECOVERY - Confd template for /srv/config-master/pybal/codfw/restbase on puppetmaster1001 is OK: No errors detected [19:54:38] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text on puppetmaster1001 is OK: No errors detected [19:54:38] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/druid-public-broker on puppetmaster1001 is OK: No errors detected [19:54:39] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload on puppetmaster1001 is OK: No errors detected [19:54:39] RECOVERY - Check size of conntrack table on puppetmaster1001 is OK: OK: nf_conntrack is 0 % full [19:54:55] RECOVERY - Confd template for /srv/config-master/pybal/codfw/rendering-https on puppetmaster1001 is OK: No errors detected [19:54:55] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on puppetmaster1001 is OK: No errors detected [19:54:55] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ores on puppetmaster1001 is OK: No errors detected [19:54:55] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is OK: No errors detected [19:54:55] RECOVERY - Confd template for /srv/config-master/pybal/codfw/appservers-https on puppetmaster1001 is OK: No errors detected [19:54:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on puppetmaster1001 is OK: No errors detected [19:54:56] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/swift-https on puppetmaster1001 is OK: No errors detected [19:54:57] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/text-https on puppetmaster1001 is OK: No errors detected [19:54:57] (03PS1) 10BryanDavis: Use ::exim4 consistently when applying class [puppet] - 10https://gerrit.wikimedia.org/r/421375 [19:54:57] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster1001 is OK: No errors detected [19:54:58] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/dns_rec on puppetmaster1001 is OK: No errors detected [19:55:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/pdfrender on puppetmaster1001 is OK: No errors detected [19:55:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on puppetmaster1001 is OK: No errors detected [19:55:16] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/upload-https on puppetmaster1001 is OK: No errors detected [19:55:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/text-https on puppetmaster1001 is OK: No errors detected [19:55:16] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/text on puppetmaster1001 is OK: No errors detected [19:57:15] huh, I guess recovery notifications ignore downtime [19:59:58] PROBLEM - MegaRAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [19:59:59] ACKNOWLEDGEMENT - MegaRAID on db1052 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T190446 [20:00:04] 10Operations, 10ops-eqiad: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T190446#4073946 (10ops-monitoring-bot) [20:00:28] <_joe_> ouch [20:02:38] PROBLEM - configured eth on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [20:04:18] PROBLEM - dhclient process on labtestvirt2002 is CRITICAL: Return code of 255 is out of bounds [20:05:59] PROBLEM - kvm ssl cert on labtestvirt2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:07:28] PROBLEM - nova-compute proc minimum on labtestvirt2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:08:24] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T190446#4073977 (10Volans) p:05Triage>03Normal It's now rebuilding AFAIK there was a disk replaced: ``` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components i... [20:08:59] RECOVERY - kvm ssl cert on labtestvirt2002 is OK: Cert /etc/ssl/localcerts/labvirt-star.codfw.wmnet.crt will not expire for at least 30 days. [20:09:18] RECOVERY - dhclient process on labtestvirt2002 is OK: PROCS OK: 0 processes with command name dhclient [20:09:28] RECOVERY - nova-compute proc minimum on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [20:13:50] (03CR) 10Chad: [C: 032] Change kind cache: short-circuit on root commits [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/419790 (owner: 10Paladox) [20:14:02] thanks :) [20:18:08] !log reimaged labtestvirt2002 [20:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:49] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 48165500 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:21:08] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 84094193 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:27:28] (03PS1) 10Chad: Updating most plugins to latest stable-2.14 [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/421382 [20:28:49] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4576 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:29:09] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5933 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:32:52] (03CR) 10Paladox: [C: 031] Updating most plugins to latest stable-2.14 [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/421382 (owner: 10Chad) [20:34:15] (03CR) 10BryanDavis: [C: 031] "Let's give it a try" [puppet] - 10https://gerrit.wikimedia.org/r/421321 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [20:35:38] (03PS2) 10Bstorm: dynamicproxy: Restrict log size to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/421321 (https://phabricator.wikimedia.org/T190218) [20:37:04] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4074067 (10RobH) p:05Triage>03Normal [20:37:52] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10RobH) @Lucas_Werkmeister_WMDE: Can you review https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups as it outlines what the access groups are... [20:39:59] (03CR) 10Bstorm: [C: 032] dynamicproxy: Restrict log size to 2GB [puppet] - 10https://gerrit.wikimedia.org/r/421321 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [20:46:28] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational [20:47:58] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:55:39] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. [20:55:53] 10Operations, 10Cloud-Services, 10Developer-Relations, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463#4074150 (10bd808) [20:58:10] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074178 (10Krinkle) [20:58:27] ragesoss: ^ [21:01:00] (03CR) 10Chad: [V: 032 C: 032] Change kind cache: short-circuit on root commits [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/419790 (owner: 10Paladox) [21:01:05] :) [21:01:42] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074189 (10Ragesoss) @Krinkle thanks much! From that description, I'm guessing this won't affect many people. I was querying for arbitrary .json pages and n... [21:03:00] (03CR) 10Ayounsi: [C: 031] eqsin: turn-up test, SG-only [dns] - 10https://gerrit.wikimedia.org/r/421361 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [21:03:02] (03CR) 10BBlack: [C: 032] eqsin: turn-up test, SG-only [dns] - 10https://gerrit.wikimedia.org/r/421361 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [21:07:49] 10Operations, 10Traffic: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074212 (10Krinkle) a:05Krinkle>03None [21:16:57] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4074242 (10RobH) [21:29:31] (03CR) 10Krinkle: [C: 031] wiki replicas: Add spamblacklist to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/418710 (https://phabricator.wikimedia.org/T184483) (owner: 10BryanDavis) [21:33:34] (03CR) 10Krinkle: labswiki: Replace 'm5-master' CNAME with backing db name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417324 (owner: 10BryanDavis) [21:34:32] (03CR) 10BryanDavis: [C: 04-2] labswiki: Replace 'm5-master' CNAME with backing db name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417324 (owner: 10BryanDavis) [21:39:02] (03PS2) 10BryanDavis: toolforge: Add Content-Security-Policy-Report-Only header [puppet] - 10https://gerrit.wikimedia.org/r/420619 (https://phabricator.wikimedia.org/T130748) [21:42:18] (03CR) 10Dzahn: [C: 032] base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [21:42:31] (03PS5) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) [21:44:17] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10598/" [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [21:47:43] (03PS3) 10Madhuvishy: toolforge: Add Content-Security-Policy-Report-Only header [puppet] - 10https://gerrit.wikimedia.org/r/420619 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [21:48:24] (03PS2) 10Dzahn: gerrit: skip systemd monitoring on gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/419086 (https://phabricator.wikimedia.org/T176532) [21:49:18] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:49:21] (03CR) 10Madhuvishy: [C: 032] toolforge: Add Content-Security-Policy-Report-Only header [puppet] - 10https://gerrit.wikimedia.org/r/420619 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [21:51:18] 10Operations, 10MediaWiki-API, 10Traffic: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4074369 (10Anomie) >>! In T190410#4073920, @BBlack wrote: > What exactly is the client going to do differently with ` (03CR) 10Dzahn: [C: 032] "Made possible by https://gerrit.wikimedia.org/r/#/c/419084/" [puppet] - 10https://gerrit.wikimedia.org/r/419086 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [22:07:26] (03CR) 10Bstorm: [C: 031] "Seems legit" [puppet] - 10https://gerrit.wikimedia.org/r/421375 (owner: 10BryanDavis) [22:11:38] (03PS2) 10Dzahn: tmux/screen: retry_internal set to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/421314 (https://phabricator.wikimedia.org/T187528) (owner: 10Alexandros Kosiaris) [22:15:34] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4074444 (10Dzahn) - added parameter to base monitoring class to allow disabling of system... [22:19:20] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:25:51] !log icinga - re-enabling notifications for a LOT of "systemd checks" that were all OK since a longer time but had not been re-enabled after some maintenance [22:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:40] (03CR) 10Chad: [V: 032 C: 032] Updating most plugins to latest stable-2.14 [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/421382 (owner: 10Chad) [22:30:10] (03CR) 10Dzahn: [C: 032] tmux/screen: retry_internal set to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/421314 (https://phabricator.wikimedia.org/T187528) (owner: 10Alexandros Kosiaris) [22:30:51] (03CR) 10Dzahn: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/421314 (https://phabricator.wikimedia.org/T187528) (owner: 10Alexandros Kosiaris) [22:36:17] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4074487 (10Dzahn) I merged that and it works (thanks for review). Then i used it to skip systemd monitoring on gerrit2001 (because of T176532) Meanwhile the syst... [22:36:30] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip (some) icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4074489 (10Dzahn) [22:36:34] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip (some) icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4029186 (10Dzahn) 05Open>03Resolved [22:37:44] 10Operations, 10monitoring: Check for long running screen/tmux should mention usernames - https://phabricator.wikimedia.org/T181409#3788998 (10Dzahn) a:03Dzahn [22:39:10] (03PS4) 10Dzahn: Gerrit: Allow enabling of tls encryption for SMTP [puppet] - 10https://gerrit.wikimedia.org/r/406145 (owner: 10Chad) [22:40:05] (03CR) 10Dzahn: [C: 032] Gerrit: Allow enabling of tls encryption for SMTP [puppet] - 10https://gerrit.wikimedia.org/r/406145 (owner: 10Chad) [22:43:57] (03PS3) 10Dzahn: Gerrit: Swap git auth to HTTP_LDAP [puppet] - 10https://gerrit.wikimedia.org/r/410474 (owner: 10Chad) [22:44:44] (03CR) 10Dzahn: [C: 032] Gerrit: Swap git auth to HTTP_LDAP [puppet] - 10https://gerrit.wikimedia.org/r/410474 (owner: 10Chad) [22:44:46] jouncebot: next [22:44:46] In 0 hour(s) and 15 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T2300) [22:47:11] !log restarting Gerrit to apply config changes gerrit:406145 and gerrit:410474 [22:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:52] Gerrit is back [22:51:49] hmm https://gerrit.wikimedia.org is not loading for me [22:51:57] ah [22:52:00] does now [22:56:55] is it just me, or does gerrit's new ui no longer tell you that someone else has updated the change while you're viewing it? [22:57:52] Cupid that [22:57:54] that [22:57:56] uh [22:58:03] that's fixed in a newer release [22:58:06] i think 2.15+ [22:59:18] o nice [23:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180322T2300). [23:00:05] Amir1, odder, and Cupid: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:14] o/ [23:01:06] \o [23:04:24] (03CR) 10Dzahn: "i would like to know if we can avoid having to almost 2000 lines of php.ini and was wondering what the diff is agains the standard php.ini" [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [23:06:43] Amir1: can you do it? [23:06:54] yeah sure [23:07:03] ty :) [23:07:16] thank you for trusting me [23:07:40] let's not go that far :P [23:07:55] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420934 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:08:42] :))) [23:09:02] (03Merged) 10jenkins-bot: Update logo for Dutch Low Saxon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420934 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:09:18] Cupid: your patches are in master of two extensions, it's not valid for SWAT [23:09:38] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /srv 61800 MB (12% inode=99%) [23:10:58] (03CR) 10jenkins-bot: Update logo for Dutch Low Saxon Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420934 (https://phabricator.wikimedia.org/T190051) (owner: 10Odder) [23:11:23] Does the issue about FlaggedRevs have been noticed? It's flooding fatalmonitor (152 atm) [23:12:11] https://wikitech.wikimedia.org/wiki/Deployments/Inclusion_criteria says that extension updates are supposed to be SWATed tho? [23:14:29] Amir1: yeah, see -core [23:14:40] Cupid: that document doesn't mention SWAT at all. It's about big things that needs a dedicated time slot in the deployment calender (and should not be SWATed) your patches doesn't fit any of it [23:14:42] greg-g: cool [23:15:25] ah [23:15:49] https://wikitech.wikimedia.org/wiki/SWAT_deploys#Guidelines [23:15:53] see allowed types of patches [23:16:34] I thought I saw odder but he is not around [23:16:49] I guess I just deploy his file updates, it's in mwdebug1002 [23:17:45] so I need someone to +2 it then it will be done? [23:18:56] Cupid: yes, unless you want to backport it (so you don't need to wait for a week for it to get deployed) [23:19:07] that is true mostly for high-priority bugs [23:19:19] ah, thanks [23:19:42] Cupid: sorry I couldn't help you :/ I will add some people to review [23:20:01] fwiw, the odder patch looks fine in mwdebug1002. Will go live [23:20:05] its fine, and thanks [23:21:59] (03PS1) 10Aaron Schulz: [WIP] Initial debianization [debs/dynomite] - 10https://gerrit.wikimedia.org/r/421447 [23:23:09] !log ladsgroup@tin Synchronized static/images/project-logos/nds_nlwiki-1.5x.png: static/images/project-logos/nds_nlwiki-2x.png static/images/project-logos/nds_nlwiki.png [[gerrit:420934|Update logo for Dutch Low Saxon Wikipedia (T190051)]] (duration: 00m 59s) [23:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:15] T190051: Please update nds-nl Wikipedia logo - https://phabricator.wikimedia.org/T190051 [23:24:40] _joe_: I'm a serious n00b at this ;) [23:25:40] <_joe_> AaronSchulz: I have my hands full atm and I'm going on vacation on wednesday, but when I come back I can surely help [23:26:03] <_joe_> AaronSchulz: or, I can patch up a rudimentary package maybe tomorrow [23:26:11] !log ladsgroup@tin Synchronized static/images/project-logos/nds_nlwiki-1.5x.png: [[gerrit:420934|Update logo for Dutch Low Saxon Wikipedia (T190051)]] (duration: 00m 58s) [23:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:33] <_joe_> given dynomite derives from nutcracker, the debianization for it could be a good starting point maybe? [23:27:08] _joe_: yeah, that and other things were a reference for https://gerrit.wikimedia.org/r/421447 [23:27:18] <_joe_> we could ask the nutcracker maintainer in debian for some help *cough* *cough* [23:27:44] <_joe_> AaronSchulz: oh nice, I'll take a look tomorrow [23:27:54] !log ladsgroup@tin Synchronized static/images/project-logos/nds_nlwiki-2x.png: [[gerrit:420934|Update logo for Dutch Low Saxon Wikipedia (T190051)]] (duration: 00m 56s) [23:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:16] _joe_: and no replies to your github comment :/ [23:28:38] (03PS23) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [23:28:40] <_joe_> which one? [23:28:41] did you already ping the twemproxy people too? [23:29:15] https://github.com/Netflix/dynomite/issues/367 [23:29:22] !log ladsgroup@tin Synchronized static/images/project-logos/nds_nlwiki.png: [[gerrit:420934|Update logo for Dutch Low Saxon Wikipedia (T190051)]] (duration: 00m 58s) [23:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:28] T190051: Please update nds-nl Wikipedia logo - https://phabricator.wikimedia.org/T190051 [23:30:13] <_joe_> AaronSchulz: ah on dynomite, right [23:30:21] <_joe_> yeah, I didn't have high hopes :P [23:30:48] <_joe_> AaronSchulz: btw, http://metadata.ftp-master.debian.org/changelogs/main/n/nutcracker/nutcracker_0.4.1+dfsg-1_changelog <-- we could ask for help to him ;) [23:32:45] <_joe_> so just looking at your lintian issues, the only one that really needs some work is libyaml being statically compiled into dynomite [23:32:48] * AaronSchulz sighs at the dozens of unlicensed files [23:33:26] I guess they mostly fall under the LICENSE/NOTES copyright? [23:33:29] <_joe_> don't worry about debian/copyright for now, we can work on that later; for now it would be ok to have a semi-decent package we can experiment with [23:33:42] I thought the logos are not sync'd but they were behind varnish : [23:33:42] <_joe_> yeah I assume that for unlicensed files [23:33:48] * :/ [23:34:07] Anyway, the odder's patch is deployed, now mine [23:34:36] I assume they woulnd't pull in a file from someone else that didn't have a copyright notice already ;) [23:35:30] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421333 (https://phabricator.wikimedia.org/T189777) (owner: 10Lucas Werkmeister (WMDE)) [23:35:37] _joe_: not sure why they made it static. I didn't really look yet. [23:36:49] (03Merged) 10jenkins-bot: Disable reading wb_terms search fields on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421333 (https://phabricator.wikimedia.org/T189777) (owner: 10Lucas Werkmeister (WMDE)) [23:38:19] (03CR) 10jenkins-bot: Disable reading wb_terms search fields on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421333 (https://phabricator.wikimedia.org/T189777) (owner: 10Lucas Werkmeister (WMDE)) [23:39:21] !log ladsgroup@tin Synchronized wmf-config/Wikibase-production.php: [[gerrit:421333|Disable reading wb_terms search fields on wikidata (T189777)]] (duration: 00m 58s) [23:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:27] T189777: Disable reading from term_search_key from wb_terms table in wikidata - https://phabricator.wikimedia.org/T189777 [23:39:46] Just to note, if you are seeing any performance regression (specially database-wise) ^ is the fault. It's very unlikely as we tested it on testwikidata [23:40:26] !log Just to note, if you are seeing any performance regression (specially database-wise) 421333 might be the reason [23:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:47] !log Evening SWAT is done [23:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:10] Amir1: thanks :) [23:42:30] yw, feel free to ping me if you need me for any SWAT [23:45:47] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4074620 (10kchapman) TechCom discussed this at our last meeting. The problem statement is still valid, but that doesn't mean it needs to be kept open as... [23:49:58] RECOVERY - MegaRAID on db1052 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [23:53:30] (03PS24) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [23:58:10] 10Operations, 10HHVM, 10User-Elukey: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4074637 (10Bawolff) Another thing to watch out for, is that farsi wikis are using a hack to work around a bug in the old version of libicu. They should probably be moved bac...