[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T0000). Please do the needful. [00:00:26] nothing to deploy again [00:00:33] Good job team, best swat ever [00:02:18] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:02:36] (03PS1) 10RobH: decom db2015 [puppet] - 10https://gerrit.wikimedia.org/r/335170 [00:04:10] (03PS1) 10RobH: decom db2015 [dns] - 10https://gerrit.wikimedia.org/r/335171 [00:04:32] (03CR) 10RobH: [C: 032] decom db2015 [puppet] - 10https://gerrit.wikimedia.org/r/335170 (owner: 10RobH) [00:04:57] (03CR) 10RobH: [C: 032] decom db2015 [dns] - 10https://gerrit.wikimedia.org/r/335171 (owner: 10RobH) [00:04:58] PROBLEM - Disk space on elastic1040 is CRITICAL: DISK CRITICAL - free space: / 2240 MB (8% inode=90%) [00:05:38] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: / 1554 MB (5% inode=90%) [00:09:00] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decomission db2015 - https://phabricator.wikimedia.org/T149102#2984103 (10RobH) a:05RobH>03Papaul [00:10:15] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decomission db2015 - https://phabricator.wikimedia.org/T149102#2742190 (10RobH) Ok, this is ready to have the disks wiped and have it pulled from the rack, along with the remaining steps listed in the updated task description. [00:10:38] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2984111 (10RobH) [00:10:55] !log mobrovac@tin Started restart [changeprop/deploy@2b980fa]: Service restart for firejail upgrade [00:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:27] !log mobrovac@tin Started restart [citoid/deploy@95df861]: Service restart for firejail upgrade [00:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:24] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2984117 (10Papaul) p:05Triage>03Normal [00:17:48] !log mobrovac@tin Started restart [cxserver/deploy@5ae4f8b]: Service restart for firejail upgrade [00:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:53] 06Operations, 10ops-codfw, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2984118 (10Papaul) a:05Papaul>03elukey installation complete [00:18:18] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:19:46] (03Abandoned) 10Krinkle: Add eventstreams.wikimedia.org to cache misc [puppet] - 10https://gerrit.wikimedia.org/r/322954 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [00:20:10] !log mobrovac@tin Started restart [graphoid/deploy@da37386]: Service restart for firejail upgrade [00:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:23] !log mobrovac@tin Started restart [mathoid/deploy@ba3217e]: Service restart for firejail upgrade [00:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:27] (03CR) 10RobH: [V: 032 C: 032] decom db2015 [puppet] - 10https://gerrit.wikimedia.org/r/335170 (owner: 10RobH) [00:22:38] !log mobrovac@tin Started restart [mobileapps/deploy@7615bf9]: Service restart for firejail upgrade [00:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:57] !log mobrovac@tin Started restart [electron-render/deploy@f1df2d3]: Service restart for firejail upgrade [00:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:35] !log restarting elasticsearch on elastic1029, got stuck in RemoteTransportException loop again [00:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:38] RECOVERY - Disk space on elastic1029 is OK: DISK OK [00:31:58] RECOVERY - Disk space on elastic1040 is OK: DISK OK [00:47:18] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [00:59:48] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:26:48] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:33:55] (03CR) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [01:34:15] (03PS14) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) [01:35:23] (03PS1) 10RobH: adding icinga cert monitoring for *.corp.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/335179 [01:39:58] (03PS1) 10BryanDavis: Enable TorBlock for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335180 [01:48:07] (03CR) 10Chad: [C: 031] "It was disabled as a part of I3c0c1e625e764d99bd22e95ecaab9a4be34b7eeb which gave no rationale or task linked. Barring some legitimate rea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335180 (owner: 10BryanDavis) [02:11:48] PROBLEM - carbon-cache@e service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [02:11:48] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:16:59] !log restart uwsgi-keystone-admin and uwsgi-keystone-public on labcontrol1001 [02:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:10] (03CR) 10Legoktm: "This needs a cronjob unless wikitech shares the cache space of the rest of the cluster? (I don't think it does...) See the mediawiki::main" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335180 (owner: 10BryanDavis) [02:22:14] legoktm: I can bandaid that cronjob on silver until we figure out the right way to apply it [02:22:29] ok, or we just run the maint script manually for now [02:22:36] looks fine otherwise [02:22:45] looks like that puppet role would need a little refactoring [02:23:48] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [02:24:48] RECOVERY - carbon-cache@e service on graphite1003 is OK: OK - carbon-cache@e is active [02:25:48] (03PS3) 10Aude: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [02:26:52] (03CR) 10BryanDavis: [V: 032 C: 032] "nodepool is messed up so I'm jumping the queue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335180 (owner: 10BryanDavis) [02:27:55] (03CR) 10jenkins-bot: Enable TorBlock for Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335180 (owner: 10BryanDavis) [02:28:00] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.9) (duration: 07m 09s) [02:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:48] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Enable TorBlock for Wikitech (duration: 00m 41s) [02:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:02] (03CR) 10Aude: [C: 031] "this is ready. (scheduled for european swat tomorrow)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [02:30:37] chasemp: context? [02:31:02] yuvipanda: just make a paste sec [02:31:06] !log Manually ran extensions/TorBlock/loadExitNodes.php on silver [02:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:04] yuvipanda: basically keystone keeps denying auth for connections [02:32:05] https://phabricator.wikimedia.org/P4840 [02:32:34] seems pretty consistent but I'm not sure at what layer it's choking, I enabled DEBUG logging in /etc/keystone/logging.conf [02:32:57] but so far it seems like keystone thinks it's being perfectly reasonable and I don't get where the misunderstanding is yet [02:33:12] I have no idea of anything keystone related at all [02:33:17] me neither [02:33:18] except for the fact that it runs a uwsgi service [02:33:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 31 02:33:22 UTC 2017 (duration 5m 23s) [02:33:24] right :) [02:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:34] chasemp: have you treid restarting it? :D [02:33:39] I did [02:33:45] ok [02:33:50] I'm out of ideas now :D [02:33:55] going to read logs properly [02:34:20] andrew had some idea it may be realted to the password change but obv that was last friday or so [02:34:31] chasemp: if novaobserver works, then this could be fallout from novaadmin password change [02:34:35] hah [02:34:39] :) [02:34:40] idk how to check if novaobserver works [02:37:46] chasemp: password seems to match between novaenv and keystone.conf [02:45:25] !log Setup temporary cron on silver as user bd808 until T156733 is fixed properly [02:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:30] T156733: Setup TorBlock cron on silver to update exit node list - https://phabricator.wikimedia.org/T156733 [02:47:08] (03CR) 10BryanDavis: "I opened T156733 for the needed cron job" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335180 (owner: 10BryanDavis) [02:58:08] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:20:11] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.108 second response time [03:20:18] PROBLEM - Labs LDAP on serpens is CRITICAL: Could not bind to the LDAP server [03:22:28] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 652.06 seconds [03:26:08] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [03:27:28] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 208.07 seconds [03:28:24] !log (slightly belated) set logging level on serpens higher to see if ldap binding is an issue [03:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:48] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [03:47:48] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:49:38] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:49:58] !log restarted nova-api on labnet1001 which actually fixed some things [03:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:48] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [03:54:48] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [03:55:58] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:58:06] (03PS1) 10Andrew Bogott: Revert "Keystone: Turn on caching of tokens and catalog" [puppet] - 10https://gerrit.wikimedia.org/r/335184 [03:58:41] (03CR) 10Rush: [C: 031] Revert "Keystone: Turn on caching of tokens and catalog" [puppet] - 10https://gerrit.wikimedia.org/r/335184 (owner: 10Andrew Bogott) [03:59:48] (03PS2) 10Rush: Revert "Keystone: Turn on caching of tokens and catalog" [puppet] - 10https://gerrit.wikimedia.org/r/335184 (owner: 10Andrew Bogott) [04:01:58] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [04:03:09] (03CR) 10Andrew Bogott: [V: 032 C: 032] "Forcing merge since this is implicated in CI breakage" [puppet] - 10https://gerrit.wikimedia.org/r/335184 (owner: 10Andrew Bogott) [04:17:38] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:25:48] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:55:48] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:44:08] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:12:08] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:16:48] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:44:48] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:28] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:56:04] (03PS5) 10Giuseppe Lavagetto: etcd: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [06:57:28] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:57:55] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#2984832 (10Marostegui) Thanks @jcrespo and @Papaul. The server looked good yesterday night when I checked it :-) [06:59:31] (03PS1) 10Marostegui: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335190 (https://phabricator.wikimedia.org/T156478) [07:06:42] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335190 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [07:08:52] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335190 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [07:09:00] (03CR) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335190 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [07:10:11] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2034 - T156478 (duration: 00m 57s) [07:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:15] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [07:11:28] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:15:48] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:20:12] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.126 second response time [07:35:28] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [07:39:28] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:40:19] _joe_ --^ just checked and the "uptime" counter seems to not to be affected by graceful reload.. (for example, atm it is 3559616 seconds, or Server uptime: 41 days 4 hours 44 minutes 38 seconds) [07:40:50] there is a restart time date but not in the server-status?auto [07:40:52] sigh [07:43:48] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:44:30] ah no restart is Restart Time: Wednesday, 21-Dec-2016 02:48:54 UTC now that I checked [07:45:27] will revert my commit later on [07:46:18] RECOVERY - cassandra-b CQL 10.64.0.237:9042 on aqs1007 is OK: TCP OK - 0.000 second response time on 10.64.0.237 port 9042 [07:47:03] this one --^ has just finished the boostrap [07:49:37] <_joe_> heh, ok, cool [07:49:45] !log Reboot db1072 to force BBU recharge - T156226 [07:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:49] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [07:53:38] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:54:05] !log installing chromium security update on osmium [07:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:11] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.387 second response time [08:13:18] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.110 seconds response time [08:18:08] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:21:38] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:22:17] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2984963 (10Marostegui) Yesterday I recloned db1072 but it has not been able to catch up with the master whereas db1073 (the server it was recloned from) had no problems. In order to test if the stora... [08:26:24] !log started Cassandra nodetool cleanup for aqs1004-a [08:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:18] 06Operations, 10ops-eqiad: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#2984977 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Cmjohnson [08:32:23] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2984980 (10Marostegui) Looks like forcing it to be WriteBack works (but it is bad anyways if the BBU is really broken): ``` root@db1072:~# megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll Set Wr... [08:35:24] (03PS1) 10Marostegui: db-eqiad.php: Add comment about bad BBU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335195 (https://phabricator.wikimedia.org/T156226) [08:37:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add comment about bad BBU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335195 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [08:39:00] (03Merged) 10jenkins-bot: db-eqiad.php: Add comment about bad BBU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335195 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [08:39:08] (03CR) 10jenkins-bot: db-eqiad.php: Add comment about bad BBU [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335195 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [08:40:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add a warning about a possible bad BBU on db1072 - T156226 (duration: 00m 46s) [08:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:37] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [08:44:58] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1007.eqiad.wmnet [08:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:08] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:51:08] 06Operations, 10ops-codfw, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2985018 (10elukey) [09:00:13] !log aligning elasticsearch low watermark to 75% disk space on all clusters (eqiad was at 70%) [09:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:38] (03PS1) 10Marostegui: db-eqiad.php: Repool hosts in C2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335198 (https://phabricator.wikimedia.org/T155999) [09:32:34] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2985341 (10Gehel) [09:32:48] (03PS1) 10Gehel: wdqs - configure new wdqs1003 node [puppet] - 10https://gerrit.wikimedia.org/r/335201 (https://phabricator.wikimedia.org/T152643) [09:37:30] (03CR) 10Gehel: [C: 032] wdqs - configure new wdqs1003 node [puppet] - 10https://gerrit.wikimedia.org/r/335201 (https://phabricator.wikimedia.org/T152643) (owner: 10Gehel) [09:38:14] !log rolling restart of cassandra in codfw to pick up openjdk and NSS security updates [09:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool hosts in C2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335198 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [09:45:11] (03Merged) 10jenkins-bot: db-eqiad.php: Repool hosts in C2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335198 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [09:45:23] (03CR) 10jenkins-bot: db-eqiad.php: Repool hosts in C2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335198 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [09:46:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool hosts in C2 - T155999 (duration: 00m 40s) [09:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:50] T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999 [09:50:02] (03PS14) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [09:52:50] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Doesn't honour our puppet code organization standards; it should be trivial to fix though, see my inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [09:55:25] (03CR) 10Filippo Giunchedi: [C: 04-1] "Modulo what Joe said, I don't think crond belongs to this change and should be taken out. +1 otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [09:57:08] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [09:58:08] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.037 second response time on 10.192.16.162 port 9042 [09:58:48] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [09:59:38] (03CR) 10Filippo Giunchedi: [C: 031] Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [09:59:48] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [10:10:11] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic, 13Patch-For-Review: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657#2985761 (10Gilles) @bblack can you take a look at the Vagrant VCL patch above (very... [10:16:23] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2985781 (10Gilles) [10:18:03] (03CR) 10Elukey: "Quickly checked the code and found one thing that might be removed, but absolutely nothing blocking. Didn't get the time to check the whol" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [10:19:18] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:27:38] (03PS5) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [10:28:00] (03CR) 10Volans: "Thanks elukey!" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [10:30:51] (03PS10) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [10:39:31] 06Operations, 10Traffic, 07HTTPS: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2985832 (10faidon) > I don't think gdnsd supports CAA records yet I checked and indeed it doesn't. I filed this in gdnsd's bug tracker as [[ https://github.com/gdnsd/gdnsd/issues/138 | bug #138 ]]. A... [10:42:35] !log starting reimage of wdqs1003 [10:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:31] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2855891 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1003.eqiad.wmnet'] ``` The log can be found in... [10:47:23] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:51:13] (03CR) 10Volans: [C: 04-1] "See the specific comments inline." (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [10:58:08] 06Operations, 10ops-eqiad: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610#2981044 (10Volans) @elukey: FYI icinga downtime expired, I've set it to downtime for a week, just in case [11:00:42] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2985881 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['wdqs1003.eqiad.wmnet']) ``` [11:02:30] (03PS1) 10Gehel: wdqs - categorize wdqs1003 as a wdqs node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/335207 (https://phabricator.wikimedia.org/T152643) [11:04:08] (03PS2) 10Gehel: wdqs - categorize wdqs1003 as a wdqs node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/335207 (https://phabricator.wikimedia.org/T152643) [11:05:09] (03PS1) 10Elukey: Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) [11:05:30] (03CR) 10Gehel: [C: 032] wdqs - categorize wdqs1003 as a wdqs node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/335207 (https://phabricator.wikimedia.org/T152643) (owner: 10Gehel) [11:06:52] (03CR) 10Volans: [C: 04-1] "Naming convention comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [11:06:55] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2985899 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1003.eqiad.wmnet'] ``` The log can be found in... [11:08:18] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2985900 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1003.eqiad.wmnet'] ``` The log can be found in... [11:14:49] !log updating the puppet compiler's facts [11:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:03] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:33:06] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2985916 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1003.eqiad.wmnet'] ``` and were **ALL** successful. [11:37:59] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3978.80 Read Requests/Sec=2326.80 Write Requests/Sec=5.70 KBytes Read/Sec=15134.00 KBytes_Written/Sec=1256.00 [11:40:25] (03PS2) 10Elukey: Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) [11:49:45] (03PS3) 10Elukey: Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) [11:50:35] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5285/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [11:52:08] (03PS4) 10Elukey: Extend role memcached to the new codfw mc hosts [puppet] - 10https://gerrit.wikimedia.org/r/335208 (https://phabricator.wikimedia.org/T155755) [11:54:59] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=154.20 Read Requests/Sec=157.70 Write Requests/Sec=137.20 KBytes Read/Sec=2711.60 KBytes_Written/Sec=4064.80 [11:55:09] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:59:59] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2985944 (10elukey) So a couple of remarks before starting: 1) Role memcached is currently applied to 18 hosts in eqiad and 16 hosts in codfw. mc2001 and mc2016 are runn... [12:00:04] addshore: Respected human, time to deploy TwoColConflict initial deployment to testwikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T1200). Please do the needful. [12:00:04] addshore: A patch you scheduled for TwoColConflict initial deployment to testwikis is about to be deployed. Please be available during the process. [12:03:49] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:05:16] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2953444 (10MoritzMuehlenhoff) I think it would be good to merge https://gerrit.wikimedia.org/r/#/c/319820/ first (possible with an additional hiera knob to only affect th... [12:06:31] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#2985948 (10elukey) @MoritzMuehlenhoff sure! [12:07:01] moritzm: about --^ - https://gerrit.wikimedia.org/r/#/c/335208 would need to wait too right? [12:10:07] for codfw it doesn't matter much, the problem is mostly limited to mc1 in eqiad, codfw is unused anyway, so shouldn't matter if we have an additional service reload [12:10:44] (03PS3) 10Addshore: Add twocolconflict to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) [12:10:46] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [12:11:04] (03PS5) 10Addshore: Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) [12:12:46] (03CR) 10Addshore: [C: 032] Add twocolconflict to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [12:13:13] (03PS5) 10Filippo Giunchedi: prometheus: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334306 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [12:13:57] (03Merged) 10jenkins-bot: Add twocolconflict to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [12:14:09] (03CR) 10jenkins-bot: Add twocolconflict to wgBetaFeaturesWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [12:14:23] (03PS1) 10Elukey: Require cassandra-wmf-tools and jvm-utils for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/335213 [12:14:34] moritzm: okok I thought you wanted to test it [12:15:41] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334306 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [12:16:34] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:332904|Add twocolconflict to wgBetaFeaturesWhitelist (T150184)]] (duration: 00m 41s) [12:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:38] T150184: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184 [12:18:48] (03CR) 10Addshore: [C: 032] Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) (owner: 10Addshore) [12:19:59] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1056.50 Read Requests/Sec=3045.40 Write Requests/Sec=5.00 KBytes Read/Sec=24598.00 KBytes_Written/Sec=174.40 [12:20:16] (03Merged) 10jenkins-bot: Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) (owner: 10Addshore) [12:20:24] (03CR) 10jenkins-bot: Enable TwoColConflict on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332908 (https://phabricator.wikimedia.org/T155716) (owner: 10Addshore) [12:22:47] !log addshore@tin Synchronized wmf-config/extension-list: [[gerrit:332908|Enable TwoColConflict on test wikis (T155716)]] 1/5 (duration: 00m 40s) [12:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:51] T155716: Deploy TwoColConflict extension to test-wikis - https://phabricator.wikimedia.org/T155716 [12:23:51] !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:332908|Enable TwoColConflict on test wikis (T155716)]] 2/5 (duration: 00m 40s) [12:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:28] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5287 looks good" [puppet] - 10https://gerrit.wikimedia.org/r/335213 (owner: 10Elukey) [12:24:45] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:332908|Enable TwoColConflict on test wikis (T155716)]] 3/5 (duration: 00m 42s) [12:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:40] !log addshore@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:332908|Enable TwoColConflict on test wikis (T155716)]] 4/5 (duration: 00m 40s) [12:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:34] !log addshore@tin Synchronized wmf-config/CommonSettings-labs.php: [[gerrit:332908|Enable TwoColConflict on test wikis (T155716)]] 5/5 (duration: 00m 40s) [12:26:38] !log TwoColConflict deploy slot done! [12:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:42] (03PS5) 10Ladsgroup: dumps: Modernize design of the index page [puppet] - 10https://gerrit.wikimedia.org/r/334856 (https://phabricator.wikimedia.org/T155697) [12:31:49] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:35:59] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=144.90 Read Requests/Sec=163.00 Write Requests/Sec=0.90 KBytes Read/Sec=2762.80 KBytes_Written/Sec=15.60 [12:45:48] (03PS1) 10Muehlenhoff: Remove access credentials for springle [puppet] - 10https://gerrit.wikimedia.org/r/335214 [12:49:27] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for springle [puppet] - 10https://gerrit.wikimedia.org/r/335214 (owner: 10Muehlenhoff) [12:55:46] (03CR) 10Giuseppe Lavagetto: Use caller function module name as default log prefix (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [12:56:19] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[springle] [12:57:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If we can selectively call the inspect for errors only, I am ok with that. But this is done on every message, even debug ones, and this wi" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [12:58:48] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2986030 (10Addshore) [12:58:50] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2986029 (10Addshore) [13:02:19] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:03:44] db1011 puppet failed [13:04:29] PROBLEM - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:04:59] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused [13:05:39] PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:05:39] PROBLEM - cassandra-b service on restbase2005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:05:44] java.lang.OutOfMemoryError: Java heap space [13:05:47] running puppet [13:05:56] this must be the tombstone issue [13:06:12] urandom: --^ [13:06:39] RECOVERY - Check systemd state on restbase2005 is OK: OK - running: The system is fully operational [13:06:39] RECOVERY - cassandra-b service on restbase2005 is OK: OK - cassandra-b is active [13:06:42] fixing db1011 [13:07:19] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:09:37] elukey: yeah :( [13:09:39] PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:09:39] PROBLEM - cassandra-b service on restbase2005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:10:39] RECOVERY - Check systemd state on restbase2005 is OK: OK - running: The system is fully operational [13:10:39] RECOVERY - cassandra-b service on restbase2005 is OK: OK - cassandra-b is active [13:12:39] PROBLEM - puppet last run on wtp1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:50] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2986079 (10Paladox) [13:12:59] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.138 and port 9042: Connection refused [13:13:09] PROBLEM - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:13:09] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:13:39] PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:13:39] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:13:39] PROBLEM - cassandra-b service on restbase2005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:14:39] is there a Phab task for this bug? [13:14:39] RECOVERY - Check systemd state on restbase2005 is OK: OK - running: The system is fully operational [13:14:39] RECOVERY - cassandra-b service on restbase2005 is OK: OK - cassandra-b is active [13:15:35] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2986084 (10Paladox) [13:15:39] RECOVERY - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-b valid until 2017-09-12 15:35:35 +0000 (expires in 224 days) [13:15:59] RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.47 port 9042 [13:17:15] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2824332 (10Paladox) [13:18:46] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2986089 (10Paladox) This should hopefully be fixed in gerrit 2.14. Though I doint know if there will be permenant damage. We... [13:19:39] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[springle] [13:22:29] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:24:39] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [13:25:09] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [13:25:59] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.138 port 9042 [13:26:09] RECOVERY - cassandra-b SSL 10.192.32.138:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-b valid until 2017-09-12 15:35:25 +0000 (expires in 224 days) [13:27:02] (03PS1) 10Gehel: elasticsearch - ensure data directory with puppet [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) [13:29:44] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2986127 (10Gehel) Initial installation is complete, data import in progress... [13:31:34] (03CR) 10DCausse: elasticsearch - ensure data directory with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [13:31:36] (03PS2) 10Gehel: elasticsearch - ensure data directory with puppet [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) [13:32:19] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:32:29] sorry I was afk, back again [13:33:11] moritzm: there is a very detailed description in https://phabricator.wikimedia.org/T144431 [13:33:45] (03CR) 10Gehel: elasticsearch - ensure data directory with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [13:34:53] elukey: thanks [13:35:19] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:36:13] (03CR) 10DCausse: elasticsearch - ensure data directory with puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [13:37:58] (03CR) 10DCausse: [C: 031] elasticsearch - ensure data directory with puppet [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [13:39:24] (03CR) 10Gehel: [C: 032] elasticsearch - ensure data directory with puppet [puppet] - 10https://gerrit.wikimedia.org/r/335218 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [13:40:39] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:43:18] (03PS1) 10Muehlenhoff: Also remove springle from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/335226 [13:45:19] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:46:09] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:48:02] (03PS1) 10Jcrespo: sanitarium2: Enable TLS, disable Toku-specific config [puppet] - 10https://gerrit.wikimedia.org/r/335227 (https://phabricator.wikimedia.org/T111654) [13:48:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [13:48:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [13:50:09] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:50:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [13:50:19] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 42 probes of 402 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:50:29] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:50:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:51:18] (03PS2) 10Jcrespo: sanitarium2: Enable TLS, disable Toku-specific config [puppet] - 10https://gerrit.wikimedia.org/r/335227 (https://phabricator.wikimedia.org/T111654) [13:52:00] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2986228 (10hashar) [13:52:29] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:53:26] jouncebot: next [13:53:27] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T1400) [13:53:29] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:54:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:55:19] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 402 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:55:29] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 260 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:56:29] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:58:09] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:58:29] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:58:59] !log rebooted analytics1039 to pick up uuids in fstab - T147879 [13:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:06] T147879: Audit fstabs on Kafka and Hadoop nodes to use UUIDs instead of /dev paths - https://phabricator.wikimedia.org/T147879 [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T1400). [14:00:04] aude: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:00:12] * aude-wiki can do swat [14:00:22] [= [14:00:32] might be only wikidata stuff [14:00:40] (03CR) 10Aude: [C: 032] Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [14:00:59] aude-wiki: yep, looks like it is your 1 patch only! [14:00:59] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:02:16] (03Merged) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [14:02:25] (03CR) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [14:05:07] (03CR) 10Muehlenhoff: [C: 032] Also remove springle from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/335226 (owner: 10Muehlenhoff) [14:05:12] (03PS2) 10Muehlenhoff: Also remove springle from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/335226 [14:07:25] !log aude@tin Synchronized wmf-config/Wikibase.php: Update property suggester config (duration: 00m 42s) [14:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:17] * aude-wiki checks (already checked on mwdebug1001) [14:08:35] looks good [14:09:02] (03PS11) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [14:09:16] (03CR) 10Marostegui: Reporting tests with the private data script (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [14:09:39] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:20] checking ---^ [14:10:29] PROBLEM - Nginx local proxy to apache on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:29] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:12:44] !log restarting hhvm on mw1204 (dump debug in /tmp/hhvm.29120.bt) [14:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:29] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.068 second response time [14:13:54] (03PS12) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [14:13:58] (03CR) 10Marostegui: "This compiles fine: https://puppet-compiler.wmflabs.org/5289/" [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [14:14:19] RECOVERY - Nginx local proxy to apache on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.036 second response time [14:14:19] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 72596 bytes in 0.177 second response time [14:18:36] (03CR) 10Marostegui: [C: 031] sanitarium2: Enable TLS, disable Toku-specific config [puppet] - 10https://gerrit.wikimedia.org/r/335227 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [14:24:30] (03PS3) 10Jcrespo: sanitarium2: Enable TLS, disable Toku-specific config [puppet] - 10https://gerrit.wikimedia.org/r/335227 (https://phabricator.wikimedia.org/T111654) [14:24:57] hhvm got stuck on mw1204 for __lll_lock_wait [14:25:43] I am wondering if we have ever opened a github issue to upstream [14:27:36] 06Operations, 07HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1048540 (10elukey) Recurrences still happen from time to time, last one on mw1204 today. [14:28:10] ah no this might not be the one, it references stat cache [14:29:00] (03PS2) 10Filippo Giunchedi: prometheus: add aggregation rules for apache and hhvm [puppet] - 10https://gerrit.wikimedia.org/r/334662 [14:29:05] mmm there are a couple of threads referencing it though [14:31:04] (03CR) 10Jcrespo: [C: 032] sanitarium2: Enable TLS, disable Toku-specific config [puppet] - 10https://gerrit.wikimedia.org/r/335227 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [14:53:08] (03CR) 10Volans: [C: 04-1] "Sorry, still some issues, see inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [14:53:34] (03CR) 10Ottomata: Add JMX port 9986 to the MapReduce History process (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [14:53:57] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/335213 (owner: 10Elukey) [14:54:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 266 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:55:24] now also ulsfo? [14:55:34] (03CR) 10Elukey: Add JMX port 9986 to the MapReduce History process (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [14:56:28] (03CR) 10Muehlenhoff: Cumin: allow connection to the targets (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [14:57:04] !log upgrading and restarting db1095 (sanitarium2) [14:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:19] moritzm: thanks, something strange happened with the last rebase, let me fix it [15:00:07] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:00:58] (03PS4) 10Elukey: Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) [15:01:14] (03PS5) 10Elukey: Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) [15:04:13] (03PS1) 10BBlack: TLS settings for public exim4 [puppet] - 10https://gerrit.wikimedia.org/r/335232 [15:07:52] (03PS2) 10BBlack: TLS settings for public exim4 [puppet] - 10https://gerrit.wikimedia.org/r/335232 [15:09:37] (03PS1) 10Jcrespo: mariadb: Add TLS support for tendril [puppet] - 10https://gerrit.wikimedia.org/r/335233 (https://phabricator.wikimedia.org/T111654) [15:09:59] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review, 07Upstream: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2986472 (10Paladox) @hashar hi this task T153079 has nothing to do with the task here as the problem there was that the branc... [15:11:17] (03PS2) 10Jcrespo: mariadb: Add TLS support for tendril [puppet] - 10https://gerrit.wikimedia.org/r/335233 (https://phabricator.wikimedia.org/T111654) [15:11:35] (03PS15) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [15:14:11] (03PS16) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [15:14:52] (03PS3) 10BBlack: TLS settings for public exim4 [puppet] - 10https://gerrit.wikimedia.org/r/335232 [15:16:49] (03CR) 10BBlack: "Compiler output looks ok: https://puppet-compiler.wmflabs.org/5291/" [puppet] - 10https://gerrit.wikimedia.org/r/335232 (owner: 10BBlack) [15:18:24] (03CR) 10Ottomata: [C: 031] Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:21:31] (03PS1) 10Gehel: elasticsearch - deploy the same administration scripts on Jessie and Trusty [puppet] - 10https://gerrit.wikimedia.org/r/335236 (https://phabricator.wikimedia.org/T151326) [15:21:33] (03PS1) 10Ottomata: Temporarily increase refined webrequest retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335237 (https://phabricator.wikimedia.org/T155141) [15:23:22] (03CR) 10Gehel: elasticsearch - deploy the same administration scripts on Jessie and Trusty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335236 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [15:24:50] (03CR) 10Ottomata: [C: 032] Temporarily increase refined webrequest retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335237 (https://phabricator.wikimedia.org/T155141) (owner: 10Ottomata) [15:27:07] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:32:13] (03CR) 10Muehlenhoff: [C: 031] "That's pretty much my earlier patches https://gerrit.wikimedia.org/r/#/c/314518/ and https://gerrit.wikimedia.org/r/#/c/314695/ :-) So +1 " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335236 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [15:33:22] (03CR) 10Gehel: "@moritz: yep, I'm shamelessly stealing your work :)" [puppet] - 10https://gerrit.wikimedia.org/r/335236 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [15:34:09] (03PS2) 10Gehel: elasticsearch - deploy the same administration scripts on Jessie and Trusty [puppet] - 10https://gerrit.wikimedia.org/r/335236 (https://phabricator.wikimedia.org/T151326) [15:34:39] (03Abandoned) 10Muehlenhoff: elasticsearch: Extend version check to also apply to jessie [puppet] - 10https://gerrit.wikimedia.org/r/314518 (owner: 10Muehlenhoff) [15:35:21] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333930 (owner: 10Marostegui) [15:35:57] (03CR) 10Gehel: [C: 032] elasticsearch - deploy the same administration scripts on Jessie and Trusty [puppet] - 10https://gerrit.wikimedia.org/r/335236 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [15:39:42] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2986610 (10Gehel) We now have deployment-elastic08 running on jessie and part of the deployment-prep elasticsearch cluster. Cluster is green and no... [15:39:54] 06Operations, 10DBA: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2986612 (10jcrespo) a:03jcrespo [15:46:43] (03PS17) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [15:49:00] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2986697 (10Fjalapeno) [15:56:43] (03CR) 10Volans: "Puppet compiler for the last review: https://puppet-compiler.wmflabs.org/5293/" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:56:51] moritzm: ^^^ ;) [15:58:30] sorry about the rebase issues [15:59:54] volans: I'll have a look tomorrow morning, currently on something else [16:00:21] no problem, just to let you know [16:03:52] !log started Cassandra nodetool cleanup for aqs1004-b [16:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:33] !log started Cassandra nodetool cleanup for aqs1007-a [16:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:48] (03PS1) 10Giuseppe Lavagetto: Initial commit of etcd2-mirror [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/335246 [16:30:07] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 252340.63 seconds [16:30:07] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 219772.63 seconds [16:30:17] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320786.97 seconds [16:30:18] PROBLEM - MariaDB Slave Lag: m3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 484171.09 seconds [16:30:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 392651.74 seconds [16:30:27] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180613.78 seconds [16:30:27] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 266772.14 seconds [16:39:11] I will downtime that [16:41:31] yeah, this is my fault, timed out too soon [16:41:44] I was doing archeology on dbstore1001 [16:42:18] oh, moving it to the new master? [16:43:10] (03PS2) 10RobH: adding icinga cert monitoring for *.corp.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/335179 [16:43:12] (03PS13) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [16:43:20] (03CR) 10RobH: [C: 032] adding icinga cert monitoring for *.corp.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/335179 (owner: 10RobH) [16:43:40] (03CR) 10Marostegui: Reporting tests with the private data script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [16:46:32] bleh, i just trusted our icinga config to break my first rule of icinga config changes. 1: run an icinga config check on the config before you merge your change, since puppet/icinga cannot be trusted to alert when it fails to reload icinga properly. [16:46:51] lets hope whoever touched it last left it in a good....... nope [16:46:57] either my change broke it or it was in a bad state, lame. [16:48:18] robh: we have an alarm for that [16:48:27] volans: folks say that [16:48:34] but we had it fail to alarm not last week iirc? [16:48:37] its in a failed state now [16:48:41] so we should see an alarm =] [16:48:56] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=einsteinium&service=Check+correctness+of+the+icinga+configuration [16:49:12] last week it failed and the alarm went off [16:49:17] oh [16:49:25] twentyafterfour: What time does the deployment branch usually get cut? [16:49:28] why isnt it echoing in here? that should be a larger failure (maybe not page) [16:49:34] but echo in here... [16:49:44] volans: that page doesnt show anything to me but the host [16:49:48] not any checks [16:50:10] robh: maybe splitted link due to IRC? and IIRC it does the check ever 10 minutes [16:50:26] if you run sudo icinga -v /etc/icinga/icinga.cfg you can see the failures [16:50:33] ahh [16:50:38] i can see the check on icinga now [16:50:43] ok, every 10 minutes then isnt bad. [16:51:08] yeah its my change [16:51:24] volans: thanks for pointing out the check [16:51:28] its appreciated [16:51:43] yw :) [16:51:58] im just trying to add in another cert to monitor, so i'll see what i borked up. [16:52:43] 06Operations, 10DBA: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2987013 (10jcrespo) 05stalled>03Resolved a:05jcrespo>03Marostegui I chhanged the master of dbstore1001. Resolving now, but let's monitor dbstore1001 to make sure nothing broke (because its delayed rep... [16:52:45] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2987016 (10jcrespo) [16:55:27] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:56:07] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [16:56:14] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#2987019 (10jcrespo) [16:56:18] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2987020 (10jcrespo) [16:56:50] (03PS4) 10Muehlenhoff: Provide a systemd override unit for memcached [puppet] - 10https://gerrit.wikimedia.org/r/319820 [16:56:59] robh: there you go ^^^ :) [16:57:06] wooooooo [16:57:15] still not sure why it hates my addition [16:57:24] since the output says 'i hate this' and directs me right to what i added [16:57:28] but what i added seems like it should work. [16:58:11] * robh ponders https://gerrit.wikimedia.org/r/#/c/335179/ [16:58:56] 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2987023 (10jcrespo) 05Open>03Resolved a:03Marostegui This is resolved, leaving T156226 open for pending issues, non-related. ``` SELECT ct_rc_id, ct_tag... [16:59:00] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2987028 (10jcrespo) [16:59:00] !log disabled puppet on einsteinium while i try to figure out what i broke in my config for icinga [16:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T1700). [17:00:18] no patches [17:01:31] (03CR) 10Gehel: [C: 031] Update elasticsearch module for es5 compatability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [17:05:06] 06Operations, 10MediaWiki-Database, 10MediaWiki-General-or-Unknown, 10Wikimedia-General-or-Unknown: 504 Gateway Time-out on https://de.wikipedia.org/w/index.php?title=Wikipedia:L%C3%B6schkandidaten&action=info - https://phabricator.wikimedia.org/T156537#2987035 (10jcrespo) Adding @daniel, -not expecting to... [17:06:15] I am going to put down tendril and dbtree for some minutes [17:06:22] unless someone has some objection [17:06:24] (03PS2) 10Gehel: elasticsearch - reimage elasticsearch relforge servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) [17:06:39] (03PS1) 10RobH: Revert "adding icinga cert monitoring for *.corp.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/335253 [17:06:44] this will allow backupd upgrade of software and kernel [17:06:51] *backend [17:06:59] I will log when I start [17:07:14] but I am giving a warning in advance [17:07:24] (03CR) 10RobH: [C: 032] Revert "adding icinga cert monitoring for *.corp.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/335253 (owner: 10RobH) [17:16:43] (03CR) 10Jcrespo: [C: 032] mariadb: Add TLS support for tendril [puppet] - 10https://gerrit.wikimedia.org/r/335233 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [17:16:49] (03PS3) 10Jcrespo: mariadb: Add TLS support for tendril [puppet] - 10https://gerrit.wikimedia.org/r/335233 (https://phabricator.wikimedia.org/T111654) [17:23:27] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:26:07] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [17:30:28] (03CR) 10Gehel: elasticsearch - reimage elasticsearch relforge servers to jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:32:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Comments inline, overall looks quite good to me." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/334435 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [17:37:10] !log stopping mysql, upgrading and restarting db1011- temporary outage of tendril & dbtree T111654 [17:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:15] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [17:42:40] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2987139 (10Jgreen) >>! In T137161#2972789, @BBlack wrote: > What about benefactorevents / eventdonations? These are hosted externally by our contrac... [17:42:54] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2987140 (10Jgreen) a:03Jgreen [17:47:09] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2987163 (10Gehel) 05Open>03Resolved @Paladox I'm not sure I understand what you mean with the... [17:50:16] (03CR) 10Gehel: elasticsearch - reimage elasticsearch relforge servers to jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:51:45] (03CR) 10Muehlenhoff: "jessie is the default since a few months now, you can simply remove this" [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:51:55] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2987190 (10debt) 05Resolved>03Open [17:54:41] (03CR) 10Faidon Liambotis: [C: 031] "LGTM superficially, but note that unlike the rest of the software we're using ssl_ciphersuite with, exim4 in Debian is built with GnuTLS. " [puppet] - 10https://gerrit.wikimedia.org/r/335232 (owner: 10BBlack) [17:54:58] (03PS3) 10Gehel: elasticsearch - reimage elasticsearch relforge servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) [17:55:17] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:58:11] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2987211 (10Jgreen) @CCogdill_WMF we'd like to make some adjustments to the benefactorevents and eventdonations webserver config to bring them inline... [18:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T1800). Please do the needful. [18:06:48] !log end up tendril and dbtree maintenance, things should be back up, report if you see degradations of service [18:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:17] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:27:17] hi yall, i [18:27:20] hi yall [18:27:24] i'd like to get a mediawiki-config change out [18:27:33] (03PS2) 10Ottomata: Enable eventbus RCFeed in production and deployment-prep beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334389 (https://phabricator.wikimedia.org/T152030) [18:27:37] this one ^ :) [18:27:43] Krinkle: if you are there, you can help maybe? [18:28:02] otherwise, who should I ask? I can read on wikitech how to do this (merge + scap sync-file, right?) but i'd rather have someone walk me through it [18:28:06] i've maybe done this once before years ago [18:28:28] ottomata, I can help [18:28:34] I do 10 of those a day [18:28:36] awesome [18:28:38] thanks jynus [18:28:52] let me know which change it is [18:28:57] https://gerrit.wikimedia.org/r/#/c/334389/2 [18:29:58] can that be tested on a single host beforehand? [18:30:15] if yes, merge [18:30:24] go to /srv/mediawiki-staging [18:30:30] ja sure it can :) [18:30:34] that would be great [18:30:44] rebase your change [18:30:51] then go to mwdebug [18:30:55] pull [18:30:58] test the change [18:31:04] merge and then fetch && rebase origin/master [18:31:09] great [18:31:09] ok [18:31:12] then scap file-sync [18:31:20] (03CR) 10Ottomata: [C: 032] Enable eventbus RCFeed in production and deployment-prep beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334389 (https://phabricator.wikimedia.org/T152030) (owner: 10Ottomata) [18:31:33] (03CR) 10jenkins-bot: Enable eventbus RCFeed in production and deployment-prep beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334389 (https://phabricator.wikimedia.org/T152030) (owner: 10Ottomata) [18:32:03] jynus: mwdebug? [18:32:11] it is the old canaries [18:32:14] the host names are [18:32:25] I think mwdebug1001.eqiad.wmnet [18:32:44] were recently setup, but things are sent there when on debug mode [18:32:45] cd /srv/mediawiki/wmf-config [18:32:46] and git pull? [18:32:49] no no [18:32:55] wait [18:32:57] k [18:33:05] on mwdebug? [18:33:06] yes [18:33:13] it doesnt matter the current dir there [18:33:20] it is just a script [18:33:23] oh [18:33:25] scap pull [18:33:32] ok, doing [18:33:33] but make sure you rebase on tin first [18:33:40] ja that's done [18:33:56] and either test the functionality or for errors on kibana [18:34:25] I do not know exactly what that does, so that is up to you on how to test [18:34:38] normally, the main thing is at least no full outage :-) [18:34:42] ok that worked, so how do I get requests here? test.wikipedia.org? [18:34:48] no [18:35:02] there is an extension you can use to make life easier [18:35:07] but otherwise [18:35:14] ?debug=true [18:35:14] ? [18:35:29] no, I think it is a header [18:35:35] ah, found it [18:35:36] let me find it on wikitech [18:35:39] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Pre-deployment_testing_in_production [18:35:55] yes, I do it with curl [18:36:06] X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet [18:36:06] ok [18:36:16] i need to make an edit or somethign,m but i have a browser extension that will let me set req headers [18:36:29] probably easier [18:36:48] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [18:36:50] ^ [18:36:55] much easier [18:36:57] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:38:19] also looking of errors on: https://logstash.wikimedia.org/app/kibana#/discover/DBQuery?_g=%28refreshInterval:%28display:Off,pause:!f,value:0%29,time:%28from:now-24h,mode:quick,to:now%29%29&_a=%28columns:!%28_source%29,filters:!%28%29,index:%27logstash-*%27,interval:auto,query:%28query_string:%28analyze_wildcard:!t,query:%27host:mwdebug1001%27%29%29,sort:!%28%27@timestamp%27,desc%29%29 [18:38:20] !log arlolra@tin Started deploy [parsoid/deploy@dc2323d]: Updating Parsoid to 734dc996 [18:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:35] when you are done testing [18:39:08] OOoo lookin ggood jynus [18:39:19] just cd /srv/mediawiki-staging scap sync-file wmf-config/your_file "message" [18:39:27] there is a && missing there [18:40:01] and then being around/checking logs to make sure everthing is ok [18:40:12] on tin, of course [18:40:29] hope nothing changed on scap3 [18:40:53] ok cool [18:41:00] checking a couple of other things.. [18:41:43] awesome, working great [18:41:45] ok [18:41:59] PM when less busy/finished [18:42:18] *me [18:42:19] jynus: real quick, this is two files, can I list them both? [18:42:33] oh, good question [18:42:37] never tried [18:42:51] twentyafterfour: again about to add a minor CN update to the train... [18:42:55] well, i can do the one then the other [18:42:58] they don't conflict [18:43:45] no, file is for a single file [18:43:55] syncing the -labs one first, it'll be a no-op in prod anyway [18:44:18] otherwise, you can sync the whole dir [18:44:19] !log otto@tin Synchronized wmf-config/CommonSettings-labs.php: Enabling RCFeed -> EventBus (duration: 00m 43s) [18:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:36] which I think that is new with scap3 [18:44:44] uses file, and no dir [18:44:52] aye, i think i saw that email too [18:45:08] !log otto@tin Synchronized wmf-config/CommonSettings.php: Enabling RCFeed -> EventBus (duration: 00m 42s) [18:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:20] yeahhhhhh [18:45:21] it works! [18:45:22] thanks jynus [18:45:26] good [18:46:18] !log recentchange events now flowing into Kafka via EventBus T152030 [18:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:23] T152030: RecentChanges in Kafka - https://phabricator.wikimedia.org/T152030 [18:47:56] (03PS3) 10Gehel: elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) [18:48:11] (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) (owner: 10Gehel) [18:48:48] kaldari: AndyRussG: cutting the branch now [18:49:00] cool, thanks for the notice :) [18:49:24] twentyafterfour: one sec........? [18:49:31] as for what time that usually happens, when I'm doing it, then it's usually around this time.... [18:49:34] AndyRussG: ok sure thing [18:49:44] AndyRussG: just let me know when you're ready? [18:50:00] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/335263 [18:50:06] Just as soon as that merges! ^ [18:50:11] Thanks and apologies for lastminuteness!!!! [18:51:18] !log arlolra@tin Finished deploy [parsoid/deploy@dc2323d]: Updating Parsoid to 734dc996 (duration: 12m 58s) [18:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:31] twentyafterfour: merge! The CN deploy branch should now go to 7c0c4932dc410d9809cfc22725b3985639f961a2 [18:54:38] merged, I meant [18:57:16] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2987415 (10Cmjohnson) I received the servers racked them to even out the rows A6 1 -wmf7011 ge-6/0/23 B4 2 - wmf7012/7013 ge-4/0/27 and 4/0/31 C7... [18:57:30] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2987417 (10Cmjohnson) [18:58:31] (03PS1) 10Smalyshev: Add most frequently used aliases for filetype: search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335265 (https://phabricator.wikimedia.org/T156413) [18:58:46] !log Updated Parsoid to version 734dc996 (T98960) [18:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:50] T98960: Parsoid should be able to understand HTML entities in links - https://phabricator.wikimedia.org/T98960 [19:00:04] Deploy window Changed: No SWAT window at this time on Tuesdays going forward (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T1900) [19:01:53] :) [19:02:03] I'll remove that the next time I make the calendar [19:04:35] jouncebot the unreminder [19:04:56] or the reminder of nothingness? [19:05:37] (03CR) 10EBernhardson: [C: 031] Add most frequently used aliases for filetype: search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335265 (https://phabricator.wikimedia.org/T156413) (owner: 10Smalyshev) [19:05:50] the reminder of nothingness :) [19:05:57] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:06:44] * AndyRussG is silent [19:36:07] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1114.30 Read Requests/Sec=3167.90 Write Requests/Sec=11.20 KBytes Read/Sec=20670.00 KBytes_Written/Sec=271.20 [19:39:23] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2987580 (10Ottomata) [19:39:57] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:40:07] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3789.90 Read Requests/Sec=2034.50 Write Requests/Sec=31.10 KBytes Read/Sec=16674.40 KBytes_Written/Sec=185.60 [19:40:09] anyone know what "The Program Dashboard Rails application" is ? [19:40:21] and where it runs [19:40:21] context? [19:40:39] https://gerrit.wikimedia.org/r/#/c/334308/4/modules/programdashboard/manifests/app.pp [19:40:43] I'm assuming Education Program [19:40:49] this module does not seem to be used in prod [19:40:59] marxarelli: ^^ [19:41:01] so i can't just enter a server name to test the change [19:41:26] then wanted to check if it's active but in Labs or just not active anymore or.. [19:44:31] 06Operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#2987587 (10Ottomata) @Marostegui ok! So the T125135 auto-increment thing is a very small piece of this larger issue. Let's see if we can hammer out a way to use regular MySQL replication... [19:44:45] greg-g: thanks. maybe it's this https://outreachdashboard.wmflabs.org/ [19:45:37] twentyafterfour: in about 5 min I have to be away from the keyboard. The CentralNotice change is about the as trivial and low-risk as you could imagine... In any case, if there's any CentralNotice related questions from the train deploy, pls ping ejegg (Elliot) on #wikimedia-fundraising... Thanks!!! [19:49:05] (03PS5) 10Dzahn: planet/pmacct/programdashboard/pybal lint changes [puppet] - 10https://gerrit.wikimedia.org/r/334308 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:49:13] (03CR) 10Dzahn: [C: 032] planet/pmacct/programdashboard/pybal lint changes [puppet] - 10https://gerrit.wikimedia.org/r/334308 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:49:37] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 266 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:49:41] mutante: i'm not sure whether it's being used as i haven't been involved in the project for a year now [19:50:14] I guess ask the education team? /me shrugs [19:50:35] marxarelli: alright, thanks, not going to remove it or anything. [19:50:40] mutante: check with awight or ragesoss or someone in #wikimedia-ed [19:51:04] ok, thanks [19:51:07] np [19:52:07] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=133.70 Read Requests/Sec=127.70 Write Requests/Sec=98.50 KBytes Read/Sec=1574.00 KBytes_Written/Sec=982.40 [19:54:57] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:47] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T2000). [20:01:22] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987618 (10Ottomata) [20:02:20] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987634 (10Ottomata) [20:02:41] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2987618 (10Ottomata) a:05jcrespo>03None [20:10:07] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:18:21] (03PS1) 10Ottomata: Revert some of the changes last week to datasets.wm.org - we will cleanup at analytics.wm.org instead [puppet] - 10https://gerrit.wikimedia.org/r/335273 (https://phabricator.wikimedia.org/T125854) [20:19:20] (03CR) 10jerkins-bot: [V: 04-1] Revert some of the changes last week to datasets.wm.org - we will cleanup at analytics.wm.org instead [puppet] - 10https://gerrit.wikimedia.org/r/335273 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [20:20:31] (03PS2) 10Ottomata: Revert some of the changes last week to datasets.wm.org - we will cleanup at analytics.wm.org instead [puppet] - 10https://gerrit.wikimedia.org/r/335273 (https://phabricator.wikimedia.org/T125854) [20:22:48] (03CR) 10Ottomata: [C: 032] Revert some of the changes last week to datasets.wm.org - we will cleanup at analytics.wm.org instead [puppet] - 10https://gerrit.wikimedia.org/r/335273 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [20:23:48] !install1001 - reboot to add second disk [20:32:06] !log stopping db1063 mariadb before full host reimage [20:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:22] I think that programdashboard module was part of the preliminary work from Wikimania 2015 to prepare the education dashboard to run on WMF production, but I it's not currently being used for outreachdashboard (which is just done as a manually configured instance on labs) [20:34:18] but yes, the application that refers to is the Rails app that runs outreachdashboard.wmflabs.org: https://github.com/WikiEducationFoundation/WikiEduDashboard [20:37:32] !log syncing 1.29.0-wmf.10 to test wikis [20:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:56] !log twentyafterfour@tin Started scap: (no justification provided) [20:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:48] mediawiki train right now twentyafterfour ? [20:39:37] TabbyCat: yes [20:39:45] * TabbyCat likes [20:39:52] jouncebot: now [20:39:52] For the next 1 hour(s) and 20 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170131T2000) [20:40:02] jouncebot: next [20:40:02] In 3 hour(s) and 19 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170201T0000) [20:46:07] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:46:18] (03CR) 10Ottomata: Add hardsync shell script (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/334435 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [20:46:47] (03PS2) 10Ottomata: Add hardsync shell script [puppet] - 10https://gerrit.wikimedia.org/r/334435 (https://phabricator.wikimedia.org/T125854) [20:48:05] (03PS3) 10Ottomata: Add hardsync shell script [puppet] - 10https://gerrit.wikimedia.org/r/334435 (https://phabricator.wikimedia.org/T125854) [20:49:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [20:49:52] (03CR) 10Ottomata: [C: 032] Add hardsync shell script [puppet] - 10https://gerrit.wikimedia.org/r/334435 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [21:02:06] 06Operations, 06Collaboration-Team-Triage, 10Flow, 10MediaWiki-Redirects, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#2987730 (10Jdlrobson) [21:05:12] !log twentyafterfour@tin Finished scap: (no justification provided) (duration: 27m 16s) [21:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:47] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:05] hmm... [21:09:07] https://test.wikipedia.org/wiki/Special:Version [21:11:05] no ExtensionMessages-1.29.0-wmf.10.php [21:11:45] !log wmf-config/ExtensionMessages-1.29.0-wmf.10.php is missing refs T155525 [21:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:48] T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525 [21:12:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [21:14:03] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in Fundraising HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2987772 (10CCogdill_WMF) Thanks for the ping @Jgreen, we're looking at it. [21:14:07] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:14:20] * twentyafterfour isn't sure why `scap sync` didn't create extensionmessages for wmf.10 [21:15:25] PHP fatal error: [21:15:25] File not found: /srv/mediawiki/php-1.29.0-wmf.10/../wmf-config/ExtensionMessages-1.29.0-wmf.10.php [21:15:27] testwiki [21:15:44] Krinkle: yeah, noticed [21:15:51] running `scap l10n-update` seems to be doing the trick [21:15:53] (03PS1) 10Yuvipanda: docker: Gently wade into new coding guidelines [puppet] - 10https://gerrit.wikimedia.org/r/335278 [21:19:40] twentyafterfour: mediawiki.org still on wmf.9 is on purpose ? [21:19:59] yeah, wmf.10 isn't ready yet [21:20:17] scap didn't create localization cache for wmf.10 so it's broken on testwiki [21:20:29] matanya: should be fixed shortly [21:20:47] ah, cool. thanks for clarifying [21:32:06] (03PS2) 10Yuvipanda: docker: Gently wade into new coding guidelines [puppet] - 10https://gerrit.wikimedia.org/r/335278 [21:32:08] (03PS1) 10Yuvipanda: docker: Allow using docker.io's repo directly for debs [puppet] - 10https://gerrit.wikimedia.org/r/335299 [21:32:24] (03PS1) 10Ottomata: Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) [21:33:09] (03PS2) 10Yuvipanda: docker: Allow using docker.io's repo directly for debs [puppet] - 10https://gerrit.wikimedia.org/r/335299 [21:35:47] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:37:28] 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#2987838 (10Jgreen) [21:38:20] (03CR) 10jerkins-bot: [V: 04-1] Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [21:39:47] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:31] !log twentyafterfour@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 19m 49s) [21:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:47] (03PS2) 10Ottomata: Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) [21:46:47] (03PS3) 10Ottomata: Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) [21:50:08] !log twentyafterfour@tin Synchronized wmf-config/: sync ExtensionMessages-1.29.0-wmf.10.php (duration: 00m 47s) [21:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:41] (03PS4) 10Ottomata: Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) [21:53:40] (03CR) 10jerkins-bot: [V: 04-1] Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [21:54:18] Krinkle: matanya: wmf.10 works now on testwikis, I'm about to sync to group0 [21:54:47] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [21:54:58] (03CR) 10JustBerry: "Extension:TorBlock affected, causing abuse of creating LDAP accounts to use for Phabricator account creation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158409 (owner: 10Andrew Bogott) [21:55:42] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335345 [21:55:44] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335345 (owner: 1020after4) [21:55:47] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [21:57:12] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335345 (owner: 1020after4) [21:58:02] (03PS5) 10Ottomata: Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) [21:58:13] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.10 [21:58:14] (03CR) 10jenkins-bot: group0 wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335345 (owner: 1020after4) [21:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:51] (03CR) 10jerkins-bot: [V: 04-1] Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [22:00:01] (03PS6) 10Ottomata: Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) [22:01:52] (03CR) 10Ottomata: [C: 032] Create /srv/published-datasets on stat* boxes, and hardsync them to analytics.wikimedia.org/datasets [puppet] - 10https://gerrit.wikimedia.org/r/335301 (https://phabricator.wikimedia.org/T125854) (owner: 10Ottomata) [22:05:47] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:06:29] (03PS1) 10Ottomata: Require proper file for rsync-published-datasets cron [puppet] - 10https://gerrit.wikimedia.org/r/335363 [22:07:47] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:11:53] (03CR) 10Ottomata: [C: 032] Require proper file for rsync-published-datasets cron [puppet] - 10https://gerrit.wikimedia.org/r/335363 (owner: 10Ottomata) [22:14:47] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:20:08] (03PS1) 1020after4: ssh.job.progress changed: now takes ProgressReporter, not str [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335366 [22:21:17] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:27:45] !log cleaned up old branches: wmf.3 and wmf.4 [22:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:11] so boring now [22:35:17] wikimedia is so enterprisey [22:35:18] :) [22:51:17] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:11:15] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:15:03] (03PS1) 10Dzahn: add install1002/2002 to replace 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/335372 (https://phabricator.wikimedia.org/T156440) [23:15:16] (03CR) 10Thcipriani: [C: 031] ssh.job.progress changed: now takes ProgressReporter, not str [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335366 (owner: 1020after4) [23:16:45] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:23:23] (03PS2) 10Dzahn: add install1002/2002 to replace 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/335372 (https://phabricator.wikimedia.org/T156440) [23:23:35] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:25:10] (03CR) 10Dzahn: [C: 032] add install1002/2002 to replace 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/335372 (https://phabricator.wikimedia.org/T156440) (owner: 10Dzahn) [23:25:19] (03PS3) 10Dzahn: add install1002/2002 to replace 1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/335372 (https://phabricator.wikimedia.org/T156440) [23:34:22] (03PS1) 10Rush: wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [23:35:26] (03CR) 10jerkins-bot: [V: 04-1] wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [23:35:34] there's a request to rename a 92k user [23:35:42] anyone here to assist? [23:39:15] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:39:47] (03PS2) 10Rush: wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [23:42:32] (03PS3) 10Rush: wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [23:44:45] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:46:53] (03PS1) 10Dzahn: add install1002/2002 to replace install1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/335376 (https://phabricator.wikimedia.org/T156440) [23:49:40] (03CR) 10Dzahn: [C: 04-1] "need to use public1-b-codfw for ganeti VMs" [dns] - 10https://gerrit.wikimedia.org/r/335376 (https://phabricator.wikimedia.org/T156440) (owner: 10Dzahn) [23:51:14] !log ppchelko@tin Started deploy [changeprop/deploy@e27c3a0]: Update change-prop to fix wikidata rollback rule [23:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:35] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:52:46] !log ppchelko@tin Finished deploy [changeprop/deploy@e27c3a0]: Update change-prop to fix wikidata rollback rule (duration: 01m 32s) [23:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:58] (03PS2) 10Dzahn: add install1002/2002 to replace install1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/335376 (https://phabricator.wikimedia.org/T156440) [23:54:12] (03CR) 10Papaul: [C: 032] add install1002/2002 to replace install1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/335376 (https://phabricator.wikimedia.org/T156440) (owner: 10Dzahn) [23:59:45] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479