[00:36:31] 10Operations, 10MediaWiki-General-or-Unknown, 10TechCom-RfC: Bump PHP requirement to 5.6 in 1.31 - https://phabricator.wikimedia.org/T178538#3705281 (10Krinkle) [00:38:16] (03PS1) 10Ayounsi: Assigning v4/v6 IPs for eqiad/esams tunnel [dns] - 10https://gerrit.wikimedia.org/r/386119 [00:39:45] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3705296 (10Jack_who_built_the_house) >>! In T173710#3701806, @Ladsgroup wrote: > I think one of the reasons contributing to the problem is the same pro... [01:04:14] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508807047 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4095185 keys, up 4 minutes 4 seconds - replication_delay is 1508807047 [01:04:24] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508807059 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4096420 keys, up 4 minutes 16 seconds - replication_delay is 1508807059 [01:04:44] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:04:55] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:05:14] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4091889 keys, up 5 minutes 5 seconds - replication_delay is 0 [01:05:44] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8798819 keys, up 5 minutes 35 seconds - replication_delay is 0 [01:05:55] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4095226 keys, up 5 minutes 51 seconds - replication_delay is 0 [01:06:25] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4093760 keys, up 6 minutes 17 seconds - replication_delay is 0 [02:09:30] (03PS9) 10Madhuvishy: ssh-key-ldap-lookup: Deny user auth if /etc/block-ldap-key-lookup exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) [02:18:27] (03CR) 10Madhuvishy: ">" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [02:25:53] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.4) (duration: 08m 13s) [02:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:48] (03CR) 10Madhuvishy: [C: 032] toolforge: Update shinken checks [puppet] - 10https://gerrit.wikimedia.org/r/386112 (owner: 10BryanDavis) [02:32:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 24 02:32:25 UTC 2017 (duration 6m 33s) [02:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:44] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 850.16 seconds [03:36:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:36:45] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:54:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:55:55] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:11:55] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 225.19 seconds [04:40:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Use the hiera() value in the message (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 (owner: 10Volans) [06:12:33] (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386127 (https://phabricator.wikimedia.org/T178460) [06:14:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386127 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [06:15:19] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386127 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [06:16:10] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1082 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386127 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [06:16:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with low weight - T178460 (duration: 00m 47s) [06:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:34] T178460: db1082 storage crashed - https://phabricator.wikimedia.org/T178460 [06:17:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386128 (https://phabricator.wikimedia.org/T164488) [06:19:04] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:19:25] PROBLEM - Nginx local proxy to apache on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:19:34] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:19:55] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386128 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:20:54] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 9.710 second response time [06:21:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386128 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:21:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386128 (https://phabricator.wikimedia.org/T164488) (owner: 10Marostegui) [06:22:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 45s) [06:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:50] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [06:23:55] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:35] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [06:28:35] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.014 second response time [06:28:54] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.125 second response time [06:29:45] PROBLEM - Check systemd state on mw1282 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:32:04] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:33:34] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 74748 bytes in 7.253 second response time [06:33:38] 10Operations, 10Puppet, 10DBA: Switch databases to the future parser - https://phabricator.wikimedia.org/T172498#3705570 (10Joe) Since we switched all of production to the future parser almost 2 months ago, we clearly fixed these issues as part of the more general ticket about the future parser. [06:33:47] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3705572 (10Joe) [06:33:49] 10Operations, 10Puppet, 10DBA: Switch databases to the future parser - https://phabricator.wikimedia.org/T172498#3705571 (10Joe) 05Open>03Resolved [06:33:54] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [06:34:24] RECOVERY - Nginx local proxy to apache on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.031 second response time [06:41:56] (03PS4) 10Giuseppe Lavagetto: Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 (owner: 10Volans) [06:44:04] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:48:54] RECOVERY - Check systemd state on mw1282 is OK: OK - running: The system is fully operational [06:49:01] (03CR) 10Giuseppe Lavagetto: [C: 032] Use the hiera() value in the message [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/386110 (owner: 10Volans) [06:49:29] (03PS2) 10Giuseppe Lavagetto: Add tests for a bad defined type [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382413 [06:49:52] (03CR) 10jerkins-bot: [V: 04-1] Add tests for a bad defined type [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382413 (owner: 10Giuseppe Lavagetto) [06:57:01] (03CR) 10Giuseppe Lavagetto: "> Code seems ok, any easy way to test it?" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382649 (owner: 10Giuseppe Lavagetto) [07:02:55] (03PS3) 10Giuseppe Lavagetto: Add tests for a bad defined type [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382413 [07:02:57] (03PS2) 10Giuseppe Lavagetto: Add checks for nodes [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382649 [07:04:41] (03CR) 10Giuseppe Lavagetto: [C: 032] Add tests for a bad defined type [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382413 (owner: 10Giuseppe Lavagetto) [07:04:57] (03CR) 10Giuseppe Lavagetto: [C: 032] Add checks for nodes [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/382649 (owner: 10Giuseppe Lavagetto) [07:25:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3704733 (10Gilles) upload.beta.wmflabs.org refuses SSL connections right now, I see that it's not on that list [07:28:56] !log Stop replication in sync on db1078 and db1103 to fix data drifts - https://phabricator.wikimedia.org/T164488 [07:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:40] (03PS1) 10Giuseppe Lavagetto: Gemfile: fetch the puppet styleguide check from rubygems.org [puppet] - 10https://gerrit.wikimedia.org/r/386129 [07:31:15] (03PS3) 10Giuseppe Lavagetto: WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [07:31:39] (03CR) 10Giuseppe Lavagetto: [C: 032] Gemfile: fetch the puppet styleguide check from rubygems.org [puppet] - 10https://gerrit.wikimedia.org/r/386129 (owner: 10Giuseppe Lavagetto) [07:31:45] (03CR) 10jerkins-bot: [V: 04-1] WIP: add puppet package version paramater to puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/385999 (https://phabricator.wikimedia.org/T178825) (owner: 10Herron) [07:36:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386130 [07:38:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386130 (owner: 10Marostegui) [07:39:34] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386130 (owner: 10Marostegui) [07:39:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386130 (owner: 10Marostegui) [07:40:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s) [07:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:38] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [07:43:07] (03PS1) 10Marostegui: db-eqiad.php: Give more weight to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386131 (https://phabricator.wikimedia.org/T178460) [07:45:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more weight to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386131 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [07:46:37] !log Stop replication in sync on db1103 and db2018 to fix data drifts - T164488 [07:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:44] T164488: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488 [07:46:53] (03Merged) 10jenkins-bot: db-eqiad.php: Give more weight to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386131 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [07:47:01] (03CR) 10jenkins-bot: db-eqiad.php: Give more weight to db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386131 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [07:48:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1082 weight - T178460 (duration: 00m 45s) [07:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:22] T178460: db1082 storage crashed - https://phabricator.wikimedia.org/T178460 [07:49:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705637 (10hashar) I guess we only fixed the text cache. Puppet fails on deployment-cache-upload04.deployment-prep.eqiad.wmflabs :( ``` Error: /Stage[main]/Nginx/Package[nginx-full]/ensure... [07:57:37] (03PS1) 10Gehel: wdqs: add timestamp to GC logs [puppet] - 10https://gerrit.wikimedia.org/r/386132 (https://phabricator.wikimedia.org/T175919) [07:59:20] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705644 (10hashar) I have applied a similar configuration in hiera for deployment-cache-upload04 While installing nginx-extra, the service failed to restart which blocks puppet: ``` nginx... [08:12:25] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10elukey) Removed the apache2 logrotate cron on osmium to avoid the following cronspam: ``` /etc/cron.daily/logrotate: Job... [08:15:07] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705665 (10hashar) In `profile::cache::ssl::unified` I have commented out the `tlsproxy::localssl { 'unified': ... }` to get the Varnish conf updated eg: ``` - new cache_local = vslp.vslp... [08:19:24] (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386133 (https://phabricator.wikimedia.org/T178460) [08:22:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1082 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386133 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [08:24:15] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386133 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [08:24:48] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705702 (10hashar) Next error: ``` Notice: /Stage[main]/Cacheproxy::Instance_pair/Varnish::Instance[upload-backend]/Exec[retry-load-new-vcl-file]/returns Command failed with error code 106... [08:26:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1082 original weight - T178460 (duration: 00m 45s) [08:26:08] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1082 with original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386133 (https://phabricator.wikimedia.org/T178460) (owner: 10Marostegui) [08:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:13] T178460: db1082 storage crashed - https://phabricator.wikimedia.org/T178460 [08:26:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1082 storage crashed - https://phabricator.wikimedia.org/T178460#3705705 (10Marostegui) 05Open>03Resolved a:03Marostegui db1082 is fully repooled now, let's close this for now [08:29:43] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705726 (10hashar) ``` # dpkg -S /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so varnish: /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_std.so # apt-cache policy varnish varnish:... [08:32:41] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3705727 (10Marostegui) Hi, Is there anything pending here? Thanks! [08:33:55] (03PS3) 10Hashar: beta: hieradata for varnish caches [puppet] - 10https://gerrit.wikimedia.org/r/386077 (https://phabricator.wikimedia.org/T178841) [08:34:36] (03PS4) 10Hashar: beta: hieradata for varnish caches [puppet] - 10https://gerrit.wikimedia.org/r/386077 (https://phabricator.wikimedia.org/T178841) [08:35:56] (03PS1) 10Marostegui: install_server: Reimage db2088 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/386135 (https://phabricator.wikimedia.org/T170662) [08:38:24] (03CR) 10Hashar: "Cherry picked on deployment-puppetmaster02. Puppet and Varnish now seems all happy on the cache instances:" [puppet] - 10https://gerrit.wikimedia.org/r/386077 (https://phabricator.wikimedia.org/T178841) (owner: 10Hashar) [08:39:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3705759 (10hashar) p:05Triage>03Normal **Status** https://gerrit.wikimedia.org/r/#/c/386077/4 cherry picked on the beta cluster puppetmaster Puppet and Varnish... [08:43:37] (03CR) 10Marostegui: [C: 032] install_server: Reimage db2088 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/386135 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [08:43:59] _joe_: ok to merge your change? [08:44:10] <_joe_> marostegui: yeah sorry [08:44:18] _joe_: merging! thanks [08:44:23] <_joe_> it's the kind of null-in-prod change I forget to merge [08:44:31] no worries at all! [08:51:48] (03PS1) 10Marostegui: mariadb: Add db2088 to s1 and s2 [puppet] - 10https://gerrit.wikimedia.org/r/386136 (https://phabricator.wikimedia.org/T178359) [08:54:28] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [08:55:45] (03CR) 10Marostegui: [C: 032] "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/8426/" [puppet] - 10https://gerrit.wikimedia.org/r/386136 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:59:05] (03PS1) 10Marostegui: s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) [09:01:17] (03CR) 10jerkins-bot: [V: 04-1] s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:01:59] (03CR) 10Volans: "Some comments inline, also db1046's compiler results seems to have some spurious Icinga checks it shouldn't have." (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:03:15] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:42] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:06:52] Uh? -1? [09:07:54] marostegui: I cannot give the asked vote :D [09:08:23] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms [09:08:56] haha, sorry, I was talking about my patch: https://gerrit.wikimedia.org/r/386139  [09:09:02] wondering why jenkins doesn't like it XD [09:09:33] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/ba561eb69f96723b1c960324fa45741e6f707fedcd61a76dcc00ea33171b7512/shm is not accessible: Permission denied [09:09:35] 09:00:48 ./HSources.py:472:13: E722 do not use bare except' [09:09:35] 09:00:48 ./HPassPlugins.py:30:13: E741 ambiguous variable name 'l' [09:09:52] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [09:09:54] and a lot more in other files [09:10:18] that ofc has nothing to do with your change ;) [09:10:39] yeah yeah [09:10:51] Maybe hashar knows about it? [09:12:23] new flake8 [09:12:34] flake8==3.5.0 today, 3.4.1 in yesterday's run [09:14:35] marostegui: as a quick workaround change tox.ini, line 19 with flake8<3.5.0 and the same in line 7 of checkhosts/tox.ini [09:17:14] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/384547 (https://phabricator.wikimedia.org/T178279) (owner: 10Volans) [09:18:04] thanks volans I will do it later if it doesn't get fixed before (it is not urgent at all) [09:18:32] marostegui: in that repository the last many commits are only from DBAs ;) [09:19:25] volans: I guess it might affect others too? [09:19:59] depends on the code and there is no "general fix", each repo must define it's own dependencies [09:20:13] Then I will get it :) [09:20:17] (03CR) 10Elukey: "Thanks Riccardo!" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:21:26] (03PS2) 10Marostegui: s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) [09:22:04] !log Stop s1 on db2092 to copy its data to db2088 - T178359 [09:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:12] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:22:15] (03CR) 10jerkins-bot: [V: 04-1] s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:23:09] ah, forgot one file [09:23:31] (03PS3) 10Marostegui: s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) [09:24:04] marostegui: wait a sec [09:24:44] RECOVERY - Disk space on copper is OK: DISK OK [09:26:01] (03CR) 10Ppchelko: [C: 04-1] "Not only this needs to wait for deployment train, this also requires a config change to include the host header." [puppet] - 10https://gerrit.wikimedia.org/r/385382 (https://phabricator.wikimedia.org/T175146) (owner: 10Mobrovac) [09:26:13] that one did the trick and got the verified [09:27:46] yeah as a temporary measure is ok [09:28:06] I'll open a task to fix the underlying issue [09:30:57] (03PS11) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [09:32:27] volans: thanks! can you subscribe me to that task? [09:32:53] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/1588902eebec2e402838e6abbfb69e093d4777e73a2f247a247351289aacdd54/shm is not accessible: Permission denied [09:33:00] sure [09:35:33] (03PS12) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [09:38:31] (03CR) 10Elukey: "New pcc looks definitely better: https://puppet-compiler.wmflabs.org/compiler02/8429/" [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [09:40:00] 10Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#3705842 (10ema) 05Open>03Resolved a:03ema >>! In T141373#3703459, @BBlack wrote: > Anything left to look at here? I've checked on a text-esams frontend and there's now plenty of... [09:40:46] 10Operations, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3705845 (10elukey) [09:40:57] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3705846 (10elukey) [09:42:18] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3701391 (10elukey) p:05Triage>03Normal Tried to force a learn cycle again, not much joy.. ``` elukey@analytics1029:~$ sudo megacli -AdpBbuCmd -a0 BBU status for Ada... [09:51:17] marostegui: I'm puzzled, all new failures are in files with a .py extension... digging more [09:57:24] (03CR) 10Alexandros Kosiaris: [C: 031] "Yeah, if we never share the host with other services, I guess it's fine" [puppet] - 10https://gerrit.wikimedia.org/r/382343 (owner: 10Dzahn) [10:08:52] 10Operations, 10DBA: operations/software repo: flake8 check - https://phabricator.wikimedia.org/T178877#3705904 (10Volans) [10:09:09] (03PS1) 10Elukey: hadoop: raise Xmx/Xms settings for hadoop worker daemons on an1030 [puppet] - 10https://gerrit.wikimedia.org/r/386147 (https://phabricator.wikimedia.org/T178876) [10:09:11] marostegui: ^^^ and I've also solved the mistery ; [10:09:12] ;) [10:12:37] (03PS1) 10Alexandros Kosiaris: Profilize role::package::builder [puppet] - 10https://gerrit.wikimedia.org/r/386148 [10:12:57] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/8430/analytics1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/386147 (https://phabricator.wikimedia.org/T178876) (owner: 10Elukey) [10:13:04] RECOVERY - Disk space on copper is OK: DISK OK [10:13:10] akosiaris: -2 for the commit message :-P [10:14:19] 10Operations, 10DBA: operations/software repo: flake8 check - https://phabricator.wikimedia.org/T178877#3705921 (10Volans) p:05Triage>03Normal [10:14:33] lol [10:14:49] hahahaha [10:14:49] ok ok .. I 'll make it better [10:16:24] (03PS2) 10Alexandros Kosiaris: Create profile::package_builder [puppet] - 10https://gerrit.wikimedia.org/r/386148 [10:16:29] there :P [10:19:41] thanks sir! :D [10:24:40] <_joe_> that's still not in our standard format [10:24:41] <_joe_> :P [10:25:05] <_joe_> try with something like role::package::builder: convert to role/profile [10:25:26] <_joe_> wow, I've been more pedantic than volans [10:26:08] _joe_: I avoid to comment on purpose to catch you ;) [10:26:15] *avoided [10:27:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See the inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386148 (owner: 10Alexandros Kosiaris) [10:29:53] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [10:30:24] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 (owner: 10Giuseppe Lavagetto) [10:30:56] <_joe_> meh I wrote python in ruby [10:30:58] <_joe_> as usual [10:31:48] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [10:32:21] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 (owner: 10Giuseppe Lavagetto) [10:34:42] (03PS3) 10Amire80: Deploy Compact Language Links on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384527 (https://phabricator.wikimedia.org/T177836) (owner: 10KartikMistry) [10:36:45] _joe_: better than the opposite ;) [10:37:28] (03CR) 10Joal: "Comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386147 (https://phabricator.wikimedia.org/T178876) (owner: 10Elukey) [10:38:55] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [10:40:11] _joe_: can we use 110% of the reactor? (sorry couldn't resist) [10:40:27] <_joe_> volans: it is possible, but not advisable [10:40:40] test pass! [10:41:49] <_joe_> volans: that number is competely out of my ass btw [10:42:01] <_joe_> I will need to tune it a bit [10:43:10] * volans rolling a D100 [10:44:06] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [10:44:13] (03PS1) 10Alexandros Kosiaris: builder: Disable docker's iptables handling [puppet] - 10https://gerrit.wikimedia.org/r/386153 [10:44:57] (03CR) 10Volans: "Much better now the compiler, thanks for the fixes! See replies inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [10:46:47] (03PS3) 10Alexandros Kosiaris: Create profile::package_builder [puppet] - 10https://gerrit.wikimedia.org/r/386148 [10:46:49] (03PS2) 10Alexandros Kosiaris: builder: Disable docker's iptables handling [puppet] - 10https://gerrit.wikimedia.org/r/386153 [10:50:56] !log cirrus/elasticsearch: reindexing of 167 small wikis done (T177871) [10:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:04] T177871: Re-index un-fallbacked languages - https://phabricator.wikimedia.org/T177871 [10:55:13] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:44] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 86%, RTA = 83.82 ms [10:56:24] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:34] ema ^^^ [10:56:45] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 73%, RTA = 83.83 ms [10:58:00] (03CR) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [11:03:09] (03PS13) 10Elukey: [WIP] Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [11:04:53] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:12:30] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [11:19:30] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [11:22:47] (03PS14) 10Elukey: role::mariadb: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [11:27:58] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/8438/" [puppet] - 10https://gerrit.wikimedia.org/r/386138 (owner: 10Giuseppe Lavagetto) [11:32:40] (03PS1) 10ArielGlenn: fix multistream job so inprogress files are moved to real names when done [dumps] - 10https://gerrit.wikimedia.org/r/386155 (https://phabricator.wikimedia.org/T177523) [11:33:03] (03CR) 10jerkins-bot: [V: 04-1] fix multistream job so inprogress files are moved to real names when done [dumps] - 10https://gerrit.wikimedia.org/r/386155 (https://phabricator.wikimedia.org/T177523) (owner: 10ArielGlenn) [11:34:53] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:39:56] (03PS1) 10Giuseppe Lavagetto: jobrunner: decommission old servers [puppet] - 10https://gerrit.wikimedia.org/r/386156 [11:39:59] (03PS1) 10Giuseppe Lavagetto: videoscaler: decom part of the older servers [puppet] - 10https://gerrit.wikimedia.org/r/386157 [11:43:43] (03CR) 10Elukey: [C: 031] profile::mediawiki::jobrunner: use 70% of capacity (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/386138 (owner: 10Giuseppe Lavagetto) [11:45:10] (03CR) 10Elukey: [C: 031] "As disclaimer to whoever reads this, before applying this change we'll need to proper drain the current jobs on the job-runners affected (" [puppet] - 10https://gerrit.wikimedia.org/r/386156 (owner: 10Giuseppe Lavagetto) [11:45:47] (03CR) 10Elukey: [C: 031] videoscaler: decom part of the older servers [puppet] - 10https://gerrit.wikimedia.org/r/386157 (owner: 10Giuseppe Lavagetto) [11:56:33] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 789.15 seconds [11:59:34] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 99.97 seconds [12:00:19] (03CR) 10Alexandros Kosiaris: [C: 032] Create profile::package_builder [puppet] - 10https://gerrit.wikimedia.org/r/386148 (owner: 10Alexandros Kosiaris) [12:00:20] (03CR) 10Alexandros Kosiaris: [C: 032] builder: Disable docker's iptables handling [puppet] - 10https://gerrit.wikimedia.org/r/386153 (owner: 10Alexandros Kosiaris) [12:05:35] 10Operations, 10monitoring: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3706215 (10fgiunchedi) Another idea for better dashboarding: show vertical lines for events other than deployments, e.g. puppet merges [12:15:25] (03PS1) 10ArielGlenn: use separate path for public/other datasets [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) [12:16:18] 10Operations, 10monitoring: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3699692 (10Volans) >>! In T178690#3706215, @fgiunchedi wrote: > Another idea for better dashboarding: show vertical lines for events other than deployments, e.g. puppet merges I agree, in a s... [12:26:50] (03PS1) 10ArielGlenn: one-off scripts for fixing up multistream dump mess [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/386162 [12:27:08] (03CR) 10jerkins-bot: [V: 04-1] one-off scripts for fixing up multistream dump mess [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/386162 (owner: 10ArielGlenn) [12:34:18] (03CR) 10Hashar: extdist: use profile::labs::lvm::srv instead of role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/385477 (owner: 10Hashar) [12:35:51] (03CR) 10Hashar: "Looks like I missed your change back in May :( Lot of contint has been migrated to profile bits par bits. There is a lot more to do thou" [puppet] - 10https://gerrit.wikimedia.org/r/355156 (owner: 10Dzahn) [12:37:44] (03CR) 10Hashar: [C: 031] "The "role" is now just:" [puppet] - 10https://gerrit.wikimedia.org/r/385480 (owner: 10Hashar) [12:41:32] (03CR) 10Hashar: [C: 031] "role::labs::lvm::srv is now a simple wrapper:" [puppet] - 10https://gerrit.wikimedia.org/r/385478 (owner: 10Hashar) [12:41:34] (03PS2) 10ArielGlenn: one-off scripts for fixing up multistream dump mess [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/386162 [12:42:49] (03PS1) 10Ppchelko: [Beta Labs] Use only EventBus for job processing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386164 [12:44:09] (03CR) 10Ppchelko: [C: 04-1] "Self -1 for now as we don't have the system in place in beta yet. Just preparing for the future." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386164 (owner: 10Ppchelko) [12:46:59] (03PS3) 10ArielGlenn: one-off scripts for fixing up multistream dump mess [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/386162 [12:48:55] (03CR) 10ArielGlenn: [C: 032] one-off scripts for fixing up multistream dump mess [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/386162 (owner: 10ArielGlenn) [12:53:09] (03PS1) 10Alexandros Kosiaris: ci: Use docker class, pass configuration [puppet] - 10https://gerrit.wikimedia.org/r/386166 [12:55:04] (03PS4) 10Marostegui: s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) [12:55:32] (03PS2) 10Alexandros Kosiaris: ci: Use docker class, pass configuration [puppet] - 10https://gerrit.wikimedia.org/r/386166 [12:59:01] (03PS3) 10Alexandros Kosiaris: ci: Use docker class, pass configuration [puppet] - 10https://gerrit.wikimedia.org/r/386166 [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T1300). [13:00:05] Jayprakash12345: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:37] I can SWAT today [13:00:46] Jayprakash12345: around for SWAT? [13:00:59] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/compiler02/8441/contint1001.wikimedia.org/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/386166 (owner: 10Alexandros Kosiaris) [13:02:40] (03PS1) 10Gehel: wdqs: GC tuning [puppet] - 10https://gerrit.wikimedia.org/r/386170 (https://phabricator.wikimedia.org/T175919) [13:02:51] (03CR) 10Volans: [C: 031] "LGTM, would be nice to have a +1 from a DBA too" [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [13:04:25] (03CR) 10Marostegui: [C: 032] s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:05:12] (03Merged) 10jenkins-bot: s1,s2.hosts: Add db2088 to s1 and s2 [software] - 10https://gerrit.wikimedia.org/r/386139 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:09:04] (03PS15) 10Elukey: role::mariadb: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) [13:10:08] (03Abandoned) 10Ottomata: Include jmx_exporter_config to make prometheus query Kafka jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/379290 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [13:10:33] (03CR) 10Zfilipin: "This was scheduled for EU SWAT today but was deployed since Jayprakash12345 was not available in #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 (https://phabricator.wikimedia.org/T178775) (owner: 10Jayprakash12345) [13:12:17] (03CR) 10Elukey: [C: 032] role::mariadb: Introduce mariadb eventlogging profiles for master/replica [puppet] - 10https://gerrit.wikimedia.org/r/385173 (https://phabricator.wikimedia.org/T177405) (owner: 10Elukey) [13:12:19] (03PS1) 10Marostegui: db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386173 (https://phabricator.wikimedia.org/T178359) [13:17:34] zeljkof: is swat done? [13:17:56] marostegui: sorry, forgot to make it explicit, did not even start :( [13:18:09] hey [13:18:13] yeah, i was wondering if you'd wait for it or not [13:18:13] oh [13:18:13] :) [13:18:19] https://gerrit.wikimedia.org/r/#/c/385843/ [13:18:26] marostegui: oh, looks like Jayprakash12345 has just arrived [13:18:30] haha yeah [13:18:34] I will wait :) [13:18:36] Jayprakash12345: ready for SWAT? [13:18:45] yes [13:18:57] ok, in that case, EU SWAT starts [13:19:24] (03PS3) 10Addshore: ci: jenkins, allow access to computer/.*/builds [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458) [13:19:46] (03PS4) 10Jayprakash12345: Add $wgNamespaceRobotPolicies Config for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 (https://phabricator.wikimedia.org/T178775) [13:20:14] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [13:21:04] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 579 bytes in 11.311 second response time [13:22:23] ^ that's me stopping and restarting [13:22:32] Jayprakash12345: can you test the patch at mwdebug1002? [13:22:55] (not right now, in general; I need a few minutes to deploy it) [13:24:07] (03PS1) 10Filippo Giunchedi: hieradata: gradual eqiad rollout of syslog-tls [puppet] - 10https://gerrit.wikimedia.org/r/386176 [13:25:21] The config is provided by Task creator (Fuzzy). And I did not found any error in code. [13:26:30] Jayprakash12345: so you are saying, you do not know how to test it? [13:26:38] or you can not test it? [13:27:40] hashar (or anybody else): could you please take a look at this? https://gerrit.wikimedia.org/r/#/c/385843/ [13:27:53] I am really not sure if it's OK to deploy it [13:29:10] zeljkof: not sure why they would want to disable it solely on some namespaces [13:29:20] I tested X-Wikimedia Debug 1002 on logo changes like patch. [13:29:37] hashar: what to do? deploy or not deploy? [13:29:44] (03CR) 10Filippo Giunchedi: "PCC=happy https://puppet-compiler.wmflabs.org/compiler02/8442/" [puppet] - 10https://gerrit.wikimedia.org/r/386176 (owner: 10Filippo Giunchedi) [13:29:45] But never test it on $wgNamespaceRobotPolicies [13:29:51] zeljkof: I dont know anything about robots policy or mediawiki internal settings [13:30:00] zeljkof: so really I cant tell how good/bad that it is [13:30:02] hashar: so, no deploy? :) [13:30:22] until somebody that is more familiar with it takes a look? [13:30:40] I cant tell the default for a namespace, nor the difference introduced by 'noindex,follow' or 'noindex,nofollow' [13:30:51] so I guess gotta hunt for someone familiar with the robot policy to review the change [13:31:57] hashar, Jayprakash12345: ok, in that case, no deploy until somebody that is more familiar with it takes a look [13:31:57] https://phabricator.wikimedia.org/T178775 , I was made patch on the task basis [13:31:59] sounds good? [13:32:50] Jayprakash12345: I see that, I am not saying you did anything wrong, but just that nobody knows if the code is doing what it is supposed to do [13:32:52] zeljkof: then 100% sure the patch is harmless for prod [13:32:53] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3706417 (10elukey) Next steps: 1) Create unit files and systemd config for eventlogging_sync.sh and add the guards in puppet to allow trusty/stre... [13:32:53] it is not going to bring the site down [13:32:59] hashar: so, deploy? :) [13:33:02] as for whether it is correct. Really I have no idea. [13:34:05] hashar: the thing that is confusing to me is that the original patch was made in phab, not in gerrit https://phabricator.wikimedia.org/T178775 [13:34:27] so, if the person creating it knew what they were doing, why didn't they submit the patch in gerrit? [13:34:38] they dont know how to use gerrit? [13:34:55] some might be familiar with mediawiki configuration but not with git/gerrit [13:35:12] zeljkof: I guess just deploy it [13:35:21] hashar: ok, deploying then [13:35:26] and they can still follow up if does not match their expectations or if something got missed [13:35:57] Jayprakash12345: deploying your patch, I guess there is nothing for you to do, since you can not test it :) [13:36:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 (https://phabricator.wikimedia.org/T178775) (owner: 10Jayprakash12345) [13:37:18] Thank you [13:37:33] (03Merged) 10jenkins-bot: Add $wgNamespaceRobotPolicies Config for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 (https://phabricator.wikimedia.org/T178775) (owner: 10Jayprakash12345) [13:37:43] (03CR) 10jenkins-bot: Add $wgNamespaceRobotPolicies Config for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385843 (https://phabricator.wikimedia.org/T178775) (owner: 10Jayprakash12345) [13:38:04] Jayprakash12345: thank you for releasing with #releng :) [13:39:17] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:385843|Add $wgNamespaceRobotPolicies Config for hewikisource (T178775)]] (duration: 00m 46s) [13:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:27] T178775: Update Robot Policies for the Hebrew Wikisource - https://phabricator.wikimedia.org/T178775 [13:39:31] !log EU SWAT finished [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:40] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:39:47] marostegui: eu swat finished [13:40:08] zeljkof: thank you! [13:41:22] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:41:34] (03CR) 10jenkins-bot: db-codfw.php: Depool db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386173 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:41:52] !log Stop MySQL on db2035 to copy its data to db2088 - T178359 [13:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:59] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [13:42:29] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2035 - T178359 (duration: 00m 45s) [13:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:36] (03PS8) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 [13:45:16] (03PS1) 10Marostegui: db2035.yaml: Update path location [puppet] - 10https://gerrit.wikimedia.org/r/386182 [13:45:40] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: use 70% of capacity [puppet] - 10https://gerrit.wikimedia.org/r/386138 (owner: 10Giuseppe Lavagetto) [13:46:06] (03PS10) 10Rush: ssh-key-ldap-lookup: Deny user auth if /etc/block-ldap-key-lookup exists [puppet] - 10https://gerrit.wikimedia.org/r/384574 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [13:46:45] (03PS2) 10Marostegui: db2035.yaml: Update path location [puppet] - 10https://gerrit.wikimedia.org/r/386182 [13:47:34] (03CR) 10Marostegui: [C: 032] db2035.yaml: Update path location [puppet] - 10https://gerrit.wikimedia.org/r/386182 (owner: 10Marostegui) [13:58:05] 10Operations: Update prod SSH key for Michael Holloway (mholloway-shell) - https://phabricator.wikimedia.org/T178897#3706472 (10Mholloway) [14:02:16] (03PS1) 10Mholloway: Update prod ssh key for user mholloway-shell [puppet] - 10https://gerrit.wikimedia.org/r/386187 (https://phabricator.wikimedia.org/T178897) [14:11:18] (03PS2) 10Giuseppe Lavagetto: jobrunner: decommission old servers [puppet] - 10https://gerrit.wikimedia.org/r/386156 [14:11:57] (03PS1) 10Ottomata: Add default_prometheus_jmx_exporter.yaml [puppet] - 10https://gerrit.wikimedia.org/r/386190 [14:12:02] (03CR) 10Giuseppe Lavagetto: "> As disclaimer to whoever reads this, before applying this change" [puppet] - 10https://gerrit.wikimedia.org/r/386156 (owner: 10Giuseppe Lavagetto) [14:13:43] (03PS2) 10Ottomata: Add default_prometheus_jmx_exporter.yaml [puppet] - 10https://gerrit.wikimedia.org/r/386190 (https://phabricator.wikimedia.org/T175344) [14:14:58] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386191 [14:15:02] (03CR) 10Elukey: [C: 031] "> > As disclaimer to whoever reads this, before applying this change" [puppet] - 10https://gerrit.wikimedia.org/r/386156 (owner: 10Giuseppe Lavagetto) [14:16:49] (03CR) 10Ottomata: "I'd like to move the config file rendering into jmx_exporter_instance, which is why I put the .yaml file in the promethues module, instead" [puppet] - 10https://gerrit.wikimedia.org/r/386190 (https://phabricator.wikimedia.org/T175344) (owner: 10Ottomata) [14:17:01] godog: when you have a few mins: https://phabricator.wikimedia.org/T175344#3706529 [14:17:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386191 (owner: 10Marostegui) [14:19:13] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386191 (owner: 10Marostegui) [14:19:21] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386191 (owner: 10Marostegui) [14:19:30] (03PS1) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [14:19:33] (03Abandoned) 10Ema: VCL: Exp cache admission policy for varnish-be [puppet] - 10https://gerrit.wikimedia.org/r/379512 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [14:20:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1106 weight (duration: 00m 45s) [14:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:25] (03PS2) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [14:20:59] (03PS3) 10Giuseppe Lavagetto: jobrunner: decommission old servers [puppet] - 10https://gerrit.wikimedia.org/r/386156 [14:22:20] (03PS2) 10Gehel: wdqs: GC tuning [puppet] - 10https://gerrit.wikimedia.org/r/386170 (https://phabricator.wikimedia.org/T175919) [14:23:27] (03CR) 10Gehel: [C: 032] wdqs: GC tuning [puppet] - 10https://gerrit.wikimedia.org/r/386170 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [14:24:30] !log restarting blazegraph on wdqs2001 for GC config change - T175919 [14:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:38] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [14:25:43] (03PS2) 10Andrew Bogott: prometheus: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385480 (owner: 10Hashar) [14:27:30] (03CR) 10Andrew Bogott: [C: 032] prometheus: use profile::labs::lvm::srv instead of role [puppet] - 10https://gerrit.wikimedia.org/r/385480 (owner: 10Hashar) [14:30:18] (03PS1) 10BBlack: new patch: configurable ssl_do_wait_shutdown [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/386195 [14:30:21] (03PS1) 10BBlack: Release 1.13.6-2+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/386196 [14:30:33] PROBLEM - Check size of conntrack table on mw1309 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [14:30:43] (03PS3) 10Hashar: graphite: cleanup servers.* [puppet] - 10https://gerrit.wikimedia.org/r/377414 [14:31:23] PROBLEM - Check size of conntrack table on mw1311 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [14:31:27] (03PS2) 10BBlack: new patch: configurable ssl_do_wait_shutdown [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/386195 (https://phabricator.wikimedia.org/T163674) [14:31:29] (03PS2) 10BBlack: Release 1.13.6-2+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/386196 [14:31:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386197 (https://phabricator.wikimedia.org/T174509) [14:32:27] RECOVERY - Check size of conntrack table on mw1311 is OK: OK: nf_conntrack is 79 % full [14:32:33] (03PS4) 10Giuseppe Lavagetto: jobrunner: decommission old servers [puppet] - 10https://gerrit.wikimedia.org/r/386156 [14:32:44] !log Optimize pagelinks and templatelinks on db1065 - T174509 [14:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:32:56] <_joe_> didn't we remove the damn conntrack from ferm rules on jobrunners? [14:33:43] !log Optimize ores_classification on enwiki db1065 - T159753 [14:33:45] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: decommission old servers [puppet] - 10https://gerrit.wikimedia.org/r/386156 (owner: 10Giuseppe Lavagetto) [14:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [14:34:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386197 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [14:35:33] RECOVERY - Check size of conntrack table on mw1309 is OK: OK: nf_conntrack is 77 % full [14:35:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386197 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [14:35:54] (03PS3) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [14:36:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386197 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [14:36:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T174509 (duration: 00m 45s) [14:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:47] (03PS4) 10Zoranzoki21: ci: jenkins, allow access to computer/.*/builds [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458) (owner: 10Addshore) [14:39:15] !log Optimize pagelinks templatelinks and recentchanges on db1030 - T174509 https://phabricator.wikimedia.org/T177772 [14:39:16] (03PS3) 10BBlack: new patch: configurable ssl_do_wait_shutdown [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/386195 (https://phabricator.wikimedia.org/T163674) [14:39:18] (03PS3) 10BBlack: Release 1.13.6-2+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/386196 [14:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:23] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [14:39:35] (03Abandoned) 10Hashar: .gitignore private in case it is a symlink [puppet] - 10https://gerrit.wikimedia.org/r/382154 (owner: 10Hashar) [14:39:43] PROBLEM - Nginx local proxy to apache on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:39:54] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:13] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:01] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3706629 (10Addshore) [14:42:39] RECOVERY - Nginx local proxy to apache on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [14:42:49] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.034 second response time [14:46:02] <_joe_> !log stopping the jobrunner service on mw1161-67 [14:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:33] 10Operations, 10wikidiff2, 10User-Addshore, 10WMDE-QWERTY-Team-Board: Update and use php-wikidiff2 to 1.5 in production - https://phabricator.wikimedia.org/T177891#3706661 (10Addshore) @hashar @greg do you see any reason that I shouldn't set wgWikiDiff2MovedParagraphDetectionCutoff based on the return of g... [14:48:19] <_joe_> !log also stopping HHVM, nginx, apache2 , and disabling all the services [14:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:00] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 75217 bytes in 0.100 second response time [14:50:19] PROBLEM - HHVM jobrunner on mw1164 is CRITICAL: connect to address 10.64.32.34 and port 9005: Connection refused [14:50:19] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: connect to address 10.64.32.37 and port 9005: Connection refused [14:50:20] PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: connect to address 10.64.32.36 and port 9005: Connection refused [14:50:29] PROBLEM - HHVM jobrunner on mw1163 is CRITICAL: connect to address 10.64.32.33 and port 9005: Connection refused [14:50:40] PROBLEM - mediawiki-installation DSH group on mw1163 is CRITICAL: Host mw1163 is not in mediawiki-installation dsh group [14:50:49] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: connect to address 10.64.32.32 and port 9005: Connection refused [14:50:59] PROBLEM - HHVM jobrunner on mw1165 is CRITICAL: connect to address 10.64.32.35 and port 9005: Connection refused [14:51:08] <_joe_> uhm I didn't force a run on einsteinium, sorry [14:51:19] PROBLEM - Nginx local proxy to apache on mw1165 is CRITICAL: connect to address 10.64.32.35 and port 443: Connection refused [14:51:49] PROBLEM - Nginx local proxy to apache on mw1166 is CRITICAL: connect to address 10.64.32.36 and port 443: Connection refused [14:51:49] PROBLEM - Nginx local proxy to apache on mw1167 is CRITICAL: connect to address 10.64.32.37 and port 443: Connection refused [14:51:50] PROBLEM - Nginx local proxy to apache on mw1164 is CRITICAL: connect to address 10.64.32.34 and port 443: Connection refused [14:52:10] PROBLEM - Nginx local proxy to apache on mw1162 is CRITICAL: connect to address 10.64.32.32 and port 443: Connection refused [14:54:30] PROBLEM - Check size of conntrack table on mw1311 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [14:56:31] RECOVERY - Check size of conntrack table on mw1311 is OK: OK: nf_conntrack is 74 % full [15:01:50] (03CR) 10Zoranzoki21: [C: 031] Deploy Compact Language Links on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384527 (https://phabricator.wikimedia.org/T177836) (owner: 10KartikMistry) [15:09:20] !log restarting blazegraph on all wdqs nodes for GC config change - T175919 [15:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:28] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [15:17:40] PROBLEM - Check size of conntrack table on mw1311 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [15:17:51] PROBLEM - Check size of conntrack table on mw1309 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [15:19:17] (03PS4) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [15:20:40] RECOVERY - Check size of conntrack table on mw1311 is OK: OK: nf_conntrack is 70 % full [15:22:51] RECOVERY - Check size of conntrack table on mw1309 is OK: OK: nf_conntrack is 77 % full [15:26:03] <_joe_> I'm looking into the conntrack table issue there [15:27:47] https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jobrunner&var-instance=mw1309 [15:27:56] time_waits have increased a bit of course [15:28:30] maybe those are temporary timewaits when a lot of connections to redis are made [15:28:42] (in a short time frame I mean) [15:30:41] PROBLEM - Check size of conntrack table on mw1311 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [15:31:20] conntrack -L shows a lot of timewaits, seems to rdb hosts [15:32:40] RECOVERY - Check size of conntrack table on mw1311 is OK: OK: nf_conntrack is 67 % full [15:34:00] also net.netfilter.nf_conntrack_tcp_timeout_time_wait is 120 [15:34:10] rather than 65 as it should [15:34:16] (on mw1311) [15:36:10] _joe_ can I set net.netfilter.nf_conntrack_tcp_timeout_time_wait to 65 to mw[1308-1311].eqiad.wmnet or are you working on them? [15:36:27] <_joe_> yeah [15:36:36] <_joe_> elukey: not only to rdb hosts, to be honest [15:36:48] oh yes yes [15:39:16] !log set net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to mw[1308-1311] - T136094 [15:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] T136094: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 [15:39:42] Cc: moritzm --^ [15:41:11] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:03] (03PS1) 10Chad: Pylint nitpicks: clean up import orders [software/conftool] - 10https://gerrit.wikimedia.org/r/386206 [15:42:33] (03CR) 10Ayounsi: [C: 032] Assigning v4/v6 IPs for eqiad/esams tunnel [dns] - 10https://gerrit.wikimedia.org/r/386119 (owner: 10Ayounsi) [15:42:45] (03CR) 10jerkins-bot: [V: 04-1] Pylint nitpicks: clean up import orders [software/conftool] - 10https://gerrit.wikimedia.org/r/386206 (owner: 10Chad) [15:48:06] !log drop table Edit_13457736_15423246 from the log database (Eventlogging) on db104[6,7], dbstore1002 [15:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:24] marostegui: --^ this is another huge table that we saved in hdfs [15:51:12] (03PS1) 10Chad: Pylint nitpicks: Whitespace/continuation fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/386211 [15:51:55] (03CR) 10jerkins-bot: [V: 04-1] Pylint nitpicks: Whitespace/continuation fixes [software/conftool] - 10https://gerrit.wikimedia.org/r/386211 (owner: 10Chad) [15:52:39] (03PS5) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [15:53:38] !log T178845: Rolling Cassandra restart, codfw, row b [15:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] T178845: Cassandra cluster name mismatch warnings - https://phabricator.wikimedia.org/T178845 [15:58:22] !log drop MediaViewer_10867062_15423246 and MobileWebUIClickTracking_10742159_15423246 from the log database on db1046 (archived on hadoop) - T168303 [15:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:29] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [15:59:06] nuria_: mind to double check the table names in --^ ? I didn't drop them at the time from db1046 (EL master) but they are already on hdfs [16:00:04] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T1600). [16:00:04] addshore: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:54] elukey: i have been documenting those here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop._Archived_Data [16:01:29] elukey: those two look good [16:01:31] PROBLEM - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.167 and port 9042: Connection refused [16:02:10] PROBLEM - cassandra-c SSL 10.192.16.167:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:02:19] nuria_: thanks! [16:02:22] addshore: o/ [16:02:27] o/ [16:02:35] addshore: holaaa [16:02:39] oh hello nuria_ ! [16:02:44] is releng ok with the change? I am trying to review your patch [16:02:51] elukey: let me know when we restart eventlogging purging [16:02:57] elukey: yup, hashar says it is OK in the ticket attached ;) [16:03:10] elukey: I am going to drop 1 more table but this one is not a big one, just for achival purposes [16:04:12] super thanks! [16:05:11] RECOVERY - cassandra-c SSL 10.192.16.167:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-c valid until 2018-07-19 10:52:13 +0000 (expires in 267 days) [16:05:31] RECOVERY - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.167 port 9042 [16:05:55] (03PS1) 10Ayounsi: Fix interface numbers on esams/eqiad tunnel PTR [dns] - 10https://gerrit.wikimedia.org/r/386214 [16:05:56] addshore: reading :) [16:06:21] elukey: do we use logstats much at all? not for hadoop not for eL, right? [16:06:41] PROBLEM - eventlogging_sync processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [16:07:24] nuria_: not much yes, we should build a dashboard for el though [16:07:32] sorry db1047 is me [16:08:21] PROBLEM - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:08:40] PROBLEM - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.176 and port 9042: Connection refused [16:08:44] RECOVERY - eventlogging_sync processes on db1047 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [16:08:57] addshore: seems good to me, releng is aware and you guys are going to check if it causes a hit in perf or not right? [16:09:06] elukey: yup! [16:09:21] RECOVERY - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-a valid until 2018-07-19 10:52:33 +0000 (expires in 267 days) [16:09:40] RECOVERY - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.176 port 9042 [16:10:14] addshore: all right let's do it [16:10:16] cc: hasharAway [16:10:51] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.3 [keeping static files] (duration: 01m 20s) [16:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:11:35] (03CR) 10Elukey: [C: 032] "Had a chat with Addshore on IRC. This seems to have been allowed by Releng. There might be a performance hit that will be evaluated, but r" [puppet] - 10https://gerrit.wikimedia.org/r/384960 (https://phabricator.wikimedia.org/T178458) (owner: 10Addshore) [16:12:12] I will match donations by 10%, up to $100 https://www.facebook.com/donate/1983502895227964/1983502898561297/?fundraiser_source=feed [16:14:15] (03CR) 10Ayounsi: [C: 032] Fix interface numbers on esams/eqiad tunnel PTR [dns] - 10https://gerrit.wikimedia.org/r/386214 (owner: 10Ayounsi) [16:15:05] PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:15:39] addshore: can you try and see if it works? [16:15:50] Looks good! [16:16:00] RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2018-07-19 10:52:36 +0000 (expires in 267 days) [16:16:09] (03PS1) 10Herron: puppetmaster: temporarily pin puppet* to jessie-backports in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386217 (https://phabricator.wikimedia.org/T177254) [16:16:32] addshore: also are we sure that no sensitive data is there by any chance? [16:16:57] performance should improve with https://issues.jenkins-ci.org/browse/JENKINS-20046 . though https://integration.wikimedia.org/ci/computer/contint1001/builds loads really fast for me. [16:17:05] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: temporarily pin puppet* to jessie-backports in codfw [puppet] - 10https://gerrit.wikimedia.org/r/386217 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [16:17:09] Hmm, so, it works, however hmmm [16:17:42] https://integration.wikimedia.org/ci/computer/contint1001/builds keeps making more requests to extend its list, which I guess isnt ideal [16:17:46] addshore: ? [16:18:43] elukey: I think revert, the underlying problem isnt solved [16:19:13] addshore i guess that should be reported upstream? [16:19:21] paladox: yeh [16:19:31] like this https://issues.jenkins-ci.org/browse/JENKINS-20046 one [16:19:37] where do you guys see the problem? It seems fast to me [16:19:37] There are probably simple ways to solve it, only load more list if the user actually scrolls down that far [16:19:41] or just page the stuff [16:20:07] elukey: Yeh it's speedy, but the issue is each ajax request to add more items to the table reparses one file that contains a list of all builds on the host [16:20:12] (03PS1) 10EBernhardson: Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386219 [16:20:22] (03PS10) 10Gehel: Add profiles/roles for stats/ML on Wikimedia Cloud [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [16:20:31] and I believe the further down the list / table you get the slower it will go / more load on the server [16:20:46] (03PS1) 10Elukey: Revert "ci: jenkins, allow access to computer/.*/builds" [puppet] - 10https://gerrit.wikimedia.org/r/386220 [16:21:33] (03CR) 10Elukey: [C: 032] Revert "ci: jenkins, allow access to computer/.*/builds" [puppet] - 10https://gerrit.wikimedia.org/r/386220 (owner: 10Elukey) [16:21:49] (03CR) 10Gehel: [C: 032] Add profiles/roles for stats/ML on Wikimedia Cloud [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [16:21:56] (03PS11) 10Gehel: Add profiles/roles for stats/ML on Wikimedia Cloud [puppet] - 10https://gerrit.wikimedia.org/r/383916 (https://phabricator.wikimedia.org/T178096) (owner: 10Bearloga) [16:22:06] (03CR) 10Hoo man: "Not sure I totally grasp this, yet." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386161 (https://phabricator.wikimedia.org/T178888) (owner: 10ArielGlenn) [16:22:20] elukey: damn, you were faster than me on the +2... [16:22:31] :) [16:22:45] Thanks for trying that, I'll write something in the ticket now! [16:23:00] sure, next time more testing is needed.. this is not really good [16:23:18] reading from the task it seemed that it was at least tested [16:23:45] (03PS1) 10Deskana: Updating wikis with consolidate editing feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386222 (https://phabricator.wikimedia.org/T168886) [16:23:54] yup, just maybe hashar didn't spot the ajax requests continuing to trickle out [16:24:01] all right https://integration.wikimedia.org/ci/computer/contint1001/builds is forbidden now [16:25:42] puppet swat completed! [16:27:17] elukey: infact, yeh, using curl of course we would not have seen the JS ajax requests happening, but we did wee that the page loaded quickly! [16:28:37] :) [16:30:10] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [16:31:10] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.137 port 9042 [16:35:19] 10Operations, 10Traffic: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3707139 (10BBlack) [16:39:18] (03PS1) 10Chad: group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386223 [16:39:20] (03CR) 10Chad: [C: 04-2] group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386223 (owner: 10Chad) [16:39:41] PROBLEM - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.143 and port 9042: Connection refused [16:39:57] !log demon@tin Started scap: bootstrap wmf.5 [16:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:26] godog: I am looking at this alerting rule: [16:40:28] check_graphite_threshold!https://graphite-labs.wikimedia.org!10!$HOSTNOTES$.$HOSTNAME$.puppetagent.failed_events!0!0!10min!0min!1!--over [16:40:41] RECOVERY - cassandra-a CQL 10.192.32.143:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.143 port 9042 [16:40:45] That seems to alert immediately any time puppet fails [16:40:53] what I want is for it to alert if puppet fails twice in a row [16:41:13] Any suggestions? Or guess about if that's possible? (I thought it was obvious until I tried it) [16:42:41] PROBLEM - puppetmaster backend https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 404 Not Found [16:42:51] PROBLEM - DPKG on puppetmaster2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:43:31] PROBLEM - puppetmaster https on puppetmaster2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 404 Not Found [16:44:31] RECOVERY - puppetmaster https on puppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.185 second response time [16:50:25] !log upgrading puppet packages on codfw puppetmaster T177254 [16:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:32] T177254: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254 [16:51:38] herron: I'm very interested in whether or not all of the cert bits between clients and puppetmasters fall apart when you do that upgrade. [16:53:47] akosiaris: hey! we have a package builder node in tools that seems to be running into some trouble with the new buster stuff in puppet [16:53:50] https://phabricator.wikimedia.org/T178920 [16:54:08] E: No such script: /usr/share/debootstrap/scripts/buster seems to be the thing that's tripping puppet [16:54:34] I saw that you'd added some stuff recently around this (https://github.com/wikimedia/puppet/commit/3767da034ebfbb4e7c3ca332d7cbd86aa5e170c3), and was wondering if you'd have an idea why [16:55:11] PROBLEM - cassandra-a CQL 10.192.48.49:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.49 and port 9042: Connection refused [16:55:24] akosiaris: may be some manual step we're missing? [16:56:11] RECOVERY - cassandra-a CQL 10.192.48.49:9042 on restbase2006 is OK: TCP OK - 0.036 second response time on 10.192.48.49 port 9042 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:19] I’ll be updating the ORES service to the next major version. Please fasten your seatbelts. [17:02:39] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3702250 (10Deskana) >>! In T178778#3702293, @PlanetKrypton wrote: > This appears to be the response / request and it's accompanying error > > https:... [17:02:40] * Zppix fastens [17:02:58] (03CR) 10DCausse: [C: 031] "don't forget to upload it to deployment-prep otherwise it'll break some browser tests." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386219 (owner: 10EBernhardson) [17:03:31] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused [17:04:31] RECOVERY - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.54 port 9042 [17:05:46] (03PS6) 10Ema: VCL: Exp cache admission policy for varnish-fe [puppet] - 10https://gerrit.wikimedia.org/r/386192 (https://phabricator.wikimedia.org/T144187) [17:07:02] !log awight@tin Started deploy [ores/deploy@fb55ab8]: Update ORES to revscoring 2.x, T175180 [17:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:11] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [17:09:00] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused [17:10:00] RECOVERY - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is OK: TCP OK - 0.036 second response time on 10.192.48.56 port 9042 [17:11:26] (03CR) 10Jforrester: [C: 031] Updating wikis with consolidate editing feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386222 (https://phabricator.wikimedia.org/T168886) (owner: 10Deskana) [17:12:38] (03CR) 10Elukey: "Failing pcc: https://puppet-compiler.wmflabs.org/compiler02/8450/kafka-jumbo1001.eqiad.wmnet/change.kafka-jumbo1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/384586 (https://phabricator.wikimedia.org/T177216) (owner: 10Ottomata) [17:18:23] !log T178845: Rolling Cassandra restart, eqiad [17:18:28] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3707411 (10debt) p:05High>03Normal [17:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:29] T178845: Cassandra cluster name mismatch warnings - https://phabricator.wikimedia.org/T178845 [17:28:01] PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.117 and port 9042: Connection refused [17:28:31] !log demon@tin Finished scap: bootstrap wmf.5 (duration: 48m 34s) [17:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:01] RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.117 port 9042 [17:30:20] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:30:50] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused [17:31:21] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [17:31:50] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:31:50] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:31:50] RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.118 port 9042 [17:32:10] RECOVERY - DPKG on puppetmaster2001 is OK: All packages OK [17:32:37] (03PS1) 10Dbarratt: Enable Special:EmailUser User Prohibit on All Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386231 (https://phabricator.wikimedia.org/T177319) [17:32:41] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [17:32:41] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [17:32:50] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [17:33:40] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [17:38:50] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:39:00] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [17:39:50] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [17:39:51] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [17:46:11] urandom, all good with rb/cass? should we proceed with our parsoid deploy? [17:47:10] nothing critical to deploy .. so, we can hold on for later / tomorrow .. let us know. [17:47:26] subbu: yeah, all good [17:47:50] arlolra, ^ [17:48:06] ok [17:48:20] (03CR) 10Volans: "if you want to order the imports why not following https://wikitech.wikimedia.org/wiki/PythonStyleGuide ?" [software/conftool] - 10https://gerrit.wikimedia.org/r/386206 (owner: 10Chad) [17:48:31] !log arlolra@tin Started deploy [parsoid/deploy@af55c4b]: Updating Parsoid to a3be9cfc [17:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:33] 10Operations, 10Parsoid, 10Traffic, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3707527 (10PlanetKrypton) @Deskana I had plugin temporarily disabled so people didn't try to use it. Try now. [17:55:21] PROBLEM - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.205 and port 9042: Connection refused [17:56:21] RECOVERY - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.205 port 9042 [17:57:47] !log arlolra@tin Finished deploy [parsoid/deploy@af55c4b]: Updating Parsoid to a3be9cfc (duration: 09m 16s) [17:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:50] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused [17:59:50] RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.206 port 9042 [18:05:33] !log Updated Parsoid to a3be9cfc (T96923, T177784, T176568, T25467) [18:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:45] T96923: html2wt should not need access to original source - https://phabricator.wikimedia.org/T96923 [18:05:45] T25467: sig preview not accurate, disregards html-tidy effects - https://phabricator.wikimedia.org/T25467 [18:05:45] T176568: Wrapping a link with causes to appear inside the link - https://phabricator.wikimedia.org/T176568 [18:05:45] T177784: Parsoid uses the wikipedia interwiki prefix for english wikipedia instead of the :en interwiki prefix - https://phabricator.wikimedia.org/T177784 [18:06:01] ohboy. I’m trying to deploy ORES, and my SSH connection hung. I hadn’t thought to run inside of screen. [18:06:22] Now I’m locked out of redeploying, and the log directory seems to be empty so I don’t even know the status of my previous deploy. [18:06:59] ah—decoy log directory. I found the log [18:07:56] wat. Are logs not written until after the scap process ends? I’m gonna need some help here. [18:08:19] XioNoX: hi, can I ask for your input on ^ ? [18:09:43] awight: I don't know much about scap, but I can probably find someone who does [18:09:44] (03PS3) 10Dzahn: requesttracker: apache resources vs include [puppet] - 10https://gerrit.wikimedia.org/r/382343 [18:09:55] XioNoX: ty! [18:09:57] twentyafterfour maybe? [18:12:17] (03CR) 10Dzahn: [C: 032] requesttracker: apache resources vs include [puppet] - 10https://gerrit.wikimedia.org/r/382343 (owner: 10Dzahn) [18:12:34] The scap .lock file doesn’t include a PID. [18:12:34] ping no_justification twentyafterfour ^ awight above [18:12:39] ty [18:12:54] I guess I just rm the lockfile and try again?> [18:13:13] I’m disturbed by the lack of a logfile, though. I’d like to know the status of my failed deployment... [18:13:55] do you have access to the hosts where it would be deployed? you can look at /srv/deployment/whatever over there and see if there's a partial or full checkout of the new version [18:14:00] awight: hmm.... [18:14:21] apergos: I could, but that’s scary :) [18:14:41] there are symlinks which indicate the status of each revision deployed [18:14:53] I’m going ahead with forcing a redployment if possible. The log thing is the biggest mystery I’d like help with, at this point. [18:14:53] e.g. an in-progress symlink pointing to the partially-deployed revision [18:15:05] cool. I like the symlinks. [18:16:02] current and done symlinks should both point to the latest deployed revision [18:16:25] !log awight@tin Started deploy [ores/deploy@fb55ab8]: Update ORES to revscoring 2.x, T175180 (take 2) [18:16:25] this is all in /srv/deployment/${project}/${repo}-cache/ [18:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:32] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [18:17:39] Shouldn't there be logs in logstash? [18:17:43] the log file should be written out as deploy progresses, not buffered until the end [18:17:48] Same with the file [18:19:22] https://logstash.wikimedia.org/app/kibana#/dashboard/scap [18:20:58] twentyafterfour: Neat, thanks for the link. [18:21:05] there are logs in there for sure [18:21:19] I don't see what went wrong (yet) [18:23:00] PROBLEM - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.135 and port 9042: Connection refused [18:23:53] 10Operations, 10hardware-requests: eqiad: (2) hardware access request for labvirt expansion (labvirt1021 & labvirt1022) - https://phabricator.wikimedia.org/T178937#3707659 (10chasemp) [18:24:00] 10Operations, 10hardware-requests: eqiad: (2) hardware access request for labvirt expansion (labvirt1021 & labvirt1022) - https://phabricator.wikimedia.org/T178937#3707659 (10chasemp) p:05Triage>03Normal [18:24:00] RECOVERY - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.135 port 9042 [18:25:54] twentyafterfour: I think the biggest issue was that my SSH connection broke before the deployment completed. Note that there’s no corresponding file in /srv/deployment/ores/deploy/scap/log/ [18:26:00] PROBLEM - ores on ores1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.003 second response time [18:26:01] PROBLEM - ores on ores1005 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time [18:26:11] PROBLEM - ores on ores1007 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time [18:26:23] <_joe_> hey, is this an ores outage in eqiad? [18:26:28] !log awight@tin Finished deploy [ores/deploy@fb55ab8]: Update ORES to revscoring 2.x, T175180 (take 2) (duration: 10m 03s) [18:26:30] PROBLEM - ores on ores1002 is CRITICAL: connect to address 10.64.0.52 and port 8081: Connection refused [18:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:35] uh oh [18:26:36] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [18:26:40] _joe_: no, sorry-- [18:26:41] PROBLEM - ores on ores1003 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 136 bytes in 0.001 second response time [18:26:50] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:26:56] I’m deploying to the current production machines, scb*, and that succeeded. [18:27:02] <_joe_> ok [18:27:13] donno why deploying to the new ores* cluster failed, but those machines are not in production yet. [18:27:19] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/8451/" [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [18:27:32] <_joe_> so what's up with ores1*? I'm around because meeting, if there is no fire someone else can look [18:27:59] There’s no fire but it’s a mystery. [18:28:40] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:28:53] ores1001 [18:28:54] INFO [18:28:56] deploy-local [18:28:58] Update submodules [18:29:31] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [18:29:51] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [18:29:53] so the deploy is targeting ores* as well as scb* [18:30:08] yes. I added that so I could push stress testing code to the new cluster. [18:30:23] What I should have done was limit this deploy to the current production cluster, though. [18:30:39] <_joe_> win 24 [18:31:19] "PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded" ? [18:31:43] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [18:31:50] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational [18:36:21] PROBLEM - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.139 and port 9042: Connection refused [18:37:21] RECOVERY - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.139 port 9042 [18:39:30] twentyafterfour: train hasn't gone out yet, right? [18:39:40] PROBLEM - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused [18:40:12] Krinkle: right, and it's no_justification :) [18:40:29] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T1900 [18:40:32] greg-g: Thx. Got a patch meant to roll out with the train slowly. No cherry-pick. [18:40:40] RECOVERY - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.140 port 9042 [18:40:46] Just backported to wmf.5, but I guess I don't need to do anything beyond that then. [18:40:56] Missed the cut by a few hours :/ [18:41:10] I started it early today, sorry [18:41:25] If you wanna go ahead and sync-file it out ahead of me doing the wikiversions swap, feel free [18:47:07] !log awight@tin Started deploy [ores/deploy@fb55ab8]: Rollback ORES to Revscoring 1.0, T175180 [18:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:15] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [18:49:40] (03PS1) 10Chad: Stop loading FundraiserLandingPage in beta for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386251 [18:49:42] (03CR) 10Chad: [C: 032] Stop loading FundraiserLandingPage in beta for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386251 (owner: 10Chad) [18:53:50] (03CR) 10Chad: "Yeah, I'm aware of the flake8 issues, already fixed them in scap. With the former: wasn't aware that was a thing even ;-)" [software/conftool] - 10https://gerrit.wikimedia.org/r/386206 (owner: 10Chad) [18:57:10] PROBLEM - puppet last run on puppetcompiler1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:20] !log awight@tin Finished deploy [ores/deploy@fb55ab8]: Rollback ORES to Revscoring 1.0, T175180 (duration: 10m 13s) [18:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:28] T175180: Deploy ORES (revscoring 2.0) - https://phabricator.wikimedia.org/T175180 [19:00:04] no_justification: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:30] It hasn't happened yet this week. Pick better snark jouncebot [19:03:14] we need an ORES backed jouncebot [19:03:22] heheh [19:03:32] (03CR) 10Chad: [C: 032] group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386223 (owner: 10Chad) [19:03:37] * Amir1 runs away [19:03:53] screaming "ORES is taking over the world" [19:04:18] I do think we need an ORES backed CI (judges git diffs for "will this break things" based on where the codes is and who did it and what their history is etc etc ;) ) [19:04:23] 10Operations, 10cloud-services-team: Reboots of cloud servers - https://phabricator.wikimedia.org/T168445#3707819 (10madhuvishy) [19:04:26] 10Operations, 10DBA, 10cloud-services-team, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3707817 (10madhuvishy) 05Resolved>03Open Reopening since we are scheduling the labsdb1001 and 1003 reboots over the next couple weeks. [19:10:30] (03Merged) 10jenkins-bot: group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386223 (owner: 10Chad) [19:10:39] (03CR) 10jenkins-bot: group0 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386223 (owner: 10Chad) [19:12:10] RECOVERY - puppet last run on puppetcompiler1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:13:00] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.5 [19:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:55] (03CR) 10Smalyshev: [C: 031] wdqs: add timestamp to GC logs [puppet] - 10https://gerrit.wikimedia.org/r/386132 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [19:33:21] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#3707903 (10herron) Notes and observations from upgrading puppetmaster2001 via `apt-get install puppetmaster` puppet packages. 1. The p... [19:36:41] (03PS3) 10Smalyshev: Make using CirrusSearch engine default for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379426 (https://phabricator.wikimedia.org/T175741) [19:41:38] paladox: congrats to a merge after PS 152 ;) [19:41:46] heh [19:42:46] (03PS14) 10Dzahn: gerrit-ssh: don't listen on all interfaces, disable on slaves [puppet] - 10https://gerrit.wikimedia.org/r/354074 [19:43:33] (03CR) 10Dzahn: [C: 032] gerrit-ssh: don't listen on all interfaces, disable on slaves [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [19:46:23] mutante: 152 time a charm xD [19:46:25] jouncebot: next [19:46:25] In 3 hour(s) and 13 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T2300) [19:46:29] Zppix: hehe [19:46:56] !log restarting gerrit on cobalt to apply change 354074 [19:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:54] paladox: Zppix: could you confirm gerrit ssh still works [19:47:57] before: [19:48:13] 0 :::29418 [19:48:15] after: [19:48:17] yep [19:48:20] ssh still works [19:48:22] 208.80.154.85:29418 [19:48:22] great [19:48:29] mutante: sorry on mobile :/ [19:48:31] so it just listens where it should listen now [19:48:36] !log arlolra@tin Started deploy [parsoid/deploy@UNKNOWN]: Enabling logging for skwiki posts [19:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:11] !log arlolra@tin Finished deploy [parsoid/deploy@UNKNOWN]: Enabling logging for skwiki posts (duration: 00m 35s) [19:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:30] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [19:49:45] (03CR) 10Dzahn: "netstat -tulpen | grep 29418" [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [19:49:54] mutante ^^ [19:49:58] [20:49:30] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [19:50:04] lol [19:50:26] this should have happened on gerrit2001 [19:50:37] didnt icinga just catch the right moment? [19:50:54] Hmm [19:51:04] i wonder how is it checking for ssh access on cobalt [19:51:10] is it using the actual ip or a local one? [19:51:22] it claims connect to address 208.80.154.81 a [19:51:27] which is where it runs [19:53:25] 10Operations, 10DBA, 10cloud-services-team, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3707945 (10madhuvishy) Proposed timing for the 2 reboots: labsdb1001: Monday Oct 30 2017, 14:30 UTC (16:30 Madrid, 10:30 EST, 07:30 PT) labsdb1003:... [19:54:07] paladox: actually, nevermind [19:54:13] it does check the wrong IP [19:54:22] things work, monitoring needs fix [19:54:35] will fix it [19:54:53] mutante i know why it broke [19:54:58] it checks cobalt.wikimedia.org:29418 and this confirms my change worked :) [19:54:59] we made it so only the domain works [19:55:06] using an ip will fail [19:55:10] 10Operations, 10DBA, 10cloud-services-team, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3707948 (10Marostegui) Looks good to me! Thanks for getting this arranged [19:55:15] git clone ssh://paladox@208.80.154.81:29418/operations/puppet puppet2 [19:55:15] Cloning into 'puppet2'... [19:55:15] ssh: connect to host 208.80.154.81 port 29418: Connection refused [19:55:18] paladox: they are both IPs, it's just the wrong one [19:55:22] yes, i know [19:55:28] ok [19:55:47] 208.80.154.85 is the right one [19:56:14] that ip works [19:56:22] yep, works as designed [19:56:27] just icinga check is wrong [19:57:02] ok [19:57:31] check_ssh_port always uses $HOSTADDRESS no matter what [19:57:41] i guess the ip needs updating here too https://github.com/wikimedia/puppet/blob/b3c6968b3cb81670b58f375480296aa6813fd354/modules/role/manifests/ci/slave/labs.pp#L63 ? [20:02:56] (03Draft1) 10Paladox: role::ci::slave::labs: Update gerrit.wikimedia.org ip under host_aliases in ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/386257 [20:02:59] (03PS2) 10Paladox: role::ci::slave::labs: Update gerrit.wikimedia.org ip under host_aliases in ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/386257 [20:03:41] (03PS3) 10Paladox: ci::slave::labs: Update gerrit.wm ip under host_aliases in ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/386257 [20:06:19] (03PS1) 10Dzahn: gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 [20:08:05] (03CR) 10Paladox: gerrit/icinga: fix check_ssh on a custom (service) IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386258 (owner: 10Dzahn) [20:08:22] (03PS2) 10Dzahn: gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 [20:09:13] (03PS3) 10Dzahn: gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 [20:09:15] (03CR) 10Dzahn: gerrit/icinga: fix check_ssh on a custom (service) IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/386258 (owner: 10Dzahn) [20:10:36] (03PS4) 10Dzahn: gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 [20:11:17] narf [20:11:52] (03PS5) 10Dzahn: gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 [20:12:53] (03CR) 10Dzahn: [C: 031] gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 (owner: 10Dzahn) [20:13:18] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic, 10Security-Core: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604#3708032 (10BBlack) Bump, I'd like to see this happen, it seems like a pretty healthy and cheap layer of prote... [20:13:46] (03CR) 10Volans: "For the former, it's not "a thing", just some sort of guideline draft/proposal I wrote a while ago for the topic of imports that is not co" [software/conftool] - 10https://gerrit.wikimedia.org/r/386206 (owner: 10Chad) [20:14:29] 10Operations, 10DBA, 10cloud-services-team, 10Scoring-platform-team (Current): Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3708038 (10madhuvishy) Thanks @Marostegui. I've updated the lists, and our wiki here -https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_sh... [20:16:47] (03CR) 10Dzahn: [C: 032] gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 (owner: 10Dzahn) [20:17:20] paladox: did it break jenkins ? arrg [20:17:36] 5 PSes, no jenkins-bot [20:17:43] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/386258 (owner: 10Dzahn) [20:17:48] i bet you they also talked to the wrong IP [20:17:54] hmm https://integration.wikimedia.org/zuul/ [20:18:02] aha [20:18:14] mutante https://gerrit.wikimedia.org/r/#/c/386257/ ? [20:18:20] or zuul using the wrong ip [20:18:23] hasharAway ^^ [20:18:54] isnt that going to be the wrong key then [20:19:35] maybe not sure [20:19:43] Anyone think its just zuul maybe needing a kick in the reboot? [20:19:50] nope :) [20:20:02] Worth a try no? [20:20:33] not really [20:21:03] i think we need to see if it can ssh to gerrit.wikimedia.org [20:21:04] because it uses it's rest api [20:21:06] is going to revert it partially but not fully [20:21:18] that is "keep the part that it's disabled on slaves" [20:21:25] +1 [20:21:32] but "revert the part that it doesnt listen on both IPs" [20:21:36] +1 too for now. [20:24:15] Maybe implement the listen on both ips part when releng is around to fix jenkins if needed mutante ? [20:24:33] (03PS1) 10Dzahn: gerrit: temp re-enable ssh on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/386265 [20:24:45] (03CR) 10Paladox: [C: 031] gerrit: temp re-enable ssh on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/386265 (owner: 10Dzahn) [20:24:54] Zppix: the purpose of this whole thing was to NOT listen on both IPs :) [20:25:08] Thats what i meant [20:25:58] (03CR) 10Dzahn: [C: 032] gerrit: temp re-enable ssh on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/386265 (owner: 10Dzahn) [20:26:05] (03CR) 10Dzahn: [V: 032 C: 032] gerrit: temp re-enable ssh on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/386265 (owner: 10Dzahn) [20:26:22] the Icinga check change can be merged anyways [20:26:30] will check the right IP and it will be back on both [20:26:35] (03CR) 10Dzahn: [V: 032 C: 032] gerrit/icinga: fix check_ssh on a custom (service) IP [puppet] - 10https://gerrit.wikimedia.org/r/386258 (owner: 10Dzahn) [20:27:11] (03PS1) 10RobH: install params for labvirt10[19-20] [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) [20:27:53] (03PS2) 10RobH: install params for labvirt10[19-20] [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) [20:28:18] robh: i have to restart gerrit, i'll let you merge :) [20:28:24] heh [20:28:29] i was wondering why i saw no zuul? [20:28:35] .. to fix that [20:28:41] yeah you do yer thing [20:28:44] ok [20:28:48] i prefer auto v so i shall wait! [20:29:04] !log restarting gerrit to make gerrit-ssh listen on all IPs again [20:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:23] :::29418 [20:29:29] expects recovery of jenkins [20:29:30] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.9-2-g99a8c8bc51-dirty (SSHD-CORE-1.2.0) (protocol 2.0) [20:29:39] heh ^^ [20:29:55] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) (owner: 10RobH) [20:30:23] o/ [20:30:25] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/386265 (owner: 10Dzahn) [20:30:32] go no_justification [20:30:34] gr [20:31:07] (missed a /) [20:31:07] (03CR) 10jerkins-bot: [V: 04-1] install params for labvirt10[19-20] [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) (owner: 10RobH) [20:31:09] (03CR) 10Zoranzoki21: [C: 031] Enable Special:EmailUser User Prohibit on All Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386231 (https://phabricator.wikimedia.org/T177319) (owner: 10Dbarratt) [20:31:21] there it goes [20:31:34] robh: ^ [20:31:35] (03CR) 10Zoranzoki21: [C: 031] Stop loading FundraiserLandingPage in beta for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386251 (owner: 10Chad) [20:32:03] (03CR) 10Zoranzoki21: [C: 031] Updating wikis with consolidate editing feedback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386222 (https://phabricator.wikimedia.org/T168886) (owner: 10Deskana) [20:32:18] mutante: that is IPv4 vs IPv6 isnt it ? [20:32:32] (03CR) 10Zoranzoki21: [C: 031] Update CirrusSearch enwiki MLR model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386219 (owner: 10EBernhardson) [20:32:37] I mean, was Gerrit made to only listen on its IPv4 address? [20:32:42] hasharAway: nope, it's cobalt vs gerrit [20:32:49] no [20:32:56] okok :) [20:32:57] it has 2 IPs [20:33:01] well, 4 :) [20:33:12] afaik Zuul/CI etc all uses gerrit.wikimedia.org [20:33:13] a server name and a service name [20:33:16] it should use the service name [20:33:42] Nothing should have been using cobalt.wm.o [20:33:48] And actually, it won't be able to anymore [20:33:49] :) [20:33:51] https://gerrit.wikimedia.org/r/#/c/386257/3/modules/role/manifests/ci/slave/labs.pp [20:33:58] no_justification: well, i reverted that [20:34:09] name => 'gerrit.wikimedia.org', [20:34:09] host_aliases => ['208.80.154.81'], [20:34:12] ^ that's wrong [20:34:17] or is it [20:34:30] no wait, it's not.. anyways it broke when we disabled ssh on cobalt.wm.org [20:34:38] and it recovered as soon as we brought it back on it [20:34:44] (03PS3) 10RobH: install params for labvirt10[19-20] [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) [20:34:56] bleh [20:34:57] (03PS4) 10RobH: install params for labvirt10[19-20] [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) [20:35:09] stupid regex mistake. [20:36:00] (03CR) 10RobH: [C: 032] install params for labvirt10[19-20] [puppet] - 10https://gerrit.wikimedia.org/r/386266 (https://phabricator.wikimedia.org/T172538) (owner: 10RobH) [20:36:02] thanks mutante :) [20:36:46] well, i dont know where it uses 208.80.154.81 but it should use 208.80.154.85 [20:36:52] for now it's back to before though and working [20:37:11] i fixed the Icinga check already for when we re-revert later [20:37:24] what i will do now is make it skip the monitoring on slave and re-enable puppet on gerrit2001 [20:37:34] sshd on master will still listen on everything [20:38:11] i am not sure if paladox' change is the fix, maybe [20:41:32] heh [20:42:57] (03PS1) 10Dzahn: gerrit: skip gerrit-ssh monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/386267 [20:43:32] (03PS2) 10Dzahn: gerrit: skip gerrit-ssh monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/386267 [20:43:34] (03CR) 10Paladox: [C: 031] gerrit: skip gerrit-ssh monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/386267 (owner: 10Dzahn) [20:44:27] (03CR) 10Dzahn: [C: 032] gerrit: skip gerrit-ssh monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/386267 (owner: 10Dzahn) [20:47:29] !log arlolra@tin Started deploy [parsoid/deploy@UNKNOWN]: (no justification provided) [20:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:42] !log arlolra@tin Finished deploy [parsoid/deploy@UNKNOWN]: (no justification provided) (duration: 00m 13s) [20:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:57] (03PS1) 10Bearloga: role::discovery: Fix stats/ML classes [puppet] - 10https://gerrit.wikimedia.org/r/386271 (https://phabricator.wikimedia.org/T178096) [20:53:21] (03CR) 10Dzahn: [C: 032] Gerrit: move LDAP spaces around for hostnames [puppet] - 10https://gerrit.wikimedia.org/r/385491 (owner: 10Chad) [20:53:25] HI! Who deploy patches for mediawiki/core? [20:53:26] (03PS2) 10Dzahn: Gerrit: move LDAP spaces around for hostnames [puppet] - 10https://gerrit.wikimedia.org/r/385491 (owner: 10Chad) [20:53:37] jouncebot: next [20:53:37] In 2 hour(s) and 6 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T2300) [20:53:55] (03PS12) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [20:53:58] Zoranzoki21: ^ see the nicknames on that link above [20:54:25] where? [20:54:37] Zoranzoki21: there is no patch in the evening swat window yet, you can add it. i know last time you missed it by 45 minutes [20:55:00] Zoranzoki21: "Evening Swat" on Oct 24 [20:55:08] in 2hours 6 min [20:55:12] It is not for wmf.. I ask for mediawiki software [20:55:44] oh, ok, then i don't know [20:55:50] OK [20:56:02] i thought it's still about your config change from recently [20:56:02] it happens apart of the weekly trains [20:56:10] unless its a urgent fix for security etc [20:56:18] mutante, I think on patch for removing tokipona [20:56:22] Zoranzoki21: I'm not sure what your question actually is. Please help me understand what you are trying to do with the answer. [20:56:45] I only want to know who can deploy patches for mediawiki software? mediawiki/core [20:56:57] where? [20:57:09] on gerrit [20:57:35] do you mean merge the changes? [20:58:09] Add +2 on patches for mediawiki/core [20:58:13] I think on it [20:58:18] no, deploy where [20:58:32] Zoranzoki21: again, what do you want to do with the answer? [20:58:54] What to do? I only want to know [20:59:01] Nothing another [20:59:28] Zoranzoki21: so short answer, probably on the order of 100 people [20:59:38] at least, in that ballpark [21:00:04] right [21:00:10] I ask for me to know [21:00:51] Zoranzoki21: roughly 100 people, give or take 30 [21:01:10] ok [21:03:26] wwtf [21:06:53] (03PS5) 10Dzahn: beta: hieradata for varnish caches [puppet] - 10https://gerrit.wikimedia.org/r/386077 (https://phabricator.wikimedia.org/T178841) (owner: 10Hashar) [21:08:20] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3708233 (10RobH) So labvirt1020 is being annoying, it gave me ssl errors trying to pull up https (not cert errors, negotiation errors) and then finall... [21:09:03] (03CR) 10Dzahn: [C: 032] beta: hieradata for varnish caches [puppet] - 10https://gerrit.wikimedia.org/r/386077 (https://phabricator.wikimedia.org/T178841) (owner: 10Hashar) [21:10:39] Platonides: ? [21:10:54] the request of Zoranzoki21 to "remove tokipona" [21:11:21] I'm not sure if he wants to remove support for that language (why?) or simply remove the wikimedia project [21:12:21] * greg-g shrugs [21:14:33] either way, if he wants to then he is back to mw-config and SWAT and not in mw-core [21:14:39] well, not SWAT :) [21:15:49] and a phabricator Task justifying it [21:15:59] he may have a point, though [21:16:51] btw mutante [21:16:55] I just saw https://gerrit.wikimedia.org/r/#/c/372210/ [21:17:15] and it struck me a bit [21:17:17] wasn't the PGP key included in that laptop? [21:17:29] Platonides: remove the language [21:17:39] but we still have two projects (locked) that use it [21:18:49] Ok, now I see the change & tasks [21:18:57] I thought tokipona was deleted, not locked [21:20:00] Platonides: https://phabricator.wikimedia.org/T178730 [21:20:09] and have a look at the submitters userpage https://www.mediawiki.org/w/index.php?title=User:KATMAKROFAN [21:20:11] Yeah, they're all in deleted.dblist [21:20:31] no_justification: I was just going by the task [21:20:36] Hah, wtf. [21:21:28] i want to keep the language there just for that userpage alone [21:21:42] "the WMF is trying to pull another Knowledge Engine and revive the Tokipona wiki while Langcom isn't looking" omg [21:22:39] To be fair, since Amir filed the original removal task I don't see a big deal in merging heh [21:22:54] Now that Langcom is on to our original evil plan [21:23:08] Platonides: no, that's not a revert, it's adding a new key [21:23:23] oh, the PGP key that i used to sign [21:24:27] well, 2 layers of encryption AND an additional 1:1 with Ariel was added [21:25:57] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3708284 (10RobH) [21:26:17] chasemp: labvirt1019 is all ready, still working on labvirt1020 [21:26:19] 3, LUKS, truecrypt and passphrase [21:52:05] mutante, or any other ops familiar with deb uploading .. i need help uploading a new parsoid deb pkg ... it is erroring ... https://gist.githubusercontent.com/subbuss/dcfd2d7f8a3e22c9a1e3b23da81f137a/raw/631fa2a6fbf85d52f844a448c15ed38ed1ea71fe/gistfile1.txt [22:23:51] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.204 second response time [22:25:52] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.209 second response time [22:31:40] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Make WDQS throttling more aggressive - https://phabricator.wikimedia.org/T178491#3693708 (10Smalyshev) 05Open>03Resolved [22:43:01] (03PS1) 10Smalyshev: Add types to some fields used by nginx [puppet] - 10https://gerrit.wikimedia.org/r/386317 (https://phabricator.wikimedia.org/T178530) [22:50:09] (03PS1) 10Andrew Bogott: git-sync-upstream: rewrite in python [puppet] - 10https://gerrit.wikimedia.org/r/386318 (https://phabricator.wikimedia.org/T177944) [22:54:42] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3708461 (10Krenair) Are we okay to close this now? Do we want to look into what caused the initial varnish upgrade? [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171024T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:21] (03PS2) 10Andrew Bogott: git-sync-upstream: rewrite in python [puppet] - 10https://gerrit.wikimedia.org/r/386318 (https://phabricator.wikimedia.org/T177944) [23:34:16] _joe_: did you try running the jobrunner in a terminal when those servers had it running (e.g. grabbing the cmd from ps suffices)? [23:35:05] (03CR) 10EBernhardson: "seems sane to me. logstash should just magically pick this up when deployed, but probably worth double checking in the elasticsearch api a" [puppet] - 10https://gerrit.wikimedia.org/r/386317 (https://phabricator.wikimedia.org/T178530) (owner: 10Smalyshev) [23:45:30] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review: Beta cluster is down - https://phabricator.wikimedia.org/T178841#3708577 (10greg) That's a good question (re what caused the varnish upgrade) so I guess we should figure that out. The timing seems oddly non-deterministic (from my u... [23:46:15] (03Abandoned) 10Krinkle: [WIP] Document and automate sources of static/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) (owner: 10Krinkle) [23:51:45] (03PS1) 10Jon Harald Søby: Fixing interwiki sort order for Northern Sami [mediawiki-config] - 10https://gerrit.wikimedia.org/r/386324 (https://phabricator.wikimedia.org/T178965)