[00:00:49] (03PS1) 10Dzahn: move planet2001 to .codfw. as it should be [dns] - 10https://gerrit.wikimedia.org/r/289112 [00:01:56] (03PS2) 10Dzahn: move planet2001 to .codfw. as it should be [dns] - 10https://gerrit.wikimedia.org/r/289112 [00:02:01] (03CR) 10Dzahn: [C: 032] move planet2001 to .codfw. as it should be [dns] - 10https://gerrit.wikimedia.org/r/289112 (owner: 10Dzahn) [00:02:04] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:31] !log mw1230, restart hhvm [00:03:32] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:54] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.033 second response time [00:04:51] !log created planet2001 ganeti VM on ganeti2001 [00:04:53] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [00:04:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [00:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:12] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [00:05:13] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [00:05:14] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 68590 bytes in 0.575 second response time [00:05:52] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [00:06:02] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead responds with malformed body: NoneType object has no attribute get [00:07:07] mobrovac: ^ [00:07:46] is that because {domain} and {title} are usually replaced by something [00:08:15] !log ebernhardson@tin Finished scap: Full scap to sync out WikimediaMessages update (duration: 25m 31s) [00:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:53] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [00:08:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:09:12] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [00:09:13] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [00:09:26] better! [00:09:40] i guess during scap then [00:09:44] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [00:10:02] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [00:15:33] (03PS1) 10Legoktm: Undeploy Gather extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289114 (https://phabricator.wikimedia.org/T128568) [00:30:05] legoktm: can you undeploy shorturl too? :D [00:30:43] AaronSchulz: one at a time :P urlshortener is waiting on https://gerrit.wikimedia.org/r/#/c/285932/ right now, that's the very last thing [00:30:55] we need to not break all the shorturl things [00:31:08] also only extension I ever got deployed :( [00:31:57] YuviPanda: https://phabricator.wikimedia.org/T107188 [00:39:33] RECOVERY - Disk space on elastic1022 is OK: DISK OK [00:52:14] !log restarted nova-conductor a few mins ago, no help for nodepool [00:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:52:23] !log restarted nova-scheduler, let's see if this fixes nodepool [00:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:54:48] !log restarted rabbitmq on labcontrol1001 [00:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:29] (03PS1) 10Bartosz Dziewoński: Fix filebackend-production.php symlink in NOC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289118 [01:07:51] (03PS2) 10Bartosz Dziewoński: Fix filebackend-production.php symlink in NOC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289118 [01:09:03] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 5 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10czar) I'm getting this error repeatedly with https://commons.wikimedia.o... [01:10:09] (03PS3) 10Bartosz Dziewoński: Fix filebackend-production.php symlink in NOC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289118 [01:11:03] (03CR) 10Alex Monk: [C: 032] Fix filebackend-production.php symlink in NOC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289118 (owner: 10Bartosz Dziewoński) [01:12:25] (03Merged) 10jenkins-bot: Fix filebackend-production.php symlink in NOC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289118 (owner: 10Bartosz Dziewoński) [01:13:40] !log krenair@tin Synchronized docroot/noc: https://gerrit.wikimedia.org/r/#/c/289118/ (duration: 00m 27s) [01:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:13:51] MatmaRex, ^ [01:15:20] Krenair: thanks [01:15:38] Krenair: oh hell, it's cached. https://noc.wikimedia.org/conf/highlight.php?file=filebackend-production.php i wonder for how long. (https://noc.wikimedia.org/conf/highlight.php?file=filebackend-production.php& shows okay) [01:16:18] oooh, it's fine now? not for very long then, i guess. :D [01:17:08] yeah, wfm [01:31:01] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2277377 (10BBlack) The last fix P3110 may leave it in a state where it can't renew, but honestly I'm not certain. We can... [01:35:33] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Transit: NTT (service ID 253065) {#11401} [10Gbps]BR [01:37:32] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [01:44:08] (03CR) 10MZMcBride: Cut down on system calls (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250170 (owner: 10Ori.livneh) [02:06:55] (03PS2) 10BBlack: varnishxcps: prevent junk injection [puppet] - 10https://gerrit.wikimedia.org/r/289067 (https://phabricator.wikimedia.org/T135227) [02:07:01] (03CR) 10BBlack: [C: 032 V: 032] varnishxcps: prevent junk injection [puppet] - 10https://gerrit.wikimedia.org/r/289067 (https://phabricator.wikimedia.org/T135227) (owner: 10BBlack) [02:28:49] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.1) (duration: 10m 33s) [02:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:36] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 17 02:37:35 UTC 2016 (duration 8m 46s) [02:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:15] (03PS2) 10Cmjohnson: Adding dns entries for mw1284-1306 [dns] - 10https://gerrit.wikimedia.org/r/288984 [02:48:46] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for mw1284-1306 [dns] - 10https://gerrit.wikimedia.org/r/288984 (owner: 10Cmjohnson) [02:58:23] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2299768 (10Cmjohnson) [03:44:49] !log Upgraded Grafana from 3.0.0-pre1 to 3.0.2 [03:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:45:00] ori: is back! [04:26:22] half :) [05:04:26] /topic status: ori: half-back [05:05:35] (03PS1) 10Dzahn: ssl: remove toolserver.org cert, uses letsencrypt now [puppet] - 10https://gerrit.wikimedia.org/r/289143 (https://phabricator.wikimedia.org/T134798) [05:06:15] (03PS2) 10Dzahn: ssl: remove toolserver.org cert, uses letsencrypt now [puppet] - 10https://gerrit.wikimedia.org/r/289143 (https://phabricator.wikimedia.org/T134798) [05:06:32] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2299896 (10Papaul) [05:10:22] (03CR) 10Dzahn: "@Nemo_bis things have changed meanwhile. toolserver.org and www.toolserver.org now have new certificates from letsencrypt. is this a reque" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [05:10:42] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T134798" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [05:12:20] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2277377 (10Dzahn) added @Nemo_bis because of the open gerrit change https://gerrit.wikimedia.org/r/#/c/227079/ . this mi... [05:28:49] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2299933 (10Papaul) [05:30:17] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2299896 (10Papaul) [05:33:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:38:03] PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79691 MB (15% inode=99%) [05:55:43] RECOVERY - Disk space on elastic1028 is OK: DISK OK [06:30:13] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:42] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:53] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:44] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:48] 06Operations, 10Traffic, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2299986 (10jcrespo) [06:51:14] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2300014 (10Nemo_bis) >>! In T134798#2299914, @Dzahn wrote: > we should add one more cert for wiki.toolserver.org besides... [06:52:32] 06Operations, 10Traffic, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300016 (10jcrespo) [06:52:34] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2300015 (10jcrespo) [06:54:12] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2288166 (10jcrespo) [06:54:53] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:23] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:34] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:22] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:52] (03PS1) 10Jcrespo: Empty db1026 except for vslow, dump [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289147 (https://phabricator.wikimedia.org/T135100) [07:00:49] (03CR) 10Jcrespo: [C: 032] Empty db1026 except for vslow, dump [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289147 (https://phabricator.wikimedia.org/T135100) (owner: 10Jcrespo) [07:02:23] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reduce db1026 load (duration: 00m 37s) [07:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:06:44] (03CR) 10Alexandros Kosiaris: [C: 032] ores: precaching goes down sometimes, making it more verbose [puppet] - 10https://gerrit.wikimedia.org/r/289036 (https://phabricator.wikimedia.org/T135444) (owner: 10Ladsgroup) [07:06:49] (03PS2) 10Alexandros Kosiaris: ores: precaching goes down sometimes, making it more verbose [puppet] - 10https://gerrit.wikimedia.org/r/289036 (https://phabricator.wikimedia.org/T135444) (owner: 10Ladsgroup) [07:06:56] (03CR) 10Alexandros Kosiaris: [V: 032] ores: precaching goes down sometimes, making it more verbose [puppet] - 10https://gerrit.wikimedia.org/r/289036 (https://phabricator.wikimedia.org/T135444) (owner: 10Ladsgroup) [07:08:52] <_joe_> !log restarted hhvm on mw1255, stuck in a deadlock on HPHP::Treadmill::getAgeOldestRequest [07:08:53] RECOVERY - HHVM rendering on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 68174 bytes in 0.084 second response time [07:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:09:03] RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.031 second response time [07:14:26] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2300081 (10Joe) a:03Joe [07:15:02] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [07:15:36] 06Operations, 10Incident-20150825-Redis, 10Monitoring: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#2300082 (10Joe) a:03Joe [07:17:08] 06Operations, 10Fundraising-Backlog: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2300084 (10Joe) p:05Triage>03Normal [07:19:24] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:20:40] 06Operations, 10Traffic, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300088 (10jcrespo) There were hundreds of que... [07:21:12] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2300091 (10Joe) [07:22:05] 06Operations, 10Traffic, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300092 (10jcrespo) Another problem is that th... [07:22:27] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2300094 (10Joe) @papaul I need to take some time to think of a transition strategy, I'll let you know as soon as I have time to think. [07:36:24] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: puppet fail [07:41:32] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:42:01] (03PS1) 10Ema: Revert "cache_misc: downgrade almost all to varnish3" [puppet] - 10https://gerrit.wikimedia.org/r/289148 (https://phabricator.wikimedia.org/T134989) [07:51:22] PROBLEM - Apache HTTP on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.021 second response time [07:53:23] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.034 second response time [07:54:46] (03CR) 10Ema: [C: 032 V: 032] Revert "cache_misc: downgrade almost all to varnish3" [puppet] - 10https://gerrit.wikimedia.org/r/289148 (https://phabricator.wikimedia.org/T134989) (owner: 10Ema) [07:56:16] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2300145 (10fgiunchedi) indeed this seems to have happened again, also while talking from -a to -b, the rest s... [07:56:42] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:57:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:57:33] PROBLEM - HHVM rendering on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.008 second response time [07:58:03] PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [07:58:23] there are a bunch of criticals on mw* hosts, is anybody looking into that? [07:59:43] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 68177 bytes in 1.169 second response time [07:59:59] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2300171 (10akosiaris) >>! In T135176#2299349, @ssastry wrote: > Separately, we should check if puppet has hardcoded references to upstart + whether we need to make any updates to prod... [08:00:12] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.073 second response time [08:02:00] yeah I'm looking on oxygen, looks like a spike of 5xx too [08:03:22] PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50405 bytes in 0.004 second response time [08:03:43] looks like mostly api to me [08:04:52] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:05:22] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.298 second response time [08:05:42] yeah api cluster in eqiad memory usage dropped ema _joe_, looks like hhvm restarted? [08:07:23] I've tried a bunch of http requests against the hosts that are critical according to icinga and they seem to be doing fine [08:08:50] (03PS1) 10Jcrespo: Set db1026 back as rc node; move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289149 (https://phabricator.wikimedia.org/T135100) [08:09:14] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=API+application+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=mem_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [08:09:15] yeah looks like hhvm keeps dying though, I'm looking at syslog on lithium [08:09:24] really weird [08:09:43] last respawn was May 17 08:07:40 mw1137 kernel: [5339527.186944] init: hhvm main process ended, respawning [08:10:17] (03CR) 10Jcrespo: [C: 032] Set db1026 back as rc node; move roles around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289149 (https://phabricator.wikimedia.org/T135100) (owner: 10Jcrespo) [08:11:41] the mw* alerts are all gone apparently [08:11:42] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Set db1026 back as rc node; move roles around (duration: 00m 31s) [08:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:57] <_joe_> godog: I did not restart anything, no [08:13:43] !log upgrading eqiad cache_misc to varnish 4 (T126206, T134989) [08:13:44] T126206: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206 [08:13:44] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [08:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:41] !log reducing durability and enabling GTID on db1026 [08:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:15] !log reducing durability and enabling GTID on db1026 T135100 [08:15:15] T135100: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100 [08:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:22] sorry for the spam [08:17:47] godog: there seems to be a correspondent increase in errors in https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm [08:18:21] SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_POS_WAIT('db1049-bin.002928', 558808376, 10) [08:18:56] but probably not related [08:19:01] elukey, I suppose you want to ping me? [08:19:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:19:36] a spike would be expected [08:19:48] yeah that seems to line up with the last sync-file [08:19:49] jynus: nono sorry I was checking with Filippo the 500s and hhvm restarts [08:20:00] and reporting, not blaming you :) [08:21:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:22:31] elukey: yeah I think the lost parent lightprocess exiting is expected [08:22:48] goooood [08:23:10] the lag was expected for a few second it should be better now, but feel free to ping me if it happens again [08:23:53] (03PS1) 10Giuseppe Lavagetto: lvs::monitor: monitor all services via service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289151 (https://phabricator.wikimedia.org/T134551) [08:24:00] it is the problem with not only deploying taking a minute, but connections not getting notified until they end [08:24:19] <_joe_> mobrovac: ^^ [08:24:25] which is why conftool will help, but not solve the problem 100% [08:25:37] is the api out or something? [08:25:59] <_joe_> mobrovac: uh? why are you saying that? [08:26:03] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [08:26:28] I'm on cp1061, please ignore the alert [08:26:46] <_joe_> mobrovac: we had a series of hhvm crashes between 7:55 and 8:10 [08:26:51] <_joe_> from what I can see [08:27:30] _joe_: ok, i think i know the culprit [08:27:46] <_joe_> mobrovac: uh? what is it? [08:27:57] sec, i'll get you the exact api call that errors out for restbase [08:28:03] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:28:05] an urwiki page [08:31:56] !log upgrading codfw cache_misc to varnish 4 (T126206, T134989) [08:31:57] T126206: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206 [08:31:58] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [08:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:02] PROBLEM - dhclient process on aqs1004 is CRITICAL: Connection refused by host [08:33:03] PROBLEM - DPKG on aqs1004 is CRITICAL: Connection refused by host [08:33:03] PROBLEM - RAID on aqs1004 is CRITICAL: Connection refused by host [08:33:23] ouch this is me [08:33:32] the host is not live [08:33:33] PROBLEM - salt-minion processes on aqs1004 is CRITICAL: Connection refused by host [08:33:40] will silence and ack [08:33:43] PROBLEM - configured eth on aqs1004 is CRITICAL: Connection refused by host [08:33:52] PROBLEM - puppet last run on aqs1004 is CRITICAL: Connection refused by host [08:34:12] PROBLEM - Disk space on aqs1004 is CRITICAL: Connection refused by host [08:35:22] ACKNOWLEDGEMENT - DPKG on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:22] ACKNOWLEDGEMENT - Disk space on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:22] ACKNOWLEDGEMENT - NTP on aqs1004 is CRITICAL: NTP CRITICAL: No response from NTP server Elukey Re-imaging, host not live [08:35:22] ACKNOWLEDGEMENT - RAID on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:22] ACKNOWLEDGEMENT - configured eth on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:23] ACKNOWLEDGEMENT - dhclient process on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:23] ACKNOWLEDGEMENT - puppet last run on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:24] ACKNOWLEDGEMENT - salt-minion processes on aqs1004 is CRITICAL: Connection refused by host Elukey Re-imaging, host not live [08:35:30] sorry for the spam [08:39:41] (03PS2) 10Giuseppe Lavagetto: lvs::monitor: monitor all services via service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289151 (https://phabricator.wikimedia.org/T134551) [08:40:21] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] lvs::monitor: monitor all services via service_checker [puppet] - 10https://gerrit.wikimedia.org/r/289151 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [08:53:19] !log filippo@palladium conftool action : set/pooled=no; selector: mw2050.codfw.wmnet [08:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:42] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2300302 (10elukey) All right 1005 booted after restarts, it might be a problem of md arrays taking too much time to bootstrap? Anyhow, after a chat with @robh we dec... [08:54:03] !log increased pool stall limit to 500 on db1049 [08:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:43] (03PS1) 10Giuseppe Lavagetto: lvs::monitor_services: fix restbase path [puppet] - 10https://gerrit.wikimedia.org/r/289154 [08:55:43] PROBLEM - Restbase LVS eqiad on restbase.svc.codfw.wmnet is CRITICAL: Generic error: paths [08:55:53] <_joe_> that's fixed by the commit above [08:56:10] 06Operations, 10DNS, 10Traffic: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Sjoerddebruin) [08:56:14] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: Generic error: paths [08:56:27] (03PS2) 10Giuseppe Lavagetto: lvs::monitor_services: fix restbase path [puppet] - 10https://gerrit.wikimedia.org/r/289154 [08:56:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] lvs::monitor_services: fix restbase path [puppet] - 10https://gerrit.wikimedia.org/r/289154 (owner: 10Giuseppe Lavagetto) [08:58:03] RECOVERY - confd service on cp3007 is OK: OK - confd is active [09:02:23] !log upgrading esams cache_misc to varnish 4 (T126206, T134989) [09:02:24] T126206: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206 [09:02:24] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [09:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:54] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet has 1 failures [09:07:38] (03PS1) 10Elukey: Reduce root rad10 in aqs-cassandra-8ssd-2srv.cfg after chat with Rob. [puppet] - 10https://gerrit.wikimedia.org/r/289157 (https://phabricator.wikimedia.org/T133785) [09:10:00] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 2 failures [09:11:09] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [09:11:19] (03CR) 10Elukey: [C: 032] Reduce root rad10 in aqs-cassandra-8ssd-2srv.cfg after chat with Rob. [puppet] - 10https://gerrit.wikimedia.org/r/289157 (https://phabricator.wikimedia.org/T133785) (owner: 10Elukey) [09:11:19] RECOVERY - Restbase LVS eqiad on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [09:11:49] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [09:11:51] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:12:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [09:13:09] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:14:57] !log upgrading ulsfo cache_misc to varnish 4 (T131501, T134989) [09:14:58] T131501: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501 [09:14:58] T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989 [09:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:09] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:36] 06Operations, 07Puppet, 06Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2300524 (10hashar) On **Jessie** we now have: fonts-gujr, fonts-gujr-extra. Dropping all references of those packages and running puppet yields th... [09:32:49] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: WDQS empty response - transfer clsoed with 15042 bytes remaining to read - https://phabricator.wikimedia.org/T134989#2300527 (10ema) I've upgraded cache_misc to Varnish 4 again. We're now running a patched version of Varnish (4.1.2-1wm5) inclu... [09:34:12] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10MediaWiki-extensions-WikibaseClient, and 2 others: File deletion problem on commons.wikimedia.org - https://phabricator.wikimedia.org/T135485#2300532 (10Sjoerddebruin) [09:43:21] RECOVERY - RAID on aqs1004 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [09:43:31] RECOVERY - configured eth on aqs1004 is OK: OK - interfaces up [09:43:34] --^ this is me running puppet [09:43:40] RECOVERY - DPKG on aqs1004 is OK: All packages OK [09:43:49] RECOVERY - dhclient process on aqs1004 is OK: PROCS OK: 0 processes with command name dhclient [09:44:01] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:45:10] RECOVERY - Disk space on aqs1004 is OK: DISK OK [09:46:52] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2300553 (10hashar) >>! In T133992#2298968, @Dzahn wrote: >> [labnodepool1001:~] $ id thcipriani >> uid=116... [09:56:10] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [09:58:10] RECOVERY - salt-minion processes on aqs1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:59:42] !log filippo@palladium conftool action : set/pooled=yes; selector: mw2050.codfw.wmnet [09:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:41] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 6 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2300651 (10Steinsplitter) * API request failed (internal_api_error_LocalFileLockErr... [10:22:40] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:22:53] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service, 13Patch-For-Review: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2300681 (10Joe) All active services are now monitored as mobileapps; we can see in a week how many ma... [10:25:39] (03PS1) 10Ladsgroup: ores: install aspell-sv [puppet] - 10https://gerrit.wikimedia.org/r/289162 (https://phabricator.wikimedia.org/T131450) [10:30:34] !log running schema change on s1 T130692 [10:30:35] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [10:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:56] 06Operations, 10DNS, 10Traffic: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Joe) The dns name is missing for the mobile version; adding it should be enough. [10:31:09] 06Operations, 10DNS, 10Traffic: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300701 (10Joe) p:05Triage>03Low [10:36:27] 06Operations, 10DNS, 10Traffic: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300711 (10Joe) I am not sure what is our policy about adding mobile records to the dns, but I am posting a change that should enable this. @bblack is there anything... [10:39:06] (03PS1) 10Giuseppe Lavagetto: Add arbcom-nl.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/289163 (https://phabricator.wikimedia.org/T135480) [10:41:47] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300723 (10Joe) a:03Joe [10:41:49] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:44:56] 06Operations, 10Traffic, 07Performance: Lots of Title::purgeExpiredRestriction from API DELETE FROM `page_restrictions` WHERE (pr_expiry < '20160517063108') without batching/throttling potentially causing lag on s5-api - https://phabricator.wikimedia.org/T135470#2300724 (10Joe) p:05Triage>03High [10:46:39] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [10:48:08] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2300729 (10elukey) All the hosts re-installed and working fine, the only issue seems to be occasionally md arrays not available during boot. [10:48:40] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:53:02] (03PS1) 10Hashar: jenkins: allow unsafe parameters [puppet] - 10https://gerrit.wikimedia.org/r/289165 (https://phabricator.wikimedia.org/T133737) [10:56:38] (03PS2) 10Jcrespo: Remove (almost) all references to db1027 on production puppet [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) [10:58:06] (03CR) 10Jcrespo: [C: 032] Remove (almost) all references to db1027 on production puppet [puppet] - 10https://gerrit.wikimedia.org/r/288943 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [10:59:09] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=82%) [10:59:27] !log disabling puppet and starting decom process of db1027 [10:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:41] ACKNOWLEDGEMENT - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=82%): Filippo Giunchedi swift race with /srv/swift-storage/sdk1 unmounted [11:04:18] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2300765 (10Joe) Neither labnet1001 nor labnet1002 have glance logs, so I consider that out of socpe for n... [11:04:43] !log updated facts on puppet compiler following https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs [11:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:05:30] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [11:16:50] (03PS1) 10Jcrespo: Remove db1027 from internal dns entries [dns] - 10https://gerrit.wikimedia.org/r/289168 (https://phabricator.wikimedia.org/T135253) [11:18:59] 06Operations, 10Traffic, 13Patch-For-Review: varnish.clients graphite metric spammed with jsessionid - https://phabricator.wikimedia.org/T135227#2300792 (10fgiunchedi) 05Open>03Resolved cleaned up junk metrics from graphite1001 and graphite2001 [11:19:38] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2300794 (10jcrespo) @Cmjohnson db1027 has been removed from puppet (sites, dhcp -not netboot, as usual, a range is used-, salt, puppet cert). DNS is above pending, as usual, waiting... [11:25:13] AaronSchulz, you are my biggest fan, do you know that? [11:26:50] I meant to write that *I* am your more unconditional fan and you have all my praises [11:38:29] (03PS1) 10Jcrespo: Add interval parameter, and change the default to 1 beat per second [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289177 (https://phabricator.wikimedia.org/T133337) [11:38:51] (03CR) 10jenkins-bot: [V: 04-1] Add interval parameter, and change the default to 1 beat per second [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289177 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [11:46:36] (03PS1) 10Jcrespo: Enable heartbeat on all masters, even on the pasive datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289178 (https://phabricator.wikimedia.org/T133337) [11:48:16] (03PS2) 10Jcrespo: Add interval parameter, and change the default to 1 beat per second [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289177 (https://phabricator.wikimedia.org/T133337) [11:50:02] (03CR) 10Jcrespo: Enable heartbeat on all masters, even on the pasive datacenter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289178 (https://phabricator.wikimedia.org/T133337) (owner: 10Jcrespo) [11:54:54] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/2823/ - Puppet compiler output" [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [12:05:00] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2300865 (10elukey) Hit ratio for today (remembering that some hosts got restarted during the last maintenance window): mc1004.eqiad.wmnet: 0.9128341632 mc1013.... [12:15:58] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, 07Elasticsearch: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499#2300880 (10Gehel) [12:16:25] !log mathoid deployed 10c7cb8 [12:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:39] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2300894 (10elukey) Stats for mc1009 before the upgrade to -o slab_reassign,slab_automove,lru_crawler,lru_maintainer: [[https://phab.wmfusercontent.org/file/dat... [12:17:55] 06Operations, 06Performance-Team, 10Thumbor: Backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2300896 (10faidon) (not sure how familiar you are with all of this, so apologies beforehand if I'm being too verbose/repetitive) We don't //have// to upload them to jessie-backpor... [12:21:16] !log starting rolling restart of Elasticsearch equiad fro Java update (T135499) [12:21:17] T135499: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499 [12:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:42] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2300906 (10elukey) I am going to merge https://gerrit.wikimedia.org/r/#/c/288951/1 to finally check the full potential of 1.4.25. I will use to measure of compa... [12:24:48] (03CR) 10Mforns: [C: 04-1] "Just a typo, see comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289007 (https://phabricator.wikimedia.org/T126549) (owner: 10Milimetric) [12:27:25] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2300913 (10elukey) Summary of evictions (0 matches hosts restarted yesterday): ``` mc1004.eqiad.wmnet:... [12:33:41] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [12:35:23] PROBLEM - RAID on elastic1032 is CRITICAL: Connection refused by host [12:35:23] PROBLEM - salt-minion processes on elastic1033 is CRITICAL: Connection refused by host [12:35:52] PROBLEM - DPKG on elastic1034 is CRITICAL: Connection refused by host [12:35:52] PROBLEM - configured eth on elastic1035 is CRITICAL: Connection refused by host [12:36:13] PROBLEM - configured eth on elastic1032 is CRITICAL: Connection refused by host [12:36:13] PROBLEM - dhclient process on elastic1035 is CRITICAL: Connection refused by host [12:36:13] PROBLEM - Disk space on elastic1034 is CRITICAL: Connection refused by host [12:36:42] PROBLEM - dhclient process on elastic1032 is CRITICAL: Connection refused by host [12:36:42] PROBLEM - puppet last run on elastic1035 is CRITICAL: Connection refused by host [12:36:53] PROBLEM - puppet last run on elastic1032 is CRITICAL: Connection refused by host [12:36:53] PROBLEM - RAID on elastic1034 is CRITICAL: Connection refused by host [12:36:53] PROBLEM - salt-minion processes on elastic1035 is CRITICAL: Connection refused by host [12:37:03] PROBLEM - salt-minion processes on elastic1032 is CRITICAL: Connection refused by host [12:37:32] PROBLEM - DPKG on elastic1033 is CRITICAL: Connection refused by host [12:37:32] PROBLEM - configured eth on elastic1034 is CRITICAL: Connection refused by host [12:37:52] PROBLEM - Disk space on elastic1033 is CRITICAL: Connection refused by host [12:37:52] PROBLEM - dhclient process on elastic1034 is CRITICAL: Connection refused by host [12:38:12] PROBLEM - puppet last run on elastic1034 is CRITICAL: Connection refused by host [12:38:22] PROBLEM - RAID on elastic1033 is CRITICAL: Connection refused by host [12:38:22] PROBLEM - salt-minion processes on elastic1034 is CRITICAL: Connection refused by host [12:38:25] 06Operations, 06Performance-Team, 10Thumbor: Backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2300920 (10Gilles) Most of them are maintained by Marcelo Jorge Vieira who packaged Thumbor (or attempted to) a while ago: https://packages.debian.org/stretch/python-derpconf http... [12:38:42] PROBLEM - DPKG on elastic1035 is CRITICAL: Connection refused by host [12:38:53] PROBLEM - configured eth on elastic1033 is CRITICAL: Connection refused by host [12:38:53] PROBLEM - DPKG on elastic1032 is CRITICAL: Connection refused by host [12:38:53] PROBLEM - Disk space on elastic1035 is CRITICAL: Connection refused by host [12:39:12] PROBLEM - Disk space on elastic1032 is CRITICAL: Connection refused by host [12:39:12] PROBLEM - dhclient process on elastic1033 is CRITICAL: Connection refused by host [12:39:31] PROBLEM - RAID on elastic1035 is CRITICAL: Connection refused by host [12:39:31] PROBLEM - puppet last run on elastic1033 is CRITICAL: Connection refused by host [12:41:00] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2300922 (10elukey) So from the past week I can see: - kafka1012 increased steadily its logsize from 12/05 ~20:00 UTC more o... [12:47:46] 06Operations, 06Discovery, 03Discovery-Search-Sprint, 07Elasticsearch, 13Patch-For-Review: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2244987 (10Gehel) All nodes restarted (including logstash), documentation updated. [12:48:23] RECOVERY - configured eth on elastic1032 is OK: OK - interfaces up [12:48:51] RECOVERY - dhclient process on elastic1032 is OK: PROCS OK: 0 processes with command name dhclient [12:49:02] RECOVERY - DPKG on elastic1032 is OK: All packages OK [12:49:12] RECOVERY - salt-minion processes on elastic1032 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:49:21] RECOVERY - Disk space on elastic1032 is OK: DISK OK [12:49:33] RECOVERY - RAID on elastic1032 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:49:43] RECOVERY - configured eth on elastic1034 is OK: OK - interfaces up [12:50:02] RECOVERY - dhclient process on elastic1034 is OK: PROCS OK: 0 processes with command name dhclient [12:50:02] RECOVERY - DPKG on elastic1034 is OK: All packages OK [12:50:02] RECOVERY - configured eth on elastic1035 is OK: OK - interfaces up [12:50:23] RECOVERY - Disk space on elastic1034 is OK: DISK OK [12:50:23] RECOVERY - dhclient process on elastic1035 is OK: PROCS OK: 0 processes with command name dhclient [12:50:32] RECOVERY - salt-minion processes on elastic1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:50:51] RECOVERY - DPKG on elastic1035 is OK: All packages OK [12:51:11] RECOVERY - Disk space on elastic1035 is OK: DISK OK [12:51:12] RECOVERY - salt-minion processes on elastic1035 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:51:12] RECOVERY - RAID on elastic1034 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:51:42] RECOVERY - RAID on elastic1035 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:52:43] ^ elastic alerts above are for new (as yet unconfigured) elasticsearch servers. I'm having a look... [12:54:34] (03CR) 10BBlack: [C: 031] Add arbcom-nl.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/289163 (https://phabricator.wikimedia.org/T135480) (owner: 10Giuseppe Lavagetto) [12:57:49] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2300952 (10elukey) Distribution of the leaders: ``` elukey@kafka1012:~$ kafka topics --describe | grep Leader | awk '{print... [12:58:11] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:59:31] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:00:00] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Peachey88) We should probably do the patchset to do all the arbcom wikis at once or none. Have e checked MobileFront handles security... [13:03:57] (03PS2) 10Elukey: Add new suggested memcached settings to mc1009 as part of perf experiment. [puppet] - 10https://gerrit.wikimedia.org/r/288951 (https://phabricator.wikimedia.org/T129963) [13:11:32] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300983 (10BBlack) >>! In T135480#2300953, @Peachey88 wrote: > We should probably do the patchset to do all the arbcom wikis at once or none. Is... [13:13:21] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [13:13:33] PROBLEM - NTP on elastic1033 is CRITICAL: NTP CRITICAL: No response from NTP server [13:14:16] (03CR) 10Alexandros Kosiaris: [C: 032] ores: install aspell-sv [puppet] - 10https://gerrit.wikimedia.org/r/289162 (https://phabricator.wikimedia.org/T131450) (owner: 10Ladsgroup) [13:14:20] (03PS2) 10Alexandros Kosiaris: ores: install aspell-sv [puppet] - 10https://gerrit.wikimedia.org/r/289162 (https://phabricator.wikimedia.org/T131450) (owner: 10Ladsgroup) [13:14:31] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [13:15:02] RECOVERY - Disk space on elastic1033 is OK: DISK OK [13:15:33] RECOVERY - RAID on elastic1033 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:15:57] (03CR) 10Elukey: [C: 032] "Puppet compiler looks good https://puppet-compiler.wmflabs.org/2824/" [puppet] - 10https://gerrit.wikimedia.org/r/288951 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [13:16:11] RECOVERY - configured eth on elastic1033 is OK: OK - interfaces up [13:16:22] RECOVERY - dhclient process on elastic1033 is OK: PROCS OK: 0 processes with command name dhclient [13:16:42] RECOVERY - salt-minion processes on elastic1033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:16:42] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:52] RECOVERY - DPKG on elastic1033 is OK: All packages OK [13:17:54] 06Operations, 10DBA, 13Patch-For-Review: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2301009 (10jcrespo) 05stalled>03Open a:03jcrespo [13:20:01] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:21:09] (03PS2) 10Filippo Giunchedi: jenkins: allow unsafe parameters [puppet] - 10https://gerrit.wikimedia.org/r/289165 (https://phabricator.wikimedia.org/T133737) (owner: 10Hashar) [13:21:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] jenkins: allow unsafe parameters [puppet] - 10https://gerrit.wikimedia.org/r/289165 (https://phabricator.wikimedia.org/T133737) (owner: 10Hashar) [13:22:41] !log memcahced restarted on mc1009 with -o slab_reassign,slab_automove,lru_crawler,lru_maintainer as part of a perf experiment (T129963) [13:22:42] T129963: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 [13:22:47] Argument 1 passed to PhabricatorPolicyAwareQuery::setViewer() must be an instance of PhabricatorUser, null given, called in /srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56a0970/phabricator/src/applications/conpherence/editor/ConpherenceEditor.php on line 96 and defined [13:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:11] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2301039 (10Peachey88) >>! In T135480#2300983, @BBlack wrote: > Is MobileFrontend installed for them? Is there other configuration required that'... [13:24:40] !log bounce carbon/frontend-relay on graphite1001 to increase queue size T135385 [13:24:41] T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:43] (03PS1) 10BBlack: frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 [13:27:06] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2301061 (10ssastry) >>! In T135176#2300171, @akosiaris wrote: >>>! In T135176#2299349, @ssastry wrote: >> Separately, we should check if puppet has hardcoded references to upstart + w... [13:28:38] does stashbot do anything else than linking phabricator things? [13:30:50] (03PS2) 10BBlack: frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 [13:32:56] !log upgrading Jenkins T133737 [13:32:56] T133737: Upgrade Jenkins from 1.642.3 to 1.651.2 - https://phabricator.wikimedia.org/T133737 [13:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:10] (03PS3) 10Alexandros Kosiaris: ores: install aspell-sv [puppet] - 10https://gerrit.wikimedia.org/r/289162 (https://phabricator.wikimedia.org/T131450) (owner: 10Ladsgroup) [13:33:14] (03CR) 10Alexandros Kosiaris: [V: 032] ores: install aspell-sv [puppet] - 10https://gerrit.wikimedia.org/r/289162 (https://phabricator.wikimedia.org/T131450) (owner: 10Ladsgroup) [13:33:18] (03PS1) 10Gehel: Elasticsearch - use unicast for discovery by default [puppet] - 10https://gerrit.wikimedia.org/r/289202 [13:34:38] RECOVERY - NTP on elastic1033 is OK: NTP OK: Offset 0.01162350178 secs [13:35:17] PROBLEM - dhclient process on elastic1036 is CRITICAL: Connection refused by host [13:35:38] PROBLEM - puppet last run on elastic1036 is CRITICAL: Connection refused by host [13:35:38] PROBLEM - DPKG on elastic1043 is CRITICAL: Connection refused by host [13:35:38] PROBLEM - configured eth on elastic1044 is CRITICAL: Connection refused by host [13:35:58] PROBLEM - salt-minion processes on elastic1036 is CRITICAL: Connection refused by host [13:35:58] PROBLEM - Disk space on elastic1043 is CRITICAL: Connection refused by host [13:35:58] PROBLEM - DPKG on elastic1040 is CRITICAL: Connection refused by host [13:35:58] PROBLEM - dhclient process on elastic1044 is CRITICAL: Connection refused by host [13:36:27] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [13:36:27] PROBLEM - puppet last run on elastic1044 is CRITICAL: Connection refused by host [13:36:27] PROBLEM - Disk space on elastic1040 is CRITICAL: Connection refused by host [13:36:38] PROBLEM - salt-minion processes on elastic1044 is CRITICAL: Connection refused by host [13:36:38] PROBLEM - RAID on elastic1043 is CRITICAL: Connection refused by host [13:36:38] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Puppet has 1 failures [13:36:49] PROBLEM - RAID on elastic1040 is CRITICAL: Connection refused by host [13:37:17] PROBLEM - configured eth on elastic1043 is CRITICAL: Connection refused by host [13:37:28] PROBLEM - configured eth on elastic1040 is CRITICAL: Connection refused by host [13:37:28] PROBLEM - dhclient process on elastic1043 is CRITICAL: Connection refused by host [13:37:48] PROBLEM - dhclient process on elastic1040 is CRITICAL: Connection refused by host [13:37:48] PROBLEM - DPKG on elastic1036 is CRITICAL: Connection refused by host [13:37:48] PROBLEM - puppet last run on elastic1043 is CRITICAL: Connection refused by host [13:38:08] PROBLEM - puppet last run on elastic1040 is CRITICAL: Connection refused by host [13:38:08] PROBLEM - Disk space on elastic1036 is CRITICAL: Connection refused by host [13:38:08] PROBLEM - salt-minion processes on elastic1043 is CRITICAL: Connection refused by host [13:38:17] gehel: ^ [13:38:27] PROBLEM - DPKG on elastic1044 is CRITICAL: Connection refused by host [13:38:27] PROBLEM - salt-minion processes on elastic1040 is CRITICAL: Connection refused by host [13:38:38] PROBLEM - Disk space on elastic1044 is CRITICAL: Connection refused by host [13:38:38] PROBLEM - RAID on elastic1036 is CRITICAL: Connection refused by host [13:39:02] chasemp: thanks! seems to be the new elasticsearch servers. I'll silence those as they are not yet configured... [13:39:09] PROBLEM - configured eth on elastic1036 is CRITICAL: Connection refused by host [13:39:10] PROBLEM - RAID on elastic1044 is CRITICAL: Connection refused by host [13:39:10] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 1 failures [13:39:18] RECOVERY - dhclient process on elastic1036 is OK: PROCS OK: 0 processes with command name dhclient [13:39:47] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:39:49] RECOVERY - DPKG on elastic1036 is OK: All packages OK [13:40:07] RECOVERY - salt-minion processes on elastic1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:40:08] cmjohnson1: are you by any chance working on those new elasticsearch servers? [13:40:08] RECOVERY - Disk space on elastic1036 is OK: DISK OK [13:40:38] RECOVERY - RAID on elastic1036 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:40:40] gehel: yes, all but elastic1045 have been installed [13:41:00] cmjohnson1: Ok, so icinga alerts are the servers coming online? [13:41:09] RECOVERY - configured eth on elastic1036 is OK: OK - interfaces up [13:42:19] gehel: yes that's it...once i complete 1045 then you can do what you want with them [13:42:46] 06Operations, 13Patch-For-Review: Cleanup puppet from unneeded and empty single-service "roots" - https://phabricator.wikimedia.org/T135386#2301094 (10Joe) 05Open>03Resolved [13:42:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Allow mobrovac to run puppet on SC(A|B) - https://phabricator.wikimedia.org/T134251#2301095 (10Joe) [13:43:17] cmjohnson1: ok, thanks! I'll let Icinga recover... [13:45:39] (03PS3) 10BBlack: frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 [13:45:41] (03PS1) 10BBlack: VCL: block 10% insecure post on non-"secure_post" clusters [puppet] - 10https://gerrit.wikimedia.org/r/289205 (https://phabricator.wikimedia.org/T105794) [13:46:34] (03PS4) 10BBlack: frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 [13:46:59] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:47:17] RECOVERY - RAID on elastic1040 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:47:17] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2301097 (10Cmjohnson) [13:47:46] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2242648 (10Cmjohnson) 05Open>03Resolved all servers have been setup and installed and ssh accessible. [13:47:59] RECOVERY - configured eth on elastic1040 is OK: OK - interfaces up [13:48:17] RECOVERY - dhclient process on elastic1040 is OK: PROCS OK: 0 processes with command name dhclient [13:48:26] (03PS5) 10BBlack: frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 [13:48:27] RECOVERY - DPKG on elastic1040 is OK: All packages OK [13:48:28] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:48:47] RECOVERY - salt-minion processes on elastic1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:48:57] RECOVERY - Disk space on elastic1040 is OK: DISK OK [13:50:23] (03PS6) 10BBlack: frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 [13:50:31] (03CR) 10BBlack: [C: 032 V: 032] frontend VCL: secure_post now affects most methods [puppet] - 10https://gerrit.wikimedia.org/r/289199 (owner: 10BBlack) [13:51:58] (03PS2) 10BBlack: VCL: block 10% insecure post on non-"secure_post" clusters [puppet] - 10https://gerrit.wikimedia.org/r/289205 (https://phabricator.wikimedia.org/T105794) [13:52:16] (03CR) 10BBlack: [C: 04-2] "Do not submit until 2016-06-12" [puppet] - 10https://gerrit.wikimedia.org/r/289205 (https://phabricator.wikimedia.org/T105794) (owner: 10BBlack) [13:57:38] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:58:01] (03PS1) 10Cmjohnson: Adding mac address for mw1284-1304 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/289206 [13:58:48] (03PS3) 10BBlack: VCL: X-Cache simplification [puppet] - 10https://gerrit.wikimedia.org/r/289015 [13:59:18] (03CR) 10BBlack: [C: 032 V: 032] VCL: X-Cache simplification [puppet] - 10https://gerrit.wikimedia.org/r/289015 (owner: 10BBlack) [13:59:29] (03PS3) 10BBlack: VCL: No X-Cache for PURGE in Varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/289018 [14:00:16] (03CR) 10BBlack: [C: 032 V: 032] VCL: No X-Cache for PURGE in Varnish3 [puppet] - 10https://gerrit.wikimedia.org/r/289018 (owner: 10BBlack) [14:01:32] (03PS2) 10Cmjohnson: Adding mac address for mw1284-1304 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/289206 [14:01:45] (03PS1) 10Andrew Bogott: Glance backup: Omit directory timestamps [puppet] - 10https://gerrit.wikimedia.org/r/289207 (https://phabricator.wikimedia.org/T135463) [14:03:08] (03CR) 10Cmjohnson: [C: 032] Adding mac address for mw1284-1304 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/289206 (owner: 10Cmjohnson) [14:03:24] (03PS2) 10Andrew Bogott: Glance backup: Omit directory timestamps [puppet] - 10https://gerrit.wikimedia.org/r/289207 (https://phabricator.wikimedia.org/T135463) [14:04:22] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10MediaWiki-extensions-WikibaseClient, and 2 others: File deletion problem on commons.wikimedia.org - https://phabricator.wikimedia.org/T135485#2300446 (10Storkk) Identical error is happening for https://commons.wikimedia.org/wiki/File:August_Kierspel.jpg [14:05:16] (03CR) 10Andrew Bogott: [C: 032] Glance backup: Omit directory timestamps [puppet] - 10https://gerrit.wikimedia.org/r/289207 (https://phabricator.wikimedia.org/T135463) (owner: 10Andrew Bogott) [14:05:43] gehel: hey, I'm getting search errors on officewiki [14:05:45] "An error has occurred while searching: Search is currently too busy. Please try again later." [14:05:47] RECOVERY - dhclient process on elastic1043 is OK: PROCS OK: 0 processes with command name dhclient [14:05:53] happened twice [14:05:55] (03PS3) 10BBlack: [WIP] varnishxcache [puppet] - 10https://gerrit.wikimedia.org/r/289071 [14:05:57] (03PS1) 10BBlack: VCL: minor syntax bugfix [puppet] - 10https://gerrit.wikimedia.org/r/289209 [14:05:58] RECOVERY - RAID on elastic1043 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:05:59] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:06:01] paravoid: checking... [14:06:17] RECOVERY - salt-minion processes on elastic1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:06:19] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 2 failures [14:06:55] (03PS2) 10BBlack: VCL: minor syntax bugfix [puppet] - 10https://gerrit.wikimedia.org/r/289209 [14:06:57] RECOVERY - DPKG on elastic1043 is OK: All packages OK [14:07:05] (03PS3) 10BBlack: VCL: minor syntax bugfix [puppet] - 10https://gerrit.wikimedia.org/r/289209 [14:07:09] RECOVERY - Disk space on elastic1043 is OK: DISK OK [14:07:12] (03CR) 10BBlack: [C: 032 V: 032] VCL: minor syntax bugfix [puppet] - 10https://gerrit.wikimedia.org/r/289209 (owner: 10BBlack) [14:07:19] RECOVERY - configured eth on elastic1043 is OK: OK - interfaces up [14:07:38] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 2 failures [14:08:18] paravoid: elastic1001 load is fairly high ... having a look [14:09:10] (03PS2) 10Milimetric: Support additional reportupdater directories [puppet] - 10https://gerrit.wikimedia.org/r/289007 (https://phabricator.wikimedia.org/T126549) [14:09:17] RECOVERY - configured eth on elastic1044 is OK: OK - interfaces up [14:09:27] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:09:28] RECOVERY - dhclient process on elastic1044 is OK: PROCS OK: 0 processes with command name dhclient [14:09:49] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:57] RECOVERY - Disk space on elastic1044 is OK: DISK OK [14:09:58] PROBLEM - NTP on elastic1043 is CRITICAL: NTP CRITICAL: Offset unknown [14:10:08] RECOVERY - salt-minion processes on elastic1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:10:38] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [14:10:39] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [14:10:48] RECOVERY - DPKG on elastic1044 is OK: All packages OK [14:11:08] RECOVERY - RAID on elastic1044 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:12:18] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 2 failures [14:12:57] PROBLEM - NTP on elastic1044 is CRITICAL: NTP CRITICAL: Offset unknown [14:13:18] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 2 failures [14:13:59] (03PS3) 10Andrew Bogott: Exchange the addresses of labs-recursor0 and labs-recursor1 [dns] - 10https://gerrit.wikimedia.org/r/289080 (https://phabricator.wikimedia.org/T135447) [14:14:17] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 2 failures [14:14:18] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [14:15:03] (03PS1) 10Filippo Giunchedi: graphite: add multiple clusters per carbon-c-relay route [puppet] - 10https://gerrit.wikimedia.org/r/289211 [14:15:05] (03CR) 10Andrew Bogott: [C: 032] Exchange the addresses of labs-recursor0 and labs-recursor1 [dns] - 10https://gerrit.wikimedia.org/r/289080 (https://phabricator.wikimedia.org/T135447) (owner: 10Andrew Bogott) [14:15:54] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: high load on elastic1001 - https://phabricator.wikimedia.org/T135509#2301184 (10Gehel) [14:15:58] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 2 failures [14:15:58] !log restarting elastic1001 - high load (T135509) [14:15:59] T135509: high load on elastic1001 - https://phabricator.wikimedia.org/T135509 [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:08] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 2 failures [14:16:17] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures [14:16:17] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 2 failures [14:16:28] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 2 failures [14:16:38] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 2 failures [14:16:45] those are all lagged indicators, puppet isn't actually failing :P [14:16:52] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2301203 (10Ottomata) The increase in log size correlates to the time at which I set `inter.broker.protocol.version=0.9.0.X`.... [14:16:58] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:17:18] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 2 failures [14:17:28] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:18:04] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2301206 (10Ottomata) 30GB for root should be fine, we do that on many other servers. [14:18:59] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:27] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:20:09] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:18] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:20:27] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:27] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:20:28] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:28] (03PS2) 10Filippo Giunchedi: graphite: add multiple clusters per carbon-c-relay route [puppet] - 10https://gerrit.wikimedia.org/r/289211 [14:20:38] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:48] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:58] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:24:28] RECOVERY - NTP on elastic1043 is OK: NTP OK: Offset -0.002328872681 secs [14:29:38] RECOVERY - NTP on elastic1044 is OK: NTP OK: Offset 0.006193637848 secs [14:37:35] (03PS5) 10Elukey: Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) [14:37:42] !log change-prop deploying 5b5a07a3 [14:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:03] (03PS1) 10Jdrewniak: T134512 Bumping portals to master. Updating survey banner. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289218 (https://phabricator.wikimedia.org/T134512) [14:46:22] !log taking elastic1001 down for investigation (T135509) [14:46:23] T135509: high load on elastic1001 - https://phabricator.wikimedia.org/T135509 [14:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:23] "No working slave server: Unknown error"... on silver [14:51:54] (03PS2) 10Jforrester: Enable VisualEditor for IP users on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287646 [14:52:03] (03CR) 10Jforrester: [C: 031] "Due now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287646 (owner: 10Jforrester) [14:58:08] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: high load on elastic1001 - https://phabricator.wikimedia.org/T135509#2301312 (10Gehel) Restarting elastic1001 did reduce its CPU load shortly, but load peaked again and response time as well. elastic1001 taken down for further investig... [14:59:25] !log disabled puppet on aqs* nodes as prep step to bootstrap the new testing cluster (https://gerrit.wikimedia.org/r/#/c/288373/5) [14:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160517T1500). Please do the needful. [15:00:04] James_F jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:05] (03CR) 10Elukey: [C: 032] Add a new AQS testing environment to play with Cassandra settings before production. [puppet] - 10https://gerrit.wikimedia.org/r/288373 (https://phabricator.wikimedia.org/T124314) (owner: 10Elukey) [15:00:17] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [15:01:30] (03PS1) 10BBlack: caches: refactor around cache::cluster [puppet] - 10https://gerrit.wikimedia.org/r/289222 [15:01:51] I can SWAT today. James_F jan_drewniak ping me when you're available for SWAT. [15:02:09] * James_F waves. [15:02:41] thcipriani: o/ [15:02:41] hiya [15:03:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287646 (owner: 10Jforrester) [15:03:27] godog: graphite web having issues or just me? [15:03:51] (03Merged) 10jenkins-bot: Enable VisualEditor for IP users on the Japanese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287646 (owner: 10Jforrester) [15:05:17] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [15:05:42] chasemp: aye, I'm seeing "Ext is not defined" on graphite.wikimedia.org on the js console indeed [15:06:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for IP users on the Japanese Wikipedia [[gerrit:287646]] (duration: 00m 35s) [15:06:06] ^ James_F check please [15:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:50] (03CR) 10BBlack: [C: 032] "Compiler no-op on all cache roles" [puppet] - 10https://gerrit.wikimedia.org/r/289222 (owner: 10BBlack) [15:06:51] thcipriani: LGTM. [15:06:58] James_F: thanks [15:08:04] godog: https://github.com/graphite-project/graphite-web/issues/1173 ? [15:08:12] (03PS2) 10Thcipriani: T134512 Bumping portals to master. Updating survey banner. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289218 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [15:08:15] (03PS3) 10Andrew Bogott: Exchange labs-recursor0 and labs-recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/289081 (https://phabricator.wikimedia.org/T135447) [15:08:17] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [15:08:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289218 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [15:09:09] (03Merged) 10jenkins-bot: T134512 Bumping portals to master. Updating survey banner. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289218 (https://phabricator.wikimedia.org/T134512) (owner: 10Jdrewniak) [15:09:28] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures [15:09:53] (03CR) 10Andrew Bogott: [C: 032] Exchange labs-recursor0 and labs-recursor1 [puppet] - 10https://gerrit.wikimedia.org/r/289081 (https://phabricator.wikimedia.org/T135447) (owner: 10Andrew Bogott) [15:09:57] chasemp: possible, I'm assuming the cache misc upgrade uncovered/triggered it [15:10:07] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:11:13] !log running /srv/mediawiki-staging/portals/sync-portals for [[gerrit:289218]] [15:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:02] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 25s) [15:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:28] !log thcipriani@tin Synchronized portals: (no message) (duration: 00m 25s) [15:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:38] ^ jan_drewniak check please [15:13:36] thcipriani: looks good, thanks! [15:13:46] jan_drewniak: thanks for checking [15:14:18] 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2301356 (10BBlack) After running for about a day, cp3048-frontend is now at 163G virtual and 86G resident. I've seen the resident part vary up and down in the ~70-90-ish r... [15:17:22] chasemp: if you force-reload in the browser do you get back all 200s ? [15:17:34] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2301358 (10Ottomata) Ok, I believe that when switching `inter.broker.protocol.version` and bouncing brokers, on startup they... [15:17:47] 06Operations, 10DNS, 10Traffic, 13Patch-For-Review: arbcom-nl.wikipedia.org doesn't have a functioning mobile website - https://phabricator.wikimedia.org/T135480#2300317 (10Krenair) >>! In T135480#2300953, @Peachey88 wrote: > Have we checked MobileFrontend handles security properly for the private wikis?... [15:18:12] godog: not sure will try, we may have stepped on each others toes there I was trying to add [15:18:12] Alias /content "/usr/share/graphite-web/static/" [15:18:13] to see if it was a path issue [15:19:02] (03PS1) 10Giuseppe Lavagetto: install-server: fix dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/289225 [15:19:04] chasemp: ah, no haven't touched it [15:19:49] (03PS2) 10Thcipriani: Clean old scap code [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) [15:19:58] it's failing on like https://graphite.wikimedia.org/content/js/ext/resources/css/ext-all.css (still for me even w/ force refresh) [15:20:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] install-server: fix dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/289225 (owner: 10Giuseppe Lavagetto) [15:20:24] I saw you had puppet disabled there I undid my change and am letting it be [15:20:30] but yeah, weird [15:20:33] (03PS1) 10Andrew Bogott: Exchange the addresses of labs-recursor0 and labs-recursor1 (reverse dns) [dns] - 10https://gerrit.wikimedia.org/r/289226 [15:21:48] (03PS2) 10Andrew Bogott: Exchange the addresses of labs-recursor0 and labs-recursor1 (reverse dns) [dns] - 10https://gerrit.wikimedia.org/r/289226 (https://phabricator.wikimedia.org/T135447) [15:22:38] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:22:42] chasemp: ack, thanks [15:23:41] (03CR) 10Andrew Bogott: [C: 032] Exchange the addresses of labs-recursor0 and labs-recursor1 (reverse dns) [dns] - 10https://gerrit.wikimedia.org/r/289226 (https://phabricator.wikimedia.org/T135447) (owner: 10Andrew Bogott) [15:25:08] (03PS1) 10Faidon Liambotis: mirrors: mirror Tails as well [puppet] - 10https://gerrit.wikimedia.org/r/289228 [15:26:02] (03CR) 10Faidon Liambotis: [C: 032] mirrors: mirror Tails as well [puppet] - 10https://gerrit.wikimedia.org/r/289228 (owner: 10Faidon Liambotis) [15:26:46] <_joe_> paravoid: :)) [15:26:55] they mailed me asking [15:27:37] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:28] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:20] I am checking, but still haven't done anything on 1002/3, so this must be the recurring issue [15:29:54] (03PS3) 10Thcipriani: Clean old scap code [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) [15:30:50] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10MediaWiki-extensions-WikibaseClient, and 3 others: File deletion problem on commons.wikimedia.org - https://phabricator.wikimedia.org/T135485#2301387 (10Krenair) ```2016-05-17 14:04:04 [Vzsk1ApAIDkAABQShV0AAAAQ] mw1187 commonswiki 1.28.0-wmf.1 exception... [15:31:57] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [15:32:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:32:39] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [15:32:48] I don't know how to check --^ but I suspect that it is related to AQS [15:32:54] (graphite) [15:33:57] yeah I think it is [15:34:08] that's the pattern lately anyways, for better or worse [15:34:42] bblack: I am trying to spin up the new SSD based cluster, that will help eventually [15:35:09] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:35:25] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2301405 (10akosiaris) >>! In T135176#2301061, @ssastry wrote: >>>! In T135176#2300171, @akosiaris wrote: >>>>! In T135176#2299349, @ssastry wrote: >>> Separately, we should check if p... [15:36:14] (03CR) 10Thcipriani: "Puppet compiler info: https://puppet-compiler.wmflabs.org/2831/" [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) (owner: 10Thcipriani) [15:36:50] elukey, you can use graphite for non-fine-grained percentages, such as type of cache or datacenter [15:37:10] e.g. https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [15:37:43] and oxygen for fine-grained logs [15:38:43] have we ever thought of sending maybe a sample of those to ELK, or there were performance/size/sercurity issues? [15:38:52] (03PS1) 10BBlack: Revert "Revert "cache_misc: do not deliver expired cached objects"" [puppet] - 10https://gerrit.wikimedia.org/r/289230 [15:38:55] or just "nobody has tried it yet" [15:39:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:40:14] (03CR) 10Ema: [C: 031] Revert "Revert "cache_misc: do not deliver expired cached objects"" [puppet] - 10https://gerrit.wikimedia.org/r/289230 (owner: 10BBlack) [15:40:43] (03CR) 10BBlack: [C: 032] Revert "Revert "cache_misc: do not deliver expired cached objects"" [puppet] - 10https://gerrit.wikimedia.org/r/289230 (owner: 10BBlack) [15:41:31] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2301413 (10Andrew) [15:41:33] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Get labs-ns0, labs-recursor0, and labservices1001 on the same system, and labs-ns1, labs-recursor1, and holmium on another - https://phabricator.wikimedia.org/T135447#2301412 (10Andrew) 05Open>03Resolved [15:43:30] An error has occurred while searching: Search is currently too busy. Please try again later. [15:43:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:43:57] that sounds serious [15:44:10] SPF|Cloud, which wiki? [15:44:20] wikitech [15:44:42] https://ganglia.wikimedia.org/latest/?c=Elasticsearch%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 network traffic over 3 times [15:45:18] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2301419 (10ssastry) @arlolra had been evaluating the migration to service-runner and this seems like a very good reason to consider that. At this time, most of the significant blocker... [15:45:48] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: Connection refused [15:45:48] PROBLEM - AQS root url on aqs1004 is CRITICAL: Connection refused [15:45:49] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.107, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:46:09] PROBLEM - Analytics Cassanda CQL query interface on aqs1004 is CRITICAL: Connection refused [15:46:12] PROBLEM - cassandra-a service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:46:37] ---^ this is me, sorry [15:46:39] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: Connection refused [15:46:48] PROBLEM - Analytics Cassanda CQL query interface on aqs1006 is CRITICAL: Connection refused [15:46:50] PROBLEM - Analytics Cassanda CQL query interface on aqs1005 is CRITICAL: Connection refused [15:46:54] PROBLEM - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is CRITICAL: Connection timed out [15:47:08] PROBLEM - cassandra-a service on aqs1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:47:08] PROBLEM - cassandra-b service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:47:08] PROBLEM - cassandra-a service on aqs1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:47:28] PROBLEM - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: Connection refused [15:47:29] PROBLEM - cassandra-b CQL 10.64.48.149:9042 on aqs1006 is CRITICAL: Connection refused [15:47:29] testing environment, didn't know that it would have fired [15:47:29] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:47:30] PROBLEM - cassandra-b service on aqs1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:47:30] PROBLEM - cassandra-b service on aqs1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:47:35] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2301422 (10Cmjohnson) @jcrespo Do you know the raid configuration you want for each of these servers? I am assuming it's H/W raid. I will update during the ILO setup [15:48:20] 06Operations, 10Traffic: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515#2301423 (10fgiunchedi) [15:48:29] chasemp: ^ [15:48:35] jynus: haven't been able to reproduce this on any production wiki, so might not be critical. it would still make sense to know if this is actually expected behaviour though [15:48:41] I can reproduce the issue right now [15:48:53] wikitech or prod? [15:48:58] but on the logs doesn't seem too frequent [15:49:03] okay [15:49:08] so probably wikitech only [15:49:18] PROBLEM - Varnishkafka log producer on cp1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:49:59] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2301456 (10jcrespo) HW RAID 10 with 256K stripe or larger. Documented on: https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia [15:50:15] ACKNOWLEDGEMENT - AQS root url on aqs1004 is CRITICAL: Connection refused Elukey Testing environment [15:50:15] ACKNOWLEDGEMENT - Analytics Cassanda CQL query interface on aqs1004 is CRITICAL: Connection refused Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - aqs endpoints health on aqs1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.107, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: Connection refused Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - cassandra-a service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: Connection refused Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - cassandra-b service on aqs1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - AQS root url on aqs1005 is CRITICAL: Connection refused Elukey Testing environment [15:50:18] ACKNOWLEDGEMENT - Analytics Cassanda CQL query interface on aqs1005 is CRITICAL: Connection refused Elukey Testing environment [15:50:20] ACKNOWLEDGEMENT - aqs endpoints health on aqs1005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.138, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Elukey Testing environment [15:50:20] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.189:9042 on aqs1005 is CRITICAL: Connection timed out Elukey Testing environment [15:50:21] ACKNOWLEDGEMENT - cassandra-a service on aqs1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Elukey Testing environment [15:50:21] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.32.190:9042 on aqs1005 is CRITICAL: Connection refused Elukey Testing environment [15:50:21] ACKNOWLEDGEMENT - cassandra-b service on aqs1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed Elukey Testing environment [15:50:22] ACKNOWLEDGEMENT - puppet last run on aqs1005 is CRITICAL: CRITICAL: Puppet has 1 failures Elukey Testing environment [15:50:22] ACKNOWLEDGEMENT - AQS root url on aqs1006 is CRITICAL: Connection refused Elukey Testing environment [15:50:23] ACKNOWLEDGEMENT - Analytics Cassanda CQL query interface on aqs1006 is CRITICAL: Connection refused Elukey Testing environment [15:50:24] ACKNOWLEDGEMENT - aqs endpoints health on aqs1006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.146, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Elukey Testing environment [15:50:24] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.148:9042 on aqs1006 is CRITICAL: Connection refused Elukey Testing environment [15:50:30] (03PS6) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [15:50:42] I know please don't shoot me :) [15:51:14] godog: gotcha, some kind of varnish interaction issue maybe? [15:51:29] RECOVERY - cassandra-a service on aqs1005 is OK: OK - cassandra-a is active [15:51:29] RECOVERY - cassandra-b service on aqs1004 is OK: OK - cassandra-b is active [15:51:41] ah I see -traffic :) [15:51:50] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:51:59] RECOVERY - cassandra-b service on aqs1005 is OK: OK - cassandra-b is active [15:52:26] SPF|Cloud, I will ask the owner or release if they have upgraded wikitech recently- I also saw some unusual db errors there [15:54:40] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2301477 (10RobH) We could put in a boot delay option (we used to have a similar issue as this on some models of Dells in the past.) I recall it simply applying to th... [15:55:39] RECOVERY - cassandra-a service on aqs1006 is OK: OK - cassandra-a is active [15:56:08] RECOVERY - cassandra-b service on aqs1006 is OK: OK - cassandra-b is active [15:56:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 656.65 seconds [15:56:50] akosiaris: what are your thoughts on T126629 (upgrading Cassandra on Maps)? I'm mostly trying to determine how to plan [15:56:50] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:57:06] ah, ignore that, the downtime expired, I will put it again for another month [15:57:45] wait, dbstore2002, that is not supposed to happen [16:00:04] jynus: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160517T1600). [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:51] o/ [16:01:41] !log branching wmf/1.28.0-wmf.2 [16:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:53] hi there, I briefly saw it , I think it was cleanup's cleanup if I remember correctly [16:02:06] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 4 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2301488 (10TerraCodes) [16:03:06] jynus: mostly it just removes some scripts from /usr/local/bin that are old cruft. [16:03:34] also removes /srv/deployment/scap since scap is now packages and installed via apt. [16:03:43] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 5 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10TerraCodes) [16:05:25] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2301491 (10TerraCodes) [16:05:29] to test it'd be best to run on mw1017 first, I can verify there. Then on tin and mira and I can verify. From there it can propagate to the other boxes on it's own time without any real impact. [16:05:53] I am ok with it, let me run puppet compiler for formal verification [16:06:02] I assume this applies to tin and mira? [16:06:19] RECOVERY - Varnishkafka log producer on cp1061 is OK: PROCS OK: 1 process with command name varnishkafka [16:06:27] oh, I saw you already did that [16:06:30] thank you [16:06:38] Yeah, tin and mira and app servers. Anything that is a scap target. [16:06:47] true [16:07:00] scap is also on every mediawiki [16:07:09] indeed. [16:08:13] (03PS4) 10Jcrespo: Clean old scap code [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) (owner: 10Thcipriani) [16:09:17] (03CR) 10Jcrespo: [C: 032] Clean old scap code [puppet] - 10https://gerrit.wikimedia.org/r/288630 (https://phabricator.wikimedia.org/T128386) (owner: 10Thcipriani) [16:10:21] let me run it on tin [16:10:28] ok. [16:10:35] thcipriani: sorry I didn't get to review it :( anyways I'm assuming absenting a directory will dtrt [16:11:30] godog: np. Let's hope :) [16:11:43] and if you have deployments later, I would say leave it like this and do the second commit and just add me as reviwer, do not wait for the next puppet swat [16:12:44] "File[/srv/deployment/scap]: Not removing directory" [16:13:03] I assume it will be removed on a second run [16:13:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 45.85 seconds [16:13:45] or is there something else there? [16:15:12] yes, removed on second run: Notice: /Stage[main]/Scap::Clean/File[/srv/deployment/scap]/ensure: removed [16:16:02] no, sorry, same notice [16:16:40] you want to remove all under there, thcipriani ? [16:16:44] ugh. OK. blerg. This is probably puppet not doing the right thing when you put 'ensure => absent' on a directory. [16:17:06] it suggests the force, is that what you want? [16:17:15] it only does an `rmdir`, not an `rm -f` I think [16:17:21] because it can be fixed very easily [16:17:30] yes, bd808 [16:17:44] jynus: thanks for the graphite links above, just read them after the aqs alerts :) [16:17:45] I thought it was an empty dir [16:17:59] no, it was an old directory of code that shouldn't be used anymore. [16:18:36] so we add force, right? and all of README.rst requirements.txt goes away? [16:18:51] jynus: I'm making a patch now if that's ok [16:18:56] sure [16:19:11] there is a funny /srv/deployment/scap/scap/scap [16:19:20] (not a symlink, actual dirs) [16:19:24] yup [16:19:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:19:55] The trebuchet base dir is /srv/deployment/scap/scap and inside it is the "scap" python module [16:20:04] :-) [16:20:18] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2301534 (10mobrovac) In practice supporting both upstart and systemd can be done by using the `base::service_unit` Puppet define, and AFAIK that is the only thing which would need to... [16:20:40] there was a scap.py sub-module at one point early in the rewrite too. So much scap [16:21:58] RECOVERY - cassandra-a service on aqs1004 is OK: OK - cassandra-a is active [16:22:27] (03PS1) 10Thcipriani: Use force to clean scap directory [puppet] - 10https://gerrit.wikimedia.org/r/289234 [16:22:51] (03PS4) 10Dzahn: planet: node regex to cover 2001 in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/289108 (https://phabricator.wikimedia.org/T134507) [16:23:06] ^ jynus that is the patch [16:23:10] !log disable mod_deflate and restart apache2 on graphite1001 T135515 [16:23:11] T135515: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515 [16:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:25] (03CR) 10Dzahn: [C: 032] planet: node regex to cover 2001 in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/289108 (https://phabricator.wikimedia.org/T134507) (owner: 10Dzahn) [16:24:45] (03CR) 10Jcrespo: [C: 032] Use force to clean scap directory [puppet] - 10https://gerrit.wikimedia.org/r/289234 (owner: 10Thcipriani) [16:25:25] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2301539 (10ssastry) >>! In T135176#2301534, @mobrovac wrote: > In practice supporting both upstart and systemd can be done by using the `base::service_unit` Puppet define, and AFAIK t... [16:28:29] Notice: /Stage[main]/Scap::Clean/File[/srv/deployment/scap]/ensure: removed [16:29:03] jynus: \o/ thank you! [16:29:18] just checked on tin, looks good. [16:29:47] so whenever you want, after 1 hours or so, send the next patch to remove the clean up [16:29:58] I may apply it tomorrow, though [16:30:25] jynus: will do, should be fine to leave it until whenever. Thank you for your help :) [16:30:49] well, at least half an hour to run on all hosts [16:30:52] :-) [16:31:06] right :P [16:32:09] (03PS5) 10Dzahn: planet: node regex to cover 2001 in codfw as well [puppet] - 10https://gerrit.wikimedia.org/r/289108 (https://phabricator.wikimedia.org/T134507) [16:34:04] (03PS1) 1020after4: WIP: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [16:37:04] (03PS2) 1020after4: WIP: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [16:38:37] (03PS1) 10Dzahn: dhcp: add MAC for planet2001 [puppet] - 10https://gerrit.wikimedia.org/r/289237 (https://phabricator.wikimedia.org/T134507) [16:39:07] (03PS2) 10Dzahn: dhcp: add MAC for planet2001 [puppet] - 10https://gerrit.wikimedia.org/r/289237 (https://phabricator.wikimedia.org/T134507) [16:39:15] (03CR) 10Dzahn: [C: 032] dhcp: add MAC for planet2001 [puppet] - 10https://gerrit.wikimedia.org/r/289237 (https://phabricator.wikimedia.org/T134507) (owner: 10Dzahn) [16:54:43] (03PS1) 10Elukey: Add myself to the aqs-admin group. [puppet] - 10https://gerrit.wikimedia.org/r/289240 [16:59:11] (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/289007 (https://phabricator.wikimedia.org/T126549) (owner: 10Milimetric) [17:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160517T1700). Please do the needful. [17:00:37] (03PS2) 10Elukey: Add myself to the aqs-admin group. [puppet] - 10https://gerrit.wikimedia.org/r/289240 [17:09:32] (03Abandoned) 10Elukey: Add myself to the aqs-admin group. [puppet] - 10https://gerrit.wikimedia.org/r/289240 (owner: 10Elukey) [17:27:41] (03CR) 1020after4: WIP: keyholder key cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [17:39:05] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2301872 (10Papaul) p:05Triage>03Normal [17:40:50] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: high load on elastic1001 - https://phabricator.wikimedia.org/T135509#2301874 (10Gehel) Thread dumps from elastic1001 below. Minimal analysis: * no deadlock * ~15 thread merging a segment. Those threads seem to be waiting to be aborted... [17:51:50] (03PS1) 10Faidon Liambotis: horizon: make the "Totp token" field a little more clear [puppet] - 10https://gerrit.wikimedia.org/r/289249 [17:54:43] (03CR) 10Andrew Bogott: [C: 031] "I'm going to hotfix this on labtestweb2001 to make sure there aren't any text alignment issues..." [puppet] - 10https://gerrit.wikimedia.org/r/289249 (owner: 10Faidon Liambotis) [17:54:55] andrewbogott: <3 [17:55:29] andrewbogott: so... is this a standard django form? [17:55:36] I wonder if we could use keystone more widely for our web auth [17:55:49] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:55:51] we could at least do so for our all of our django apps, couldn't we [17:56:09] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:56:29] The auth is handled by a subproject called openstack_auth [17:56:40] but it's pretty tiny, essentially just a django form [17:57:16] paravoid: looks fine https://labtesthorizon.wikimedia.org/auth/login/ [17:57:18] I'd like to move the 2fa seed into ldap and get rid of the MW database dependency [17:57:25] Yeah [17:57:40] paravoid: enough things are in flux that it's probably not a good time to standardize on keystone [17:57:48] but a couple of versions down, maybe [17:58:04] (e.g. right now we're using custom code for 2fa, but there's similar code in upstream for verision M) [17:58:09] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [17:58:43] (03CR) 10Andrew Bogott: [C: 032] horizon: make the "Totp token" field a little more clear [puppet] - 10https://gerrit.wikimedia.org/r/289249 (owner: 10Faidon Liambotis) [17:58:49] okay [17:58:56] and yes, what bd808 said sounds sensible [17:59:05] Also keystone is insane [17:59:10] and I keep thinking it will be less insane [17:59:13] but so far, not [17:59:15] :( [17:59:22] aha [17:59:30] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:31] I haven't looked closely [17:59:44] Well, really I just have one particular beef, which maybe isn't relevant in this case [17:59:45] the form looked nice, and having a standardized form to do auth and 2FA sounded appealing [17:59:51] but if it's insane, then no :) [18:00:08] also an SSO would be nice [18:00:20] having to login over http basic auth to all those small services is not really great [18:00:30] yeah :/ [18:01:00] The main thing I hate about keystone is that you can't delegate the ability to modify roles [18:01:09] In this case that's probably ops-only anyway [18:01:26] anybody know what SSO methods are easy to integrate with both apache and nginx these days? [18:01:34] bd808: yes [18:01:35] so that's moot, I think. We'd just have a tenant called 'OpsDashboards' or something and give people roles on that. [18:01:36] none [18:01:48] paravoid: kind of what I thought [18:01:50] on my previous job I was administering a SAML federation [18:01:52] not fun [18:02:30] <_joe_> every time you or alex name that [18:02:35] <_joe_> I feel scared :P [18:02:48] you should be [18:02:58] Shibboleth, most convoluted software I've used, ever [18:03:08] <_joe_> java, I guess [18:03:12] yes [18:03:17] about 15-20 configuration files [18:03:19] in XML, obviously [18:03:19] <_joe_> all java software is convoluted by default [18:03:22] <_joe_> all XML [18:03:25] <_joe_> LOL [18:03:32] but this one is especially complicated [18:03:35] <_joe_> how many config options? [18:03:40] hundreds? [18:03:52] plus free-form turing complete options too [18:04:05] with the option to embed lua, javascript or python in the config on later versions [18:04:26] <_joe_> one of the internal softwares at venere had (last I counted) 230 config options scattered around the codebase, and we had an overrides file of like 180 tunings? [18:04:34] <_joe_> paravoid: oh man [18:04:36] but the protocol (SAML) is complicated too [18:04:44] XMLSec for signed XML assertions [18:04:51] <_joe_> yeah it's like soap done to something more complex than http [18:05:05] https://www.switch.ch/aai/support/tools/wayf/wayf-vs-ds.png [18:05:35] <_joe_> paravoid: it's CORBA-level complexity, seen from 1000 km [18:05:57] http://www.cren.net/crenca/onepagers/guidebook/images/s9.gif [18:06:09] <_joe_> you're probably young enough to never have had to see corba :P [18:06:16] the whole path going through the user's browser [18:06:36] so in later versions the IdP sent signed, encrypted XML assertions to the user [18:06:39] to pass them on to the SP [18:06:47] impossible to debug really [18:07:25] paravoid: something like this -- https://neon1.net/mod_auth_pubtkt/ -- looks like it wouldn't be too hard to make more [18:07:40] yeah, I've seen it [18:07:56] and wasn't very impressed, don't remember why [18:08:16] there is a similar but also different nginx module -- https://github.com/heipei/nginx-sso/ [18:09:38] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:09:59] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:12:05] Wondering what extension is making this huge username list on https://test.wikipedia.org/wiki/Sandbox [18:22:24] Krinkle: that's core I think...? action=credits [18:23:39] (03PS1) 10Ema: 4.1.2-1wm6: update 0005-handle-eof-http1.1.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/289254 (https://phabricator.wikimedia.org/T135515) [18:34:39] !log Restart restbase2008-a.codfw.wmnet; Hail Mary pass for failed 2008-b bootstraps : T95253 [18:34:41] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:34:44] (03PS3) 10Giuseppe Lavagetto: Initial debianization [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) [18:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:31] (03CR) 10Giuseppe Lavagetto: Initial debianization (0312 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) (owner: 10Giuseppe Lavagetto) [18:36:07] !log Restarting (failed) bootstrap of restbase2008-b.codfw.wmnet : T95253 [18:36:08] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:05] !log Ran manual SQL write in production to work around T122262; see task for query. [18:37:06] T122262: Improve Flow deletion/undeletion resilience - https://phabricator.wikimedia.org/T122262 [18:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:19] (03CR) 10Ema: [C: 032 V: 032] 4.1.2-1wm6: update 0005-handle-eof-http1.1.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/289254 (https://phabricator.wikimedia.org/T135515) (owner: 10Ema) [18:50:37] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2302503 (10RobH) I'll toss my hat in the ring for wanting to test the service deployment step with our documentation? It has been awhile since I pushed an apache from bare metal to service, so I'd... [18:51:29] (03PS1) 10BBlack: V4 XFF Fixup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/289256 [18:51:31] (03PS1) 10BBlack: V4 XFF Fixup 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/289257 [18:51:33] (03PS1) 10BBlack: V4 XFF Fixup 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/289258 [19:00:04] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160517T1900). Please do the needful. [19:01:38] * twentyafterfour will be deploying the train today in place of hashar [19:01:58] !log preparing to deploy wmf/1.28.0-wmf.2 [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:46] (03PS1) 10TheDJ: Enable experimental videojs player on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 [19:14:56] (03PS16) 10Eevans: [WIP] Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:23:48] (03PS17) 10Eevans: [WIP] Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:31:44] (03CR) 10Eevans: "Patch 17 temporarily configures maps-test2001.codfw.wmnet with cassandra::target_version '2.2', and the puppet-compiler output for that ca" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:34:12] (03PS18) 10Eevans: Cassandra 2.2.6 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:36:09] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2302694 (10GWicke) [19:37:31] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2302694 (10GWicke) [19:54:21] (03PS1) 10Eevans: add CQL interface and port to descriptors [puppet] - 10https://gerrit.wikimedia.org/r/289264 (https://phabricator.wikimedia.org/T132958) [19:58:08] !log starting upgrade of varnish4 packages on cache_misc [19:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:47] 06Operations, 10Traffic, 13Patch-For-Review: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515#2302786 (10BBlack) p:05Triage>03Normal Misc-cluster updated to 4.1.2-1wm6, patch above seems to fix (along with the mod_deflate disable). We may need further comm... [20:09:23] !log finished upgrade of varnish4 packages on cache_misc [20:09:26] !log starting upgrade of varnish4 packages on cache_maps [20:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:59] PROBLEM - Host kafka1013 is DOWN: PING CRITICAL - Packet loss = 100% [20:12:06] (03PS1) 1020after4: group0 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289269 [20:13:46] !log twentyafterfour@tin Started scap: (no message) [20:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:27] !log syncing testwiki to wmf/1.28.0-wmf.2 [20:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:14] (03CR) 10BBlack: [C: 032] V4 XFF Fixup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/289256 (owner: 10BBlack) [20:19:05] (03CR) 10BBlack: [C: 032] V4 XFF Fixup 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/289257 (owner: 10BBlack) [20:22:49] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:23:09] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:23:10] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:23:10] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:23:29] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:23:30] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:23:30] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:23:38] what's up with kafka1013? [20:23:48] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:23:48] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:23:48] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:23:49] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:23:49] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:23:49] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:23:49] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:23:50] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:23:50] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:23:51] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:23:51] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:23:58] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:23:58] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:23:59] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:23:59] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:08] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:24:08] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:24:08] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:24:08] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:24:19] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:24:20] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:24:20] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:24:20] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:24:20] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:24:20] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:20] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:21] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:24:21] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:22] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:22] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:23] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:29] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1013_v4,kafka1013_v6 [20:24:29] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 53 not-conn: kafka1013_v4,kafka1013_v6 no-xfrm: cp1054_v6 [20:24:29] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:24:30] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:38] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1013_v4,kafka1013_v6 [20:24:38] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:38] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:38] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:38] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:39] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:39] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:24:40] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:24:40] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:24:41] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:41] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:24:42] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:24:42] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:43] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:24:55] in racadm it's showing poweroff, as if it were intentionally halted [20:25:08] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:25:08] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:25:09] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1013_v4,kafka1013_v6 [20:25:09] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1013_v4,kafka1013_v6 [20:25:09] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1013_v4,kafka1013_v6 [20:25:09] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:25:09] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1013_v4,kafka1013_v6 [20:26:11] ottomata: ? [20:26:29] (03CR) 10Alexandros Kosiaris: [C: 031] Cassandra 2.2.6 config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [20:26:54] assuming crash or hwfail [20:26:59] !log rebooting kafka1013 from racadm [20:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:48] bblack: does racadm getraclog say anything ? [20:28:49] akosiaris: yeah [20:28:58] 2x events at "crash" time: [20:28:58] Message = Host System is powering off [20:28:58] FQDD = iDRAC.Embedded.1#HostPowerCtrl [20:29:04] Message = Host System is performing a LPC reset [20:29:04] FQDD = iDRAC.Embedded.1#HostPowerCtrl [20:29:11] both are: Timestamp = 2016-05-17 20:07:57 [20:29:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [10.0] [20:29:41] sel had nothing with a recent timestamp [20:30:18] LPC reset... Low Pin Count bus... interesting ... what on earth ? [20:30:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] [20:30:42] probably a hardware fault [20:30:45] yeah [20:30:58] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [10.0] [20:31:03] I tried to boot it power "powerup", but no console output yet [20:31:03] although it did not actually perform a reset as it seems [20:31:13] trying cycle [20:31:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] [20:31:50] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [10.0] [20:32:16] ah the powerup/cycle fails, I missed it the first time due to screen-clear [20:32:19] ERROR: Timeout while waiting for server to perform requested power action. [20:32:30] great. so hardware error ... [20:32:32] trying racreset... [20:32:44] hiii [20:33:09] ottomata1: kafka1013 is hardware-dead, at least temporarily [20:33:16] I assume replica alerts are due to that [20:33:24] reading [20:33:31] bblack that would make sense [20:33:53] looking [20:33:58] I'm already looking at 1013 [20:34:46] bblack you are trying to reboot it? [20:35:04] ottomata: basically, yes, it got complicated [20:35:08] it may come back up soon [20:35:19] as in "facebook" complicated ? [20:35:20] :P [20:35:38] yeah it's coming back up [20:35:42] hmmm [20:35:59] may have some minor mdadm/fs issues from crash [20:36:18] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [20:36:18] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [20:36:18] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [20:36:18] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK [20:36:18] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [20:36:19] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [20:36:19] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [20:36:19] RECOVERY - Host kafka1013 is UP: PING OK - Packet loss = 0%, RTA = 2.55 ms [20:36:20] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [20:36:20] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [20:36:21] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [20:36:21] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [20:36:22] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [20:36:29] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 28 ESP OK [20:36:29] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 28 ESP OK [20:36:29] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [20:36:30] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [20:36:30] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [20:36:30] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [20:36:30] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [20:36:48] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [20:36:48] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [20:36:48] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [20:36:49] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [20:36:49] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [20:36:49] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 28 ESP OK [20:36:49] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [20:36:50] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [20:36:50] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [20:36:51] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [20:36:51] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [20:36:52] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [20:36:59] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 36 ESP OK [20:36:59] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [20:36:59] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [20:37:00] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [20:37:00] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [20:37:00] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 28 ESP OK [20:37:00] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [20:37:00] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 28 ESP OK [20:37:00] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [20:37:01] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 28 ESP OK [20:37:02] making a ticket, so we have some tracking if there's further failure... [20:37:18] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [20:37:19] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 36 ESP OK [20:37:19] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [20:37:24] (03CR) 10Alexandros Kosiaris: [C: 032] add CQL interface and port to descriptors [puppet] - 10https://gerrit.wikimedia.org/r/289264 (https://phabricator.wikimedia.org/T132958) (owner: 10Eevans) [20:37:29] (03PS2) 10Alexandros Kosiaris: add CQL interface and port to descriptors [puppet] - 10https://gerrit.wikimedia.org/r/289264 (https://phabricator.wikimedia.org/T132958) (owner: 10Eevans) [20:37:35] (03CR) 10Alexandros Kosiaris: [V: 032] add CQL interface and port to descriptors [puppet] - 10https://gerrit.wikimedia.org/r/289264 (https://phabricator.wikimedia.org/T132958) (owner: 10Eevans) [20:37:39] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [20:37:39] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [20:37:39] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [20:37:40] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 28 ESP OK [20:37:40] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [20:37:40] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [20:37:40] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [20:37:41] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [20:37:41] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [20:37:42] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [20:37:50] thanks bblack [20:37:58] looking at logs seeing if i can find an easy reason [20:38:00] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [20:38:01] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [20:38:01] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [20:38:01] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK [20:38:01] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [20:38:01] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [20:38:01] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [20:38:02] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [20:38:02] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [20:40:56] hmm, not finding anything [20:41:28] !log twentyafterfour@tin Finished scap: (no message) (duration: 27m 42s) [20:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:06] 06Operations, 10ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2303007 (10BBlack) [20:44:19] 06Operations, 10ops-eqiad: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2303026 (10Ottomata) Yeah, I can't find anything wrong either. Kafka disk stuff seems fine, and replicas are syncing now. I will check back on this tomorrow, and if it still looks fine, run a replica-election then... [20:46:48] (03CR) 1020after4: [C: 032] group0 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289269 (owner: 1020after4) [20:46:53] (03PS2) 10BBlack: V4 XFF Fixup 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/289258 [20:47:06] (03CR) 10BBlack: [C: 032 V: 032] V4 XFF Fixup 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/289258 (owner: 10BBlack) [20:47:24] (03Merged) 10jenkins-bot: group0 to 1.28.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289269 (owner: 1020after4) [20:47:58] (03PS1) 10Cmjohnson: Adding mgmt dns entries for mw1305-6 [dns] - 10https://gerrit.wikimedia.org/r/289309 [20:48:55] 06Operations: Create instrumentation to monitor load on geoiplookup.wikimedia.org - https://phabricator.wikimedia.org/T104258#1411912 (10Krinkle) See also {T100902}. [20:49:01] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for mw1305-6 [dns] - 10https://gerrit.wikimedia.org/r/289309 (owner: 10Cmjohnson) [20:49:22] 06Operations, 10Monitoring: Create instrumentation to monitor load on geoiplookup.wikimedia.org - https://phabricator.wikimedia.org/T104258#2303056 (10Krinkle) [20:51:32] !log twentyafterfour@tin Purged l10n cache for 1.27.0-wmf.20 [20:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:51:51] !log twentyafterfour@tin Purged l10n cache for 1.27.0-wmf.21 [20:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:05] !log twentyafterfour@tin Purged l10n cache for 1.27.0-wmf.22 [20:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:58] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.2 [20:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [20:55:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [20:55:28] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [20:55:58] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [20:56:18] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [20:56:34] (03CR) 10Brion VIBBER: [C: 031] "There are still some rough edges but the closer an environment to production we can start putting it in, the readier it'll get. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289261 (owner: 10TheDJ) [20:57:37] thanks for responding bblack, appreciate it [20:59:50] np [21:06:48] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [21:14:45] huh? sync-masters no workie? it didn't error out when I ran scap [21:15:40] !log ori@tin Synchronized php-1.28.0-wmf.2/includes/api/ApiStashEdit.php: Id1e0808c: Improve edit stash hit rate for logged-out users (duration: 00m 35s) [21:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:50] !log finished deploying wmf/1.28.0-wmf.2 [21:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:50] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:21:09] !log ori@tin Synchronized php-1.28.0-wmf.1/extensions/NavigationTiming: I62e20087c1: Expand coverage of conformance test (duration: 00m 28s) [21:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:19] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 1 failures [21:29:16] (03PS1) 10Andrew Bogott: Organize Ganglia cluster names for labs [puppet] - 10https://gerrit.wikimedia.org/r/289315 [21:43:05] (03PS1) 10Jdrewniak: T135235 Turning off survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289317 (https://phabricator.wikimedia.org/T135235) [21:45:16] !log stopping rolling restart of elasticsearch cluster for the night (T135499) [21:45:17] T135499: Restart elasticsearch clusters for Java update - https://phabricator.wikimedia.org/T135499 [21:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:38] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:55:49] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Puppet has 1 failures [22:18:03] (03PS7) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [22:19:36] (03PS10) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [22:20:56] (03CR) 10jenkins-bot: [V: 04-1] Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) (owner: 10Ottomata) [22:21:38] (03PS11) 10Ottomata: Druid module and analytics_cluster role class [puppet] - 10https://gerrit.wikimedia.org/r/288099 (https://phabricator.wikimedia.org/T131974) [22:26:20] ACKNOWLEDGEMENT - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Puppet has 1 failures andrew bogott Andrew is in the thick of this [22:40:52] !log ori@tin Synchronized php-1.28.0-wmf.1/includes/api/ApiStashEdit.php: Id1e0808c: Improve edit stash hit rate for logged-out users (duration: 00m 32s) [22:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:05] RoanKattouw ostriches Krenair Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160517T2300). Please do the needful. [23:00:05] legoktm RoanKattouw jan_drewniak Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:08] (03PS1) 10Dereckson: Add Puotal: namespace to jam.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289331 (https://phabricator.wikimedia.org/T135479) [23:00:38] * Dereckson adds * [config] {{Gerrit|289331}} Add Puotal: namespace to jam.wikipedia ({{phabT|135479}}) [23:01:09] I'm here too for jan_drewniak [23:01:24] ori: can we SWAT? [23:01:27] jan_drewniak is also here :P [23:01:30] yes [23:01:31] jam is a fun language :) [23:01:36] mutante: aye [23:01:44] Jizas Krais [23:02:30] Okay I'll SWAT this evening. Let's start with jan_drewniak and debt so. [23:03:11] * RoanKattouw waves [23:03:25] Portal change 5e3f584b9ae4e317e68eec4d234d126dcda2b2ee is already merged, so it's fine to proceed [23:03:31] hello [23:03:55] Hello. [23:04:34] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289317 (https://phabricator.wikimedia.org/T135235) (owner: 10Jdrewniak) [23:05:06] (03Merged) 10jenkins-bot: T135235 Turning off survey banner on wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289317 (https://phabricator.wikimedia.org/T135235) (owner: 10Jdrewniak) [23:05:41] RoanKattouw: for 289268 and 289270, 289268 first? [23:06:08] Either order is fine, but one of them needs a scap [23:07:57] jan_drewniak: 289317 Turning off survey banner on www.wikipedia.org (note: please run the sync-portals script at the root of the repo after deploy) [23:08:12] jan_drewniak: this script takes care of sync-dir? [23:08:33] Dereckson: yup. it syncs the dirs [23:08:36] nice [23:09:45] !log dereckson@tin Synchronized portals/prod/wikipedia.org/assets: Turning off survey banner on www.wikipedia.org (T135235) (duration: 00m 25s) [23:09:46] T135235: Wikipedia.org Portal Survey: turn off banner - https://phabricator.wikimedia.org/T135235 [23:10:12] !log dereckson@tin Synchronized portals: Turning off survey banner on www.wikipedia.org (T135235) (duration: 00m 26s) [23:10:13] T135235: Wikipedia.org Portal Survey: turn off banner - https://phabricator.wikimedia.org/T135235 [23:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:28] debt and jan_drewniak > test please ^ [23:11:17] looks good to me - jan_drewniak ? [23:11:45] looks good :) [23:11:52] Perfect, thanks for testing. [23:11:55] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288435 (https://phabricator.wikimedia.org/T127887) (owner: 10Legoktm) [23:12:31] (03PS2) 10Dereckson: Disable $wgCentralAuthCheckSULMigration functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288435 (https://phabricator.wikimedia.org/T127887) (owner: 10Legoktm) [23:12:40] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288435 (https://phabricator.wikimedia.org/T127887) (owner: 10Legoktm) [23:13:22] (03Merged) 10jenkins-bot: Disable $wgCentralAuthCheckSULMigration functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288435 (https://phabricator.wikimedia.org/T127887) (owner: 10Legoktm) [23:13:28] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2303490 (10Arlolra) This [[ https://github.com/wikimedia/service-runner/pull/73 | pull ]] leads me to believe that service-runner has a hard dependency on node v4.x, which kind of put... [23:14:23] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Disable $wgCentralAuthCheckSULMigration functionality (T127887) (duration: 00m 25s) [23:14:25] legoktm: here you are, please test ^ [23:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:48] T127887: Disable ability to log in with old user credentials (pre-SUL finalization) - https://phabricator.wikimedia.org/T127887 [23:14:50] (03PS2) 10Dereckson: Undeploy Gather extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289114 (https://phabricator.wikimedia.org/T128568) (owner: 10Legoktm) [23:17:00] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:39] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:11] legoktm: ? [23:19:57] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2303498 (10GWicke) @Arlolra, if you look at the dates & the discussion you'll notice that a) there is no dependency on node v4.x, and there are no plans to introduce such a dependency... [23:20:16] (03PS1) 10Thcipriani: Remove scap::clean now that scap is clean [puppet] - 10https://gerrit.wikimedia.org/r/289333 [23:21:49] Krenair: I'm still getting timed out on all wmf-wikis intermittent... [23:21:59] Also, ConnectionError: HTTPSConnectionPool(host='commons.wikimedia.org', port=443): Max retries exceeded with url: /w/api.php?inprop=protection&titles=File%3A%281646%29+Oak+Hook-tip+%28Watsonalla+binaria%29+-+Flickr+-+Bennyboymothman.jpg&continue=&format=json&prop=info&meta=userinfo&indexpageids=&action=query&maxlag=5&uiprop=blockinfo%7Chasmsg (Caused by [23:22:00] NewConnectionError(': Failed to establish a new connection [23:22:14] Josve05a: when? [23:22:22] just now, and 15 min ago [23:22:42] okay so after wmf-config/CommonSettings.php: Disable $wgCentralAuthCheckSULMigration functionality [23:22:48] it stops working for ~~~1-5 min, then works agen, then it can take aan hour or two, then happen again [23:23:10] again* [23:23:52] RoanKattouw: I'm going to deploy 289270 first so [23:24:01] OK [23:24:07] That's fine, that one doesn't need a scap [23:24:25] !log dereckson@tin Synchronized php-1.28.0-wmf.2/extensions/Echo/includes/formatters/EchoHtmlEmailFormatter.php: HTML email footer shows raw HTML ([[Gerrit:289270]]) (duration: 00m 31s) [23:24:29] RoanKattouw: please test ^ [23:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:33] (03PS1) 10Dzahn: hiera: add variable for the active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289334 [23:25:38] ugh...can't access Wikipedia now when I actually need it xD [23:25:44] legoktm: I've deployed your wgCentralAuthCheckSULMigration and I would need you to confirm you're testing it + could you check the Socket timeout after 10 seconds on mw1207 isn't trigerred by your change? [23:27:27] (03PS2) 10Dzahn: hiera: add variable for the active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289334 [23:27:46] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2303505 (10Arlolra) If you say so, but that's not the impression I got from reading it. In that case, I'll write a patch for T90668 to ease the work here. [23:28:20] (03PS2) 10Dereckson: Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) [23:28:45] (03PS3) 10Dereckson: Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) [23:28:57] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) (owner: 10Dereckson) [23:29:20] (03CR) 10jenkins-bot: [V: 04-1] Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) (owner: 10Dereckson) [23:29:48] (03CR) 10jenkins-bot: [V: 04-1] Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) (owner: 10Dereckson) [23:31:01] (03PS4) 10Dereckson: Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) [23:31:42] Dereckson: sorry, testing now [23:31:49] (03CR) 10Dereckson: "PS2: rebased / PS4: added a missing comma from a previous wgCopyUploadsDomains change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) (owner: 10Dereckson) [23:31:53] thanks legoktm [23:32:35] Dereckson: yep, worked fine. [23:32:42] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) (owner: 10Dereckson) [23:33:09] I can log into mw1207 just fine... [23:33:19] (03Merged) 10jenkins-bot: Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287063 (https://phabricator.wikimedia.org/T134472) (owner: 10Dereckson) [23:33:20] hhvm probably just locked up? [23:33:40] 55 parent, LightProcess exiting [23:33:45] probably that [23:34:00] legoktm: for Gather, before the Nemo_bis intervention, there was an agreement to notify communities first [23:34:22] This notification is important because we'll remove user content. They could be interested to create a subpage on their userpage to backup that. [23:34:37] Dereckson: the bug said that they already exported the collections? [23:34:40] Look for example https://en.wikivoyage.org/wiki/Special:Gather/id/51/Myanmar [23:34:44] oh? [23:35:12] https://phabricator.wikimedia.org/T128568#2179500 "@Tgr and I are working on exporting the wikivoyage and hebrew collections first. Please bear with us." [23:35:19] Provide export path for enwiki Gather users [23:35:25] which is T128056 [23:35:25] T128056: Provide export path for enwiki Gather users - https://phabricator.wikimedia.org/T128056 [23:35:27] and then 2 weeks later that it's no longer blocked on them [23:36:30] hmmmm [23:36:32] "Public and private lists whose owners requested migration have all been exported. Those public collections now live on talk pages and private collections were emailed to owners (only those owners who had email attached to their wikipedia account- the vast majority)." [23:36:41] Dereckson: https://phabricator.wikimedia.org/T131063#2219261 is the ticket for hewp and envoy [23:37:09] What about private collections for users without a mail? [23:37:23] At the top of that bug it says there were no private collections [23:37:45] Also, the data is not being deleted out of the database. So if people really want their collections, we can still get it for them [23:38:22] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2303514 (10Arlolra) [23:40:06] legoktm: https://en.wikivoyage.org/w/index.php?title=Wikivoyage:Travellers%27_pub&offset=&limit=500&action=history [23:40:19] the bug speaks about communities notifications [23:40:31] Are you sure en.wikivoyage was notified? [23:41:13] Individual users were. [23:41:45] https://phabricator.wikimedia.org/T131063 doesn't mention any noticeboard messages. [23:41:47] Okay, I'm not going to deploy this. Peachey88 asked for communities notification, ori offered to do it in a tech news. This was planned. [23:42:06] But then Nemo_bis says "AFAICS, there is no need to wait a single minute more." [23:42:16] And that killed the notifications. [23:42:31] I suggest we choose a date in the next week, and notify the communities it will be disabled at this date. [23:42:46] Really? [23:42:55] These communities didn't even ask for it to be turned on [23:43:05] And everyone who used the feature was notified that it was going away [23:43:23] Is there any emergency to undeploy the extension today instead of after proper notifications are done according the initial plan? [23:43:27] No, I said, should we ask for community census to disable it (https://phabricator.wikimedia.org/T128568#2079378) [23:43:36] If i'm looking at the same comment [23:43:39] I will personally leave a note on en.woy's village pump after it is removed. [23:43:51] Dereckson: Yes, it should have never existed and should have been removed a month ago. [23:44:16] p858snake|L_: "We should still notify the communities as a whole and allow them to comment as appropriate." [23:44:26] that's the step skipped I think. [23:44:27] I can't do anything about the first part, but I can make sure the second part gets done. [23:44:52] It's not a community decision to keep the extension. It's no longer maintained and abandoned. Hence it shouldn't be deployed anymore [23:45:58] If you feel uncomfortable deploying the patch, I can deploy it myself. [23:47:25] greg-g: RoanKattouw: Krenair: any opinion on this matter? ^ [23:48:17] Dereckson: My Echo HTML email patch is working [23:48:20] Reading backscroll [23:48:35] good news [23:49:19] I've CR+2 289268 so Zuul can test that. [23:50:37] I'm happy for the Gather undeploy to go ahead [23:50:46] I am a bit surprised though that it never made it to tech news [23:50:59] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:51:34] legoktm: And there's an export thing but does that cover recent changes? Some of the collections on enwikivoyage and hewiki were modified in the last few days [23:52:08] I doubt it? I was expecting that the reading team would have undeployed it shortly after doing the export but they dropped the ball on it... [23:52:10] (Also cc ostriches who is the interim greg-g right now) [23:52:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add tasnimnews.com & khamenei.ir to wgCopyUploadsDomains (duration: 00m 36s) [23:52:51] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:52:55] Oh as part of the export they sent messages to every Gather user? [23:53:00] Yes [23:53:04] That's enough notification as far as I'm concerned [23:53:18] The only unfortunate part is that they quoted April 21 as the switch-off date which clearly didn't happen [23:53:21] https://en.wikivoyage.org/wiki/User_talk:Jdlrobson#Your_collections_have_been_exported / https://en.wikivoyage.org/wiki/User:Jdlrobson/Gather_lists [23:53:28] (for example) [23:53:35] Huh? [23:53:37] did the modified lists exist before those exports? [23:53:43] Gather users with nonzero collections were notified via MassMessage [23:54:21] modifications made to the collections after the export would be lost, obviously [23:55:00] I thought gather was dead already [23:55:12] ostriches: So did I :( [23:55:15] Kill that shit with fire then. [23:55:22] on enwiki I think there was a survey and only those who asked for it there got exports [23:55:32] ostriches: TL;DR: we need to undeploy Gather, legoktm wants to do it today, initial plan was to notify through Tech News, but that has been forgotten, every user having using the extension has been notified by a message, but private stuff was only sent to users with a mail. No notification to the communities village pumps. No tech news. We have two choices: undeploy it right now, undeploy i [23:55:38] t next week so we have time for notices. [23:55:43] Kill it now. [23:55:46] Okay. [23:56:09] Should've happened ages ago anyway [23:56:10] yeah, just do it [23:56:13] that. [23:56:14] on the small wikis there weren't any private collections IIRC [23:56:17] Thanks all for feedback. [23:56:28] on en it's already disabled anyway, I think? [23:56:40] legoktm: sorry for the delay, but notifications and information lost risk was a serious concern to address [23:56:54] Dereckson: no worries, thank you for doing due dilligence :) [23:57:10] (03PS3) 10Dereckson: Undeploy Gather extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289114 (https://phabricator.wikimedia.org/T128568) (owner: 10Legoktm) [23:57:54] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289114 (https://phabricator.wikimedia.org/T128568) (owner: 10Legoktm) [23:58:31] (03Merged) 10jenkins-bot: Undeploy Gather extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289114 (https://phabricator.wikimedia.org/T128568) (owner: 10Legoktm) [23:58:42] (03PS1) 10Dzahn: planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 [23:59:54] (03CR) 10jenkins-bot: [V: 04-1] planet: only run updates when in active datacenter [puppet] - 10https://gerrit.wikimedia.org/r/289340 (owner: 10Dzahn)