[00:13:05] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:13:22] (03PS2) 10Catrope: Remove individual wikis' config for wgOresModels, use 'default' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312168 [00:37:31] PROBLEM - cassandra CQL 10.192.16.34:9042 on maps-test2003 is CRITICAL: Connection refused [00:38:07] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [00:40:41] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:52:25] RECOVERY - cassandra CQL 10.192.16.34:9042 on maps-test2003 is OK: TCP OK - 0.037 second response time on port 9042 [00:52:26] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [00:52:27] PROBLEM - cassandra CQL 10.192.0.129:9042 on maps-test2002 is CRITICAL: Connection refused [00:53:05] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [01:05:13] PROBLEM - cassandra CQL 10.192.16.34:9042 on maps-test2003 is CRITICAL: Connection refused [01:05:57] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [01:39:30] PROBLEM - kartotherian endpoints health on maps-test2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.34, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:39:30] PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): [01:39:31] PROBLEM - kartotherian endpoints health on maps-test2004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.35, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:39:31] PROBLEM - kartotherian endpoints health on maps-test2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.129, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:39:38] PROBLEM - cassandra CQL 10.192.16.35:9042 on maps-test2004 is CRITICAL: Connection refused [01:40:28] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [01:40:29] PROBLEM - tilerator on maps-test2004 is CRITICAL: Connection refused [01:40:30] PROBLEM - tilerator on maps-test2002 is CRITICAL: Connection refused [01:41:00] PROBLEM - tilerator on maps-test2003 is CRITICAL: Connection refused [01:47:46] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:00:15] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [02:07:56] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [02:12:36] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:15:40] (03PS1) 1020after4: remove reference to undefined wmgMFUseCentralAuthToken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313346 [02:24:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:26:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:28:01] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 12m 42s) [02:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:59] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [02:32:48] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 29 02:32:47 UTC 2016 (duration 4m 46s) [02:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:42] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:37:34] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [02:41:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:51:14] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:16:08] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:24:07] (03CR) 10Krinkle: [C: 031] Remove individual wikis' config for wgOresModels, use 'default' instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312168 (owner: 10Catrope) [04:50:54] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:58:29] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:00:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [05:05:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:15:32] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:25:27] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:00:43] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Rocket Surgery 2016, and 4 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2676251 (10Nikerabbit) I saw this on SoS but nothing seems to point towards Translate extension. @aar... [06:04:40] (03PS2) 10Hashar: rpc: trick mw into generating a raw exception report [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312077 [06:10:23] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [06:17:45] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [06:34:59] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:45:04] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2666339 (10Dzahn) > Is it possible to come to a conclusion on what to do with such domain names, should all current uses be removed or can the redirection functionality of the domai... [06:46:46] 06Operations, 10DNS, 10Traffic: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619#2676264 (10Dzahn) Note that almost a full year has passed since this has been deactivated. [06:49:57] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [06:50:53] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.037 second response time on port 9042 [06:51:12] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:34] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:29:44] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [07:36:59] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:37:29] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [07:54:52] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:01:24] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 752265 msg: ocg_render_job_queue 3006 msg (=3000 critical) [08:01:24] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 752265 msg: ocg_render_job_queue 3006 msg (=3000 critical) [08:02:34] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:02:55] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 752756 msg: ocg_render_job_queue 3055 msg (=3000 critical) [08:20:34] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:51] 06Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1245233 (10ArielGlenn) Just to remind people that these boxes are still flapping with this alert. [08:34:20] 06Operations, 10Mail, 10OTRS: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#2676457 (10Reedy) [08:36:48] (03CR) 10Reedy: [C: 032] Remove old variable transfers (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313321 (https://phabricator.wikimedia.org/T146945) (owner: 10Reedy) [08:37:19] (03Merged) 10jenkins-bot: Remove old variable transfers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313321 (https://phabricator.wikimedia.org/T146945) (owner: 10Reedy) [08:38:48] !log reedy@tin Synchronized wmf-config/mobile-labs.php: Remove transfers of non existent $wmg variables (duration: 00m 48s) [08:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:42:36] (03PS2) 10Urbanecm: Fix hewiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310798 (https://phabricator.wikimedia.org/T145017) [08:42:57] <_joe_> oh ocg is failing [08:43:00] <_joe_> too bad :( [08:43:12] <_joe_> Reedy: is this an emergency deployment? [08:43:13] "no one uses it anyway" [08:43:26] _joe_: It was to fix some noise on beta, yes [08:43:32] Noop for prod [08:43:37] <_joe_> that's hardly an emergency [08:43:51] beta is more widely broken too atm though [08:44:27] <_joe_> again, "no non emergency deployments this week seems pretty clear to me." [08:45:04] _joe_: Does this also apply to labs? In the past such restriction never applied to beta [08:45:51] <_joe_> hoo: not to beta, but a "noop" deployment just happened in prod [08:46:02] I don't think prod actually loads that file does it? [08:46:06] <_joe_> I can explain you why "noop" can be not exactly a noop [08:46:16] well, sure, hhvm is too smart [08:46:24] <_joe_> Reedy: exactly :) [08:46:36] <_joe_> Krenair: it shouldn't, no [08:46:45] <_joe_> let me check [08:46:59] <_joe_> after I've figured out what the hell is happening with ocg [08:47:03] if ( $wmfRealm === 'labs' ) { [08:47:03] require_once( __DIR__ . '/mobile-labs.php' ); [08:47:03] } [08:47:18] yeah that sync probably doesn't even risk the hhvm crash bug [08:47:18] hhvm shouldn't even stat (or even read) that file once [08:47:22] Reedy: you're giving me flashbacks [08:47:48] 06Operations, 03Interactive-Sprint: maps-test* hosts running low on space - https://phabricator.wikimedia.org/T146848#2676486 (10Gehel) At this point, the best is probably to trash them and re-image, which we should do anyway to validate once more that re-imaging works and to make sure that the maps-test serve... [08:47:57] addshore: hey! we're all in europe now :D wanna do the grafana deploy tomorrow? [08:48:06] yuvipanda: it's a friday :P [08:48:10] <_joe_> Krenair: yes it shouldn't in fact [08:48:27] <_joe_> Reedy: discard what I said, that file doesn't get stated, in fact :) [08:48:41] _joe_: good. though, sorry for any alarm possibly caused :) [08:48:42] Reedy: it's hackathon day ;) [08:48:51] <_joe_> so, ocg is just rendering a shitton of collections [08:48:52] it's only deployed on labs grafana too [08:49:03] <_joe_> any idea how to see who is requesting those? [08:49:13] logstash might list them [08:49:15] I can't remember [08:49:16] <_joe_> we'll run out of disk space on those servers pretty quick I guess [08:49:42] <_joe_> I can grep the sampled logs on varnish, of course [08:49:44] Reedy: you should have some fun and deploy from a plane again! [08:49:50] p858snake: I'm on a train [08:50:45] (03PS1) 10Urbanecm: Fix hewiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313374 (https://phabricator.wikimedia.org/T145017) [08:53:41] 06Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1245233 (10Joe) They're not flapping, they are currently processing an enormous amount of requests suddenly today around 8 AM UTC. The source of such requests should probably be investigated, but... [09:03:31] 06Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#2676528 (10ArielGlenn) When I looked at syslog on ocg1001 earlier the job queue health varied from a little over 3000 to a little over 1000. The period of time over 3000 was about 8 minutes. If the... [09:04:30] (03PS1) 10Dzahn: install: add network location to server MOTDs [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) [09:08:50] (03PS2) 10Dzahn: install: add network location to server MOTDs [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) [09:09:59] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:10:57] (03PS3) 10Dzahn: install: add network location to server MOTDs [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) [09:19:40] (03CR) 10MarcoAurelio: [C: 031] "If optiPNG'd then LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312977 (https://phabricator.wikimedia.org/T146745) (owner: 10Urbanecm) [09:24:17] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 760659 msg: ocg_render_job_queue 460 msg [09:27:56] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 761245 msg: ocg_render_job_queue 383 msg [09:27:56] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 761245 msg: ocg_render_job_queue 383 msg [09:30:18] yuvipanda: yes! ? :D [09:30:32] where in Europe by the way? [09:30:37] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 and stat1004 for nschaaf - https://phabricator.wikimedia.org/T146924#2674988 (10ArielGlenn) Which group is that, statistics-users, statistics-privatedata-users? [09:30:53] (03CR) 10Urbanecm: "Yes, I ran the command..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312977 (https://phabricator.wikimedia.org/T146745) (owner: 10Urbanecm) [09:30:58] addshore: BAAAARCEEELOOOONA [09:31:04] ooooooh :P [09:32:30] !log received notification of ulsfo.1.23.pdu flapping power status via united layer icinga, yet checking router shows no power interruption for cr1-ulsfo. seems to be a monitoring false alarm (from united layers end, not ours) [09:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:24] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:39:49] (03CR) 10Dzahn: [C: 04-1] "i should use module "motd" instead. also Faidon points out the llpdctl command isn't always stable, so value should be cached" [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [09:42:04] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:47:36] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 703 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3676015 keys - replication_delay is 703 [10:06:35] \ [10:06:44] woops [10:07:55] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:12:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [10:17:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:18:55] I'm stumped about the redis rep delay from above, it tries and failes repeatedly to sync from the master, and thta master (rdb2005) is also repeatedly failing to resync from its master (why isn't it whinging in icinga?), with I/O error trying to sync with MASTER: connection lost [10:19:00] tcp_6479.log on both hosts [10:19:33] I guess we are stuck with full resync in all cases, can it be that partial resync now takes too long? but if so, why now suddenly? [10:19:39] s/partial/full/ [10:21:48] on rdb1007 (master for 1005) I see that it tries the full resynch and then "[1932] 29 Sep 10:17:07.607 # Connection with slave 10.192.32.133:6479 lost." [10:21:51] etc [10:26:04] I thought maybe a timeout issue (but again, why now?), repl-timeout has no setting that I can find which makes it the default 60 seconds [10:26:16] and so here we are back to "I'm stumped"... [10:28:46] !log Upgrading Jenkins plugins with zeljkof :] [10:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:15] <_joe_> apergos: the reason rdb2005 doesn't warn you is mostly a limitation in how the replica check works [10:30:25] <_joe_> so it's disabled for cross-dc replicas [10:30:35] ah ha [10:30:53] <_joe_> and, this is sadly not new [10:30:59] any thoughts on the above issue? [10:31:12] <_joe_> just never had the time to look on why rdb1007->rdb2005 fails [10:31:20] <_joe_> from time to time [10:32:13] ok [10:42:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "We have facts for LLDP. The code is in" [puppet] - 10https://gerrit.wikimedia.org/r/313375 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [10:54:00] (03Abandoned) 10Urbanecm: Fix hewiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313374 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [12:30:30] (03PS2) 10Urbanecm: [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) [12:31:04] (03CR) 10jenkins-bot: [V: 04-1] [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) (owner: 10Urbanecm) [12:32:16] (03PS3) 10Urbanecm: [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) [12:32:46] (03CR) 10jenkins-bot: [V: 04-1] [throttle] Ada Lovelave Day Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312852 (https://phabricator.wikimedia.org/T146654) (owner: 10Urbanecm) [12:36:22] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:56:58] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 6 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[create_user-replication@maps2003-v4],Exec[create_user-replication@maps-test2002-v4],Exec[create_user-replication@maps-test2003-v4],Exec[create_user-replication@maps2002-v4] [12:59:34] 06Operations, 06Developer-Relations (Jul-Sep-2016): Operations Team Offsite - https://phabricator.wikimedia.org/T141940#2676966 (10Aklapper) a:03Rfarrand This is happening right now. (Assigning to @rfarrand so this task can get closed in a few days.) [13:00:58] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:10:45] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 and stat1004 for nschaaf - https://phabricator.wikimedia.org/T146924#2674988 (10elukey) I think that analytics-privatedata-users or analytics-users should be good, oozie basically relies on Hadoop perms afaik. From T146064 it seems that the o... [13:21:40] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [13:28:28] (03Abandoned) 10Hashar: wmflib: mute hiera debug log in spec [puppet] - 10https://gerrit.wikimedia.org/r/297133 (owner: 10Hashar) [13:29:03] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [13:49:40] (03PS1) 10Elukey: Avoid unnecessary varnishkafka restarts [puppet] - 10https://gerrit.wikimedia.org/r/313400 [14:05:12] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 1 failures [14:10:07] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 1 failures [14:14:18] rigel ^^^ looking [14:15:15] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:21:57] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [14:29:29] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [14:46:59] (03CR) 10Hashar: "I have cherry picked it on the beta puppet master and then noticed there is not a single message for the 'runJobs' channel :] https://logs" [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [14:47:46] (03CR) 10Hashar: "Bah found out that mediawiki-config has:" [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [14:52:46] (03PS1) 10Hashar: beta: log bucket 'runJobs' to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313408 (https://phabricator.wikimedia.org/T146469) [14:53:37] (03CR) 10Hashar: [C: 032] beta: log bucket 'runJobs' to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313408 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [14:54:02] (03Merged) 10jenkins-bot: beta: log bucket 'runJobs' to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313408 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:00:16] (03PS1) 10Hashar: beta: 'runJobs' to info for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313410 (https://phabricator.wikimedia.org/T146469) [15:01:03] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 and stat1004 for nschaaf - https://phabricator.wikimedia.org/T146924#2677233 (10schana) @elukey I'm pretty sure it's `analytics-privatedata-users` that I need. [15:01:18] (03CR) 10Hashar: [C: 032] beta: 'runJobs' to info for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313410 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:01:44] (03Merged) 10jenkins-bot: beta: 'runJobs' to info for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313410 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:11:30] (03CR) 10Hashar: [C: 04-1] "-1 since it is not ready." [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:18:09] !log T133395: restbase staging: decommissioning restbase-test2001-b.codfw.wmnet (test of decomm/bootstrap under time-windowed compaction) [15:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:29] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [15:23:57] (03PS2) 10Hashar: logstash: parse runJobs messages [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) [15:25:09] (03CR) 10Hashar: [C: 04-1] "Now handle debug messages for start of a job as well as the duration of jobs completed properly." [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:25:45] (03PS1) 10Muehlenhoff: Revert "Update Debian patches for 1.0.2i" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313411 [15:25:47] (03PS1) 10Muehlenhoff: Revert "Bump changelog for 1.0.2i update" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313412 [15:25:49] (03PS1) 10Muehlenhoff: Revert "Imported Upstream version 1.0.2i" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313413 [15:25:51] (03PS1) 10Muehlenhoff: New upstream version 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/313414 [15:25:53] (03PS1) 10Muehlenhoff: New upstream version 1.0.2j [debs/openssl] - 10https://gerrit.wikimedia.org/r/313415 [15:25:55] (03PS1) 10Muehlenhoff: Remove ca.patch, now obsolete [debs/openssl] - 10https://gerrit.wikimedia.org/r/313416 [15:25:57] (03PS1) 10Muehlenhoff: Update cloudflare patch for 1.0.2j [debs/openssl] - 10https://gerrit.wikimedia.org/r/313417 [15:33:35] (03PS1) 10Muehlenhoff: Revert "Update Debian patches for 1.0.2i" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313418 [15:34:44] (03PS3) 10Hashar: logstash: parse runJobs messages [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) [15:36:09] (03Abandoned) 10Muehlenhoff: Revert "Update Debian patches for 1.0.2i" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313411 (owner: 10Muehlenhoff) [15:36:21] (03Abandoned) 10Muehlenhoff: Revert "Bump changelog for 1.0.2i update" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313412 (owner: 10Muehlenhoff) [15:36:34] (03Abandoned) 10Muehlenhoff: Revert "Imported Upstream version 1.0.2i" [debs/openssl] - 10https://gerrit.wikimedia.org/r/313413 (owner: 10Muehlenhoff) [15:36:47] (03Abandoned) 10Muehlenhoff: New upstream version 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/313414 (owner: 10Muehlenhoff) [15:36:58] (03Abandoned) 10Muehlenhoff: New upstream version 1.0.2j [debs/openssl] - 10https://gerrit.wikimedia.org/r/313415 (owner: 10Muehlenhoff) [15:37:09] (03Abandoned) 10Muehlenhoff: Remove ca.patch, now obsolete [debs/openssl] - 10https://gerrit.wikimedia.org/r/313416 (owner: 10Muehlenhoff) [15:37:21] (03Abandoned) 10Muehlenhoff: Update cloudflare patch for 1.0.2j [debs/openssl] - 10https://gerrit.wikimedia.org/r/313417 (owner: 10Muehlenhoff) [15:38:38] (03CR) 10Hashar: "In grok, only the first match would be executed. That prevented the filter from capturing the "job_type" from the URL." [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [15:39:59] (03PS4) 10Hashar: logstash: parse runJobs messages [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) [15:42:25] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:50] PROBLEM - cassandra-b CQL 10.192.16.155:9042 on restbase-test2001 is CRITICAL: Connection refused [16:00:45] PROBLEM - cassandra-b SSL 10.192.16.155:7001 on restbase-test2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [16:02:18] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.16.155:9042 on restbase-test2001 is CRITICAL: Connection refused eevans decomm/bootstrap test - The acknowledgement expires at: 2016-09-30 16:02:02. [16:02:58] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.16.155:7001 on restbase-test2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused eevans decomm/bootstrap test - The acknowledgement expires at: 2016-09-30 16:02:47. [16:03:48] who ack'd it? [16:04:24] greg-g: ?? [16:04:57] greg-g: are you referring to those last two acknowledgements, or something else? [16:05:31] yeah, those :) I thought the common practice was to put a name in the comment for who ack'd it? Maybe it's visible in the icinga website and not in IRC, but I thought I remember seeing it in IRC before... [16:05:53] * greg-g assumes it was you now :) [16:05:56] it is... "eevans decomm/bootstrap test" [16:06:01] * urandom is eevans [16:06:03] cool [16:06:09] nvm me then :) [16:06:11] carry on [16:06:14] :) [16:07:45] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:09:12] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628342 (10mpopov) @Dzahn: Good morning! @MelodyKramer was having problems connecting to MySQL and asked me to check it out. Can you please add her to the researcher... [16:10:28] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2677351 (10MelodyKramer) Thanks @mpopov! @Dzahn I am happy to show you my config file to ensure that it's accurate, as well! [16:18:44] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:29] 06Operations, 06Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#2677362 (10EBernhardson) 05Open>03Resolved capacity planning is completed. Closing this out. [16:20:34] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2677365 (10ArielGlenn) We need manager approval for the stat1003 access. What host are you trying to access mysql from and using what command(s)? [16:32:40] !log T133395: restbase staging: starting bootstrap of restbase-test2001-b.codfw.wmnet (test of decomm/bootstrap under time-windowed compaction) [16:32:43] !log executed 'sudo salt -C 'G@cluster:imagescaler and G@site:codfw' cmd.run 'find /var/log/hhvm/ -type f -user root -exec chown www-data:www-data {} \;' to reduce cronspam [16:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:02] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [16:33:25] RECOVERY - cassandra-b SSL 10.192.16.155:7001 on restbase-test2001 is OK: SSL OK - Certificate restbase-test2001-b valid until 2017-09-08 16:33:01 +0000 (expires in 343 days) [16:34:53] !log executed 'sudo salt -C 'G@cluster:imagescaler and G@site:eqiad' cmd.run 'find /var/log/hhvm/ -type f -user root -exec chown www-data:www-data {} \;' to reduce cronspam [16:35:42] (the chowns that I've done were also executed days ago on app/api servers, root:adm files are created after reimage - there is a tracking task to solve this issue) [16:37:27] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:56] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:43:36] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:52:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:05:09] RECOVERY - cassandra-b CQL 10.192.16.155:9042 on restbase-test2001 is OK: TCP OK - 0.037 second response time on port 9042 [17:08:09] as usual I am still here but not very much (evening, food, etc)... [17:08:18] ping if I'm needed [17:15:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:25:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:35:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:38:19] that bump looks like it was related to redis [17:38:58] 2 sets of 140 errors of: "Warning: Failed connecting to redis server at rdb1007.eqiad.wmnet: Connection timed out in /srv/mediawiki/php-1.28.0-wmf.20/includes/clientpool/RedisConnectionPool.php on line 245" [17:40:12] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:03:30] so from https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen it seems started one hour and a half ago [18:03:47] and the spike seems to be 500 related [18:04:12] but looking back a couple of days we had similar jumps [18:04:14] mmmm [18:06:49] and from oxygen 5xx.log seems related to upload.wikimedia.org /wikipedia/commons/thumb etc.. [18:07:20] greg-g: did you see anything specific in logstash? [18:10:52] (03PS3) 10Thcipriani: Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [18:18:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:18:36] (03CR) 10Thcipriani: "@Volans thanks for the review! Newest iteration takes your suggestions, code has much less boiler-plate." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [18:24:18] !log https://tools.wmflabs.org/sal/ missing some entries for 2016-09-29; consider https://wikitech.wikimedia.org/wiki/Server_Admin_Log canonical [18:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:41] * bd808 needs to make an admin interface that makes backfilling those easier [18:26:00] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:33:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:34:11] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1466.88 seconds [18:45:54] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:51:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:51:18] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:51:24] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:51:26] 06Operations, 03Interactive-Sprint: maps-test* hosts running low on space - https://phabricator.wikimedia.org/T146848#2677873 (10MaxSem) Meanwhile, the hosts are at zero / space too and therefore services are down. [18:52:24] I don't even see db1047 in the db config file for any cluster, let alone s1 [18:52:33] I guess that's good but still [18:53:24] I tried to check on fluorine but I don't get a lot out of it (for the 5xx issue) [18:54:23] 1047 seems a slave of 1018 from https://tendril.wikimedia.org/host/view/db1047.eqiad.wmnet/3306 [18:54:41] there's an alter table pagelinks going on it [18:55:13] yeah [18:57:29] I guess that's the issue, sure not much else happening [18:58:01] I saw a lot of slow timers logged in the hhvm logs [18:58:11] but I can't always tell what is happening from them [18:59:54] the timings from tendril does not match exactly what I can see in https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen [19:00:02] I think jynus is going to want to know about db1047, although I think it's not impacting users [19:00:31] in puppet the only things I can find are 7161318da114ebdb3b89559888aed97f6d8a40da [19:01:09] https://gerrit.wikimedia.org/r/#/c/302884/ [19:02:11] oh it's m4 [19:02:12] ic [19:08:46] the pagelinks stuff is running from neodymium as root [19:09:22] it looks like jmm owns it [19:09:39] moritzm_: [19:10:00] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3584.77 seconds Jcrespo expected replication delay for a few hours [19:10:27] excellent [19:13:43] (03CR) 10Aaron Schulz: [C: 031] rpc: trick mw into generating a raw exception report [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312077 (owner: 10Hashar) [19:21:28] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:27:37] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /home 40745 MB (3% inode=98%) [19:41:12] 06Operations, 06Discovery-Search, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2279787 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining wh... [19:41:17] 06Operations, 10Mobile-Content-Service, 06Parsing-Team, 06Services, and 2 others: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2678027 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longe... [19:41:20] 06Operations, 07Wikimedia-Incident: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#1891322 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is stil... [19:41:26] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1960976 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid... [19:41:29] 06Operations, 07Wikimedia-Incident: Add step in start-nfs to ask operator to consider dropping some snapshots - https://phabricator.wikimedia.org/T121890#1890949 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If... [19:41:32] 06Operations, 06Labs, 07Wikimedia-Incident: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#1884194 (10greg) This follow-up task from an incident report has not been updated recently. If it is... [19:41:59] 06Operations, 10Monitoring, 07Wikimedia-Incident: Monitor redis memory/disk usage - https://phabricator.wikimedia.org/T110169#1570438 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please... [19:42:06] 06Operations, 10Continuous-Integration-Config, 07Regression, 07Wikimedia-Incident: operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801#2678045 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid,... [19:42:22] 06Operations, 10Architecture, 10RESTBase, 10ArchCom-RfC (ArchCom-Approved), and 5 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1235677 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment e... [19:42:25] 06Operations, 10Traffic, 07HTTPS, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1149975 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If... [19:42:35] 06Operations, 07Availability, 13Patch-For-Review, 07Wikimedia-Incident: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1018974 (10greg) This follow-up task from an incident report has not been updated recently. If it is no longer va... [19:42:42] (didn't realize how many were in the #operations project :) ) [19:46:39] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:48:25] (03CR) 10Jdlrobson: [C: 031] "go for it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313346 (owner: 1020after4) [20:30:04] RECOVERY - cassandra service on maps-test2004 is OK: OK - cassandra is active [20:37:45] PROBLEM - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [20:43:44] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#2678332 (10Smalyshev) I think it's still valid and we have some patches in review. I imagine when @Gehel comes back... [22:14:41] 06Operations, 10ops-codfw, 06Discovery, 10Wikidata, and 2 others: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#2678532 (10Gehel) Work is happening in sub tasks, but it is happening. This should be done by the end of next week. [22:40:20] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:51:30] 06Operations, 06Collaboration-Team-Triage, 10Flow, 10MediaWiki-Redirects, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#2678735 (10jmatazzoni) [23:05:41] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:21:33] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [23:25:04] 504 gateway timeout [23:29:02] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [23:51:42] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [23:59:13] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed