[00:25:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:56] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2679065 (10awight) @RobLa pointed to a deployment on the date this bug was reported, Sep 7: https://www.m... [00:29:17] !log Manually updated the DB to fix already-broken cases caused by since-fixed T138310 [00:29:19] T138310: Flow as a Beta feature: enable, disable and reenable doesn't seem to work - https://phabricator.wikimedia.org/T138310 [00:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:32] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:33:06] 06Operations, 10media-storage: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2679077 (10Peachey88) [00:35:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:32] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:08] Ops: Anyone available to help with that^^^ [00:41:23] Fundraising has no payment notification listener right now [00:41:48] I think we need a web server restarted, and none of us devs have enough ops powers [00:42:01] ejegg: you might need to find a way to page apergos, with ops all in spain right now, it's about 3am [00:42:16] ebernhardson: oh man, so it is [00:42:23] thanks! [00:45:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:32] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:01] if any opsen are somehow around... ^ [00:46:29] afaik it's a pretty limited set who have payments cluster access [00:46:45] but yeah we have a stuck web server or 3 [00:50:26] ebernhardson, isn't apergos in a similar timezone? [00:50:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:51:17] Krenair: i meant that everyone is in spain, afaik. apergos happens to be mentioned in the title which kinda-sorta makes them on-call this week, but not sure how common that is because typically there isn't a reason to page anyone in ops, there is usually someone around [00:51:51] (the channel title) [00:52:33] if its paging or rather sending a text/ringing: try to find the person in a "awake" timezone first before escalating [00:52:59] p858snake: the thing is, this week is the ops offsite. they flew everyone to one location [00:53:00] p858snake: any idea who that would be during the ops offsite in spain? [00:53:52] well naturally offsites are slightly different [00:54:24] but fundraising should already have protocols for this documented on [00:54:25] i poked around in my emails but i can't find any mention on the ops list about someone that couldn't go and stayed back in the states this week [00:54:26] Is there a public list of ops with access to frack? [00:54:47] Krenair: it will probably be officewiki or collab [00:55:13] what is the impact [00:55:14] ? [00:55:21] that's not exactly what I call public, but... [00:55:24] i'm checking if I have access to frack -- I don't think I do [00:55:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:32] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:39] I think you'd know if you did, it's separate to prod puppet etc. [00:56:11] Permission denied (publickey). [00:56:44] the only two I know have access are Jeff_Green and cmjohnson1 [00:56:53] have you called jeff? [00:57:03] about to [00:57:55] good [00:58:52] i did have access at one point because i have a whole config section for proxy jumping via tellurium [00:59:26] might've been copied in from the ssh page on wikitech [01:00:29] ok, woke him up & he's on it [01:00:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:32] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:50] 👍 [01:05:31] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:05:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:31] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:10:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:31] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:15:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:20:31] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:20:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:20:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:31] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:25:31] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:31] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:30:33] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:30:33] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:30:33] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:34] PROBLEM - check_apache2 on thulium is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [01:35:34] PROBLEM - check_listener_gc on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:34] PROBLEM - check_listener_ipn on thulium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:06] RECOVERY - check_apache2 on thulium is OK: PROCS OK: 6 processes with command name apache2 [01:40:06] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 272 bytes in 0.094 second response time [01:40:07] RECOVERY - check_listener_ipn on thulium is OK: HTTP OK: HTTP/1.1 302 Found - 292 bytes in 0.008 second response time [01:43:51] (03PS2) 10Mattflaschen: Use === for $wgDBname comparison [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309226 (owner: 10Dereckson) [01:47:41] (03CR) 10Mattflaschen: [C: 031] "Please schedule a SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309226 (owner: 10Dereckson) [01:48:15] (03CR) 10Mattflaschen: "Actually, I will." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309226 (owner: 10Dereckson) [01:48:35] (03PS3) 10Mattflaschen: Set $wgDefaultExternalStore for wikitech before Flow settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309225 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [01:49:27] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:54:07] (03PS2) 10Mattflaschen: Enable Flow on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [02:04:24] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:09:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:09:29] (03CR) 10Mattflaschen: [C: 031] "Scheduled for Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309225 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [02:13:48] (03CR) 10Mattflaschen: Enable Flow on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [02:14:11] (03PS3) 10Dereckson: Enable Flow on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) [02:14:36] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [02:14:37] (03CR) 10Mattflaschen: "Looks good, blocked on table creation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [02:22:38] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 3 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2679226 (10RobLa-WMF) Here's the chronology of possibly unrelated events: * 2016-08-29 - Branch cut, st... [02:28:23] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 13m 31s) [02:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:12] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Sep 30 02:33:12 UTC 2016 (duration 4m 49s) [02:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:09:35] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [06:16:51] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:16:53] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [06:22:27] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [06:28:31] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:29:44] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [06:43:27] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:51:52] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [06:52:58] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:13] !admin [06:59:04] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [07:35:21] RECOVERY - Disk space on stat1002 is OK: DISK OK [07:54:45] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3539867 keys - replication_delay is 43 [07:57:36] (03PS1) 10BBlack: remove redundant require [puppet] - 10https://gerrit.wikimedia.org/r/313554 [08:07:40] (03PS1) 10Alexandros Kosiaris: hiera_lookup: Fix regression introduced in 110aaa2 [puppet] - 10https://gerrit.wikimedia.org/r/313555 [08:07:43] (03CR) 10Volans: "There are also some changes other than the split, it will be much easier to review if the split and the changes are in separate commits, e" [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 (owner: 10Giuseppe Lavagetto) [08:08:32] (03CR) 10Mark Bergsma: Add netlink-based Ipvsmanager implementation [WiP] (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 (owner: 10Giuseppe Lavagetto) [08:10:13] (03CR) 10Alexandros Kosiaris: [C: 032] hiera_lookup: Fix regression introduced in 110aaa2 [puppet] - 10https://gerrit.wikimedia.org/r/313555 (owner: 10Alexandros Kosiaris) [08:18:22] (03CR) 10Mark Bergsma: Add generic Finite States Machine (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [08:27:30] (03PS2) 10Giuseppe Lavagetto: Split IPVS Manager into the interface and manager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 [08:27:32] (03PS1) 10Giuseppe Lavagetto: Add IPVSError as a generic IPVS-related exception [debs/pybal] - 10https://gerrit.wikimedia.org/r/313556 [08:34:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add generic Finite States Machine (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [08:38:26] (03PS1) 10Andrew Bogott: Puppetize the upstart logrotate script on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/313558 [08:39:59] (03CR) 10jenkins-bot: [V: 04-1] Puppetize the upstart logrotate script on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/313558 (owner: 10Andrew Bogott) [08:46:03] (03PS2) 10Andrew Bogott: Puppetize the upstart logrotate script on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/313558 [08:48:26] (03CR) 10Faidon Liambotis: [C: 032] Puppetize the upstart logrotate script on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/313558 (owner: 10Andrew Bogott) [09:03:53] (03CR) 10Mark Bergsma: Add generic Finite States Machine (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [09:03:55] (03CR) 10Volans: Add generic Finite States Machine (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [09:10:28] (03CR) 10Mark Bergsma: [C: 04-1] Split IPVS Manager into the interface and manager implementation (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 (owner: 10Giuseppe Lavagetto) [09:12:33] 06Operations, 13Patch-For-Review: Build poolcounter for jessie - https://phabricator.wikimedia.org/T146277#2655845 (10fgiunchedi) The package builds on jessie now and tests are ran on build. Still marked as Debian native but good enough for now. [09:14:39] (03CR) 10Giuseppe Lavagetto: Add generic Finite States Machine (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 (owner: 10Giuseppe Lavagetto) [09:24:46] (03PS1) 10Filippo Giunchedi: celery: use SyslogIdentifier [puppet] - 10https://gerrit.wikimedia.org/r/313562 (https://phabricator.wikimedia.org/T146581) [09:27:22] (03PS1) 10Andrew Bogott: l10nupdate: Add 'su' to logrotate script [puppet] - 10https://gerrit.wikimedia.org/r/313563 (https://phabricator.wikimedia.org/T132324) [09:28:16] (03PS2) 10Andrew Bogott: l10nupdate: Add 'su' to logrotate script [puppet] - 10https://gerrit.wikimedia.org/r/313563 (https://phabricator.wikimedia.org/T132324) [09:29:41] (03CR) 10jenkins-bot: [V: 04-1] l10nupdate: Add 'su' to logrotate script [puppet] - 10https://gerrit.wikimedia.org/r/313563 (https://phabricator.wikimedia.org/T132324) (owner: 10Andrew Bogott) [09:30:55] (03PS1) 10Filippo Giunchedi: poolcounter: move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/313564 (https://phabricator.wikimedia.org/T123734) [09:34:03] (03CR) 10Dzahn: [C: 031] poolcounter: move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/313564 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [09:34:57] (03PS3) 10Andrew Bogott: l10nupdate: Add 'su' to logrotate script [puppet] - 10https://gerrit.wikimedia.org/r/313563 (https://phabricator.wikimedia.org/T132324) [09:37:40] 06Operations, 06Labs: cronspam from labscontrol1001, labstore1001, labnet1002.eqiad.wmnet, labsdb1003.eqiad.wmnet - https://phabricator.wikimedia.org/T132422#2679434 (10elukey) Summary after today's hacking with @Andrew: 1) logrotate errors while zipping should be resolved via https://gerrit.wikimedia.org/r/#... [09:39:17] (03CR) 10Dzahn: "no-op http://puppet-compiler.wmflabs.org/4179/" [puppet] - 10https://gerrit.wikimedia.org/r/313564 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [09:40:13] (03PS3) 10Volans: Split IPVS Manager into the interface and manager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 (owner: 10Giuseppe Lavagetto) [09:41:17] (03CR) 10Volans: Split IPVS Manager into the interface and manager implementation (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 (owner: 10Giuseppe Lavagetto) [09:43:09] (03CR) 10Filippo Giunchedi: "noop in PCC https://puppet-compiler.wmflabs.org/4178/" [puppet] - 10https://gerrit.wikimedia.org/r/313564 (https://phabricator.wikimedia.org/T123734) (owner: 10Filippo Giunchedi) [09:46:26] yuvipanda: so today? [09:47:33] 06Operations, 13Patch-For-Review: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1936635 (10fgiunchedi) Note that in helium case it is also the bacula director/storage. I propose we start with moving a poolcounter to a ganeti VM and move off helium. [09:48:36] 06Operations, 13Patch-For-Review: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2679464 (10akosiaris) >>! In T123734#2679462, @fgiunchedi wrote: > Note that in helium case it is also the bacula director/storage. I propose we start with moving a poolcounter to a ganeti VM an... [09:48:56] (03PS2) 10Giuseppe Lavagetto: Add generic Finite States Machine [debs/pybal] - 10https://gerrit.wikimedia.org/r/302435 [09:49:30] addshore: yes!! [09:49:36] :D [09:49:39] lets do it! [09:58:22] addshore: we're going to lunch now, maybe at 3? [09:58:38] is it 12 for you now? [10:04:44] addshore: yeah [10:04:58] cool see you at 3 your time and 2 my time :) [10:05:59] addshore: ok :) [10:12:15] PROBLEM - puppet last run on mc1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:48] (03PS1) 10Elukey: Reduce cronspam from graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/313573 (https://phabricator.wikimedia.org/T144797) [10:14:48] 06Operations, 10hardware-requests: Site: (3) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2679484 (10Andrew) [10:15:06] 06Operations, 10hardware-requests: Site: (3) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2679497 (10Andrew) 05Open>03stalled [10:37:04] RECOVERY - puppet last run on mc1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:51:27] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:08] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:26:44] 06Operations, 10hardware-requests: Site: (3) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2679672 (10Andrew) [12:35:22] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 712 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3545229 keys - replication_delay is 712 [12:36:13] (03CR) 10Filippo Giunchedi: [C: 031] Puppetize the upstart logrotate script on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/313558 (owner: 10Andrew Bogott) [12:38:17] (03CR) 10Mark Bergsma: [C: 04-1] "Separate file please, exceptions.py or so?" [debs/pybal] - 10https://gerrit.wikimedia.org/r/313556 (owner: 10Giuseppe Lavagetto) [12:40:41] 07Puppet, 06Labs, 10Labs-Infrastructure: Investigate usage of hiera_hash in our puppet repo - https://phabricator.wikimedia.org/T146621#2679673 (10Andrew) 05Open>03Resolved I just talked this over with Chase and I'm back to being convinced that this does just what we want. Specifically -- first match st... [12:43:08] 06Operations, 13Patch-For-Review: graphite-web cronspam - https://phabricator.wikimedia.org/T144797#2679676 (10elukey) Two code reviews to review/merge: - https://gerrit.wikimedia.org/r/313558 - https://gerrit.wikimedia.org/r/313573 [12:44:40] 06Operations, 10ops-codfw: ms-be2009.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T147060#2679677 (10fgiunchedi) 03NEW [12:46:29] ACKNOWLEDGEMENT - MegaRAID on ms-be2009 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi sdd broken T147060 [12:46:29] ACKNOWLEDGEMENT - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 27 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdd1] Filippo Giunchedi sdd broken T147060 [12:46:47] 06Operations, 10netops: Access to network devices - https://phabricator.wikimedia.org/T147061#2679684 (10elukey) [12:51:41] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [12:54:16] (03PS7) 10Hashar: zuul: refactor to use hiera [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) [12:54:18] (03PS2) 10Hashar: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 [12:54:22] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:36] addshore: around? :) [12:59:12] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [13:03:37] yuvipanda: ya [13:03:52] addshore: ok let's go [13:04:02] :D [13:07:37] (03PS10) 10Addshore: Add simple-json-datasource plugin to labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/302119 (https://phabricator.wikimedia.org/T141636) [13:07:53] (03CR) 10Yuvipanda: [C: 032 V: 032] Add simple-json-datasource plugin to labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/302119 (https://phabricator.wikimedia.org/T141636) (owner: 10Addshore) [13:08:32] (03PS5) 10Elukey: Imply Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [13:11:53] (03PS1) 10Andrew Bogott: Certcleaner: Add some logging [puppet] - 10https://gerrit.wikimedia.org/r/313578 (https://phabricator.wikimedia.org/T146303) [13:13:07] 06Operations, 10Continuous-Integration-Infrastructure, 10Monitoring: Remove Ganglia Jenkins plugin from gallium - https://phabricator.wikimedia.org/T147065#2679754 (10hashar) [13:14:43] addshore: running puppet now [13:14:50] cool! :) [13:15:19] (03PS2) 10Andrew Bogott: Certcleaner: Add some logging [puppet] - 10https://gerrit.wikimedia.org/r/313578 (https://phabricator.wikimedia.org/T146303) [13:16:35] addshore: Notice: /Stage[main]/Role::Grafana::Labs/Git::Clone[grafana/simple-json-datasource]/Exec[git_clone_grafana/simple-json-datasource]/returns: fatal: repository 'operations/software/grafana/simple-json-datasource' does not exist [13:17:01] what O_o.. *looks* its been a while since i last looked at all of this [13:17:41] https://github.com/wikimedia/operations-software-grafana-simple-json-datasource [13:18:20] https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/grafana/simple-json-datasource [13:19:33] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:20:16] (03PS1) 10Hashar: contint: remove the ganglia jenkins plugin [puppet] - 10https://gerrit.wikimedia.org/r/313579 (https://phabricator.wikimedia.org/T147065) [13:21:12] (03PS3) 10Andrew Bogott: Certcleaner: Add some logging [puppet] - 10https://gerrit.wikimedia.org/r/313578 (https://phabricator.wikimedia.org/T146303) [13:21:27] yuvipanda: any ideas? :D [13:21:52] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_grafana/simple-json-datasource] [13:22:38] (03CR) 10Hashar: "Puppet compilation https://puppet-compiler.wmflabs.org/4183/" [puppet] - 10https://gerrit.wikimedia.org/r/313579 (https://phabricator.wikimedia.org/T147065) (owner: 10Hashar) [13:23:06] (03PS6) 10Elukey: Imply Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [13:23:54] addshore: hmm not sure. looking [13:24:42] (03PS1) 10Hashar: contint: remove Jenkins gmond legacy files [puppet] - 10https://gerrit.wikimedia.org/r/313581 (https://phabricator.wikimedia.org/T147065) [13:27:10] (03CR) 10Hashar: "Need to get rid of the plugin material on gallium via parent change: https://gerrit.wikimedia.org/r/313579" [puppet] - 10https://gerrit.wikimedia.org/r/313581 (https://phabricator.wikimedia.org/T147065) (owner: 10Hashar) [13:27:33] (03PS1) 10Yuvipanda: grafana: Attempt to fix the grafana labs plugin [puppet] - 10https://gerrit.wikimedia.org/r/313582 [13:29:41] (03CR) 10Yuvipanda: [C: 032] grafana: Attempt to fix the grafana labs plugin [puppet] - 10https://gerrit.wikimedia.org/r/313582 (owner: 10Yuvipanda) [13:30:58] addshore: that seems to have worked [13:31:45] addshore: check if it works? :) [13:32:11] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:32:33] 06Operations, 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review, 07Technical-Debt: Remove Ganglia Jenkins plugin from gallium - https://phabricator.wikimedia.org/T147065#2679814 (10hashar) The links are no more showing on https://integration.wikimedia.org/ Task is now pending pup... [13:32:40] 06Operations, 10Continuous-Integration-Infrastructure, 10Monitoring, 13Patch-For-Review, 07Technical-Debt: Remove Ganglia Jenkins plugin from gallium - https://phabricator.wikimedia.org/T147065#2679816 (10hashar) p:05Triage>03Normal [13:32:57] hmm, I can't see it in the list of datasources :/ [13:33:26] !log restart grafana-server on labmon1001 [13:33:28] addshore: try now? [13:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:03] I see a GenericDataSource [13:34:04] okay, now I see GenericDatasource! [13:34:16] Plugin Error [13:34:16] Error: XHR error (404) loading https://grafana-labs-admin.wikimedia.org/public/app/plugins/datasource/datasource-plugin-genericdatasource/module.js?bust=1475242413859 Error loading https://grafana-labs-admin.wikimedia.org/public/app/plugins/datasource/datasource-plugin-genericdatasource/module.js [13:36:30] (03CR) 10Ottomata: Imply Pivot UI's puppetization (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:36:47] (03CR) 10Ottomata: [C: 031] "Aside from few little comments, merge away!" [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:39:09] hmm [13:40:02] addshore: do you know if we have any other plugins installed? [13:40:15] yes, but they are all defaults [13:40:21] what version of grafana are we running? [13:40:52] addshore: Version 3.0.4 (commit: v3.0.4) [13:42:33] hmmmm, that should be fine [13:44:30] addshore: boom try now [13:44:52] oooh, that looks better [13:44:54] what was the issue? [13:45:17] addshore: wrong path [13:45:29] I see :D [13:46:07] addshore: https://gerrit.wikimedia.org/r/#/c/313538/ [13:47:11] AaronSchulz: done! [13:47:31] yuvipanda: awesome, so I just added a datasource for the labs tool I created, now lets see if it correctly adds annotations! [13:47:50] addshore: I'd like to take another go at https://gerrit.wikimedia.org/r/#/c/313456/ [13:48:00] addshore: :D ok [13:48:14] (the silly bug is fixed this time) [13:52:35] (03PS1) 10Yuvipanda: grafana: Fix paths for grafana datasource plugin [puppet] - 10https://gerrit.wikimedia.org/r/313587 [13:52:38] (03PS7) 10Elukey: Imply Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [13:52:39] addshore: ^ should fix it [13:53:13] (03CR) 10Addshore: [C: 031] grafana: Fix paths for grafana datasource plugin [puppet] - 10https://gerrit.wikimedia.org/r/313587 (owner: 10Yuvipanda) [13:53:19] interesting! [14:00:23] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:19:18] (03PS4) 10Jgreen: Modify secret.rb to accept a file list and use first match, like http://www.puppetcookbook.com/posts/select-a-file-based-on-a-fact.html [puppet] - 10https://gerrit.wikimedia.org/r/294331 [14:19:20] (03PS1) 10Jgreen: ssh public key for jgreen [puppet] - 10https://gerrit.wikimedia.org/r/313589 [14:19:47] (03PS1) 10Alexandros Kosiaris: Remove changeprop.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/313590 [14:22:14] (03CR) 10Jgreen: [C: 032] ssh public key for jgreen [puppet] - 10https://gerrit.wikimedia.org/r/313589 (owner: 10Jgreen) [14:24:27] (03PS2) 10Jgreen: ssh public key for jgreen [puppet] - 10https://gerrit.wikimedia.org/r/313589 [14:26:13] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:28:29] 06Operations: Migrate puppetmaster/backends to jessie - https://phabricator.wikimedia.org/T123730#2679991 (10fgiunchedi) This is essentially done, palladium is still up but unused AFAIK and it will be shutdown soon. cc @akosiaris @Joe [14:40:37] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2680061 (10fgiunchedi) [14:40:40] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2680059 (10fgiunchedi) 05Open>03stalled Setting as stalled, though next steps look like this: [] Flip tools master from labsdb1005 to labsdb1004 [] Decommission labsdb... [14:43:29] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on scandium to 2.5.0-8-gcbc7f62-wmf3jessie1 - https://phabricator.wikimedia.org/T147073#2680070 (10hashar) [14:43:41] 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on scandium to 2.5.0-8-gcbc7f62-wmf3jessie1 - https://phabricator.wikimedia.org/T147073#2680087 (10hashar) [14:50:54] (03PS1) 10Hashar: nodepool: point to Jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/313599 [14:51:29] (03CR) 10Hashar: [C: 04-1] "Pending maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/313599 (owner: 10Hashar) [14:57:34] (03PS1) 10Hashar: cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/313600 [14:58:41] 06Operations, 10DBA: Add icinga check for all MySQL/MariaDB hosts to check they have the right read_only value - https://phabricator.wikimedia.org/T111766#1615926 (10fgiunchedi) out of curiosity I tried asking the following questions via prometheus for eqiad ```name='mysql_global_variables_read_only{role="sla... [14:58:50] (03CR) 10Hashar: [C: 04-1] "We will migrate CI services from gallium to contint1001 at some point. This is to switch the web facing entries to the new host." [puppet] - 10https://gerrit.wikimedia.org/r/313600 (owner: 10Hashar) [15:00:53] 06Operations, 06Discovery-Search, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2680125 (10debt) Hi @Gehel and @EBernhardson - can we take a look at this to see if it's still valid? @greg, the answer to your concerns might have to... [15:03:41] (03PS1) 10MarcoAurelio: Raise abusefilter condition limit for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) [15:09:05] 06Operations, 10Ops-Access-Requests, 10Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#2680143 (10fgiunchedi) [15:17:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:18:43] (03PS2) 10Yuvipanda: grafana: Fix paths for grafana datasource plugin [puppet] - 10https://gerrit.wikimedia.org/r/313587 [15:18:50] (03CR) 10Yuvipanda: [C: 032 V: 032] grafana: Fix paths for grafana datasource plugin [puppet] - 10https://gerrit.wikimedia.org/r/313587 (owner: 10Yuvipanda) [15:22:12] (03CR) 10Chad: [C: 031] "Definitely the right approach. Two suggestions though:" [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [15:23:04] (03CR) 10Yuvipanda: [C: 032] k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 (owner: 10Yuvipanda) [15:23:30] (03CR) 10Chad: [C: 031] contint: remove the ganglia jenkins plugin [puppet] - 10https://gerrit.wikimedia.org/r/313579 (https://phabricator.wikimedia.org/T147065) (owner: 10Hashar) [15:23:41] (03CR) 10Chad: [C: 031] contint: remove Jenkins gmond legacy files [puppet] - 10https://gerrit.wikimedia.org/r/313581 (https://phabricator.wikimedia.org/T147065) (owner: 10Hashar) [15:24:58] (03PS3) 10Yuvipanda: k8s: Make the kubernetes user be a member of ssl-certs group [puppet] - 10https://gerrit.wikimedia.org/r/312817 [15:25:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:33:49] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2680201 (10MelodyKramer) @ArielGlenn I'm having trouble accessing stat1002. Here is what I'm seeing: melodykramer@stat1002:~$ cat /etc/mysql/conf.d/stats-research-c... [15:49:36] I'm going to be gone for about an hour: cat-sitting. [15:57:21] * bd808 finds sitting on cats to be uncomfortable due to a combination of claws and allergies [16:09:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:10:29] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2680264 (10elukey) Fine to me to grant the access to `researchers` (that will grant the access to the mysql credentials). [16:13:33] addshore: you don't need anything from me now I think right? [16:13:43] addshore: let me know how it goes :D you still have me on monday [16:13:49] (will still be in same tz) [16:19:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:22:05] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [16:29:45] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:42:53] bd808: the cat was sitting and I was providing entertainment [16:43:01] to be really precise, the cat was lolling about [17:04:22] (03CR) 1020after4: [C: 031] Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [17:33:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:34:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:46:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:59:19] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [18:04:29] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [18:06:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [18:09:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:11:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:39:02] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [18:39:29] ^^ is there a problem happening with upload ? [18:40:11] * ori looks [18:40:52] thanks [18:41:14] ori: there was a report of broken upload that wouldn't thumb earlier... But not sure where the break happened [18:41:43] the alert is one summarize(nonNegativeDerivative(keepLastValue(swift.eqiad.containers.mw-media.originals.objects)) [18:41:46] let's see what that looks like [18:42:22] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [18:45:05] yeah, something is spiking https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1475261089.744&target=summarize(nonNegativeDerivative(keepLastValue(swift.eqiad-prod.containers.mw-media.originals.objects))%2C%20%221h%22)&from=-24days [18:45:23] I'm not sure how to interpret that [18:45:54] It could simply be WLM being successful (i.e., lots more people uploading) [18:47:01] Reedy: where did you see that? [18:47:35] https://phabricator.wikimedia.org/T147082#2680419 [18:47:40] ori https://phabricator.wikimedia.org/T147040 [18:49:09] !bug 1 | ori [18:49:09] ori: https://bugzilla.wikimedia.org/show_bug.cgi?id=1 [18:49:36] the second one in https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#File_database_server_broken.3F seems OK [18:49:46] unless I misunderstood the report [18:50:16] ori: well, it looks like someone re-uploaded it [18:50:28] !task [18:50:28] https://phabricator.wikimedia.org/ [18:50:46] greg-g: around? [18:50:46] ori: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [18:50:51] ish [18:50:58] getting ready to go into that lunch interview [18:51:03] what's up? [18:51:04] reading [18:51:17] unobvious monitoring is up [18:51:25] T147082 is an UBN! but impact seems confined to a single image or two [18:51:26] T147082: Unable able to thumb large PNG file - https://phabricator.wikimedia.org/T147082 [18:51:44] MatmaRex is the worst [18:52:00] the alert sounds bad but is hard to interpret and could have plausibly been triggered by higher-than-normal upload activity due to WLM! [18:52:07] theres also https://phabricator.wikimedia.org/T147040 [18:52:11] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [18:52:20] what now Reedy [18:52:26] that's the same bug [18:52:28] sorry, I meant T147040 [18:52:28] T147040: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040 [18:53:06] greg-g: I don't think either is page-worthy but wanted to make sure you're aware [18:53:27] thanks, I was wondering... [18:53:29] We do now and again get files that just go AWOL [18:53:33] It's happened for years [18:53:36] since I'm not sure when we'll see opsen again [18:53:38] Less so with less nfs [18:53:45] greg-g: WHAT DID YOU DO TO THEM [18:53:52] DADT [18:54:06] He only arranged one bus [18:54:52] most of the files i've seen that are *gone* are from the early years. like 2007-ish. and there was some situation in 2010-ish when we lost a big bunch? but recently uploaded files disappearing sound pretty bad to me :( [18:55:21] Like I say, fsck nfs [18:57:15] MatmaRex: OK, I trust your judgement. What is the right escalation path, in your opinion? [18:57:29] Page ops? E-mail ops@ and ask to have that looked that ASAP? [18:57:30] ori: i have literally no idea [18:57:53] afaik it's just the two files, i wouldn't page people unless we know it's more than that [18:57:59] ^ [18:58:10] agreed, email with request to assist seems wise [18:58:21] if we get 1 more though, we page [18:58:22] but i would like the root cause of this disappearance to be found at some point [18:58:24] :) [18:58:36] someone wanna write this e-mail or should I? [18:58:45] i don't tihnk i'm on ops@ [18:58:48] can you? I'm still reading/digesting a resume [18:59:42] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [19:01:57] ok will do [19:04:02] as a side note, would be nice if someone more knowledgeable than me and with access to more tools searched some logs for traces of T147082. i don't think logstash search can be trusted, and i have no idea how to write any interesting queries there. :/ [19:04:02] T147082: Unable able to thumb large PNG file - https://phabricator.wikimedia.org/T147082 [19:04:23] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [19:05:22] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [19:10:46] can someone acknowledge the maps-test nodes? they're not urgent and not worth to be spammed here [19:18:02] ACKNOWLEDGEMENT - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 25 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[create_user-replication@maps2003-v4],Exec[create_user-replication@maps-test2002-v4],Exec[create_user-replication@maps-test2003-v4],Exec[create_user-replication@maps2002-v4] ori.livneh MaxSem can someone acknowledge the maps-test nodes? theyre not urgent and no [19:18:02] ACKNOWLEDGEMENT - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed ori.livneh MaxSem can someone acknowledge the maps-test nodes? theyre not urgent and not worth to be spammed here [19:18:02] ACKNOWLEDGEMENT - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed ori.livneh MaxSem can someone acknowledge the maps-test nodes? theyre not urgent and not worth to be spammed here [19:18:02] ACKNOWLEDGEMENT - cassandra service on maps-test2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed ori.livneh MaxSem can someone acknowledge the maps-test nodes? theyre not urgent and not worth to be spammed here [19:58:31] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 / webrequest logs for MelodyKramer - https://phabricator.wikimedia.org/T145387#2680846 (10Deskana) >>! In T145387#2677365, @ArielGlenn wrote: > We need manager approval for the stat1003 access. Approved. [19:58:34] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [20:00:23] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [20:02:55] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:03:22] ^^ i wonder why that keeps getting problem then recoverying [20:03:59] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:04:46] paladox: the current assumption is Wiki Loves Monuments is causing more uploads to happen (good!) but it's causing that alarm to go off as it doesn't take into account events like this [20:04:50] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:05:11] Oh [20:05:16] thanks for explaning [20:16:50] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 5 minutes ago with 19 failures. Failed resources (up to 3 shown): Package[molly-guard],Package[lldpd],Package[ncdu],Package[dstat] [20:24:46] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:29:47] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:39:59] if I get an exception on wiki production site, is there some log I can look at the backtrace? [20:40:18] RECOVERY - cassandra service on maps-test2002 is OK: OK - cassandra is active [20:40:46] SMalyshev: Yup [20:41:01] Can you login to fluorine.eqiad.wmnet ? [20:41:03] Else, logstash [20:41:17] Reedy: I tried logstash but can't see that exception... [20:41:22] maybe I'm not looking right [20:41:45] this is the bad req: https://www.wikidata.org/wiki/Special:EntityData/Q19369930.json [20:41:46] they can be hard to find sometimes on logstash [20:41:57] but I can't see anything like it on logstash [20:42:10] do you have the exception id? [20:42:11] don't think I have access to fluorine [20:42:26] do you have shell access? [20:42:28] bd808: I don't know... [20:42:29] Stas isn't a deployer? o_0 [20:42:38] bd808: I have access to tin [20:42:45] but not to fluorine [20:42:54] weird [20:42:56] (03CR) 10Luke081515: [C: 04-1] "Code ok, but theres the consensus question, see task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) (owner: 10MarcoAurelio) [20:43:08] SMalyshev: on logstash, you should be able to do a search for `url:Q19369930` and it brings up a few exceptions recently, dunno if they are what you are looking for though [20:44:25] if i had to guess, it looks like the exception is saying a GET request shuldn't be talking to the master database? [20:45:00] I have no idea... I still can't see the backtrace :) [20:45:14] ahh I see now [20:45:23] the "expectation not met" one I guess [20:45:36] hmm, i get a full stack trace in the `message` field :S [20:45:43] (03PS1) 10Eevans: Extend classpath via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/313619 (https://phabricator.wikimedia.org/T133395) [20:45:51] yeah I found it already [20:46:20] i'm not sure thats actually an exception though, hmm [20:46:45] yea that's just a log, not an exception. hmm [20:47:48] PROBLEM - cassandra service on maps-test2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [20:48:06] I suspect this may be caused by recent changes with regard to master/master replication... [20:48:24] and not getting GET reqs do master connections [20:49:44] still weird that I don't see actual exception [20:49:50] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:52:19] hoo: ping? [20:53:26] SMalyshev: well, if you have terbium access what you can do is (little convoluted...) run `mwrepl wikidatawiki` [20:53:30] SMalyshev: then run SpecialPageFactory::executePath( Title::newFromText( 'Special:EntityData/Q19369930.json' ), new RequestContext ); [20:54:01] you can set breakpoint at /srv/mediawiki/php-1.28.0-wmf.20/extensions/Wikidata/extensions/Wikibase/repo/includes/LinkedData/EntityDataRequestHandler.php:355 so it stops before it throws the exception [20:54:23] nope doesn't look like i have access there [20:54:26] :S [20:55:16] SMalyshev: the relevant output is: https://phabricator.wikimedia.org/P4139 [20:55:34] but looking at the code, it is only supposed to generate log message, not throw.... [20:55:55] tin should be able to run mwrepl, but annoyingly there is some bug where l10n cache doesn't end up in the right place so it doesn't work atm... [20:55:56] ebernhardson: aha, thanks! [20:57:13] hmm weird 404 is legit response for this, so why it's not handled properly> [20:57:14] ? [20:58:33] dunno, i've never used that exception :S [21:01:30] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [21:04:12] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [21:07:39] I suspect that something weird is going on in combination of DB problems and exception handling... [21:07:54] something that breaks exception handling [21:08:19] (03CR) 10Eevans: "Puppet Compiler output here: http://puppet-compiler.wmflabs.org/4187" [puppet] - 10https://gerrit.wikimedia.org/r/313619 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [21:10:35] RoanKattouw, error: request has exceeded memory limit in /srv/mediawiki/php-1.28.0-wmf.20/extensions/Echo/includes/DiscussionParser.php on line 631 [21:10:49] MaxSem: Yes, there's a task about that already [21:10:55] TCB (WMDE) said they'd look at it [21:11:58] AaronSchulz: ping? [21:30:34] SMalyshev: so i took a look into it, it looks like aaron added a new MWExceptionRenderer to core in the last few weeks, and it never tries to use MWException::report(), so the stuff the HttpError class does never gets processed [21:30:53] ebernhardson: yeah I got there too... [21:31:01] and there is no log because HttpError returns false for isLoggable() under the assumption it can create it's own custom log in report...so fun :) [21:31:06] ahh, that's probably why the ping :) [21:31:24] unfortunately it's pretty huge refactoring so I guess I'll have to leave it to AaronSchulz to fix [21:31:55] though that will mean deletes on WDQS would be broken for several days [21:32:03] :( [21:40:52] (03CR) 10Hashar: [C: 04-1] "I have that patch applied on the beta cluster. It is helpful to play with the logs via logstash-beta." [puppet] - 10https://gerrit.wikimedia.org/r/312504 (https://phabricator.wikimedia.org/T146469) (owner: 10Hashar) [21:51:29] RECOVERY - cassandra service on maps-test2003 is OK: OK - cassandra is active [21:58:59] PROBLEM - cassandra service on maps-test2003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [21:59:45] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:00] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [22:06:07] !log Re-run mwscript deleteEqualMessages.php on all wikis it was previously run on (T45917) [22:06:08] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [22:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:14] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [22:17:45] PROBLEM - Host mw1207 is DOWN: PING CRITICAL - Packet loss = 100% [22:24:55] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:30:34] (03PS2) 10Kaldari: Deploying PageAssessments to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312869 (https://phabricator.wikimedia.org/T146679) [23:00:56] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3477525 keys - replication_delay is 0 [23:06:38] PROBLEM - puppet last run on elastic1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:24:42] 06Operations, 10Wikimedia-Site-requests, 07I18n, 07Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2681325 (10Liuxinyu970226) >>! In T21986#2666650, @deryckchan wrote: > 3. Change the "traditional" MediaWiki interwiki prefix (not so important because Wikidata has... [23:31:50] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures