[00:05:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [00:07:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:05:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:06:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:08:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:09:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:24:06] !log Spoken to User:Nirzardp for T150554, set a new password [01:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:49] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1807.016534 Seconds [01:35:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 29.717543 Seconds [01:55:19] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:05:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [02:09:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:16:40] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 05m 34s) [02:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Nov 13 02:20:58 UTC 2016 (duration 4m 18s) [02:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:19] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [03:16:39] (03PS1) 10BryanDavis: openstack: Log mwclient failure details [puppet] - 10https://gerrit.wikimedia.org/r/321167 [03:17:48] (03CR) 10jenkins-bot: [V: 04-1] openstack: Log mwclient failure details [puppet] - 10https://gerrit.wikimedia.org/r/321167 (owner: 10BryanDavis) [03:22:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 754.90 seconds [03:33:19] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:43:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 199.99 seconds [03:45:04] (03PS2) 10BryanDavis: openstack: Log mwclient failure details [puppet] - 10https://gerrit.wikimedia.org/r/321167 [03:47:19] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:19] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:00:19] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [04:04:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:08:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:16:19] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:22:10] (03PS1) 10BryanDavis: openstack: cache mwclient connection in wikistatus [puppet] - 10https://gerrit.wikimedia.org/r/321169 [04:23:30] (03CR) 10Alex Monk: [C: 031] openstack: Log mwclient failure details [puppet] - 10https://gerrit.wikimedia.org/r/321167 (owner: 10BryanDavis) [04:26:14] (03CR) 10Alex Monk: "so after the user gets logged out for whatever reason, the next attempt to log status will fail?" [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [04:29:19] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [04:34:01] (03CR) 10Andrew Bogott: [C: 04-1] "This code cached the site object before but I ripped it out because it had races. If we want to cache the session then accesses to self.s" [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [04:40:56] (03CR) 10BryanDavis: "> This code cached the site object before but I ripped it out because" [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [05:01:54] (03PS2) 10BryanDavis: openstack: cache mwclient connection in wikistatus [puppet] - 10https://gerrit.wikimedia.org/r/321169 [05:20:34] (03CR) 10BryanDavis: openstack: cache mwclient connection in wikistatus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [05:25:02] (03PS3) 10BryanDavis: openstack: cache mwclient connection in wikistatus [puppet] - 10https://gerrit.wikimedia.org/r/321169 [06:06:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:07:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:11:19] (03CR) 10Andrew Bogott: [C: 031] "This looks ok to me -- I'll merge it on a couple of hosts and see how things go when I'm not just about to go to bed :)" [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [06:35:39] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[gdisk] [06:47:29] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - 531 bytes in 0.026 second response time [06:52:29] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.068 second response time [07:03:29] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:05:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:09:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:04:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:05:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:04:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [10:08:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:35:44] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2790751 (10matmarex) (Another report: https... [11:40:43] 06Operations, 10Mail: Update legal-tm-vio@ alias - https://phabricator.wikimedia.org/T150463#2790785 (10Peachey88) [11:48:29] PROBLEM - puppet last run on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:29] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:29] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:51:09] RECOVERY - DPKG on mira is OK: All packages OK [11:54:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:55:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [11:55:09] RECOVERY - DPKG on mira is OK: All packages OK [11:58:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:58:19] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:01:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:09:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:10:56] (03CR) 10Addshore: [C: 04-1] "Right now the RevisionSlider (in its final version) will be be deployed in its ready state on this day." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321103 (https://phabricator.wikimedia.org/T150573) (owner: 10Arseny1992) [12:13:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:14:09] PROBLEM - MegaRAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:20:29] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [12:23:29] PROBLEM - puppet last run on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:24:00] RECOVERY - MegaRAID on mira is OK: OK: no disks configured for RAID [12:27:19] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:37:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:48:09] grrrit-wm: force-restart [12:48:11] re-connecting to gerrit and irc. [12:48:53] re-connected to gerrit and irc. [13:05:41] I know it’s rather early, and on a sunday, but anyone around? [13:05:58] (not urgent) [13:12:24] depends on what kind of anyone you need [13:13:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:14:09] RECOVERY - DPKG on mira is OK: All packages OK [13:15:44] MatmaRex: I’m writing a comment to an existing task, I’ll just link it. [13:19:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:20:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:21:48] 06Operations, 10Wikimedia-General-or-Unknown, 10hardware-requests: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2790856 (10Revent) Ok. As it now stands, all of the broken transcodes 'exposed' by TimedMediaHandler on Commons (I mean, the list at https://commons.wikimedia.org/w... [13:22:05] MatmaRex: That. [13:23:17] MatmaRex: Exposing more of the list of broken transcodes through TimedMediaHandler (so that the backlog can be worked on while other problems are addressed) should not be a huge issue. [13:24:03] It shows about 50, out of over a third of a million. [13:27:24] yeah, that should possibly be pageable or something. [13:27:45] Revent: if you need the full list, you could get it with a Quarry query. [13:28:09] I was able to put about 150 through, before the list got clogged. [13:28:38] That’s not a terrible ratio, and would allow keeping the servers loaded down for quite a while longer. [13:29:03] (I mean, before the ‘moar power’ issue was addressed, or the bugs) [13:29:15] MatmaRex: Umm… how? [13:30:04] I will… not happily, but productively, keep those machines working if I have a list. [13:30:46] let me just find out how exactly it generates that list there [13:31:07] (I am aware of quarry, I grok sql, I know shit all about mediawiki database structures) [13:33:05] Revent: the query it uses is: select * from commonswiki.transcode WHERE transcode_time_startwork IS NOT NULL AND transcode_time_error IS NOT NULL order by transcode_time_error DESC limit 50 [13:33:12] so… just bump the limit and run it in quarry [13:33:32] this is actually a really bad query, there's no index on those fields. ugh. it'll take a long time [13:34:22] SELECT command denied to user 'u2029'@'10.68.17.29' for table 'transcode' [13:34:28] removing the "order by" actually makes it much better. so, if you don't care about ordering by when they failed… select * from commonswiki.transcode WHERE transcode_time_startwork IS NOT NULL AND transcode_time_error IS NOT NULL limit 50 [13:34:31] (lol) [13:34:41] Revent: oh, sorry. use commonswiki_p rather than commonswiki [13:35:26] running... [13:36:37] \o/ [13:36:53] MatmaRex: Thanks muchly [13:39:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:09] RECOVERY - DPKG on mira is OK: All packages OK [13:40:38] https://commons.wikimedia.org/wiki/File:Satu_Nusa_Satu_Bangsa.ogg <- wow, that’s broken [13:41:06] The length is ‘not’ 0.0s, obviously. [13:42:40] quite. perhaps the metadata extraction was buggy when this was uploaded. [13:43:20] From what I have seen, it’s more likely actually a bug in the file itself. [13:43:54] I have ‘fixed’ files with broken durations (that showed broken locally when downloaded) by re-transcoding them [13:44:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:44:29] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [13:45:09] RECOVERY - DPKG on mira is OK: All packages OK [13:45:32] MatmaRex: Still, the ability to load down the scalers with ones that ‘will work’, while waiting on fixes… priceless. [13:46:15] :) [13:54:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:29] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:09] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:09] RECOVERY - DPKG on mira is OK: All packages OK [14:03:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:07:51] MatmaRex: Heh. [14:08:03] huh? [14:08:32] This also shows ‘very old’ files where, when I purge them, more transcode options show up [14:09:04] (the config has changed since, for instance, Feb 2013 [14:13:44] https://commons.wikimedia.org/wiki/File:Loreen_presenting_herself_in_Swedish.ogv <- amusing [14:14:05] Mostly because of who it is, and the WMF logo. [14:15:08] She apparently won the Eurovision Song Contest in 2013 [14:16:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:09] RECOVERY - DPKG on mira is OK: All packages OK [14:18:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:20:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:23:09] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [14:25:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:26:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:43:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:19] RECOVERY - configured eth on mira is OK: OK - interfaces up [14:51:38] (03CR) 10Andrew Bogott: [C: 04-1] "Seems to work fine in testing. I'm concerned about Alex's question, though -- this login session will expire, right? After which we'll d" [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [14:59:26] (03CR) 10Andrew Bogott: [C: 032] openstack: Log mwclient failure details [puppet] - 10https://gerrit.wikimedia.org/r/321167 (owner: 10BryanDavis) [15:01:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:03:29] RECOVERY - configured eth on mira is OK: OK - interfaces up [15:06:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:08:23] (03CR) 10Andrew Bogott: "btw, the but referred to in 'MW has a bug' is https://phabricator.wikimedia.org/T95839" [puppet] - 10https://gerrit.wikimedia.org/r/321167 (owner: 10BryanDavis) [15:09:14] (03CR) 10Andrew Bogott: "and/or https://phabricator.wikimedia.org/T150373" [puppet] - 10https://gerrit.wikimedia.org/r/321167 (owner: 10BryanDavis) [15:09:33] MatmaRex: Still around? [15:10:32] Seems, like many, many, many ‘ancient’ transcodes of small files that successfully complete quite quickly (a minute or two) [15:10:49] Ancient bugs, perhaps [15:10:55] huh. neat. [15:12:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:50:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:51:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:03:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:04:09] RECOVERY - DPKG on mira is OK: All packages OK [16:04:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:05:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:05:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:10:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:11:09] RECOVERY - DPKG on mira is OK: All packages OK [16:14:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:19] RECOVERY - configured eth on mira is OK: OK - interfaces up [16:24:09] PROBLEM - MegaRAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:59] RECOVERY - MegaRAID on mira is OK: OK: no disks configured for RAID [16:36:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:37:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:48:18] MatmaRex: I’ve kicked over 50 of these off the queue, already, as well as other transcodes that were ‘showed up’ when the file page was purged. [16:49:08] I’m thinking the ‘not on the file page’ ones were probably due to config changes about what to generate. [16:50:00] Still… it’s not a matter of trying to boot them all manually, it’s a matter of doing enough to figure out the problem cases for you guys, I think. [16:55:58] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate tools to secondary labstore HA cluster (Scheduled on 11/14) [tracking] - https://phabricator.wikimedia.org/T146154#2791023 (10madhuvishy) [16:59:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:01:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:03:29] RECOVERY - configured eth on mira is OK: OK - interfaces up [17:06:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:06:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:10:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:12:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:14:29] RECOVERY - configured eth on mira is OK: OK - interfaces up [17:18:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:20:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:49:09] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:09] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:00:04] (03CR) 10BryanDavis: "> Is there a way we can 'taste' the session and refresh the login" [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [18:03:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:29] RECOVERY - configured eth on mira is OK: OK - interfaces up [18:07:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:08:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:11:09] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:12:09] RECOVERY - DPKG on mira is OK: All packages OK [18:12:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:13:29] RECOVERY - configured eth on mira is OK: OK - interfaces up [18:18:29] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:29] RECOVERY - configured eth on mira is OK: OK - interfaces up [18:24:49] PROBLEM - Host mintaka is DOWN: PING CRITICAL - Packet loss = 100% [18:25:49] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:39] PROBLEM - Host alnitak is DOWN: PING CRITICAL - Packet loss = 100% [18:26:49] PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100% [18:26:54] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [18:28:55] uh? [18:29:07] frack is it? [18:29:23] Is that that same switch in codfw that dies periodically? [18:29:40] based on names I expect it's teh same flaky fw issue of old, but my bouncer is down so I don't have backscroll here [18:30:35] there isn't much backscroll… mintaka, pay-lvs2002, alnitak, bellatrix, betelgeuse all down [18:30:37] I don't see those names in my scrollback again but it doesn't go all that far back [18:30:39] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [18:30:45] RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 36.50 ms [18:30:45] grrrr [18:30:47] and a lot of miscellaneous noise from mira which is I think unrelated [18:30:50] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.54 ms [18:30:55] RECOVERY - Host alnitak is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [18:30:58] and here they come back yep [18:31:01] RECOVERY - Host mintaka is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [18:31:02] yeah, here they come [18:32:19] there is a task for this ongoing somewhere I can't find yet to make a note of reocurrence [18:33:40] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:45] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:51] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [18:34:14] bouncing again? [18:34:29] I suspect that the task is named after the particular switch, which, I've no idea what that is [18:36:11] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:51] a search for asw turned up nothing I recognize, but I probably didn't look at that task either [18:36:51] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [18:37:41] T126790 ? [18:37:49] i am on mobile heh [18:38:57] godog: that could be it but of course that ticket doesn't actually mention whether or not it causes outage alerts :( [18:40:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: Puppet has 10 failures [18:40:21] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [18:40:26] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [18:40:31] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.53 ms [18:40:36] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [18:40:40] true :( [18:40:41] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [18:40:56] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:40:56] seems on it way to recovery tho [18:43:23] (03PS1) 10Ori.livneh: statsv: use systemd's process watchdog [puppet] - 10https://gerrit.wikimedia.org/r/321231 [18:45:06] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: Puppet has 10 failures [18:46:33] the URL shortening service we use for the channel topic is down [18:47:04] i was going to check the channel logs to see if indeed it alerted [18:47:09] on the dates paravoid mentioned [18:47:16] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:16] RECOVERY - configured eth on mira is OK: OK - interfaces up [18:50:06] RECOVERY - check_puppetrun on pay-lvs2002 is OK: OK: Puppet is currently enabled, last run 77 seconds ago with 0 failures [18:50:06] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 22 failures [18:50:07] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 15 failures [18:50:07] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:28] can someone add a comment on the tssk maybe? i have to go [18:50:31] thanks for making the comment andrewbogott, I think you're correct and it's this issue and so far afaik no one knows what to do about it [18:50:34] godog: seems andrewbogott did [18:51:06] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [18:53:17] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [18:53:18] !log rmmod acpi_pad on mira [18:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:06] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 203 seconds ago with 0 failures [18:55:07] RECOVERY - check_puppetrun on alnitak is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:56:27] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2791091 (10Volans) It happened to `mira` too, it was having high load and failing random Icinga checks. I've `rmmod acpi_pad` and is now back to normal. It started at 2016-11-13 11:40 UTC according to Grafana... [19:04:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:05:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:08:56] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:09:26] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:25:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:27:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:37:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:46:33] (03Abandoned) 10Urbanecm: New user right and user group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319568 (https://phabricator.wikimedia.org/T149610) (owner: 10Urbanecm) [19:49:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:50:47] what's up with esams? [19:52:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:00:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:02:18] apergos: ^ ? [20:05:12] grrrr [20:05:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:07:03] (03PS4) 10Andrew Bogott: openstack: cache mwclient connection in wikistatus [puppet] - 10https://gerrit.wikimedia.org/r/321169 (owner: 10BryanDavis) [20:07:05] (03PS1) 10Andrew Bogott: wikistatus: Handle a few more state changes [puppet] - 10https://gerrit.wikimedia.org/r/321233 [20:09:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:19:28] I'm getting a whole lot of nothing when looking at either varnish or mw dashboards to see about those criticals [20:20:00] I don't see anything that started around ... 50 minutes ago, now [20:20:15] here is when I wish bblack would miraculously appear [20:21:19] I just took a look and didn't see anything clearly abnormal either -- I'm inclined to think it's the anomaly detection logic being too crude again [20:22:11] maybe so [20:22:31] but I'll mention it to him next time we're both on, anyways [20:22:58] he's been looking at some issues over the past week, this might ring a bell [20:27:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:34:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:41:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:50:26] (03PS1) 10Urbanecm: [logo] HD for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321234 (https://phabricator.wikimedia.org/T150616) [20:53:30] apergos: I've seen these alerts a lot of times here [20:53:43] they could be something serious, but I doubt it [20:54:25] we always try to investigate them because there is often a genuine problem behind the alerts [20:54:47] of course [20:54:56] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:56:34] actually, looking at them them, 5xx / all requests is 0.013% for esams and 0.012% for eqiad, but 0.003% for codfw and ulsfo [20:58:21] I wonder how that possible is. wrong requests coming from Europe? [20:59:13] perhaps (if possible) look at the raw requests and get more information? [21:04:15] well I look ed eg at the fatals monitor but [21:04:35] as mentioned above, nothing really that started around the time of the alerts [21:06:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:07:16] It seems to be getting higher ^^ [21:07:24] was 11% before and now 22.22 [21:07:28] % [21:14:16] (03PS1) 10Urbanecm: [logo] HD for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321237 (https://phabricator.wikimedia.org/T150620) [21:17:47] Urbanecm , if you're going to add hdlogos for all projects, guess would be better to consolidate tasks. Otherwise we'd be ending with identical tasks and commits 900 times [21:20:27] arseny92: Not all of them (I'm not familiar with creating logos/rewriting them to SVG) so I'm only seeking at Commons for SVG which created somebody else. I download them, convert to 1.5x and 2x and then I create a task and commit. [21:21:11] If you think one task for all projects will be better, T150618 exists. [21:21:12] T150618: [GOAL] HD logo for all Wikipedias - https://phabricator.wikimedia.org/T150618 [21:22:02] But I don't know if I'll create commits for all projects without SVG (probably not) so I created it as goal one. [21:22:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:22:27] Urbanecm i have a patch to add support for svg in mediawiki [21:22:47] https://gerrit.wikimedia.org/r/#/c/193434/ [21:23:05] since its tracking goal , you can add all such tasks as subtasks to it [21:23:08] svg logo's i mean [21:23:36] paladox, but I don't have SVG for all projects. I only find SVG which exist. I can't create SVG logos. Convert SVG to PNG of certain size is easy... [21:23:43] But thanks anyway. [21:23:48] Oh [21:23:56] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:24:16] (03CR) 10Omeras: [C: 04-1] "esto es una prueba" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321237 (https://phabricator.wikimedia.org/T150620) (owner: 10Urbanecm) [21:24:53] So should I stop creating single task for every project and add T150618 to every commit related to this goal? [21:25:05] paladox , adding tandard wglogo and hd is still needed for backcompat as not all browsers support svg [21:25:21] Oh [21:26:03] Or what should I do instead of what I do now? [21:26:40] otherwise Commons and filepges on wikis would stop generating png versions of svg files [21:27:15] arseny92, your message was for me or for paladox? [21:27:26] paladox [21:27:30] Oh [21:27:37] thx [21:28:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:29:24] Urbanecm , probably best to ask this in the task referencing some key people, or ask at Meta for opinions or even a rfc to get opinions how to proceeding with this [21:29:58] RFC for maintenance-related thing? [21:30:10] And what are the key people for logos? [21:30:22] I don't know about any list of key people by theme :) [21:30:25] arseny92, ^ [21:30:45] well idk i just suggesting, ask on meta VP? [21:31:34] I only said I don't thing RFC is the best way. And I think better will be to ask certain people... But whom :). [21:31:44] Oh I should ask for the list on meta? [21:33:04] either way this going to flood either on tasks (if you going to add subtasks) or on commits (if you ref all to the same task, as gerritbot will flood the task) [21:33:44] or maybe a better design is to have a project with workboard for this [21:34:57] I thing flooding on commits is better than on tasks... Because if I have only one commit for all HD logos which aren't done yet... How long it will last then it can be deployed. [21:35:07] And of course the best is no flooding. [21:37:32] a workaround would be to submit in batches, like you commit logos for 15 or so projects at a time (depending on how much you can test at once in the swat time on mw1099) [21:38:24] Okay. I'll merge current two patches to one patch and I'll amend the patch till some SWAT window and create new patch after it. [21:38:30] Is it okay for you arseny92 ? [21:41:24] For me? Well im just opinion-suggesting ways to solve design issues. How you be doing in the end is obviously up to you. [21:42:07] Yes, much better i guess [21:42:17] Okay. [21:46:24] (03PS2) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321234 (https://phabricator.wikimedia.org/T150618) [21:46:50] (03Abandoned) 10Urbanecm: [logo] HD for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321237 (https://phabricator.wikimedia.org/T150620) (owner: 10Urbanecm) [21:50:57] (03PS3) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321234 (https://phabricator.wikimedia.org/T150618) [21:56:56] (03PS4) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321234 (https://phabricator.wikimedia.org/T150618) [21:58:06] (03PS5) 10Urbanecm: HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321234 (https://phabricator.wikimedia.org/T150618) [22:01:16] PROBLEM - HTTPS-blog on blog.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate blog.wikimedia.org valid until 2016-12-13 22:01:00 +0000 (expires in 29 days) [22:06:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:08:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:05:29] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2791411 (10Krenair) [23:06:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [23:09:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:44:00] (03PS1) 10Thcipriani: Bump scap version to 3.3.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/321339 [23:44:23] (03CR) 10Thcipriani: [C: 04-1] "Needs new package on carbon 1st" [puppet] - 10https://gerrit.wikimedia.org/r/321339 (owner: 10Thcipriani)