[00:04:37] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:05:46] RECOVERY - puppet last run on mw1222 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:08:27] (03PS6) 10BryanDavis: debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) [00:08:33] (03CR) 10jenkins-bot: [V: 04-1] debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [00:09:42] (03Abandoned) 10BryanDavis: Update beta cluster logging config for namespaced classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201986 (owner: 10BryanDavis) [00:13:01] (03PS1) 10Dzahn: change store to a CNAME for c.ssl.shopify.com. [dns] - 10https://gerrit.wikimedia.org/r/203497 (https://phabricator.wikimedia.org/T92438) [00:14:37] PROBLEM - puppet last run on mw2058 is CRITICAL puppet fail [00:14:50] (03PS7) 10BryanDavis: debug logging: Convert to Monolog logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) [00:16:20] (03CR) 10Dzahn: [C: 031] "this is actually nicer than the previous method because now we can just see if it works without touching the existing shops and they say "" [dns] - 10https://gerrit.wikimedia.org/r/203497 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [00:18:05] (03CR) 10Dzahn: [C: 032] "doesn't touch current main shop url and is revertable in minutes, so doing it now" [dns] - 10https://gerrit.wikimedia.org/r/203497 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [00:22:20] (03Abandoned) 10Dzahn: shop URL: change 'shop' to 'store' [dns] - 10https://gerrit.wikimedia.org/r/199796 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [00:22:47] (03CR) 10Dzahn: "see https://gerrit.wikimedia.org/r/#/c/203497/ instead. per shopify the CNAME changed" [dns] - 10https://gerrit.wikimedia.org/r/199796 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [00:23:12] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T92438#1199993" [dns] - 10https://gerrit.wikimedia.org/r/199796 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [00:24:20] (03PS1) 10Dzahn: Revert "change store to a CNAME for c.ssl.shopify.com." [dns] - 10https://gerrit.wikimedia.org/r/203498 [00:25:31] (03CR) 10Dzahn: [C: 032] "shopify said we can change it anytime, but "" [dns] - 10https://gerrit.wikimedia.org/r/203498 (owner: 10Dzahn) [00:32:15] (03CR) 10BryanDavis: [C: 04-1] "-1 because this isn't safe until after the next train deploy (2015-04-14)" (0312 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [00:32:37] RECOVERY - puppet last run on mw2058 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:43:40] 6operations, 6Security, 10Wikimedia-Shop, 7HTTPS, 5Patch-For-Review: Changing the URL for the Wikimedia Shop - https://phabricator.wikimedia.org/T92438#1200521 (10Dzahn) so shopify said "whenever you're ready point your CNAME for store to c.ssl.shopify.com." and that we should use the _new_ CNAME for sto... [00:51:40] 6operations, 10hardware-requests: hardware for global ganglia aggregator in eqiad - https://phabricator.wikimedia.org/T95792#1200545 (10Dzahn) 3NEW [00:53:30] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1145909 (10Dzahn) [00:53:32] 6operations, 10hardware-requests: hardware for global ganglia aggregator in eqiad - https://phabricator.wikimedia.org/T95792#1200555 (10Dzahn) [00:58:37] RECOVERY - Graphite Carbon on graphite2001 is OK All defined Carbon jobs are runnning. [01:01:03] (03PS1) 10Dzahn: releases: move zip install out of node into role [puppet] - 10https://gerrit.wikimedia.org/r/203501 [01:01:14] 6operations: unzip on tin - https://phabricator.wikimedia.org/T83213#1200556 (10Dzahn) [01:13:56] 6operations, 10ops-esams, 5Patch-For-Review: decommission cp3001 & cp3002 - https://phabricator.wikimedia.org/T94215#1200568 (10Dzahn) just mgmt DNS, racktables and maybe switch ports are left of these [01:54:02] is anyone there who can help with a database issue? [02:02:45] springle: ping [02:09:03] ebernhardson: pong [02:09:12] reading your email to ops@ now [02:14:11] springle: ebernhardson I’m back in the office if there’s anything I can help with (can’t think of any atm, just a fyi) [02:17:53] thx, the actual content is guaranteed still in the ES databases, what was overwritten are the pointers from individual flow revisions to the row in the ES database that contains their content (the rev_content field) [02:22:46] YuviPanda, you're in SF? [02:22:59] superm401: yes :) [02:24:37] YuviPanda, yeah, if you want to talk over the DB replication and backup (either in person or otherwise), that would be great. We're in the Collaboration team corner, on the 3rd floor. [02:24:38] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 37s) [02:24:50] Logged the message, Master [02:29:28] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-11 02:28:24+00:00 [02:29:33] Logged the message, Master [02:37:53] superm401: uh, I can come over but springl.e is your man, I’ve no idea about our db setup as such. [02:38:34] ebernhardson: these from the maint script? UPDATE /* Flow\Data\Storage\RevisionStorage::update */ `flow_revision` SET rev_content_length = '55',rev_flags = 'external',rev_content = 'DB://cluster24/104561' WHERE rev_id = '' [02:39:07] springle: looks like what i would expect, yes. [02:39:53] ebernhardson: your email said "flow_content is not one of those fields". s/flow/rev/ ? [02:40:18] springle: yes i meant rev_content [02:40:52] its not supposed to be updatable, but based on the current state of the db our guess is that it was updated. Thanks for verifying that is what happened. [02:41:17] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [02:41:29] * YuviPanda damns labstore1001 [02:42:19] "for whatever reason" :) [02:42:36] greg-g: :) it *is* the catchall [02:42:50] * YuviPanda looks at https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring [02:43:11] ebernhardson: as for backup: 30 days of logs but no snapshot dumps older than 8 hours for X1 [02:44:01] (everything seems ok on labstore, btw) [02:44:03] springle: unfortunate for us, but ok. We will run through some other ideas for how to re-associate this data. [02:44:52] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 05m 28s) [02:44:59] Logged the message, Master [02:45:40] springle: i appreciate you checking in on a saturday [02:46:01] ebernhardson: we could at least use the logs to determine the records touched by that script. would that help speed the process? [02:46:07] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [02:46:58] springle: hmm, a complete list of the UPDATE statements like that could help alot, yes [02:47:14] it should be the only UPDATE's to rev_content [02:47:23] ok. easy enough to grep [02:48:13] springle: do comments show up in the sql logs? [02:48:27] ebernhardson: where the rev_user_* fields clobbered too? [02:48:39] pastacat: yes [02:49:00] ebernhardson: and rev_mod_timestamp ? [02:49:07] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-11 02:48:04+00:00 [02:49:11] Logged the message, Master [02:49:51] pastacat: not as far as we have noticed yet [02:49:57] I guess that won't help for really old text though [02:50:22] ebernhardson: I assume rev_content_length is broken though? [02:50:32] really what we are missing is a bunch of strings like 'DB://cluster24/104561' [02:51:14] we have an idea for filtering the ids in external store against the ones that are actually pointed to by core, but not sure if there are a bunch of orphaned content in ES or not [02:52:33] hmm, actually because its sharded between cluster24 and cluster25 that might find the ids, but won't be able to associate back to the revisions they belong to [02:55:36] springle, at any given time, is more than one External Store cluster used? [02:55:46] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [02:58:01] hmm, labstore *io* is peaking, but not iowait [02:58:10] superm401: yes, writes balance across two clusters [02:58:32] $wgDefaultExternalStore = array( 'DB://cluster24', 'DB://cluster25', [02:58:35] ); [03:00:22] * pastacat remembers bitching about it when it was just one cluster [03:03:07] ebernhardson: where do you want the list of UPDATEs? somewhere on terbium? [03:04:26] springle: terbium will work yes, any temp dir i can copy it from [03:05:55] ebernhardson: /tmp/T90443_flowdb.sql.gz [03:06:14] springle: got it, thanks [03:08:46] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [03:14:07] ebernhardson, output from: [03:14:09] echo "select CONCAT_WS( ',', CONCAT( '''', substring(old_text from 6 for 9), ''''), CONCAT( '''', substring(old_text from 16), '''')) FROM text where substring(old_text from 6 for 9) IN ('cluster24', 'cluster25') order by old_id desc"|mysql -h s3-analytics-slave officewiki|tee officewiki_external_storage_cluster.csv [03:14:29] is on stat1003 in my home directory. [03:16:47] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [03:23:46] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 66.67% of data above the critical threshold [35.0] [03:24:06] (03PS1) 10Aaron Schulz: Lowered "max lag" to 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203508 [03:25:14] springle, are cluster24 and cluster25 mirrored somewhere I can access from stat1003? [03:25:37] something’s up with tools-submit [03:25:40] * YuviPanda continues digging [03:26:47] PROBLEM - HHVM rendering on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [03:27:16] PROBLEM - Apache HTTP on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [03:27:47] PROBLEM - HHVM processes on mw1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [03:27:50] superm401: probably not, but we can approriate a CODFW slave for this today if you need it [03:28:15] springle, I think we're okay for now. [03:28:47] superm401: although, stat1003 network access might be the limitation. terbium would be better [03:28:52] ok [03:29:14] springle, yeah, ebernhardson is not using stat1003 right now. [03:29:20] first we will be developing a recovey for officewiki, since its the smallest data set to work with. But when we do frwiki and enwiki we will probably need those. I don't know if we will get that far tonight [03:29:27] RECOVERY - HHVM processes on mw1081 is OK: PROCS OK: 25 processes with command name hhvm [03:29:58] RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 65215 bytes in 0.295 second response time [03:30:07] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [03:30:27] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [03:31:13] ebernhardson: np. just shout [03:31:44] (03PS1) 10Yuvipanda: tools: Do not allow jlocal in crontab [puppet] - 10https://gerrit.wikimedia.org/r/203510 (https://phabricator.wikimedia.org/T95796) [03:35:37] (03PS2) 10Yuvipanda: tools: Do not allow jlocal in crontab [puppet] - 10https://gerrit.wikimedia.org/r/203510 (https://phabricator.wikimedia.org/T95796) [03:36:20] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Do not allow jlocal in crontab [puppet] - 10https://gerrit.wikimedia.org/r/203510 (https://phabricator.wikimedia.org/T95796) (owner: 10Yuvipanda) [03:41:36] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 87.50% of data above the critical threshold [35.0] [03:44:24] booooooo [03:47:56] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [03:50:04] springle: we are going to need a grep for UPDATE and rev_content for one more database, while almost all wiki's have their flow data in flowdb on x1, officewiki due to being a private wiki has its data in the main officewiki table [03:51:03] ebernhardson: same time frame i guess? [03:53:18] ebernhardson: I didn't realise officewiki had the flow extension locally. we do have a full dump from Apr 6th and Mar 24th for officewiki [03:55:37] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [03:56:43] springle: no rush on that one, based on the info we have so far we are developing a plan but it will take some time to implement and we will not be doing that tonight [03:57:05] but we are confident now we have a plan to correlate the ES data back into flow, partially by uses the content length and id's from your UPDATE logs [03:57:31] ebernhardson: ok, do you want the officewiki UPDATE logs in order to test the plan? [03:57:40] springle: yes [03:57:47] then possibly use the officewiki dumps to cross check [03:57:50] ok, coming up [03:57:56] Thanks, springle [04:02:00] !log aaron Synchronized php-1.26wmf1/includes/jobqueue/JobRunner.php: 65ff16efa7a69dfbec4c70df22d89a1b12c60762 (duration: 00m 11s) [04:02:04] Logged the message, Master [04:03:47] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [04:09:13] ebernhardson: terbium:/tmp/T90443_officewiki.sql.gz [04:13:52] springle: thanks! [04:15:00] springle, do you know how pruning works for external storage? [04:15:08] superm401: nope [04:16:57] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 66.67% of data above the critical threshold [35.0] [04:30:06] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [04:32:05] (03CR) 10Yuvipanda: "Alright, I now know why it was allowed, still think we shouldn't." [puppet] - 10https://gerrit.wikimedia.org/r/203510 (https://phabricator.wikimedia.org/T95796) (owner: 10Yuvipanda) [04:38:07] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [04:39:56] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [04:40:02] (03PS1) 10Yuvipanda: Use isoformat in datetime logs, rather than asctime [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/203513 [04:40:38] PROBLEM - puppet last run on cp3030 is CRITICAL puppet fail [04:46:35] 6operations, 7Monitoring: Allow customizing the alert message from graphite - https://phabricator.wikimedia.org/T95801#1200702 (10yuvipanda) 3NEW [04:51:16] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [04:58:27] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:58:30] springle: I can't for RBR, that would cut down on some lag due to avoiding some of the scanning aspects of replicated queries [05:02:37] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [05:05:47] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [05:09:17] cheesecat: and potentially increase the volume of binlog data for queries that touch many rows. remains to be seen whether we just end up moving the goal posts [05:09:31] worse case, there is MIXED [05:09:43] does that actually act clever based on affected row count? [05:09:59] we're defaulting to MIXED now on the mariadb 10 config [05:10:19] but masters are still 5.5 right? [05:10:52] no, afaik RBR only thinks about consistency, and defaults to STATEMENT to keep log size down [05:10:57] correct [05:11:54] is there an eta on upgrading those? [05:12:23] asap after all slaves are upgraded, or we trial a codfw switchover [05:13:19] probably s1 could happen already, actually [05:13:37] but hesitant to expriment with s1 :) [05:14:40] cheesecat: are your nicks going to be reliably *cat? :) [05:15:04] for the time being [05:15:10] lots of nicks stashed up [05:15:13] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 11 05:14:10 UTC 2015 (duration 14m 9s) [05:15:19] burritocat was cooler [05:15:19] Logged the message, Master [05:23:38] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:23:56] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 66.67% of data above the critical threshold [35.0] [05:24:03] 6operations, 10hardware-requests: hardware for global ganglia aggregator in eqiad - https://phabricator.wikimedia.org/T95792#1200719 (10Dzahn) [05:24:28] cheesecat: did you intend to leave db-codfw.php maxlag higher? [05:25:08] no [05:25:12] i suppose that does make sense, but maybe once we get to 10s [05:25:31] well it doesn't really make sense, just a mistake [05:25:47] the lag is the log application lag, not the log reception lag [05:25:54] * cheesecat amends [05:26:45] (03PS2) 10Aaron Schulz: Lowered "max lag" to 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203508 [05:28:56] we'll find out [05:29:46] (03CR) 10Springle: [C: 032] Lowered "max lag" to 15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203508 (owner: 10Aaron Schulz) [05:30:28] !log springle Synchronized wmf-config/db-eqiad.php: reduce max lag to 15s, gerrit 203508 (duration: 00m 11s) [05:30:34] Logged the message, Master [05:30:48] !log springle Synchronized wmf-config/db-codfw.php: reduce max lag to 15s, gerrit 203508 (duration: 00m 12s) [05:30:51] Logged the message, Master [05:32:00] and because it's a Friday night, HDD and Floppy Music: Nirvana - Smell Like Teen Spirit - https://www.youtube.com/watch?v=G081hD0nwWE [05:45:52] springle: I guess when 'max lag' gets low enough, we may want to except the vslow boxen, leaving them with a higher value perhaps [05:51:25] ah true [05:51:59] huge sorting spikes seem to often correlate with a bit of lag [06:30:56] PROBLEM - puppet last run on mw1162 is CRITICAL puppet fail [06:32:27] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on mw1213 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:35:46] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:36:06] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:36:17] PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 1 failures [06:36:17] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:46:26] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw1162 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:47] PROBLEM - puppet last run on mw2074 is CRITICAL puppet fail [07:22:37] RECOVERY - puppet last run on mw2074 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:59:24] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: wikibooks.org redirects to en.wikibooks.org - https://phabricator.wikimedia.org/T87039#1200819 (10Glaisher) 5Open>3Resolved a:3Glaisher @Andrew Thanks for the merge. Looks like the caching issue has been resolved now. ``` HTTP/1.1 301 M... [09:19:15] (03PS1) 10Yuvipanda: Revert "tools: Do not allow jlocal in crontab" [puppet] - 10https://gerrit.wikimedia.org/r/203525 [09:19:25] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Do not allow jlocal in crontab" [puppet] - 10https://gerrit.wikimedia.org/r/203525 (owner: 10Yuvipanda) [10:14:48] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [10:39:08] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [11:29:06] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [11:35:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 62.50% of data above the critical threshold [35.0] [11:56:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [12:12:46] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [12:20:53] !log krinkle Synchronized php-1.25wmf24/includes/Title.php: T95811 (duration: 00m 11s) [12:21:01] Logged the message, Master [12:22:26] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [12:25:26] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [12:26:50] !log krinkle Synchronized php-1.26wmf1/includes/Title.php: T95811 (duration: 00m 12s) [12:26:57] Logged the message, Master [13:00:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 7 below the confidence bounds [14:24:38] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:49:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [14:49:11] (03CR) 10Mjbmr: "@Aklapper What did I say exactly here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201977 (owner: 10Mjbmr) [15:15:01] (03CR) 10John F. Lewis: [C: 031] shop redirects: store instead of shop [puppet] - 10https://gerrit.wikimedia.org/r/199791 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [16:19:10] (03PS1) 10John F. Lewis: codfw: add datacenter to dns config [dns] - 10https://gerrit.wikimedia.org/r/203544 [16:19:22] (03CR) 10jenkins-bot: [V: 04-1] codfw: add datacenter to dns config [dns] - 10https://gerrit.wikimedia.org/r/203544 (owner: 10John F. Lewis) [16:19:35] (03PS2) 10John F. Lewis: codfw: add datacenter to dns config [dns] - 10https://gerrit.wikimedia.org/r/203544 [16:20:39] (03CR) 10John F. Lewis: "submitted https://gerrit.wikimedia.org/r/#/c/203544/ which replaces this change." [dns] - 10https://gerrit.wikimedia.org/r/196076 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [16:20:49] (03CR) 10John F. Lewis: "submitted https://gerrit.wikimedia.org/r/#/c/203544/ which replaces this change." [dns] - 10https://gerrit.wikimedia.org/r/196069 (https://phabricator.wikimedia.org/T92377) (owner: 10Dzahn) [16:26:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60627 bytes in 2.087 second response time [16:28:06] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 24.14% of data above the critical threshold [100000000.0] [16:31:05] JohnLewis, did you intend to add eqiad entries to templates/0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa ? [16:31:21] Krenair: probably not, let me check [16:31:41] copy and paste >.> [16:32:10] :P [16:32:56] (03PS3) 10John F. Lewis: codfw: add datacenter to dns config [dns] - 10https://gerrit.wikimedia.org/r/203544 [16:33:03] Krenair: thanks for finding that :) [17:03:46] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [17:40:55] (03PS1) 10BBlack: cache.pp cleanup: un-nest class defs [puppet] - 10https://gerrit.wikimedia.org/r/203553 [17:40:57] (03PS1) 10BBlack: cache.pp cleanup: ws (classes unindent) [puppet] - 10https://gerrit.wikimedia.org/r/203554 [17:40:59] (03PS1) 10BBlack: cache.pp cleanup: a little formatting [puppet] - 10https://gerrit.wikimedia.org/r/203555 [17:41:01] (03PS1) 10BBlack: cache.pp cleanup: ws (some nodes unindent) [puppet] - 10https://gerrit.wikimedia.org/r/203556 [17:41:03] (03PS1) 10BBlack: cache.pp cleanup: -decommed_nodes list (unused?) [puppet] - 10https://gerrit.wikimedia.org/r/203557 [17:41:05] (03PS1) 10BBlack: cache.pp cleanup: class names: s/r::c::varnish::/r::c::/ [puppet] - 10https://gerrit.wikimedia.org/r/203558 [17:41:07] (03PS1) 10BBlack: cache.pp cleanup: class names: s/r::c::ssl::/r::c::ssl_/ [puppet] - 10https://gerrit.wikimedia.org/r/203559 [17:41:09] (03PS1) 10BBlack: cache.pp cleanup: fully qualify ssl-related definition names [puppet] - 10https://gerrit.wikimedia.org/r/203560 [17:41:11] (03PS1) 10BBlack: cache.pp cleanup: decompress ssl def usage [puppet] - 10https://gerrit.wikimedia.org/r/203561 [17:41:13] (03PS1) 10BBlack: cache.pp cleanup: various format nits [puppet] - 10https://gerrit.wikimedia.org/r/203562 [17:41:15] (03PS1) 10BBlack: cache.pp cleanup: split to one global class/def per file [puppet] - 10https://gerrit.wikimedia.org/r/203563 [17:43:00] bblack: hey, around? [17:43:38] sorta :) [17:44:09] bblack: okay, since you're the gdnsd god, mind giving https://gerrit.wikimedia.org/r/#/c/203544/ a look over when you get spare time this weekend or Monday perhaps? :) [17:46:17] JohnLewis: ok, probably monday. Also, there's still https://phabricator.wikimedia.org/T83110 hanging around. I don't remember for sure if I finished up whatever needed to be done with eqiad's IP layout or not, which might be relevant here, I'll need to dig into it a bit. [17:47:22] bblack: Monday is fine and re. that bug, is it still in the 'we need to consider it' stage or are there some ideas or so documented somewhere? [17:47:54] if there are some ideas somewhere, I don't mind intepreting and digging away at improving it for codfw in the patch [17:48:52] I don't think the reasoning is fullly documented well outside the zonefiles themselves. I'd look for any inconsistency in the layouts of the public service IP ranges between esams/ulsfo/eqiad currently, eqiad may be lagging behind on some moves to arrive at the layout that's probably already correct at the others. [17:49:09] (and make codfw do the right thing even if eqiad is still out of whack a little) [17:49:56] fixing eqiad for any lingering issue there, if there is any, would be a long two-step process of setting up the correct IPs and migrating traffic to them and then eventually decomming the old ones, etc... [17:50:35] I don't really remember, it's been a while, but it's possible I already fixed everything there, too. [17:51:42] anyways, I'm outta here for now, I'm starving :) Thanks for looking at this stuff! [17:52:02] go get something to eat and poke the patch on Monday with changes :) [20:56:23] !log aaron Synchronized php-1.26wmf1/includes/jobqueue/JobRunner.php: 2e96dc28ef225441547f4e61acb8a09cb5c0709e (duration: 00m 12s) [20:56:32] Logged the message, Master [21:05:55] !log aaron Synchronized php-1.26wmf1/maintenance/Maintenance.php: 103c7f7534b69f7a920edd3b893e25851301e79c (duration: 00m 12s) [21:06:00] Logged the message, Master [21:24:40] 6operations, 10hardware-requests: hardware for global ganglia aggregator in eqiad - https://phabricator.wikimedia.org/T95792#1201225 (10Andrew) p:5Triage>3Normal [21:25:08] 6operations, 7Monitoring: Allow customizing the alert message from graphite - https://phabricator.wikimedia.org/T95801#1201226 (10Andrew) p:5Triage>3Normal [21:26:14] 6operations, 10Continuous-Integration, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#1201229 (10Andrew) p:5Triage>3High [21:26:43] 6operations: Allow access to https://archiva.wikimedia.org from analytics nodes. - https://phabricator.wikimedia.org/T95712#1201231 (10Andrew) p:5Triage>3Normal [21:27:14] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1201233 (10Andrew) p:5Triage>3Normal [21:27:33] 6operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1201235 (10Andrew) p:5Triage>3Normal [21:28:12] 6operations, 10ops-codfw: mw2128 not rebooting after network driver crash, blank console - https://phabricator.wikimedia.org/T95264#1201237 (10Andrew) p:5Triage>3High [21:28:20] 6operations, 10ops-codfw: mw2128 not rebooting after network driver crash, blank console - https://phabricator.wikimedia.org/T95264#1184867 (10Andrew) p:5High>3Normal [21:28:39] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1201240 (10Andrew) p:5Triage>3High [21:28:57] 6operations: Trigger some sort of alert if the memcache-serious log file is filling up at a greater than usual rate - https://phabricator.wikimedia.org/T95231#1201243 (10Andrew) p:5Triage>3Normal [21:29:11] 6operations, 10Wikimedia-Apache-configuration: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1201245 (10Andrew) p:5Triage>3High [21:29:51] 6operations, 6Scrum-of-Scrums, 3Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1201246 (10Andrew) p:5High>3Normal [21:31:46] PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail [21:32:33] 6operations, 10Wikimedia-Labs-wikitech-interface, 7Regression: Some wikitech.wikimedia.org thumbnails broken (404) - https://phabricator.wikimedia.org/T93041#1201254 (10Andrew) a:3Andrew [21:47:57] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:27:54] 6operations, 10Wikimedia-General-or-Unknown, 7database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1201301 (10Krenair) I don't have any particularly sane ways of finding out whether it exists but is orphaned. Would the IDs have been... [23:07:23] (03PS1) 10devunt: Add Josa extension and deploy to Korean language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) [23:11:06] (03CR) 10devunt: "This is my first time to deploy a new extension to wmf and I'm very not sure that this is the right way. So please look carefully and plea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203627 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [23:36:22] mutante, ugh, does the rt migration script not keep timestamps? [23:37:06] oh no, I see what's going on [23:37:16] ignore me, it's far too late :/ [23:37:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [23:45:35] anyone know what capella was? [23:46:55] Krenair: OrientDB testing [23:47:02] looks to have once been called mobile1 in pmtpa, then reappears in codfw [23:47:07] is that actually the same server? [23:47:25] Krenair: wikitech page says "capella is a Wikimedia IPv6 tunnel relay (6to4/Teredo) (role::ipv6relay)." [23:47:41] I'm trying to update the wikitech page [23:48:16] It's certainly not that anymore [23:49:10] Krenair: it's a spare in codfw now [23:49:45] https://wikitech.wikimedia.org/w/index.php?title=Server_Spares&diff=152418&oldid=152413 [23:51:13] But then we have https://phabricator.wikimedia.org/T84901 [23:51:21] old name: solr3 [23:51:21] new name: capella [23:51:21] asset tag: WMF5835 [23:52:02] so it used to be a solr host [23:52:16] right, but there also seems to be some earlier history [23:52:46] can probably track it down via dns I guess [23:52:51] * JohnLewis looks [23:53:19] I thought perhaps it was renamed solr3 for a while for solr use [23:54:37] 6operations, 7HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#1201327 (10Legoktm) [23:54:53] but then found sdtpa references to a solr3 :/ [23:55:26] I'm not sure really [23:55:34] https://github.com/wikimedia/operations-dns/commit/00a58b58a7ed8be0391bfb65c2f9bb2ab9694fae is where it was first added back in December 2012 [23:56:32] Krenair: found a paper trial for capella entering pmtpa and leaving pmtpa for codfw [23:57:17] Entering: https://phabricator.wikimedia.org/T81852 Shipping to eqiad: https://phabricator.wikimedia.org/T83652 [23:57:30] right, but that's eqiad not codfw [23:57:37] yeah I just realised :? [23:57:55] I think there are really multiple servers that have had the same name [23:58:42] probably [23:59:02] I can't see any evidence of capella leaving eqiad from the old RT stuff [23:59:36] robh may know something though :)