[00:00:04] <jouncebot>	 RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141106T0000).
[00:01:06] <marktraceur>	 Is someone doing the SWAT?
[00:01:11] <grrrit-wm>	 (03CR) 10Hoo man: [C: 031] "Looks good at a glance now" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[00:01:13] <grrrit-wm>	 (03CR) 10Dzahn: "15:59 <Reedy> 'wmgMobileFrontend' => array(" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[00:01:21] <marktraceur>	 Looks like RoanKattouw has a patch in
[00:01:26] <MaxSem>	 I can
[00:04:24] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Change ocg.log rotation from 15 days to 7 days. [puppet] - 10https://gerrit.wikimedia.org/r/171477 
[00:05:07] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "sounds good, adding cscott" [puppet] - 10https://gerrit.wikimedia.org/r/171477 (owner: 10Andrew Bogott)
[00:05:19] <MaxSem>	 ebernhardson, I hope you will not run away during the deployment like last time? :P
[00:05:58] <ebernhardson>	 MaxSem: i dont intend to run away :)
[00:06:02] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Change ocg.log rotation from 15 days to 7 days. [puppet] - 10https://gerrit.wikimedia.org/r/171477 (owner: 10Andrew Bogott)
[00:06:12] <ebernhardson>	 MaxSem: usually i try to get home before swat so that doesn't happen
[00:06:27] <MaxSem>	 ok
[00:08:48] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures  
[00:09:16] <andrewbogott>	 !log cleaned up some log files on ocg1001 and reduced logrotations to 7.
[00:09:23] <morebots>	 Logged the message, Master
[00:09:47] <icinga-wm>	 RECOVERY - Disk space on ocg1001 is OK: DISK OK  
[00:13:20] <andrewbogott>	 !log ocg1001 is depressingly tiny and will probably keeping complaining about disk space until it's rebuilt
[00:13:28] <morebots>	 Logged the message, Master
[00:13:49] <James_F>	 andrewbogott: :-)
[00:18:16] <logmsgbot>	 !log maxsem Synchronized php-1.25wmf6/extensions/Flow/: SWAT (duration: 00m 05s)
[00:18:24] <morebots>	 Logged the message, Master
[00:18:26] <MaxSem>	 ebernhardson, ^^^
[00:18:52] <logmsgbot>	 !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: SWAT (duration: 00m 04s)
[00:18:58] <morebots>	 Logged the message, Master
[00:19:03] <MaxSem>	 kaldari, ^^^
[00:20:24] <logmsgbot>	 !log maxsem Synchronized php-1.25wmf7/extensions/VisualEditor/: SWAT (duration: 00m 07s)
[00:20:31] <morebots>	 Logged the message, Master
[00:20:39] <MaxSem>	 RoanKattouw, ^^^
[00:20:54] <James_F>	 MaxSem: Have flagged to etonkovidova.
[00:21:19] <MaxSem>	 grr, hierarchies
[00:21:49] <RoanKattouw>	 lol
[00:22:04] <James_F>	 MaxSem: I have people who have people. ;-)
[00:22:13] <James_F>	 Also, Elena should be in here. ;-)
[00:22:19] <MaxSem>	 EVUL
[00:22:30] <James_F>	 No no, "James".
[00:22:44] <kaldari>	 MaxSem: PHP fatal error in /srv/mediawiki/php-1.25wmf6/extensions/MobileFrontend/includes/MobileContext.php line 1024 Call to undefined method WebResponse::getheader() :(
[00:23:31] <Reedy>	 lol, ori ^
[00:24:41] <logmsgbot>	 !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: (no message) (duration: 00m 07s)
[00:24:50] <morebots>	 Logged the message, Master
[00:24:52] <MaxSem>	 srsly
[00:25:04] <ebernhardson>	 MaxSem: thanks
[00:25:15] <MaxSem>	 deployment branches should maintain contact with reality
[00:25:17] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[00:25:29] <MaxSem>	 they should not contain stuff reverted yesterday
[00:25:58] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:26:06] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:26:09] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:26:18] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:26:35] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:27:04] <Krenair>	 MaxSem?
[00:27:09] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:27:29] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:27:33] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:27:42] <ori>	 MaxSem: master: https://gerrit.wikimedia.org/r/#/c/171190/ , wmf6: https://gerrit.wikimedia.org/r/#/c/171189/ , wmf5: https://gerrit.wikimedia.org/r/#/c/171188/ 
[00:27:44] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:28:12] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:28:17] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:28:20] <logmsgbot>	 !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: (no message) (duration: 00m 04s)
[00:28:23] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[00:28:28] <morebots>	 Logged the message, Master
[00:28:29] <MaxSem>	 whaddafuq
[00:28:38] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20964 bytes in 0.009 second response time  
[00:29:10] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 21048 bytes in 0.463 second response time  
[00:29:47] <MaxSem>	 reverted locally, looking WTF was going on
[00:29:49] <bblack>	 what's up?
[00:29:52] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21059 bytes in 0.597 second response time  
[00:29:58] <jgage>	 hi
[00:29:58] <bblack>	 I was about to eat, and myphone is making lots of noise
[00:30:05] <MaxSem>	 bblack, PHP error, reverted
[00:30:17] <bblack>	 ok
[00:30:26] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 21035 bytes in 0.600 second response time  
[00:30:32] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21079 bytes in 0.470 second response time  
[00:30:47] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21019 bytes in 0.034 second response time  
[00:31:14] <jgage>	 that was fun
[00:31:17] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20986 bytes in 0.290 second response time  
[00:31:22] <jgage>	 thanks for handling, maxsem
[00:31:24] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20999 bytes in 0.219 second response time  
[00:31:54] <jgage>	 i was paged 19 times, heh
[00:31:56] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20938 bytes in 0.002 second response time  
[00:32:07] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20986 bytes in 0.031 second response time  
[00:32:11] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21008 bytes in 0.288 second response time  
[00:32:53] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21030 bytes in 0.230 second response time  
[00:34:04] <grrrit-wm>	 (03PS1) 10Dereckson: Adding Ukraine photo sources to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171484 (https://bugzilla.wikimedia.org/73045) 
[00:34:38] <jgage>	 i look forward to the postmortem on how that change made it past qa/beta
[00:35:07] <bblack>	 with dexterity and subtlety!
[00:35:29] <jgage>	 heh
[00:36:44] <MaxSem>	 well, that change was detected in beta and reverted, but restored back by a bogus cherrypick/module update
[00:37:01] <jgage>	 oops
[00:37:16] <MaxSem>	 currently we're trying to figure out what happened
[00:38:19] <MaxSem>	 we don't have a test environment to test production branches, hehehe
[00:39:14] <bblack>	 clearly we need a beta-beta-labs-labs to validate what comes out of beta-labs
[00:39:25] <Carmela>	 +1
[00:39:28] <jgage>	 delta labs
[00:39:41] <MaxSem>	 lambda labs!
[00:39:55] <MaxSem>	 <3 the crowbar
[00:40:52] <bblack>	 in lambda labs you can clone an instance to a new one while making one change, but thereafter each instance is immutable.
[00:41:26] <ori>	 too highbrow
[00:41:32] <bblack>	 :p
[00:42:08] <bblack>	 let deployment-www-99234 = deployment-www-99233 + 5f4bde3a;
[00:44:53] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[00:51:03] <icinga-wm>	 PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 332 seconds  
[00:52:32] <icinga-wm>	 RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds  
[00:57:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[00:57:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[00:57:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[00:57:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[00:57:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[00:57:52] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2069: active_shards: 6221: relocating_shards: 3: initializing_shards: 3: unassigned_shards: 1  
[00:57:53] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2069: active_shards: 6221: relocating_shards: 3: initializing_shards: 3: unassigned_shards: 1  
[00:58:53] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[00:58:53] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[00:58:53] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[00:58:53] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[00:59:02] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[00:59:03] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[00:59:03] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[01:02:13] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] beta: linting and autoload modules [puppet] - 10https://gerrit.wikimedia.org/r/170484 (owner: 10John F. Lewis)
[01:05:58] <mutante>	 !log git-sync-upstream on deployment-salt for beta puppetmaster 
[01:06:08] <morebots>	 Logged the message, Master
[01:10:55] <icinga-wm>	 PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail  
[01:11:14] <awight>	 can anyone help with a fundraising emergency?
[01:11:54] <awight>	 I'm hoping to replay commands from our db replication log.
[01:14:56] <^d>	 "Fatal error: Call to a member function get() on a non-object in /srv/mediawiki/php-1.25wmf6/extensions/Gadgets/Gadgets_body.php on line 357 "
[01:16:29] <awight>	 global ping: any opsen who are familiar with the Fundraising database servers?
[01:17:47] <awight>	 failover ping: anyone who can get to db1025->db1008 replication logs?
[01:19:30] <TimStarling>	 springle: ^
[01:20:20] <springle>	 oh frack
[01:20:30] <springle>	 theoretically yes, I can try
[01:20:41] <^d>	 How is $wgMemc a non-oject?
[01:21:01] <ebernhardson>	 ^d: too early in the init process?
[01:21:12] <^d>	 This is pretty late in a maintenance script.
[01:21:40] <ebernhardson>	 :S
[01:21:45] <TimStarling>	 !log restarted gmond on mw1018 and mw1031
[01:21:53] <morebots>	 Logged the message, Master
[01:21:57] <MaxSem>	 the script is too hardcore to call Setup.php? :P
[01:22:22] <^d>	 Time to live-hack terbium and find out :)
[01:22:55] <awight>	 springle: thanks, hopefully you don't regret it any more than I already do :p
[01:23:16] <awight>	 springle: so, I accidentally dropped the table civicrm.wmf_civicrm_extra
[01:23:56] <awight>	 springle: I'm hoping we can restore it from a backup--I'm already in a position to do that
[01:24:17] <awight>	 springle: then, replay the replication logging if possible, to fill in the remainder of the missing data.
[01:24:53] <awight>	 I don't know if it's possible to replay at all, but if it is, I'd like to replay only statements which write to that one table
[01:25:05] <springle>	 awight: looking. we may need to page Jeff_Green to avoid making it worse
[01:25:14] <awight>	 sure, that would work for me
[01:25:38] <springle>	 replaying a single table is non-trivial
[01:25:43] <awight>	 I bet...
[01:26:02] <^d>	 MaxSem, ebernhardson: We have stacktrace! https://phabricator.wikimedia.org/P63
[01:26:17] <awight>	 springle: I'm also open to restoring the entire db, and replaying all tables until the point at which I destroy everything.
[01:26:55] <awight>	 springle: it would be great if you could start a backup of the current, damaged state of db1025:civicrm
[01:35:11] <springle>	 awight: looks like the frack db dumps are per database, which is good, but not linked to a specific replication position, which means we need to ask Jeff_Green how a restore should be handled
[01:35:26] <awight>	 springle: thanks for checking!
[01:35:27] <springle>	 the last one is ~13h old
[01:35:48] <awight>	 springle: I texted a number ending in 5522, is that how to get ahold of Jeff?
[01:36:17] <awight>	 springle: also, do you know if the replication logs are going expire on us?
[01:36:31] <MaxSem>	 mmm, only 6 other digits to pick:P
[01:36:51] <springle>	 awight: we have 10 days before log expiry
[01:36:56] <^d>	 MaxSem: I'll start with 000-000 :p
[01:37:18] <awight>	 springle: great
[01:37:51] <springle>	 awight: this is important enough to page Jeff_Green in your opinion? (i really have no idea what civicrm handles on frack)
[01:38:06] <awight>	 springle: definitely
[01:38:10] <springle>	 ok then
[01:38:18] <awight>	 springle: we had already chatted about butting heads at about this hour
[01:38:28] <awight>	 it just got... more urgent, though
[01:40:07] <grrrit-wm>	 (03PS1) 10Dzahn: beta monitoring (labmon): fix graphite class name [puppet] - 10https://gerrit.wikimedia.org/r/171492 
[01:41:55] <springle>	 awight: special civicrm backup on db1008 is started
[01:42:03] <bblack>	 are new transactions still going to a fresh table now on top of the current mess?
[01:42:04] <springle>	 may not help us much, but there you are
[01:42:13] <springle>	 there is no fresh table
[01:42:15] <bblack>	 ok
[01:42:43] <awight>	 bblack: if you're talking about the FR fubar, no I had to pause the process which records new donations into our DB
[01:42:55] <bblack>	 ok
[01:43:02] <springle>	 awight: so we have leeway to wait for Jeff?
[01:43:08] <awight>	 springle: thanks, the backup will be a big time saver for Jeff.
[01:43:54] <awight>	 springle: yep, the donations intake is decoupled from this part of the pipeline, so the donor-facing issue is very minor, just a delay in getting their receipts and other warm salutations.
[01:44:03] <springle>	 ok, good
[01:44:23] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] beta monitoring (labmon): fix graphite class name [puppet] - 10https://gerrit.wikimedia.org/r/171492 (owner: 10Dzahn)
[01:44:24] * awight kneels to ancestors for a moment
[01:47:11] <springle>	 hmm, can't find an lvm snapshot slave for frack
[01:47:38] <springle>	 maybe we don't do that there. means sql restore + binlog is only choice
[01:47:43] <icinga-wm>	 RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures  
[01:49:03] <awight>	 rats, that sounded like a good plan
[01:52:23] <Jeff_Green>	 springle: ping
[01:52:36] <springle>	 Jeff_Green: pong
[01:52:46] <Jeff_Green>	 blargh
[01:53:28] * awight weeps softly in the background
[01:54:54] <MaxSem>	 awight, how did it happen?
[01:55:28] <awight>	 MaxSem: first, I did something small and stupid.  Then I worked my way up to dropping an entire table.
[01:55:55] <awight>	 I know my own actions brought me to this point, but still I blame Oracle for providing me the tools to hurt myself with.
[01:56:16] <MaxSem>	 ORACLE?
[01:56:31] <^d>	 SNORACLE.
[01:56:32] * awight mutters mysql...
[01:57:12] <springle>	 it's mariadb :)
[01:57:31] <MaxSem>	 maxsem@tin:/srv/mediawiki-staging/php-1.25wmf6$ mysql --version
[01:57:31] <MaxSem>	 mysql  Ver 14.14 Distrib 5.5.35, for debian-linux-gnu (x86_64) using readline 6.2
[01:57:50] <MaxSem>	 depends if we're talking about client or server SW:)
[01:57:56] <mutante>	 tin != prod db servers
[01:58:04] <mutante>	 there are 3 modules, mysql, mysql_wmf and mariadb
[02:03:53] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[02:24:47] <grrrit-wm>	 (03PS1) 10Dzahn: (WIP) facilities: move to module [puppet] - 10https://gerrit.wikimedia.org/r/171493 
[03:07:35] <grrrit-wm>	 (03PS1) 10Dzahn: (WIP) certificates: move to module [puppet] - 10https://gerrit.wikimedia.org/r/171496 
[03:24:06] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1  
[03:24:06] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1  
[03:24:06] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1  
[03:24:06] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1  
[03:24:45] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1  
[03:25:16] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[03:25:16] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[03:25:16] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1018 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[03:25:16] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[03:25:46] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 7: initializing_shards: 0: unassigned_shards: 0  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:47] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:48] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:48] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:50] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:50] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:51] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:51] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:52] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:52] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:26:53] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1  
[03:31:17] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6243: relocating_shards: 3: initializing_shards: 2: unassigned_shards: 1  
[03:33:18] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2077: active_shards: 6247: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[03:34:06] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:07] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:07] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:07] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:07] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:07] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:08] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:08] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:09] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:09] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:11] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:11] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:12] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:12] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:13] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:34:13] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[03:45:08] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 72% free (5436 MB out of 7627 MB)  
[03:50:17] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5624 MB out of 7627 MB)  
[03:55:17] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5642 MB out of 7627 MB)  
[03:58:27] <icinga-wm>	 PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail  
[04:00:11] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5656 MB out of 7627 MB)  
[04:05:17] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5659 MB out of 7627 MB)  
[04:09:48] <icinga-wm>	 RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures  
[04:10:08] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5675 MB out of 7627 MB)  
[04:15:19] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5678 MB out of 7627 MB)  
[04:20:05] <grrrit-wm>	 (03CR) 10KartikMistry: "@Reedy, what need to fix to get it work as expected? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry)
[04:20:19] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5684 MB out of 7627 MB)  
[04:25:19] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5703 MB out of 7627 MB)  
[04:26:55] <legoktm>	 kart_: you need to use $wmg instead of $wg
[04:27:42] <legoktm>	 kart_: if you set 'wgFoo', that gets set to $wgFoo, then your extension is loaded, which will set the extension default of $wgFoo, wiping out your intended value
[04:27:59] <legoktm>	 so, set what you want to $wmgFoo, and then after your extension is enabled, do $wgFoo = $wmgFoo
[04:30:10] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5706 MB out of 7627 MB)  
[04:33:19] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1  
[04:33:20] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1  
[04:34:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[04:34:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1  
[04:35:12] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5714 MB out of 7627 MB)  
[04:35:38] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[04:35:38] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[04:36:08] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2077: active_shards: 6247: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[04:36:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2077: active_shards: 6247: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[04:40:18] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5729 MB out of 7627 MB)  
[04:42:17] <grrrit-wm>	 (03CR) 10Glaisher: "Reedy:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher)
[04:45:20] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5733 MB out of 7627 MB)  
[04:47:58] <kart_>	 legoktm: thanks!
[04:50:21] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6009 MB out of 7627 MB)  
[04:55:12] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6010 MB out of 7627 MB)  
[04:56:51] <grrrit-wm>	 (03PS2) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 
[04:57:34] <grrrit-wm>	 (03PS3) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 
[04:58:28] <grrrit-wm>	 (03PS1) 10Brion VIBBER: Expose Content-Range response header for CORS requests on upload. [puppet] - 10https://gerrit.wikimedia.org/r/171502 
[05:00:12] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6010 MB out of 7627 MB)  
[05:01:25] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1  
[05:02:34] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 9: initializing_shards: 0: unassigned_shards: 0  
[05:04:35] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:04:35] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:04:35] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:04:35] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:04:35] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:05:14] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6010 MB out of 7627 MB)  
[05:05:46] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:05:46] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:05:46] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[05:06:35] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:35] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:35] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:35] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:35] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:44] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:44] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:06:44] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0  
[05:10:14] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6011 MB out of 7627 MB)  
[05:15:14] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 80% free (6060 MB out of 7627 MB)  
[05:20:14] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 80% free (6070 MB out of 7627 MB)  
[05:25:15] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6174 MB out of 7627 MB)  
[05:28:30] <grrrit-wm>	 (03CR) 10Legoktm: Beta: Enable EventLogging in ContentTranslation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry)
[05:30:15] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6174 MB out of 7627 MB)  
[05:35:15] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6174 MB out of 7627 MB)  
[05:40:18] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6175 MB out of 7627 MB)  
[05:45:09] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6190 MB out of 7627 MB)  
[05:47:04] <kart_>	 legoktm: bah. Thanks. I need more coffee.
[05:47:40] <grrrit-wm>	 (03PS4) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 
[05:50:11] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6226 MB out of 7627 MB)  
[05:55:16] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6227 MB out of 7627 MB)  
[06:00:17] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6227 MB out of 7627 MB)  
[06:01:12] <icinga-wm>	 PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610  
[06:01:32] <icinga-wm>	 PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e  
[06:01:41] <icinga-wm>	 PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8  
[06:01:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core:  cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR  
[06:04:05] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] "Hi Andrew, based on a cursory look this looks good to me." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata)
[06:05:11] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6229 MB out of 7627 MB)  
[06:10:13] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6230 MB out of 7627 MB)  
[06:15:12] <icinga-wm>	 PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6230 MB out of 7627 MB)  
[06:18:12] <grrrit-wm>	 (03PS4) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 
[06:18:51] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke)
[06:20:49] <grrrit-wm>	 (03PS5) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 
[06:26:59] <icinga-wm>	 RECOVERY - Disk space on ocg1002 is OK: DISK OK  
[06:27:00] <icinga-wm>	 RECOVERY - Disk space on ocg1003 is OK: DISK OK  
[06:28:09] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:39] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:50] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:59] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:09] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0  
[06:29:29] <icinga-wm>	 RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 52.96 ms  
[06:29:49] <icinga-wm>	 RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 52.59 ms  
[06:29:49] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:30:09] <icinga-wm>	 RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 51.88 ms  
[06:34:09] <icinga-wm>	 PROBLEM - puppet last run on es1002 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:35:29] <icinga-wm>	 PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:46:11] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures  
[06:46:16] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[06:46:29] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures  
[06:46:59] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures  
[06:47:08] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:47:09] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures  
[06:51:09] <icinga-wm>	 RECOVERY - puppet last run on es1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures  
[06:53:38] <icinga-wm>	 RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[07:05:48] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures  
[07:10:40] <icinga-wm>	 PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail  
[07:30:18] <icinga-wm>	 RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[07:38:59] <icinga-wm>	 PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures  
[07:40:53] <YuviPanda>	 mutante: whelp, sorry about that.
[07:53:31] <grrrit-wm>	 (03PS3) 10Glaisher: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[07:54:50] <grrrit-wm>	 (03CR) 10Glaisher: [C: 031] "Just added bd.m, be.m and nyc.m" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[07:57:09] <icinga-wm>	 PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e  
[07:57:15] <icinga-wm>	 PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610  
[07:57:29] <icinga-wm>	 PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8  
[07:57:39] <icinga-wm>	 RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures  
[07:57:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core:  cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR  
[08:00:01] <grrrit-wm>	 (03CR) 10Glaisher: "There are other private/small wikis as well (eg. stewardwiki, checkuserwiki). Do we add them as well?" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[08:12:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 104, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core:  cr2-eqiad:xe-4/2/0 (Telia, IC-307236) {#10694} [10Gbps wave]BR  
[08:14:27] <grrrit-wm>	 (03PS1) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. [puppet] - 10https://gerrit.wikimedia.org/r/171515 
[08:15:14] <grrrit-wm>	 (03CR) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh)
[08:16:06] <grrrit-wm>	 (03PS2) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. [puppet] - 10https://gerrit.wikimedia.org/r/171515 
[08:17:10] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 106, down: 0, dormant: 0, excluded: 0, unused: 0  
[08:17:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0  
[08:17:43] <_joe_>	 ori: mmmh not sure I like this solution, isn't there a way to check if that's activated?
[08:17:58] <_joe_>	 uh, sorry, bbiab (if you go, good night)
[08:17:59] <icinga-wm>	 RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 52.00 ms  
[08:17:59] <icinga-wm>	 RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 52.37 ms  
[08:18:10] <icinga-wm>	 RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 51.90 ms  
[08:19:58] <ori>	 _joe_: there isn't (that was what my comment was about)
[08:20:23] <ori>	 good night
[08:24:49] <icinga-wm>	 PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%):  
[08:25:25] <_joe_>	 gee
[08:29:00] <grrrit-wm>	 (03CR) 10Steinsplitter: [C: 031] Adding Ukraine photo sources to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171484 (https://bugzilla.wikimedia.org/73045) (owner: 10Dereckson)
[08:29:08] <_joe_>	 so, we seriously need to re-do the ocg servers from scratch
[08:29:24] <_joe_>	 for now, I'm going to bind-mount /var/log into /srv
[08:32:26] <_joe_>	 well, I'll do that after I have created the new HHVM package and tried it
[08:33:22] <_joe_>	 I still can't believe we created a server with a non-LVM root partition of 9 GB, with no space for expansion
[08:40:00] <icinga-wm>	 PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error  
[08:40:00] <icinga-wm>	 PROBLEM - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)  
[08:45:14] <icinga-wm>	 PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures  
[08:47:23] <grrrit-wm>	 (03PS1) 10Nikerabbit: Revert "Disable l10nupdate for the duration of CLDR 26 plural migration" [puppet] - 10https://gerrit.wikimedia.org/r/171516 
[09:01:39] <icinga-wm>	 RECOVERY - Disk space on ms-be2003 is OK: DISK OK  
[09:29:46] <grrrit-wm>	 (03PS1) 10Yuvipanda: icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 
[09:29:47] <YuviPanda>	 godog: ^ +1?
[09:30:40] <godog>	 YuviPanda: sure, taking a look
[09:31:42] <godog>	 YuviPanda: what's with the garbled html in the commit message? :)
[09:32:06] <YuviPanda>	 lolwut
[09:32:23] <YuviPanda>	     https://phabricator.wikimedia.org/T789 is how it looks on my console
[09:32:37] <YuviPanda>	 godog: lolwut, hit edit, and it shows it correctly
[09:33:45] <godog>	 mhhh perhaps the auto-linking of issues like RT #<number> is clashing?
[09:34:04] <YuviPanda>	 perhaps...
[09:34:09] <YuviPanda>	 let me try to edit
[09:34:21] <grrrit-wm>	 (03PS2) 10Yuvipanda: icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 
[09:34:31] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "note: the url shows garbled in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/171519 (owner: 10Yuvipanda)
[09:34:36] <grrrit-wm>	 (03PS3) 10Yuvipanda: icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 
[09:34:46] <YuviPanda>	 godog: seems to be the case now, is fine now
[09:34:47] <godog>	 ah nevermind my comment, +1
[09:35:05] <godog>	 so ok T<number> is enough
[09:35:21] <YuviPanda>	 yeah
[09:37:37] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 (owner: 10Yuvipanda)
[09:41:08] <paravoid>	 YuviPanda: I'd rather we get rid of this icinga/betalabs thing entirely
[09:41:13] <YuviPanda>	 yeah
[09:41:22] <YuviPanda>	 should be gone next week
[09:41:27] <paravoid>	 good :)
[09:41:58] <YuviPanda>	 paravoid: I'm going to go ahead and do (1) from my ops@ email next week, if not much more discussion happens.
[09:42:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:42:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:47:54] <paravoid>	 I never replied you on-list, but I'm wondering if the two problems to external resources you stated are really unsolvable
[09:50:42] <_joe_>	 interesting, mw1030 will be fixed soon-ish
[09:53:51] <_joe_>	 !log installing the new hhvm package on mw1030 and mw1018 in order to test for stability
[09:53:59] <morebots>	 Logged the message, Master
[09:54:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.149 second response time  
[09:54:49] <icinga-wm>	 RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 67494 bytes in 0.352 second response time  
[09:56:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time  
[09:56:41] <icinga-wm>	 RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 67494 bytes in 0.129 second response time  
[09:57:37] <_joe_>	 instead of getting more traffic to hhvm, I will raise the weight of those two servers in pybal
[10:07:52] <_joe_>	 !log temporary raising weight of mw1018 and 1030 in pybal to load-test them and check for crashes
[10:07:58] <morebots>	 Logged the message, Master
[11:04:15] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[11:12:33] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: add graphite-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/171525 
[11:13:54] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[11:14:26] <_joe_>	 mmmh what's up?
[11:33:05] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[11:48:04] <icinga-wm>	 PROBLEM - nutcracker port on mw1163 is CRITICAL: Connection refused  
[11:52:06] <icinga-wm>	 RECOVERY - nutcracker port on mw1163 is OK: TCP OK - 0.000 second response time on port 11212  
[11:57:41] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While I do agree with the idea, wouldn't it be better to set this as a local cronjob running every hour?" [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh)
[12:01:08] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "1) put the new command definition in a separate declaration, like most of the check commands are now" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn)
[12:02:26] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: hhvm: remove unnecessary upstart stanza, config option [puppet] - 10https://gerrit.wikimedia.org/r/171244 
[12:09:38] <akosiaris>	 !log Depool wtp1001, wtp1003-1006 for trusty upgrade
[12:09:46] <morebots>	 Logged the message, Master
[12:10:07] <_joe_>	 5 at a time, w00t
[12:10:17] <akosiaris>	 yesterday it was 7
[12:10:19] <_joe_>	 I usually do 4
[12:10:26] <akosiaris>	 and parsoid did not sweat
[12:10:48] <_joe_>	 yeah resource usage on parsoid is not that high
[12:13:35] <icinga-wm>	 PROBLEM - check if salt-minion is running on wtp1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion  
[12:14:04] <icinga-wm>	 PROBLEM - check if salt-minion is running on wtp1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion  
[12:14:24] <icinga-wm>	 PROBLEM - check if salt-minion is running on wtp1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion  
[12:14:28] <icinga-wm>	 PROBLEM - check if salt-minion is running on wtp1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion  
[12:15:48] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: remove unnecessary upstart stanza, config option [puppet] - 10https://gerrit.wikimedia.org/r/171244 (owner: 10Giuseppe Lavagetto)
[12:21:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 67939 bytes in 0.225 second response time  
[12:22:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time  
[12:22:30] <_joe_>	 bbl, lunch
[12:52:54] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[13:16:14] <icinga-wm>	 PROBLEM - puppet last run on wtp1003 is CRITICAL: Timeout while attempting connection  
[13:16:36] <icinga-wm>	 PROBLEM - Host wtp1006 is DOWN: PING CRITICAL - Packet loss = 100%  
[13:17:00] <icinga-wm>	 PROBLEM - Host wtp1005 is DOWN: PING CRITICAL - Packet loss = 100%  
[13:17:19] <icinga-wm>	 PROBLEM - Host wtp1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[13:18:00] <icinga-wm>	 PROBLEM - Host wtp1003 is DOWN: PING CRITICAL - Packet loss = 100%  
[13:19:09] <icinga-wm>	 PROBLEM - Host wtp1004 is DOWN: PING CRITICAL - Packet loss = 100%  
[13:26:41] <grrrit-wm>	 (03PS1) 10Adrian Lang: Add qunit localhost setup to role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/171535 (https://bugzilla.wikimedia.org/72184) 
[13:48:24] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Move wtp1001 to wtp1004 to raid1-lvm partman scheme [puppet] - 10https://gerrit.wikimedia.org/r/171536 
[13:48:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move wtp1001 to wtp1004 to raid1-lvm partman scheme [puppet] - 10https://gerrit.wikimedia.org/r/171536 (owner: 10Alexandros Kosiaris)
[13:53:46] <icinga-wm>	 PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[14:11:47] <icinga-wm>	 RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0  
[14:41:37] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0]  
[14:44:35] <grrrit-wm>	 (03CR) 10Nikerabbit: [C: 031] Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry)
[14:51:38] <grrrit-wm>	 (03CR) 10Ottomata: "> One is the user/pass for a given cluster; what will the path in the private puppet repos be." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata)
[14:53:04] <grrrit-wm>	 (03CR) 10Manybubbles: [C: 031] "No idea on the ensure_packages. I'm not really a skilled puppet developer so I just do whatever the rest of the repository does." [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi)
[14:55:12] <manybubbles>	 !log running performance test for Cirrus taking zhwiki
[14:55:21] <morebots>	 Logged the message, Master
[14:55:56] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[14:57:06] <manybubbles>	 !log performance test for zhwiki was good.  trying dewiki
[14:57:10] <manybubbles>	 ^d: ^^^
[14:57:13] <morebots>	 Logged the message, Master
[14:57:24] <^d>	 sweet
[14:58:28] <Nemo_bis>	 yayyy
[14:59:16] <^d>	 manybubbles: i'd try zhwiki again after we rebuild it. it got more shards.
[14:59:29] <manybubbles>	 ^d: that _should_ only make it faster
[14:59:38] <^d>	 indeed :)
[15:01:01] <manybubbles>	 !log dewiki is fine.  trying enwiki.
[15:01:08] <morebots>	 Logged the message, Master
[15:03:05] <manybubbles>	 ^d: I'm replaying 100% of enwiki's traffic now.  http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Elasticsearch+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[15:03:40] <manybubbles>	 Nemo_bis: good news!
[15:04:07] <godog>	 sweet
[15:04:36] <manybubbles>	 all done
[15:04:53] <manybubbles>	 well, done with 20k searches.  those new disks are much much much much better
[15:04:55] <manybubbles>	 ottomata: ^^^^
[15:05:20] <ottomata>	 woowooo
[15:05:40] <godog>	 having doubled the number of machines might have something to do with that too <g>
[15:05:56] <manybubbles>	 ottomata: tried replaying 100% enwiki's traffic.  Servers saw almost no bump in, well, anything
[15:06:00] <manybubbles>	 godog: might just :)
[15:06:16] <manybubbles>	 I suppose what I'm saying is that we have more than enough power to cut over
[15:06:35] <^d>	 let's do it today then!
[15:06:37] <^d>	 gogogogogo
[15:19:25] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Expose Content-Range response header for CORS requests on upload [puppet] - 10https://gerrit.wikimedia.org/r/171502 (owner: 10Brion VIBBER)
[15:19:42] <grrrit-wm>	 (03PS3) 10Faidon Liambotis: Expose Content-Range response header for CORS requests on upload [puppet] - 10https://gerrit.wikimedia.org/r/171502 (owner: 10Brion VIBBER)
[15:19:57] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Expose Content-Range response header for CORS requests on upload [puppet] - 10https://gerrit.wikimedia.org/r/171502 (owner: 10Brion VIBBER)
[15:24:41] <manybubbles>	 !log finished with performance testing for cirrus - new servers look like way way more than enough power
[15:24:43] <^d>	 manybubbles: I came across http://www.elasticsearch.org/overview/shield, bleh
[15:24:48] <morebots>	 Logged the message, Master
[15:25:17] <manybubbles>	 ^d: there's been lots of clawing for that
[15:25:22] <manybubbles>	 I'm not suprised
[15:25:37] <^d>	 Yeah. But it's like paid add-on stuff which is why I said bleh :\
[15:25:55] <_joe_>	 enterprise-grade security. Having worked in an enterprise, I won't use that as a marketing pitch :P
[15:38:22] <grrrit-wm>	 (03CR) 10Krinkle: [C: 04-1] "Won't work. and duplicate of" [puppet] - 10https://gerrit.wikimedia.org/r/171535 (https://bugzilla.wikimedia.org/72184) (owner: 10Adrian Lang)
[15:39:06] <anomie>	 Whenever I hear "enterprise" anything, I think Star Trek.
[15:39:45] <marktraceur>	 Galaxy-class security.
[15:40:41] <grrrit-wm>	 (03PS1) 10Ottomata: Use mysql::config::client to render a research pw file readable by the stats user. [puppet] - 10https://gerrit.wikimedia.org/r/171543 
[15:41:31] <grrrit-wm>	 (03PS2) 10Ottomata: Use mysql::config::client to render a research pw file readable by the stats user. [puppet] - 10https://gerrit.wikimedia.org/r/171543 
[15:43:28] <icinga-wm>	 PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused  
[15:43:38] <icinga-wm>	 PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused  
[15:43:48] <_joe_>	 !log upgrading  mw1031,mw1032 to the new package, no crashes seeen since reinstall
[15:43:49] <icinga-wm>	 PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused  
[15:43:49] <icinga-wm>	 PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused  
[15:43:53] <morebots>	 Logged the message, Master
[15:43:55] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Use mysql::config::client to render a research pw file readable by the stats user. [puppet] - 10https://gerrit.wikimedia.org/r/171543 (owner: 10Ottomata)
[15:44:41] <icinga-wm>	 PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Puppet has 1 failures  
[15:44:59] <icinga-wm>	 PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 2 failures  
[15:45:10] <icinga-wm>	 PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 2 failures  
[15:45:10] <icinga-wm>	 PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Puppet has 2 failures  
[15:45:20] <anomie>	 marktraceur, manybubbles, ^d: So who wants to SWAT today?
[15:45:29] <manybubbles>	 I can do it!
[15:45:29] <marktraceur>	 Hmm
[15:45:34] <marktraceur>	 OK then!
[15:45:38] <anomie>	 ok!
[15:48:22] <grrrit-wm>	 (03PS3) 10Manybubbles: Enable TemplateData GUI for all wikis; move config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) (owner: 10Jforrester)
[15:48:57] <grrrit-wm>	 (03CR) 10Manybubbles: [C: 031] "Rebased clean. Should be ok. -1 seemed to have been cleared before I got to it. Looks fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) (owner: 10Jforrester)
[15:49:37] <_joe_>	 !log load-testing hhvm, in particular the servers with the new package
[15:49:43] <morebots>	 Logged the message, Master
[15:50:00] <grrrit-wm>	 (03CR) 10Manybubbles: [C: 031] "Noop for production so fine by me. Will +2 during SWAT in 10 minutes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry)
[15:52:58] <kart_>	 manybubbles: thanks
[15:55:38] <icinga-wm>	 PROBLEM - NTP on wtp1006 is CRITICAL: NTP CRITICAL: Offset unknown  
[15:55:39] <icinga-wm>	 PROBLEM - NTP on wtp1005 is CRITICAL: NTP CRITICAL: Offset unknown  
[15:56:22] <godog>	 anyone had this error before in labs? first time I see it but can't find exactly what's wrong
[15:56:25] <godog>	 Error: Failed to apply catalog: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/memcached.py] at /etc/puppet/modules/memcached/manifests/ganglia.pp:19
[15:56:38] <icinga-wm>	 RECOVERY - NTP on wtp1006 is OK: NTP OK: Offset -0.001505970955 secs  
[15:56:49] <icinga-wm>	 RECOVERY - NTP on wtp1005 is OK: NTP OK: Offset -0.007328987122 secs  
[15:56:53] <godog>	 i.e. ganglia_new should get applied and that dir created inturn
[15:59:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 8: initializing_shards: 1: unassigned_shards: 1  
[15:59:48] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 9: initializing_shards: 1: unassigned_shards: 1  
[15:59:57] <manybubbles>	 silly health check
[16:00:04] <jouncebot>	 manybubbles, anomie, ^d, marktraceur, James_F: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141106T1600). Please do the needful.
[16:00:18] <manybubbles>	 its the stupid health check the complain when we rebuild indexes
[16:00:26] <grrrit-wm>	 (03CR) 10Manybubbles: [C: 032] Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry)
[16:00:34] <grrrit-wm>	 (03Merged) 10jenkins-bot: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry)
[16:01:20] <godog>	 manybubbles: I never checked but the percent-based check seems to be working ok?
[16:01:22] <manybubbles>	 James_F|Away: are you around for https://gerrit.wikimedia.org/r/#/c/157478/3 ?
[16:01:29] <godog>	 checked for false positives that is
[16:01:42] <manybubbles>	 its much much less likely to complain for no reason
[16:01:49] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[16:02:28] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0  
[16:02:40] <godog>	 yeah, possibly next week we could disable this one
[16:02:52] <logmsgbot>	 !log manybubbles Synchronized wmf-config/: SWAT deploy some beta configs.  Should be noop. (duration: 00m 04s)
[16:03:00] <morebots>	 Logged the message, Master
[16:03:08] <manybubbles>	 kart_: ^^
[16:03:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[16:03:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[16:03:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1  
[16:03:27] <manybubbles>	 elasticsearch is green you silly health check
[16:04:02] <manybubbles>	 gi11es: around for your swat?
[16:04:08] <gi11es>	 manybubbles: yes
[16:05:24] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[16:05:25] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[16:05:25] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0  
[16:06:10] <grrrit-wm>	 (03CR) 10BBlack: "For the future: is there somewhere we could be auto-generating these from that's upstream in the mediawiki-config sense, both for mobile a" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[16:08:41] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2081: active_shards: 6257: relocating_shards: 7: initializing_shards: 6: unassigned_shards: 4  
[16:09:42] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 8: initializing_shards: 0: unassigned_shards: 0  
[16:09:42] <grrrit-wm>	 (03CR) 10Cscott: "That's not actually 7 days, that's "7 log files which are 256M each". But we're probably growing the log more than 256M/day, and logrotat" [puppet] - 10https://gerrit.wikimedia.org/r/171477 (owner: 10Andrew Bogott)
[16:11:37] <logmsgbot>	 !log manybubbles Synchronized php-1.25wmf7/extensions/MultimediaViewer/: SWAT revert layout changes (duration: 00m 06s)
[16:11:38] <manybubbles>	 gi11es: ^^^^
[16:11:41] <cscott>	 andrewbogott: ocg1001's / is full again.  could you delete /var/log/upstart/ocg* -- those files shouldn't be being created any more.
[16:11:42] <morebots>	 Logged the message, Master
[16:11:58] <manybubbles>	 James_F|Away: last ping for https://gerrit.wikimedia.org/r/#/c/157478/3
[16:12:26] <gi11es>	 manybubbles: testing
[16:13:54] <andrewbogott>	 cscott: done, but that box is pretty much doomed regardless.   Needs a rebuild with more disk space.
[16:14:19] <cscott>	 or we just need to make /var/log its own partition or some such.
[16:14:26] <_joe_>	 cscott: they are empty
[16:14:38] <andrewbogott>	 _joe_: did that today already.  I think
[16:14:41] <_joe_>	 cscott: your home on that server has 500 mb of data, mwalkers' 200
[16:14:54] <_joe_>	 andrewbogott: no I had no time this morning
[16:14:58] <andrewbogott>	 ah, ok
[16:14:59] <cscott>	 du /var/log is 1.2G right now, on a root partition that is only 9.1G large.
[16:15:04] <_joe_>	 those servers need reimaging
[16:15:40] <_joe_>	 cscott: can we make ocg log to a different directory than /var/log?
[16:15:40] <gi11es>	 manybubbles: not seeing the changes yet, but it's not usual for there to be a delay when I test after the deploy
[16:15:45] <cscott>	 _joe_: oh, yeah, it's a little weird that /home is on the root partition -- but useful, since we need to rebuild binary modules from time to time and the build process hates flock on nfs.
[16:16:03] <gi11es>	 manybubbles: ah, there it is. all good
[16:16:13] <manybubbles>	 gi11es: great!
[16:16:24] <cscott>	 _joe_: we could; the logrotate config would have to change as well.
[16:16:26] <gi11es>	 manybubbles: thanks for the swat
[16:16:58] <cscott>	 _joe_: i'm working on a patch now to make logrotate run hourly, that should help some in terms of making logs use a consistent amount of space, so sudden load doesn't cause the partition to fill up
[16:17:21] <cscott>	 _joe_: we already are using the space-limited logrotate options, instead of the time-limited options, but because logrotate only runs once/day it's not really as effective as it should be
[16:17:24] <manybubbles>	 gi11es: your welcome!
[16:17:43] <_joe_>	 cscott: no, can we log to another directory?
[16:17:52] <icinga-wm>	 RECOVERY - Disk space on ocg1001 is OK: DISK OK  
[16:18:05] <_joe_>	 you don't need to do log rotation aggressively if we have room
[16:18:20] <_joe_>	 cscott: say /srv/log/ocg/
[16:18:25] <cscott>	 _joe_: i can roll that into the same puppet patch if that's helpful.  but i'd still like to be space-limited rather than date-limited.
[16:18:48] <cscott>	 especially since the /srv directory is used for ocg's cache & etc.  they should all be space-limited in theory.
[16:19:04] <cscott>	 putting logs on /srv will probably create more problems than it solves
[16:19:16] <_joe_>	 cscott: in practice, it will not
[16:19:23] <cscott>	 since ocg doesn't actually stop or anything if / fills up, but it *would* start giving errors if /srv fills up.
[16:19:29] <_joe_>	 we have a ridiculously large /srv partition for now
[16:19:49] <cscott>	 _joe_: only ridiculously large because i turned caching in ocg *way* down.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             [16:21:52] <cscott>	 so i'll make the patch :)
[16:21:56] <_joe_>	 cscott: ok :)
[16:22:38] <_joe_>	 cscott: while you're at it, can you make the log file and dir a class variable and use templates for log configs?
[16:22:43] <cscott>	 also, we're using 70G on /srv for 2 days of cache right now.  i'd like to increase the cache lifetime up to 7 days or so, which would use closer to 245G.
[19:32:38] <_joe_>	 which would leave ~ 120 gb free
[19:32:38] <cscott>	 _joe_: maybe we need two patches. ;)  this is starting to sound like advanced puppeting.
[19:32:38] <_joe_>	 :)
[19:32:38] <_joe_>	 ok, make yours, I'll do the other later :)
[19:32:39] <cscott>	 _joe_: in theory!
[19:32:39] <icinga-wm>	 PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused  
[19:32:39] <godog>	 cscott: can ocg do its own rotation instead?
[19:32:39] <cscott>	 godog: we should just turn off syslog logging in that case.  it's already logging to logstash.
[19:32:39] <icinga-wm>	 PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:32:39] <manybubbles>	 !log manybubbles is done with SWAT
[19:32:40] <godog>	 cscott: ah syslogging, nevermind my comment then
[19:32:41] <cscott>	 syslog is/should just be a small amount of local log mirroring for emergency use.
[19:32:41] <cscott>	 it's just that / doesn't even allow a "small amount" of logging right now.
[19:32:41] <icinga-wm>	 PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 2 failures  
[19:32:41] <cscott>	 and, like i keep saying, it would be fine if logrotate was actually size-limiting the logs like it is supposed to.  but...
[19:32:42] <cscott>	 on the other hand, someone should be writing a logstash gc/rotation process, if there isn't one already.
[19:32:42] <bd808>	 logstash keeps 31 days of logs. It drops the 32nd day every night
[19:32:43] <cscott>	 bd808: perhaps you should read backlog ;)
[19:32:43] <cscott>	 and/or the comment on https://gerrit.wikimedia.org/r/171477 -- and there's another one or two puppet patches where i have this same discussion, fruitlessly.
[19:32:43] <cscott>	 bd808: oh, wait.  never mind.  you said 'logstash'.
[19:32:43] <cscott>	 bd808: i still had syslog on my brain.  forgive me.
[19:32:43] * cscott is hacking puppet
[19:32:44] <cscott>	 bd808: good to know, thanks.
[19:32:45] <cscott>	 bd808: i like the powers of 2
[19:32:45] <cscott>	 _joe_: /srv/deployment/ocg/log good?
[19:32:45] <bd808>	 Apparently it actually drops the 31st day -- https://github.com/wikimedia/operations-puppet/blob/production/modules/logstash/files/logstash_delete_index.sh#L14
[19:32:45] <icinga-wm>	 RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms  
[19:32:45] <cscott>	 bd808: how lunar.
[19:32:46] <icinga-wm>	 PROBLEM - RAID on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:46] <godog>	 anyone up for a dns change? https://gerrit.wikimedia.org/r/#/c/171525/
[19:32:47] <icinga-wm>	 PROBLEM - DPKG on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:47] * cscott notes, pedantically, that the 30-day period is the lunar synodic period, as opposed to the 28-day lunar orbital period.
[19:32:47] <icinga-wm>	 PROBLEM - puppet last run on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:47] <icinga-wm>	 PROBLEM - check configured eth on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:47] <icinga-wm>	 PROBLEM - check if dhclient is running on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:47] <icinga-wm>	 PROBLEM - check if salt-minion is running on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:47] <icinga-wm>	 PROBLEM - Disk space on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[19:32:47] <icinga-wm>	 PROBLEM - SSH on labstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:32:50] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail  
[19:32:58] <Coren>	 I'm on it.  (labstore1001 just went away)
[19:33:00] <_joe_>	 !log upgrading the hhvm API appservers to use the new package
[19:33:04] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures  
[19:33:05] <icinga-wm>	 PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:33:11] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail  
[19:33:14] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures  
[19:33:14] <icinga-wm>	 RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.027 second response time  
[19:33:14] <icinga-wm>	 RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures  
[19:33:15] <icinga-wm>	 RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures  
[19:33:15] <icinga-wm>	 RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.006 second response time  
[19:33:18] <icinga-wm>	 RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.019 second response time  
[19:33:18] <icinga-wm>	 RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures  
[19:33:18] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2087: active_shards: 6277: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[19:33:18] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2087: active_shards: 6277: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:33:21] <icinga-wm>	 PROBLEM - Host wtp1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[19:33:21] <icinga-wm>	 PROBLEM - Host wtp1006 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[19:33:21] <icinga-wm>	 PROBLEM - Host wtp1004 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:33:21] <icinga-wm>	 PROBLEM - Host wtp1003 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:33:22] <icinga-wm>	 RECOVERY - Host wtp1006 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms  
[19:33:22] <icinga-wm>	 RECOVERY - Host wtp1001 is UP: PING OK - Packet loss = 0%, RTA = 2.86 ms  
[19:33:22] <icinga-wm>	 RECOVERY - Host wtp1003 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms  
[19:33:22] <icinga-wm>	 RECOVERY - Host wtp1004 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms  
[19:33:22] <icinga-wm>	 RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time  
[19:33:23] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail  
[19:33:24] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures  
[19:33:25] * Coren brings labstore1001 up gradually and carefully.
[19:33:26] <icinga-wm>	 RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0)  
[19:33:26] <icinga-wm>	 RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 4.81 ms  
[19:33:26] <akosiaris>	 !log repool wtp1001, wtp1003-1006 after trusty upgrade
[19:33:27] <cscott>	 _joe_: a present for you: https://gerrit.wikimedia.org/r/171578
[19:33:27] <_joe_>	 cscott: thanks a lot
[19:33:27] <cscott>	 _joe_: pls review carefully -- that's probably the largest puppet patch i've written so far.  not improbable that i'm misunderstanding how things work in some way.
[19:33:27] <_joe_>	 cscott: rotate hourly means you need to restart node every hour? or it has a signal for log rotation?
[19:33:27] <andrewbogott>	 _joe_:  Are you happy with that patch as an alternative to remapping /var/log?  We shouldn't do both...
[19:33:27] <_joe_>	 andrewbogott: I am
[19:33:27] <andrewbogott>	 ok
[19:33:27] <cscott>	 _joe_: rotate hourly means logrotate is run every hour.  node doesn't have to be restarted.
[19:33:27] <_joe_>	 andrewbogott: also, the *real* solution is to just reimage those servers with LVM and sensible partitioning
[19:33:27] <gwicke>	 _joe_: logrotate is configured to use copytruncate
[19:33:27] <cscott>	 rsyslog is probably HUPped or something, but i assume logrotate knows what it's doing there.
[19:33:27] <_joe_>	 cscott: if you do copytruncate, no;
[19:33:27] <_joe_>	 gwicke: ok
[19:33:27] <cscott>	 the log file in question isn't being written to directly from node.
[19:33:27] <andrewbogott>	 _joe_: yes!  that's what I wrote in the SAL yesterday.  "andrewbogott: ocg1001 is depressingly tiny and will probably keeping complaining about disk space until it's rebuilt"
[19:33:27] <cscott>	 the upstart log file is piped from the console, so that's more "interesting" -- but that log file is empty nowadays
[19:33:27] <gwicke>	 cscott: stdout & stderr is piped into it
[19:33:27] <_joe_>	 cscott: oh ok
[19:33:27] <_joe_>	 but as gwicke told you ^^
[19:33:27] <cscott>	 gwicke: yes, that's /var/log/upstart/ocg.* which is a different log file, not the one being tweaked in this patch.
[19:33:27] <_joe_>	 cscott: so ocg just sends messages to rsyslog
[19:33:27] <_joe_>	 ok
[19:33:27] <gwicke>	 cscott: kk
[19:33:27] <cscott>	 yeah.  somebody (andrewbogott maybe?) added 'console none' to the upstart file a few days ago in any case, so there isn't actually an upstart log file right now.
[19:33:27] <andrewbogott>	 wasn't me
[19:33:27] <_joe_>	 cscott: I did
[19:33:27] <cscott>	 but the latest ocg reconfig (part of the move from winston to bunyan logging) turned off console logging for ocg.
[19:33:28] <_joe_>	 cscott: so I can remove that, I was waiting for this :)
[19:33:28] <cscott>	 so if 'console none' was removed from the upstart config, i *expect* that very little will actually end up in the upstart logs.  but that's a patch for another time.
[19:33:28] <cscott>	 yeah, let's do one thing at a time.
[19:33:28] <cscott>	 the rotation of the upstart log files are actually impossible to reconfigure from a puppet module, which is a buglet in its own right.
[19:33:28] <_joe_>	 cscott: upstart dictates one-size-fits-all
[19:33:28] <_joe_>	 mmm the gerrit bot is dead apparently
[19:33:28] <cscott>	 yeah, so i figure it's upstart's responsibility to cleanly handle its logs being rotated in any case.
[19:33:28] <cscott>	 but fwiw /var/log/upstart/*.log does *not* have copytruncate configured.
[19:33:28] <cscott>	 but /etc/logrotate.d/upstart comes directly from the upstart package, so again it's upstream's problem if it doesn't work right.  and we can't easily change it in any case.
[19:33:28] <icinga-wm>	 PROBLEM - Host labstore1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds  
[19:33:28] <_joe_>	 andrewbogott: can you care to merge that patch? I got to get back to hhvm stuff
[19:33:28] <_joe_>	 and it's already pretty late here
[19:33:28] <andrewbogott>	 _joe_: Yep!  Will merge as soon as I finish reading
[19:33:28] <andrewbogott>	 cscott: Those ensure=>absent crons that you're removing… they're a relic of another age?
[19:33:28] <andrewbogott>	 things that are no longer created?
[19:33:29] <andrewbogott>	 cscott: I've merged your patch and applied it on ocg1001.  So far I don't see any log files.  Is it possible the service is totally wrecked due to the full drive earlier?
[19:33:31] <_joe_>	 andrewbogott: you need to restart rsyslog maybe?
[19:33:31] <andrewbogott>	 _joe_: trying...
[19:33:32] <_joe_>	 sorry but I'm going off for a few hours, I most surely have business to attend to this evening
[19:33:32] <Nemo_bis>	 you should have called it a day after the hhvm upgrade :)
[19:33:32] <_joe_>	 !log load test done on the HHVM pool
[19:33:33] <Nemo_bis>	 17.28 -!- morebots [~morebots@208.80.155.255] has quit [Ping timeout: 245 seconds]
[19:33:34] <andrewbogott>	 cscott: please ping me if/when you return
[19:33:35] <andrewbogott>	 or, hm, Jeff_Green are you about?
[19:33:35] <Jeff_Green>	 ya
[19:33:35] <andrewbogott>	 You worked on ocg, right?
[19:33:35] <Jeff_Green>	 some yeah
[19:33:35] <Jeff_Green>	 looking at backscroll....
[19:33:35] <andrewbogott>	 We've been tinkering with the logfiles there and I'm trying to figure out what, if anything, is currently broken
[19:33:35] <andrewbogott>	 it had 100% full / earlier today, so it might be that a lot of it is busted.
[19:33:35] <andrewbogott>	 I don't know what it does enough to check w/not it's working properly
[19:33:35] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2087: active_shards: 6277: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1  
[19:33:35] <andrewbogott>	 (It is definitely not producing any logs!)
[19:33:35] <Jeff_Green>	 i've had a couple squabbles with rsyslog there before
[19:33:35] <Jeff_Green>	 hateses rsyslog
[19:33:35] <Jeff_Green>	 ocg1001?
[19:33:35] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2088: active_shards: 6280: relocating_shards: 7: initializing_shards: 0: unassigned_shards: 0  
[19:33:35] <andrewbogott>	 yep
[19:33:35] <Jeff_Green>	 we can at least test the rsyslog filters using logger
[19:33:35] <James_F>	 manybubbles: Hey, sorry, just out of meetings. Was https://gerrit.wikimedia.org/r/#/c/157478 listed without a note that it was good to go? Eurgh. :-(
[19:33:35] <manybubbles>	 James_F: I just won't SWAT anything without _someone_ supporting it around.
[19:33:35] <James_F>	 manybubbles: Sure, but why did greg-g list me as supporting it?
[19:33:35] <manybubbles>	 checking
[19:33:35] <manybubbles>	 I thought you were the supported because you committed it
[19:33:35] <manybubbles>	 let me make sure it was you on the list...
[19:33:35] <manybubbles>	 yeah - it was you on the calendar too
[19:33:35] <James_F>	 It was just going to go out in the usual Thursday deploy, but we changed the existence of those.
[19:33:35] <James_F>	 Maybe I should have just let Reedy push it yesterday instead.
[19:33:35] * James_F sighs.
[19:33:35] * James_F runs to the next meeting.
[19:33:35] <Reedy>	 lol
[19:33:35] <greg-g>	 oops, forgot about that
[19:33:35] <greg-g>	 Reedy: little help with that ^ :)
[19:33:35] <James_F>	 Sorry!
[19:33:35] <greg-g>	 James_F: my bad, too
[19:33:36] <Jeff_Green>	 andrewbogott: the syslog user doesn't have write privs to the log dir
[19:33:36] <andrewbogott>	 Jeff_Green: well, that would do it.  One moment...
[19:33:36] <Jeff_Green>	 there are a lot of log dirs on this box
[19:33:45] <andrewbogott>	 Jeff_Green: it's already owner => root, group => syslog.  What should it be?
[19:33:46] <andrewbogott>	 (Or so says puppet)
[19:33:46] <Jeff_Green>	 g+w
[19:33:46] <Jeff_Green>	 maybe 0775 
[19:33:46] <andrewbogott>	 oh of course
[19:33:47] <Jeff_Green>	 we should probably purge or move the old logs when all is said and done to reduce confusion
[19:33:47] <Nemo_bis>	 Hm. "To illustrate the costs of permanently storing 1 TB of files: $2000 " http://archiveteam.org/index.php?title=Swipnet#Donating
[19:33:48] <andrewbogott>	 Jeff_Green: https://gerrit.wikimedia.org/r/#/c/171586/1
[19:33:50] <Jeff_Green>	 merged
[19:33:50] <andrewbogott>	 thanks
[19:33:50] <Jeff_Green>	 np
[19:33:50] <andrewbogott>	 I'll apply on ocg1001 and clean up
[19:33:50] <andrewbogott>	 unless you already did
[19:33:51] <Jeff_Green>	 nope, go for it
[19:33:51] <logmsgbot>	 !log reedy Synchronized wmf-config/: Enable TemplateData GUI everywhere (duration: 00m 14s)
[19:33:51] <andrewbogott>	 Jeff_Green, cscott, ocg logging seems right now.
[19:33:52] <cscott>	 cool, thanks!
[19:34:00] <James_F>	 Reedy: Aha, thanks.
[19:34:03] <andrewbogott>	 !log Coren and cmjohnson frantically working to resolve a Labs NFS failure
[19:34:04] <Coren>	 andrewbogott: Looks like a dead controller atm.  Chris is shuffling hardware around to make sure now.
[19:34:05] <greg-g>	 _joe_: from -tech: 13:43 <      mbh_> hi, what's with api? I cannot save pages with api, error 503
[19:34:06] <_joe_>	 greg-g: I am off, can you ping someone else?
[19:34:06] <_joe_>	 is this hhvm specific?
[19:34:06] <greg-g>	 no idea
[19:34:06] <_joe_>	 ok so... I have dinner in 1 minute :)
[19:34:06] <greg-g>	 opsen ^^ :)
[19:34:06] <greg-g>	 _joe_: enjoy
[19:34:06] <cmjohnson>	 coren: give it a whirl
[19:34:07] <Coren>	 cmjohnson: Drac isn't responding?
[19:34:07] <Coren>	 Ah, there it is.
[19:34:11] <bd808>	 greg-g: You should champion a global ping word for opsen like the "hhvm-help:" stalk word that is used in #hhvm :)
[19:34:11] <greg-g>	 I'll just put that on my list of things to champion....
[19:34:11] <Nemo_bis>	 what, !domas doesn't do that?
[19:34:11] <domas>	 hah
[19:34:12] * matanya thanks _joe_ 
[19:34:25] <AaronSchulz>	 Reedy: maybe we can have a DB on extension1 called 'shared' for these sorts of random global tables
[19:34:25] <Reedy>	 AaronSchulz: sounds sensible
[19:34:25] <AaronSchulz>	 seems better than picking meta or centralauth ;)
[19:34:25] <Reedy>	 sql aawiki -h db1029
[19:34:25] <Reedy>	 CREATE DATABASE shared;
[19:34:25] <icinga-wm>	 RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms  
[19:34:27] <Coren>	 NFS is back up; LABS instances are gradually recovering.
[19:34:27] <AaronSchulz>	 Reedy: grants need tweaking
[19:34:27] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Slightly refactor misc::statistics::limn::mobile_data_sync [puppet] - 10https://gerrit.wikimedia.org/r/171553 (owner: 10Ottomata)
[19:34:27] <grrrit-wm>	 (03PS1) 10Ottomata: Fix typo variable name in statistics.pp [puppet] - 10https://gerrit.wikimedia.org/r/171559 
[19:34:27] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Fix typo variable name in statistics.pp [puppet] - 10https://gerrit.wikimedia.org/r/171559 (owner: 10Ottomata)
[19:34:27] <ottomata>	 well that is delayed!
[19:34:27] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "depends on Id3bc2b4c9" [puppet] - 10https://gerrit.wikimedia.org/r/171547 (owner: 10Filippo Giunchedi)
[19:34:27] <grrrit-wm>	 (03PS1) 10Ottomata: Couple of more fixes for misc::statistics::limn::data::generate refactor [puppet] - 10https://gerrit.wikimedia.org/r/171563 
[19:34:27] <Coren>	 ottomata: It's actually impressively robust that it came through at all.  :-)
[19:34:27] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Couple of more fixes for misc::statistics::limn::data::generate refactor [puppet] - 10https://gerrit.wikimedia.org/r/171563 (owner: 10Ottomata)
[19:34:28] <ottomata>	 :)
[19:34:28] <grrrit-wm>	 (03PS1) 10Ottomata: Fix source_dir variable reference [puppet] - 10https://gerrit.wikimedia.org/r/171566 
[19:34:28] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Fix source_dir variable reference [puppet] - 10https://gerrit.wikimedia.org/r/171566 (owner: 10Ottomata)
[19:34:28] <grrrit-wm>	 (03PS1) 10Ottomata: Remove management of source_dir ownership [puppet] - 10https://gerrit.wikimedia.org/r/171568 
[19:34:28] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Remove management of source_dir ownership [puppet] - 10https://gerrit.wikimedia.org/r/171568 (owner: 10Ottomata)
[19:34:28] <AaronSchulz>	 Reedy: Access denied for user 'wikiadmin'@'10.%' to database 'shared' (10.64.16.18)
[19:34:38] <grrrit-wm>	 (03PS1) 10Ottomata: Separate rsync from generate cron job [puppet] - 10https://gerrit.wikimedia.org/r/171572 
[19:34:42] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Separate rsync from generate cron job [puppet] - 10https://gerrit.wikimedia.org/r/171572 (owner: 10Ottomata)
[19:34:44] <Reedy>	 AaronSchulz: heh. I wonder what they're setup as
[19:34:57] <grrrit-wm>	 (03PS1) 10Ottomata: Don't need to manage rsync_from and output directories [puppet] - 10https://gerrit.wikimedia.org/r/171577 
[19:35:01] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Don't need to manage rsync_from and output directories [puppet] - 10https://gerrit.wikimedia.org/r/171577 (owner: 10Ottomata)
[19:41:11] <grrrit-wm>	 (03PS1) 10Ori.livneh: apache::monitoring: provision 'links' package [puppet] - 10https://gerrit.wikimedia.org/r/171621 
[19:41:32] <ori>	 godog: got a moment for a small patch? ^
[19:43:35] <icinga-wm>	 PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown  
[19:45:48] <andrewbogott>	 !log restarted ntp on labstore1001
[19:45:54] <morebots>	 Logged the message, Master
[19:46:03] <andrewbogott>	 not that it helped
[19:46:25] <ori>	 'offset unknown' sounds sci-fi-ish
[19:46:30] <andrewbogott>	 yeah
[19:46:33] <ori>	 i like it, it's exciting
[19:47:06] <ori>	 Timecop 3: Offset Unknown
[19:47:33] <icinga-wm>	 RECOVERY - DPKG on labstore1001 is OK: All packages OK  
[19:47:33] <icinga-wm>	 RECOVERY - Disk space on labstore1001 is OK: DISK OK  
[19:47:34] <icinga-wm>	 RECOVERY - RAID on labstore1001 is OK: OK: optimal, 60 logical, 60 physical  
[19:47:52] <icinga-wm>	 RECOVERY - check if salt-minion is running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion  
[19:47:56] <icinga-wm>	 RECOVERY - check configured eth on labstore1001 is OK: NRPE: Unable to read output  
[19:47:56] <icinga-wm>	 RECOVERY - check if dhclient is running on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient  
[19:47:56] <icinga-wm>	 RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures  
[19:50:45] <AaronSchulz>	 Reedy: I made 'wikishared', which is usable
[19:51:24] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "please do." [puppet] - 10https://gerrit.wikimedia.org/r/171621 (owner: 10Ori.livneh)
[19:51:44] <_joe_>	 ori: /win 39
[19:51:47] <_joe_>	 err
[19:51:51] <mihau_>	 hello
[19:51:52] <_joe_>	 ori: merge it!
[19:51:58] <grrrit-wm>	 (03PS2) 10Ori.livneh: apache::monitoring: provision 'links' package [puppet] - 10https://gerrit.wikimedia.org/r/171621 
[19:52:05] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] apache::monitoring: provision 'links' package [puppet] - 10https://gerrit.wikimedia.org/r/171621 (owner: 10Ori.livneh)
[19:52:11] <ori>	 thanks!
[19:54:14] <mihau_>	 Hi all, i wanted to share some of my time for wiki foundation, how I can do this and what is required?
[19:54:55] <Reedy>	 mihau_: What would you like to do?
[19:55:33] <mihau_>	 I don't know exactly, some scripting in shell maybe, usual operations sth not difficult for starters
[19:56:07] <grrrit-wm>	 (03CR) 10Ottomata: "I refactored the limn data sync puppet code. You should now be able to add a single line to class misc::statistics::limn::data::jobs to d" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171465 (owner: 10Milimetric)
[19:57:03] <ori>	 giuseppe (i'm avoiding a ping :P) see my reply on <https://gerrit.wikimedia.org/r/#/c/171515/> if you're around
[20:00:25] <grrrit-wm>	 (03PS3) 10Ottomata: Create eventlogging-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169195 
[20:01:40] <mihau_>	 Reddy: Can you tell me what entry level can do for wikimedia?
[20:04:03] <icinga-wm>	 RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset 0.000152349472 secs  
[20:04:25] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Create eventlogging-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169195 (owner: 10Ottomata)
[20:04:32] <grrrit-wm>	 (03CR) 10Ori.livneh: Create eventlogging-roots group, add qchris (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169195 (owner: 10Ottomata)
[20:04:37] <AaronSchulz>	 br_mail_timestamp ON /*_*/bounce_records(br_user_email(50), br_timestamp);
[20:04:50] <AaronSchulz>	 Reedy: heh, I doubt mysql is smart enough to use the second part of that index
[20:04:53] <ori>	 ottomata: heh, ignore my comment, not worth a follow up
[20:05:01] <MatmaRex>	 can somebody do something about gerrit's ssh being down (or incredibly slow) for me and aparently other people outside US?
[20:05:09] <ottomata>	 haha
[20:05:12] <ottomata>	 ori, sorry :)
[20:05:19] <ori>	 np!
[20:05:22] <ottomata>	 naw i like pedandicness :)
[20:05:23] <grrrit-wm>	 (03CR) 10Tnegrin: "works for me." [puppet] - 10https://gerrit.wikimedia.org/r/169195 (owner: 10Ottomata)
[20:05:48] <MatmaRex>	 a `git fetch` runs for ten minutes, then times out.
[20:06:10] <YuviPanda>	 MatmaRex: mine responded after about a minute and then finished reasonably quickly
[20:06:19] <Reedy>	 WFM too
[20:06:21] <MatmaRex>	 mine timed out after ten minutes.
[20:06:36] <grrrit-wm>	 (03PS1) 10Ottomata: Fix capitalization of EventLogging in comments in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/171626 
[20:06:53] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Fix capitalization of EventLogging in comments in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/171626 (owner: 10Ottomata)
[20:06:58] <grrrit-wm>	 (03CR) 10Ori.livneh: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/171626 (owner: 10Ottomata)
[20:07:11] <grrrit-wm>	 (03PS2) 10Milimetric: Add cron job that generates flow statistics [puppet] - 10https://gerrit.wikimedia.org/r/171465 
[20:07:15] <AaronSchulz>	 Reedy: it could on an equality with a string of len < 50, but I know it's not clever enough
[20:07:19] <AaronSchulz>	 too bad
[20:10:46] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "It is two operations (crontab -e; puppet agent --disable) vs one; but it's using the right tool for the job... my opposition comes more fr" [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh)
[20:11:38] <_joe_>	 ori: even better - make it a cron.d file instead of using the idiotic puppet cron resource, just renaming it will disable it
[20:11:52] <_joe_>	 so we can do that programmatically for multiple hosts
[20:12:01] <ori>	 _joe_: that could work. i had another idea too
[20:12:33] <_joe_>	 (mv /etc/cron.d/cover-our-lazy-asses /etc/cron.d/cover-our-lazy-asses.disabled)
[20:12:58] <ori>	 we could make it a cron.hourly script that checks if any users are logged in
[20:13:07] <ori>	 on the assumption that if you have heap profiling enabled you at least have a screen session open
[20:13:26] <_joe_>	 too complicated :)
[20:14:02] <_joe_>	 I mean, you can disable puppet and move a file, it's no big deal, and we don't risk it not running because I forgot a screen session there
[20:14:10] <_joe_>	 (which I do all the time btw)
[20:14:23] <RoanKattouw>	 MatmaRex: Is it trying to hit Gerrit over IPv6?
[20:15:03] <MatmaRex>	 RoanKattouw: i doubt it, how do i check? (i'm on windows)
[20:15:16] <MatmaRex>	 actually, the plain SSH connections to gerrit API succeed
[20:15:20] <RoanKattouw>	 oh OK
[20:15:21] <MatmaRex>	 it's just the `git fetch` that hangs
[20:15:30] <MatmaRex>	 oh wow
[20:15:34] <MatmaRex>	 nevermind, it finished now
[20:15:41] <MatmaRex>	 only took nine minutes this time
[20:15:44] <MatmaRex>	 crisis averted
[20:15:53] <^d>	 10% better!
[20:30:26] <grrrit-wm>	 (03CR) 10Dzahn: "matanya suggested to "write that class in the role but not in the main one, but include only it in site.pp on a specific host". so like on" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn)
[20:35:03] <grrrit-wm>	 (03CR) 10Dzahn: "not sure yet about getting it from mediawiki config, but what we could do is make a new template in helpers, like langlist/langs.tmpl, tha" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[20:43:54] <cscott>	 apparently jenkins is stuck?
[20:44:03] <cscott>	 anyone in here working on that, or should i give it a go?
[20:45:26] <Reedy>	 aude: About? Do you know if mobilefrontend was disabled on wikidata because it didn't work well? Or was it just because there wasn't a mobile domain setup?
[20:47:36] <Reedy>	 Lydia_WMDE: ^
[20:47:59] <Lydia_WMDE>	 Reedy: i think because it is broken
[20:48:09] <Lydia_WMDE>	 doesn't work very well with our pages
[20:48:14] <Lydia_WMDE>	 at the moment at least
[20:48:36] <Reedy>	 Thanks :)
[20:48:44] <Reedy>	 That did come to mind when mutante asked about it
[20:50:00] <grrrit-wm>	 (03PS4) 10Dzahn: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 
[20:50:14] <mutante>	 Lydia_WMDE: ^ so because of that, adding a bunch of missing ".m."'s
[20:50:23] <mutante>	 but we currently exclude wikitech and wikidata
[20:50:36] <Lydia_WMDE>	 *nod* please keep it excluded for now
[20:50:47] <mutante>	 we were wondering if there was no DNS because no mobile frontend, or no MF because no DNS :)
[20:50:50] <mutante>	 ok!
[20:50:55] <Lydia_WMDE>	 :)
[20:51:16] <Lydia_WMDE>	 hopefully the situation will improve with the new design
[20:51:20] <Lydia_WMDE>	 and then we can see again
[20:52:30] <aude>	 Reedy: statements render ugly
[20:52:41] <aude>	 what Lydia_WMDE says
[20:52:46] <MaxSem>	 or rather Wikidata doesn't work with narrow screens;)
[20:53:27] <aude>	 MaxSem: yeah
[20:53:30] <Reedy>	 heh
[20:53:46] <MaxSem>	 mutante, a mobile subdomain for WD or anything else is fine, as long as were not starting to redirect people there:)
[20:54:01] <grrrit-wm>	 (03CR) 10Dzahn: "added "steward" and "checkuser" wikis" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[20:55:00] <mutante>	 MaxSem: i think the bug was that for other missing wikis we did redirect people but then it didnt exist
[20:55:07] <grrrit-wm>	 (03CR) 10Matanya: [C: 031] "Please add DFC as well." [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[20:55:38] <mutante>	 dfc?
[20:55:49] <grrrit-wm>	 (03CR) 10Matanya: "FDC, of course." [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[20:56:05] <gwicke>	 ottomata: do you have a moment for a puppet / hiera question?
[20:57:12] <grrrit-wm>	 (03PS5) 10Dzahn: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 
[20:57:33] <grrrit-wm>	 (03CR) 10Dzahn: "added fdc" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[20:57:35] <ottomata>	 gwicke: sure
[20:59:22] <gwicke>	 so I think we'll need to come up with some path for private info like user/pass per cluster
[20:59:40] <gwicke>	 $db_pass = '<%= scope.lookupvar('passwords::bugzilla::bugzilla_db_pass') %>';
[21:00:18] <ottomata>	 yeah, for sure
[21:00:24] <ottomata>	 gwicke it will be in the private repo somewhere
[21:00:36] <ottomata>	 but, you will pass that in from the role class, it won't be in any of your modules
[21:00:41] <ottomata>	 (or it will come from hiera...?)
[21:00:58] <gwicke>	 so something like passwords::cassandra::cluster1::user ?
[21:00:59] <ottomata>	 you woudln't do it in a template
[21:01:09] <ottomata>	 yeah, something like that, but the template should just do
[21:01:11] <gwicke>	 yeah, I'll pass it to the module
[21:01:16] <ottomata>	 pass=<%= @db_pass %>
[21:01:17] <ottomata>	 or whatever
[21:01:18] <ottomata>	 yeah
[21:01:52] <gwicke>	 so for now, is there a reason / possibility to put any of this in hiera?
[21:02:40] <ottomata>	 yes, so, hiera would just fill in your module parameters automatically
[21:02:52] <ottomata>	 your module should still work without hiera
[21:03:04] <ottomata>	 e.g. if all required parameters were passed in manually
[21:03:26] <gwicke>	 hmm.. wouldn't that mean that i'd have to pass in a cluster name?
[21:03:40] <ottomata>	 yup
[21:04:00] <ottomata>	 in my vagrant instance wher ei was developing the cassandra module
[21:04:04] <ottomata>	 i have this in my heira common.yaml
[21:04:11] <ottomata>	 cassandra::cluster_name: mediawiki-vagrant
[21:04:11] <ottomata>	 cassandra::listen_address: "%{::ipaddress_eth1}"
[21:04:11] <ottomata>	 cassandra::rpc_address: "%{::ipaddress_eth1}"
[21:04:11] <ottomata>	 cassandra::seeds: ["%{::ipaddress_eth1}"]
[21:04:11] <ottomata>	 cassandra::dc: vagrant
[21:04:12] <ottomata>	 cassandra::rack: 1
[21:04:46] <gwicke>	 so it's not keyed off the cluster name?
[21:04:51] <ottomata>	 ?
[21:04:55] <gwicke>	 how do you support multiple clusters in this scenario?
[21:05:15] <ottomata>	 ah, good q, not entirely sure.  if i was not using hiera, i would do it in multiple roles
[21:05:22] <ottomata>	 role::restbase::cassandra, maybe
[21:05:26] <ottomata>	 role::analytics::cassandra
[21:05:30] <ottomata>	 and those would each be users of the cassandra module
[21:05:33] <ottomata>	 and pass in relevant values
[21:05:49] <gwicke>	 yeah, I think I know how to do that
[21:05:55] <ottomata>	 with hiera (which I am still new at)...
[21:05:56] <ottomata>	 hm
[21:06:09] <gwicke>	 let me just do the obvious thing first
[21:06:13] <ottomata>	 i guess there would be a separate yaml file for each?  and somehow those yaml files would be applied per role?
[21:06:37] <gwicke>	 I'd think that it's hierarchical info, so we'd key off a cluster name for example
[21:06:40] <ottomata>	 aye ok, i mean, for your module developemnt, it doesn't matter as much, you can test your modules using a role class and manually setting params
[21:06:56] <ottomata>	 ja, gwicke, if there is a way to do that, then yeah
[21:06:57] <ottomata>	 makes sense
[21:08:12] <gwicke>	 I'll manually copy in my WIP module on some of the labs vms
[21:08:20] <ottomata>	 aye cool
[21:08:27] <gwicke>	 can't check out two things at once
[21:08:31] <ottomata>	 gwicke: supposedly you can edit yaml configs in labs now too!
[21:08:40] <ottomata>	 although I don't yet know how it works, YuviPanda|zzz pointed me to it
[21:09:04] <gwicke>	 yaml configs as in hiera?
[21:09:06] <ottomata>	 yes
[21:09:08] <ottomata>	 um um um
[21:09:09] <gwicke>	 kk
[21:10:15] <ottomata>	 pretty brand new though
[21:10:15] <ottomata>	 https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2014-11-03#Hiera
[21:10:21] <ottomata>	 need documentation :)
[21:10:35] <ottomata>	 it will let you edit a special wikipage on wikitech to set hiera yaml for your labs project
[21:10:54] <gwicke>	 cool
[21:13:23] <icinga-wm>	 PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail  
[21:18:50] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn)
[21:20:33] <grrrit-wm>	 (03CR) 10QChris: "The dependent change has been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris)
[21:25:23] <grrrit-wm>	 (03PS2) 10Ottomata: Make varnishkafka pick up Range header [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris)
[21:26:05] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Make varnishkafka pick up Range header [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris)
[21:26:56] <ottomata>	 !log added Range header field to varnishkafka webrequest logs
[21:27:01] <morebots>	 Logged the message, Master
[21:32:44] <icinga-wm>	 RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures  
[21:39:05] <gwicke>	 ottomata: I'm unclear on how I can get from the hiera setup in labs to what you sketched on the patch
[21:39:46] <ottomata>	 the role classes, gwicke
[21:39:47] <ottomata>	 ?
[21:40:12] <gwicke>	 yes
[21:40:33] <gwicke>	 so I'll have to set up some role for the cluster
[21:41:42] <ottomata>	 yes, you'll have to do that anyway, at the very least to include the class...i think there is some fancy hiera mainrole thing that _joe_ made
[21:41:49] <ottomata>	 but i'm not entirely sure how to use it
[21:42:06] <ottomata>	 but, ja, you will need role classes defined
[21:42:13] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[21:42:29] <gwicke>	 ottomata: so how do I get a role class that I can include per cluster?
[21:42:29] <ottomata>	 if you are doing self hosted puppetmaster, you could manually create them in /var/lib/git/operations/puppet/manifests/role/ for now
[21:43:06] <gwicke>	 currently everything seems to be just cassandra::*
[21:43:10] <ottomata>	 gwicke: i'm not sure how to do the hiera yaml stuff we were just talking about
[21:43:19] <ottomata>	 but, traditionally, with roles, you'd have
[21:43:28] <ottomata>	 maybe
[21:44:41] <ottomata>	 role::restbase::cassandra {
[21:44:41] <ottomata>	   class { '::cassandra': cluster_name => 'restbase', ... }
[21:44:41] <ottomata>	 }
[21:44:41] <ottomata>	 role::othercluster::cassandra {
[21:44:42] <ottomata>	   class { '::cassandra': { cluster_name => 'othercluster', ... }
[21:44:42] <ottomata>	 }
[21:44:48] <ottomata>	 and include those roles on whatever nodes you want
[21:45:18] <ottomata>	 not saying you *should* do that, exactly, but that is a way
[21:45:24] <ottomata>	 as for hiera...uhhh
[21:45:40] <ottomata>	 IF you can figure out how to set hiera values based on cluster
[21:45:49] <ottomata>	 or, pick them based on cluster
[21:45:53] <ottomata>	 then you might not need the role classes
[21:45:56] <ottomata>	 you might be able to do
[21:45:57] <ottomata>	 just
[21:46:02] <ottomata>	 class { '::cassandra': }
[21:46:05] <ottomata>	 and that's it
[21:46:12] <ottomata>	 and hiera would fill in the appropriate parameters
[21:46:41] <gwicke>	 okay, the latter sounds less certain
[21:46:54] <ottomata>	 yeah, i mean, its just because I don't have a lot of experience with hiera yet
[21:47:01] <ottomata>	 need someone who does..._joe_? :)
[21:47:26] <gwicke>	 can you explain what the nested stanzas are doing?
[21:47:40] <gwicke>	 especially class { '::cassandra': cluster_name
[21:47:42] <gwicke>	 => 'restbase', ... }
[21:47:58] <gwicke>	 does that write into the cassandra namespace?
[21:49:04] <ottomata>	 ah, no, that is just including a class with parameters
[21:49:17] <ottomata>	 class { 'classname':
[21:49:18] <ottomata>	   parameterA => 'valueA',
[21:49:18] <ottomata>	   ...
[21:49:18] <ottomata>	 }
[21:50:03] <ottomata>	 the :: in ::cassandra means to load the class starting from root scope, 
[21:50:11] <gwicke>	 I see, thanks
[21:50:25] <ottomata>	 often needed in similarly named role classes, because puppet allows for nested includes and classes (i wish it didn't)
[21:50:30] <ottomata>	 relative* incldues
[21:50:55] <ottomata>	 e.g.  
[21:50:55] <ottomata>	 class role::cassandra  {
[21:50:55] <ottomata>	   class { 'cassandra': ... }
[21:50:55] <ottomata>	 }
[21:50:58] <ottomata>	 could likely error
[21:51:09] <ottomata>	 because puppet would think you are trying to include the class from inside itself
[21:54:23] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]  
[21:55:49] <gwicke>	 ottomata: *nod*, thx
[21:57:24] <grrrit-wm>	 (03CR) 10Greg Grossmeier: "From: https://bugzilla.wikimedia.org/show_bug.cgi?id=72275#c5" [puppet] - 10https://gerrit.wikimedia.org/r/154710 (owner: 10Ori.livneh)
[22:09:32] <icinga-wm>	 PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused  
[22:09:35] <icinga-wm>	 PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused  
[22:09:47] <icinga-wm>	 PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused  
[22:09:48] <icinga-wm>	 PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused  
[22:10:02] <icinga-wm>	 PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused  
[22:10:02] <icinga-wm>	 PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused  
[22:10:22] <icinga-wm>	 PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused  
[22:10:33] <icinga-wm>	 PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused  
[22:10:37] <icinga-wm>	 PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused  
[22:11:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused  
[22:11:03] <icinga-wm>	 PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused  
[22:11:23] <icinga-wm>	 PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused  
[22:11:32] <icinga-wm>	 PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused  
[22:11:33] <subbu>	 grr ... syntax error in the config file.
[22:11:36] <subbu>	 hotfixing.
[22:11:52] <icinga-wm>	 PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused  
[22:12:13] <icinga-wm>	 PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused  
[22:12:22] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: Connection refused  
[22:12:39] <icinga-wm>	 PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused  
[22:12:39] <icinga-wm>	 PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused  
[22:12:39] <icinga-wm>	 PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused  
[22:12:49] <icinga-wm>	 PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused  
[22:12:55] <icinga-wm>	 PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused  
[22:12:59] <icinga-wm>	 PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused  
[22:13:06] <_joe_>	 subbu: ok
[22:13:10] <icinga-wm>	 RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.022 second response time  
[22:13:10] <icinga-wm>	 RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.033 second response time  
[22:13:11] <icinga-wm>	 RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time  
[22:13:19] <subbu>	 jshint should have caught it ..  how did jenkins let it by.
[22:13:19] <icinga-wm>	 RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time  
[22:13:19] <icinga-wm>	 RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time  
[22:13:29] <icinga-wm>	 RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time  
[22:13:36] <icinga-wm>	 RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.023 second response time  
[22:13:40] <icinga-wm>	 RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time  
[22:13:49] <icinga-wm>	 RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time  
[22:13:49] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time  
[22:14:00] <icinga-wm>	 RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.015 second response time  
[22:14:00] <icinga-wm>	 RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time  
[22:14:00] <icinga-wm>	 RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.022 second response time  
[22:14:00] <icinga-wm>	 RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time  
[22:14:00] <icinga-wm>	 RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time  
[22:14:00] <icinga-wm>	 RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.004 second response time  
[22:14:01] <icinga-wm>	 RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.014 second response time  
[22:14:09] <paravoid>	 on the bright side, paging works
[22:14:13] <icinga-wm>	 RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.027 second response time  
[22:14:14] <icinga-wm>	 RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time  
[22:14:14] <icinga-wm>	 RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time  
[22:14:14] <icinga-wm>	 RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.028 second response time  
[22:14:14] <RoanKattouw>	 Yup
[22:14:14] <icinga-wm>	 RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time  
[22:14:19] <akosiaris>	 hey, what's up ?
[22:14:30] <subbu>	 sorry .. it was bad config.
[22:14:31] <RoanKattouw>	 paravoid: It worked very well yesterday too. I got like 26 pages for the mobile outage
[22:14:54] <subbu>	 fixed. we have to look at our jenkins setting why jshint didn't catch it.
[22:14:55] <akosiaris>	 subbu: ok, no worries.
[22:15:12] <paravoid>	 RoanKattouw: why are you getting pages? :)
[22:15:17] <_joe_>	 akosiaris: I was worried the upgrade did have something to do with this
[22:15:25] <RoanKattouw>	 Because we have separate alerts for mobile-lb.{eqiad,esams,ulsfo} IPv{4,6}
[22:15:28] <paravoid>	 you gave up your root recently, didn't you?
[22:15:35] <RoanKattouw>	 paravoid: Yeah and I asked to only get Parsoid pages
[22:15:43] <RoanKattouw>	 But mutante said he wasn't sure that was possibel
[22:15:48] <akosiaris>	 _joe_: me too
[22:15:49] <RoanKattouw>	 s/Parsoid/*oid/
[22:16:35] * akosiaris going back to sleep
[22:16:42] <subbu>	 akosiaris, really sorry!
[22:17:17] <subbu>	 we wil have to fix that config testing hole next in jenkins.
[22:17:19] <grrrit-wm>	 (03PS1) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[22:17:19] <_joe_>	 akosiaris: you shouldn't get pages after midnight!
[22:18:01] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[22:18:05] <grrrit-wm>	 (03CR) 10John F. Lewis: "It one of these cases of 'it works for what we want it to do' really. As the current version, puppet-lint gives a warning about indentatio" [puppet] - 10https://gerrit.wikimedia.org/r/170493 (owner: 10John F. Lewis)
[22:19:15] <paravoid>	 _joe_: we're on CET, for simplicity
[22:19:36] <paravoid>	 (iirc)
[22:19:48] <subbu>	 !log updated parsoid to d23d2be6 (+ a hotfix to the production localsettings config file)
[22:19:50] <paravoid>	 ok, I'm going to sleep as well
[22:19:51] <grrrit-wm>	 (03PS2) 10John F. Lewis: dataset: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170492 
[22:19:53] <morebots>	 Logged the message, Master
[22:21:29] <grrrit-wm>	 (03CR) 10John F. Lewis: bacula: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis)
[22:23:36] <AndyRussG>	 Hi RoanKattouw, I'm taking you up on ur offer of letting me bug you with more questions :) here are some probably silly ones: - How is it determined whether a "version" param will be included in the call to bits for ResourceLoader modules? (asking because I see that impacts on caching) ...and (incidentally) how does bits know which version of MW to serve files for?
[22:24:42] <grrrit-wm>	 (03Abandoned) 10John F. Lewis: backup: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170474 (owner: 10John F. Lewis)
[22:24:44] <RoanKattouw>	 The version parameter is purely for cache busting, it is never read by the server
[22:24:51] <RoanKattouw>	 Just so that's clear
[22:25:11] <RoanKattouw>	 We put in a version parameter whenever possible
[22:25:24] <subbu>	 greg-g, what is the protocol here? does that config snafu require an email to the ops list?
[22:25:30] <RoanKattouw>	 The value of the version parameter is determined by JS on the client, as the max(...) of the timestamps of the modules in the request
[22:25:37] <greg-g>	 subbu: how long was the outage?
[22:25:39] <subbu>	 it was a javascript syntax error that wasn't caught.
[22:25:40] <grrrit-wm>	 (03PS2) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[22:25:51] <subbu>	 ~3 mins or so.
[22:26:04] <RoanKattouw>	 Some requests are initiated without using JS, e.g. from <link rel="stylesheet" href="...."> . Those do not have version parameters because we can't use dynamic URL composition there
[22:26:14] <greg-g>	 subbu: I'm curious why Jenkins let it happen, as you are, maybe that itself is worth the outage report/bug report
[22:26:23] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[22:26:52] <subbu>	 yes, cscott is starting investigating that.
[22:26:56] * greg-g nods
[22:27:00] <RoanKattouw>	 Also, the request for the startup module (load.php?modules=startup) does not have a version parameter, because of bootstrapping: the startup module is what contains this timestamp information in the first place, so we cannot give it a version parameter because we don't have any timestamps yet
[22:27:11] <RoanKattouw>	 Generally we try to avoid version-less requests except for startup
[22:27:20] <AndyRussG>	 Ah hmmm
[22:27:25] <subbu>	 greg-g, ok, will email.
[22:27:39] <AndyRussG>	 RoanKattouw_away: thanks! :)
[22:29:00] <grrrit-wm>	 (03CR) 10John F. Lewis: authdns: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170473 (owner: 10John F. Lewis)
[22:29:23] <greg-g>	 subbu: ty sir
[22:49:39] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds  
[22:57:01] <MaxSem>	 Reedy, Unknown site ID configured: maiwiki
[22:57:11] <Reedy>	 MaxSem: Where's that?
[22:57:25] <MaxSem>	 logstash/exception log
[22:57:33] <Reedy>	 link/stacktrace?
[22:58:00] <Reedy>	 /what is it from?
[22:58:07] <MaxSem>	 learn to use logstash? :P
[22:58:41] <MaxSem>	 {"file":"/srv/mediawiki/php-1.25wmf6/extensions/Wikidata/extensions/Wikibase/client/includes/ChangeHandler.php","line":67,"function":"__construct","class":"Wikibase\\ChangeHandler","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.25wmf6/extensions/Wikidata/extensions/Wikibase/lib/includes/ChangeNotificationJob.php","line":128,"function":"singleton","class":"Wikibase\\ChangeHandler","type":"::","args":[]}, {"file":"/srv/mediawiki/php-1.25w
[22:58:41] <MaxSem>	 mf6/includes/jobqueue/JobRunner.php","line":136,"function":"run","class":"Wikibase\\ChangeNotificationJob","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.25wmf6/maintenance/runJobs.php","line":80,"function":"run","class":"JobRunner","type":"->","args":["array"]}, {"file":"/srv/mediawiki/php-1.25wmf6/maintenance/doMaintenance.php","line":101,"function":"execute","class":"RunJobs","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.25wm
[22:58:42] <MaxSem>	 f6/maintenance/runJobs.php","line":95,"args":["string"],"function":"require_once"}, {"file":"/srv/mediawiki/multiversion/MWScript.php","line":97,"args":["string"],"function":"require_once"}
[22:58:56] <Reedy>	 https://logstash.wikimedia.org/#/dashboard/elasticsearch/default is showing absolutely nothing
[22:59:33] <bd808>	 blah. Stupid cluster rebalance strikes again :(
[22:59:50] <Reedy>	 aude: ^^ looks like something didn't take for maiwiki :(
[23:00:23] <AaronSchulz>	 call me maybe
[23:00:34] <greg-g>	 https://logstash.wikimedia.org/#dashboard/temp/yYZV9Vo4TiG4wSmM_yvqHQ
[23:01:12] <Reedy>	 Looks like they're all jobrunner?
[23:02:22] <bd808>	 !log restarted logstash on logstash1001 for the usual reason (no events making it to elasticsearch)
[23:02:29] <morebots>	 Logged the message, Master
[23:02:30] <aude>	 Reedy: oh noes
[23:02:36] <greg-g>	 better link: https://logstash.wikimedia.org/#dashboard/temp/afps2EJsTESbcVXBEx4kSg
[23:02:39] <aude>	 -em looks
[23:02:46] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 
[23:03:11] <MaxSem>	 bd808, create a monitoring metric for it?
[23:03:32] <_joe_>	 ori: ^^ I'm merging it
[23:03:42] <bd808>	 MaxSem: Yeah. We really should.
[23:03:48] <Reedy>	 aude: Ah, its sites table is empty
[23:03:54] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 (owner: 10Giuseppe Lavagetto)
[23:03:55] <ori>	 _joe_: please do
[23:04:00] <aude>	 i can take care of ti
[23:04:02] <aude>	 it*
[23:04:18] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 (owner: 10Giuseppe Lavagetto)
[23:04:26] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 (owner: 10Giuseppe Lavagetto)
[23:04:27] <Reedy>	 I did run  foreachwikiindblist wikidataclient.dblist extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --strip-protocols
[23:04:28] <Reedy>	 :/
[23:04:55] <MaxSem>	 why not just everywhere?
[23:05:12] <Reedy>	 Not everywhere has wikidataclient
[23:06:05] <Reedy>	 You get Fatal error: Class 'SiteMatrixParser' not found in /srv/mediawiki-staging/php-1.25wmf7/extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php on line 55
[23:06:39] <Reedy>	 enwiki knows about maiwiki for example
[23:09:07] <aude>	 it needs the --site-group thing
[23:09:23] * aude really needs to fix it.... getting annoyed
[23:10:18] <Reedy>	 Does the autodetect stuff not work right?
[23:10:23] <Reedy>	 		$this->addOption( 'site-group', 'Site group that this wiki is a member of.  Used to populate '
[23:10:24] <Reedy>	 				. ' local interwiki identifiers in the site identifiers table.  If not set and --wiki'
[23:10:24] <Reedy>	 				. ' is set, the script will try to determine which site group the wiki is part of'
[23:10:24] <Reedy>	 				. ' and populate interwiki ids for sites in that group.', false, true );
[23:12:50] <jdlrobson>	 any zuul experts here? Mobile's Jenkins is broken again.. https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/6812/console
[23:12:54] <aude>	 it says that?
[23:13:07] <Reedy>	 indeed
[23:13:11] <mutante>	 22:31:45 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/p/mediawiki/core
[23:13:45] <Reedy>	 https://github.com/wikimedia/mediawiki-extensions-Wikidata/blob/master/extensions/Wikibase/lib/maintenance/populateSitesTable.php#L29-L32
[23:14:05] <jdlrobson>	 mutante: yup :(
[23:14:16] <jdlrobson>	 22:31:45 IOError: Lock for file '/srv/ssd/jenkins-slave/workspace/mwext-MobileFrontend-qunit-mobile/src/.git/config' did already exist, delete '/srv/ssd/jenkins-slave/workspace/mwext-MobileFrontend-qunit-mobile/src/.git/config.lock' in case the lock is illegal
[23:15:07] <ori>	 http://upload.wikimedia.org/wikipedia/commons/thumb/6/65/Kmii_logo_en.gif/220px-Kmii_logo_en.gif
[23:15:39] <Reedy>	 aude: ah. I see the sitegroup for maiwiki on enwiki is mai
[23:15:57] <aude>	 oh really?
[23:16:12] <Reedy>	 http://p.defau.lt/?613TLEkJ_tIKaGjfVIO5CQ
[23:16:39] <Reedy>	 the rest looks right/sane in comparison to aawiki row
[23:16:51] <grrrit-wm>	 (03PS3) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[23:17:10] <bd808>	 jdlrobson: I deleted the git lock file. Let's try re-running the job
[23:17:17] <jdlrobson>	 thanks bd808
[23:17:23] * jdlrobson crosses fingers
[23:17:34] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[23:18:13] <bd808>	 jdlrobson: Well it's at least broken differently now :/
[23:18:20] <jdlrobson>	 bd808: lolz
[23:18:36] <cscott>	 bd808: parsoid and ocg logging on logstash appears to be down
[23:18:43] <cscott>	 subbu: ^
[23:19:07] <bd808>	 cscott: I'll kick the logstash instance. That goes to 1002 correct?
[23:19:15] <subbu>	 bd808, parsoid to 1003
[23:19:24] <jgage>	 i just checked 1002 and it said the cluster is green, fwiw
[23:19:26] <ori>	 is there a plan in place for figuring out the root of the problem? (i presume it is unknown)
[23:20:05] <aude>	 seems to say site_group 'mai' on all wikis
[23:20:26] <jgage>	 i wish logstash's own log was more useful
[23:20:47] <bd808>	 jgage: The logstash service that feeds the es backend gets hung up sometimes
[23:21:00] <bd808>	 !log restarted logstash on logstash1003
[23:21:03] <grrrit-wm>	 (03PS7) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 
[23:21:05] <cscott>	 bd808: https://logstash.wikimedia.org/#dashboard/temp/Pqjx___vQ-GeeAb0Y3xaUQ says ocg logging stopped around 21:00
[23:21:06] <morebots>	 Logged the message, Master
[23:21:08] <cscott>	 utc, presumably
[23:21:11] <jgage>	 bd808, how do you detemrine if it's in a bad state?
[23:21:12] <grrrit-wm>	 (03PS4) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[23:21:21] <bd808>	 parsoid logs seem to be back
[23:21:42] <cscott>	 bd808: ocg logs are still missing
[23:21:59] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[23:22:35] <bd808>	 !log restarted logstash on logstash1002
[23:22:39] <morebots>	 Logged the message, Master
[23:22:56] <cscott>	 bd808: yup, i'm seeing logs again.  thanks.
[23:22:57] <aude>	 viwikivoyage looks ok
[23:23:05] <bd808>	 cscott: np
[23:24:11] <bd808>	 !log Killed 3 hung /usr/local/bin/logstash_optimize_index.sh processes on logstash1002
[23:24:17] <morebots>	 Logged the message, Master
[23:24:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1  
[23:26:18] <grrrit-wm>	 (03PS5) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[23:26:39] <jgage>	 oho, logstash_optimize_index.sh is called by root's crontab
[23:26:50] <bd808>	 !log deleted corrupt mediawki/core clone in workspace/mwext-MobileFrontend-qunit-mobile on gallium
[23:26:52] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:52] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:52] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:53] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1027 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:53] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:53] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:54] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:54] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:26:56] <morebots>	 Logged the message, Master
[23:26:58] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[23:27:51] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[23:27:51] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1027 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[23:27:51] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[23:27:52] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[23:30:03] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0  
[23:30:06] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0  
[23:30:06] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0  
[23:30:06] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0  
[23:30:22] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 04-1] "Looks good. I left comments inline, but they are all for cosmetic issues, except the one about the service resource." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke)
[23:30:48] <jgage>	 i went to a talk last night by an elasticsearch author, he was excited that we have a 31-node cluster. apparently that's large :)
[23:31:01] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds  
[23:31:21] <^demon|away>	 jgage: How was the talk?
[23:31:37] <subbu>	 bd808, https://gerrit.wikimedia.org/r/#/c/170935/ can probably be abandoned now.
[23:31:42] <jgage>	 it was pretty good, i need to check my notes against our config
[23:31:46] <cscott>	 jgage: that's always worrisome
[23:31:52] <jgage>	 he gave several specific tuning suggestions
[23:32:08] <jgage>	 lemme see if the deck is online
[23:32:14] <bd808>	 jgage: That cron job comes from this puppet config <https://github.com/wikimedia/operations-puppet/blob/production/modules/logstash/manifests/output/elasticsearch.pp#L74-L92> -- not a big deal if it hangs but I haven't seen that before
[23:32:45] <bd808>	 cscott: Not that worrisome. We have more content than most folks would by a long shot
[23:33:12] <bd808>	 And many.bubbles is practically a core contributor to the project :)
[23:33:49] <jgage>	 thanks bd808. /etc/cron.daily/ would probably be better than a user crontab, but it's a minor point.
[23:33:59] <cscott>	 bd808: i assume you've seen http://aphyr.com/posts/317-call-me-maybe-elasticsearch ?
[23:34:06] <^demon|away>	 bd808: github's elastic setup is probably pretty decently sized, but i don't know details
[23:34:37] <grrrit-wm>	 (03Abandoned) 10BryanDavis: logstash: Drop spammy parsoid messages [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis)
[23:34:38] <^demon|away>	 cscott: A ton of call me maybe was the motivation behind the zen discovery improvements in 1.4.0
[23:35:36] <jgage>	 looks like this is the ~same talk i watched, but you have to register to watch the video, haven't found slides: http://www.elasticsearch.org/webinars/elk-stack-devops-environment/
[23:35:48] <grrrit-wm>	 (03PS6) 10Ori.livneh: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[23:35:58] <ori>	 ^ gwicke -- made some lint fixes for you
[23:36:11] <^demon|away>	 cscott: I think they've got a whole set of integration tests to simulate those network failure conditions now.
[23:36:22] <cscott>	 i would hope they are actually running jepsen
[23:36:48] <^demon|away>	 Not a clue.
[23:36:57] <grrrit-wm>	 (03CR) 10Ori.livneh: "Avoid class names with dashes in them ('otto-cass'); they're a headache." [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke)
[23:37:03] <bd808>	 ^demon|away: A bit dated but <http://exploringelasticsearch.com/github_interview.html> said 44 EC2 instances with 2T of SSD :)
[23:37:26] <^demon|away>	 2T x 44 EC2 instances?
[23:37:28] <bd808>	 oops 2T SSD per machine
[23:37:41] <^demon|away>	 Oh ok, 88T makes more sense.
[23:37:50] <^demon|away>	 I was like "what could you be indexing in only 2T?"
[23:37:59] <bd808>	 "That one is running elasticsearch 0.2, The volume of data there is 30 terabytes of primary data."
[23:38:06] <bd808>	 so old data for sure
[23:38:57] <^demon|away>	 "Behind the scenes, we actually have probably a good 40 to 50 search indexes"
[23:39:16] <gwicke>	 ori: thanks, I'm tweaking & testing currently as well
[23:42:33] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1  
[23:43:03] <grrrit-wm>	 (03PS4) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 
[23:43:33] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0  
[23:47:23] <grrrit-wm>	 (03CR) 10Ori.livneh: Add class and role for Openstack Horizon (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170340 (owner: 10Andrew Bogott)
[23:49:34] <grrrit-wm>	 (03PS1) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) 
[23:50:55] <grrrit-wm>	 (03PS2) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) 
[23:51:13] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn)
[23:51:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1  
[23:51:42] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1  
[23:51:43] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1027 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1  
[23:52:34] <grrrit-wm>	 (03PS7) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[23:52:42] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[23:52:43] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1027 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[23:52:43] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0  
[23:53:05] <jdlrobson>	 bd808 mutante zuul seems to have fixed itself :)
[23:53:12] <grrrit-wm>	 (03PS3) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) 
[23:53:28] <grrrit-wm>	 (03PS8) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 
[23:53:35] <grrrit-wm>	 (03PS8) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 
[23:53:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn)
[23:55:01] <bd808>	 jdlrobson: I deleted the git clone of mw/core and let it check it out again. It got corrupted somehow.
[23:55:04] <ori>	 andrewbogott: do you remember anything re: what the issue was for parsoid/ve on wikitech?
[23:55:36] <andrewbogott>	 ori: I'm not sure we ever knew -- I think it didn't work and we turned it off and moved on.
[23:56:18] <andrewbogott>	 ori: sorry for the incurious response; at the time many things were broken all at once :)
[23:56:55] <ori>	 andrewbogott: no worries, i totally understand
[23:57:16] <ori>	 andrewbogott: plus "incurious" is a wonderful word so you get bonus points for that ;)
[23:57:56] <aude>	 Reedy: https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json
[23:58:07] <aude>	 maiwiki is listed as a special site
[23:58:59] <Reedy>	 I wonder why
[23:59:09] <Reedy>	 the dblists it is in are sane
[23:59:33] <aude>	 yeah
[23:59:45] <Reedy>	 oh
[23:59:49] <Reedy>	 did I sync langlist!?