[00:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141106T0000). [00:01:06] Is someone doing the SWAT? [00:01:11] (03CR) 10Hoo man: [C: 031] "Looks good at a glance now" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [00:01:13] (03CR) 10Dzahn: "15:59 'wmgMobileFrontend' => array(" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [00:01:21] Looks like RoanKattouw has a patch in [00:01:26] I can [00:04:24] (03PS1) 10Andrew Bogott: Change ocg.log rotation from 15 days to 7 days. [puppet] - 10https://gerrit.wikimedia.org/r/171477 [00:05:07] (03CR) 10Dzahn: [C: 031] "sounds good, adding cscott" [puppet] - 10https://gerrit.wikimedia.org/r/171477 (owner: 10Andrew Bogott) [00:05:19] ebernhardson, I hope you will not run away during the deployment like last time? :P [00:05:58] MaxSem: i dont intend to run away :) [00:06:02] (03CR) 10Andrew Bogott: [C: 032] Change ocg.log rotation from 15 days to 7 days. [puppet] - 10https://gerrit.wikimedia.org/r/171477 (owner: 10Andrew Bogott) [00:06:12] MaxSem: usually i try to get home before swat so that doesn't happen [00:06:27] ok [00:08:48] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:09:16] !log cleaned up some log files on ocg1001 and reduced logrotations to 7. [00:09:23] Logged the message, Master [00:09:47] RECOVERY - Disk space on ocg1001 is OK: DISK OK [00:13:20] !log ocg1001 is depressingly tiny and will probably keeping complaining about disk space until it's rebuilt [00:13:28] Logged the message, Master [00:13:49] andrewbogott: :-) [00:18:16] !log maxsem Synchronized php-1.25wmf6/extensions/Flow/: SWAT (duration: 00m 05s) [00:18:24] Logged the message, Master [00:18:26] ebernhardson, ^^^ [00:18:52] !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: SWAT (duration: 00m 04s) [00:18:58] Logged the message, Master [00:19:03] kaldari, ^^^ [00:20:24] !log maxsem Synchronized php-1.25wmf7/extensions/VisualEditor/: SWAT (duration: 00m 07s) [00:20:31] Logged the message, Master [00:20:39] RoanKattouw, ^^^ [00:20:54] MaxSem: Have flagged to etonkovidova. [00:21:19] grr, hierarchies [00:21:49] lol [00:22:04] MaxSem: I have people who have people. ;-) [00:22:13] Also, Elena should be in here. ;-) [00:22:19] EVUL [00:22:30] No no, "James". [00:22:44] MaxSem: PHP fatal error in /srv/mediawiki/php-1.25wmf6/extensions/MobileFrontend/includes/MobileContext.php line 1024 Call to undefined method WebResponse::getheader() :( [00:23:31] lol, ori ^ [00:24:41] !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: (no message) (duration: 00m 07s) [00:24:50] Logged the message, Master [00:24:52] srsly [00:25:04] MaxSem: thanks [00:25:15] deployment branches should maintain contact with reality [00:25:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [00:25:29] they should not contain stuff reverted yesterday [00:25:58] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:06] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:09] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:18] PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:35] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:04] MaxSem? [00:27:09] PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:29] PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:33] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:42] MaxSem: master: https://gerrit.wikimedia.org/r/#/c/171190/ , wmf6: https://gerrit.wikimedia.org/r/#/c/171189/ , wmf5: https://gerrit.wikimedia.org/r/#/c/171188/ [00:27:44] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:12] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:17] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:20] !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [00:28:23] PROBLEM - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:28] Logged the message, Master [00:28:29] whaddafuq [00:28:38] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20964 bytes in 0.009 second response time [00:29:10] RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 21048 bytes in 0.463 second response time [00:29:47] reverted locally, looking WTF was going on [00:29:49] what's up? [00:29:52] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21059 bytes in 0.597 second response time [00:29:58] hi [00:29:58] I was about to eat, and myphone is making lots of noise [00:30:05] bblack, PHP error, reverted [00:30:17] ok [00:30:26] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 21035 bytes in 0.600 second response time [00:30:32] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21079 bytes in 0.470 second response time [00:30:47] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21019 bytes in 0.034 second response time [00:31:14] that was fun [00:31:17] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20986 bytes in 0.290 second response time [00:31:22] thanks for handling, maxsem [00:31:24] RECOVERY - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20999 bytes in 0.219 second response time [00:31:54] i was paged 19 times, heh [00:31:56] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20938 bytes in 0.002 second response time [00:32:07] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20986 bytes in 0.031 second response time [00:32:11] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21008 bytes in 0.288 second response time [00:32:53] RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 21030 bytes in 0.230 second response time [00:34:04] (03PS1) 10Dereckson: Adding Ukraine photo sources to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171484 (https://bugzilla.wikimedia.org/73045) [00:34:38] i look forward to the postmortem on how that change made it past qa/beta [00:35:07] with dexterity and subtlety! [00:35:29] heh [00:36:44] well, that change was detected in beta and reverted, but restored back by a bogus cherrypick/module update [00:37:01] oops [00:37:16] currently we're trying to figure out what happened [00:38:19] we don't have a test environment to test production branches, hehehe [00:39:14] clearly we need a beta-beta-labs-labs to validate what comes out of beta-labs [00:39:25] +1 [00:39:28] delta labs [00:39:41] lambda labs! [00:39:55] <3 the crowbar [00:40:52] in lambda labs you can clone an instance to a new one while making one change, but thereafter each instance is immutable. [00:41:26] too highbrow [00:41:32] :p [00:42:08] let deployment-www-99234 = deployment-www-99233 + 5f4bde3a; [00:44:53] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [00:51:03] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 332 seconds [00:52:32] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:57:42] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [00:57:42] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [00:57:42] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [00:57:42] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [00:57:42] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2068: active_shards: 6220: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [00:57:52] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2069: active_shards: 6221: relocating_shards: 3: initializing_shards: 3: unassigned_shards: 1 [00:57:53] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2069: active_shards: 6221: relocating_shards: 3: initializing_shards: 3: unassigned_shards: 1 [00:58:53] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [00:58:53] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [00:58:53] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [00:58:53] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [00:59:02] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [00:59:03] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [00:59:03] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2070: active_shards: 6226: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:02:13] (03CR) 10Dzahn: [C: 032] beta: linting and autoload modules [puppet] - 10https://gerrit.wikimedia.org/r/170484 (owner: 10John F. Lewis) [01:05:58] !log git-sync-upstream on deployment-salt for beta puppetmaster [01:06:08] Logged the message, Master [01:10:55] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail [01:11:14] can anyone help with a fundraising emergency? [01:11:54] I'm hoping to replay commands from our db replication log. [01:14:56] <^d> "Fatal error: Call to a member function get() on a non-object in /srv/mediawiki/php-1.25wmf6/extensions/Gadgets/Gadgets_body.php on line 357 " [01:16:29] global ping: any opsen who are familiar with the Fundraising database servers? [01:17:47] failover ping: anyone who can get to db1025->db1008 replication logs? [01:19:30] springle: ^ [01:20:20] oh frack [01:20:30] theoretically yes, I can try [01:20:41] <^d> How is $wgMemc a non-oject? [01:21:01] ^d: too early in the init process? [01:21:12] <^d> This is pretty late in a maintenance script. [01:21:40] :S [01:21:45] !log restarted gmond on mw1018 and mw1031 [01:21:53] Logged the message, Master [01:21:57] the script is too hardcore to call Setup.php? :P [01:22:22] <^d> Time to live-hack terbium and find out :) [01:22:55] springle: thanks, hopefully you don't regret it any more than I already do :p [01:23:16] springle: so, I accidentally dropped the table civicrm.wmf_civicrm_extra [01:23:56] springle: I'm hoping we can restore it from a backup--I'm already in a position to do that [01:24:17] springle: then, replay the replication logging if possible, to fill in the remainder of the missing data. [01:24:53] I don't know if it's possible to replay at all, but if it is, I'd like to replay only statements which write to that one table [01:25:05] awight: looking. we may need to page Jeff_Green to avoid making it worse [01:25:14] sure, that would work for me [01:25:38] replaying a single table is non-trivial [01:25:43] I bet... [01:26:02] <^d> MaxSem, ebernhardson: We have stacktrace! https://phabricator.wikimedia.org/P63 [01:26:17] springle: I'm also open to restoring the entire db, and replaying all tables until the point at which I destroy everything. [01:26:55] springle: it would be great if you could start a backup of the current, damaged state of db1025:civicrm [01:35:11] awight: looks like the frack db dumps are per database, which is good, but not linked to a specific replication position, which means we need to ask Jeff_Green how a restore should be handled [01:35:26] springle: thanks for checking! [01:35:27] the last one is ~13h old [01:35:48] springle: I texted a number ending in 5522, is that how to get ahold of Jeff? [01:36:17] springle: also, do you know if the replication logs are going expire on us? [01:36:31] mmm, only 6 other digits to pick:P [01:36:51] awight: we have 10 days before log expiry [01:36:56] <^d> MaxSem: I'll start with 000-000 :p [01:37:18] springle: great [01:37:51] awight: this is important enough to page Jeff_Green in your opinion? (i really have no idea what civicrm handles on frack) [01:38:06] springle: definitely [01:38:10] ok then [01:38:18] springle: we had already chatted about butting heads at about this hour [01:38:28] it just got... more urgent, though [01:40:07] (03PS1) 10Dzahn: beta monitoring (labmon): fix graphite class name [puppet] - 10https://gerrit.wikimedia.org/r/171492 [01:41:55] awight: special civicrm backup on db1008 is started [01:42:03] are new transactions still going to a fresh table now on top of the current mess? [01:42:04] may not help us much, but there you are [01:42:13] there is no fresh table [01:42:15] ok [01:42:43] bblack: if you're talking about the FR fubar, no I had to pause the process which records new donations into our DB [01:42:55] ok [01:43:02] awight: so we have leeway to wait for Jeff? [01:43:08] springle: thanks, the backup will be a big time saver for Jeff. [01:43:54] springle: yep, the donations intake is decoupled from this part of the pipeline, so the donor-facing issue is very minor, just a delay in getting their receipts and other warm salutations. [01:44:03] ok, good [01:44:23] (03CR) 10Dzahn: [C: 032] beta monitoring (labmon): fix graphite class name [puppet] - 10https://gerrit.wikimedia.org/r/171492 (owner: 10Dzahn) [01:44:24] * awight kneels to ancestors for a moment [01:47:11] hmm, can't find an lvm snapshot slave for frack [01:47:38] maybe we don't do that there. means sql restore + binlog is only choice [01:47:43] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:49:03] rats, that sounded like a good plan [01:52:23] springle: ping [01:52:36] Jeff_Green: pong [01:52:46] blargh [01:53:28] * awight weeps softly in the background [01:54:54] awight, how did it happen? [01:55:28] MaxSem: first, I did something small and stupid. Then I worked my way up to dropping an entire table. [01:55:55] I know my own actions brought me to this point, but still I blame Oracle for providing me the tools to hurt myself with. [01:56:16] ORACLE? [01:56:31] <^d> SNORACLE. [01:56:32] * awight mutters mysql... [01:57:12] it's mariadb :) [01:57:31] maxsem@tin:/srv/mediawiki-staging/php-1.25wmf6$ mysql --version [01:57:31] mysql Ver 14.14 Distrib 5.5.35, for debian-linux-gnu (x86_64) using readline 6.2 [01:57:50] depends if we're talking about client or server SW:) [01:57:56] tin != prod db servers [01:58:04] there are 3 modules, mysql, mysql_wmf and mariadb [02:03:53] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [02:24:47] (03PS1) 10Dzahn: (WIP) facilities: move to module [puppet] - 10https://gerrit.wikimedia.org/r/171493 [03:07:35] (03PS1) 10Dzahn: (WIP) certificates: move to module [puppet] - 10https://gerrit.wikimedia.org/r/171496 [03:24:06] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [03:24:06] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [03:24:06] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [03:24:06] PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [03:24:45] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [03:25:16] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [03:25:16] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [03:25:16] RECOVERY - ElasticSearch health check on elastic1018 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [03:25:16] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [03:25:46] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 7: initializing_shards: 0: unassigned_shards: 0 [03:26:47] PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:47] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:47] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:47] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:47] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:47] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:47] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:48] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:48] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:49] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:49] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:50] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:50] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:52] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:52] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:26:53] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6239: relocating_shards: 9: initializing_shards: 3: unassigned_shards: 1 [03:31:17] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6243: relocating_shards: 3: initializing_shards: 2: unassigned_shards: 1 [03:33:18] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2077: active_shards: 6247: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [03:34:06] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:07] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:07] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:07] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:07] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:07] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:08] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:08] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:09] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:09] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:10] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:10] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:11] RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:11] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:12] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:12] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:13] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:34:13] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:45:08] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 72% free (5436 MB out of 7627 MB) [03:50:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5624 MB out of 7627 MB) [03:55:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 74% free (5642 MB out of 7627 MB) [03:58:27] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [04:00:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5656 MB out of 7627 MB) [04:05:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5659 MB out of 7627 MB) [04:09:48] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:10:08] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5675 MB out of 7627 MB) [04:15:19] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5678 MB out of 7627 MB) [04:20:05] (03CR) 10KartikMistry: "@Reedy, what need to fix to get it work as expected? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [04:20:19] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5684 MB out of 7627 MB) [04:25:19] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5703 MB out of 7627 MB) [04:26:55] kart_: you need to use $wmg instead of $wg [04:27:42] kart_: if you set 'wgFoo', that gets set to $wgFoo, then your extension is loaded, which will set the extension default of $wgFoo, wiping out your intended value [04:27:59] so, set what you want to $wmgFoo, and then after your extension is enabled, do $wgFoo = $wmgFoo [04:30:10] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5706 MB out of 7627 MB) [04:33:19] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [04:33:20] PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [04:34:49] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:34:49] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:35:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 75% free (5714 MB out of 7627 MB) [04:35:38] RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [04:35:38] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [04:36:08] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2077: active_shards: 6247: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [04:36:10] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2077: active_shards: 6247: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [04:40:18] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5729 MB out of 7627 MB) [04:42:17] (03CR) 10Glaisher: "Reedy:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [04:45:20] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 76% free (5733 MB out of 7627 MB) [04:47:58] legoktm: thanks! [04:50:21] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6009 MB out of 7627 MB) [04:55:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6010 MB out of 7627 MB) [04:56:51] (03PS2) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 [04:57:34] (03PS3) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 [04:58:28] (03PS1) 10Brion VIBBER: Expose Content-Range response header for CORS requests on upload. [puppet] - 10https://gerrit.wikimedia.org/r/171502 [05:00:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6010 MB out of 7627 MB) [05:01:25] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [05:02:34] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 9: initializing_shards: 0: unassigned_shards: 0 [05:04:35] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:04:35] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:04:35] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:04:35] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:04:35] PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:05:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6010 MB out of 7627 MB) [05:05:46] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:05:46] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:05:46] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2075: active_shards: 6241: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [05:06:35] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:35] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:35] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:35] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:35] RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:44] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:44] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:06:44] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2076: active_shards: 6244: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:10:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 79% free (6011 MB out of 7627 MB) [05:15:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 80% free (6060 MB out of 7627 MB) [05:20:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 80% free (6070 MB out of 7627 MB) [05:25:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6174 MB out of 7627 MB) [05:28:30] (03CR) 10Legoktm: Beta: Enable EventLogging in ContentTranslation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [05:30:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6174 MB out of 7627 MB) [05:35:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6174 MB out of 7627 MB) [05:40:18] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 81% free (6175 MB out of 7627 MB) [05:45:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6190 MB out of 7627 MB) [05:47:04] legoktm: bah. Thanks. I need more coffee. [05:47:40] (03PS4) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 [05:50:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6226 MB out of 7627 MB) [05:55:16] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6227 MB out of 7627 MB) [06:00:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6227 MB out of 7627 MB) [06:01:12] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [06:01:32] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [06:01:41] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [06:01:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [06:04:05] (03CR) 10GWicke: [C: 031] "Hi Andrew, based on a cursory look this looks good to me." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata) [06:05:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6229 MB out of 7627 MB) [06:10:13] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6230 MB out of 7627 MB) [06:15:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 82% free (6230 MB out of 7627 MB) [06:18:12] (03PS4) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [06:18:51] (03CR) 10jenkins-bot: [V: 04-1] WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [06:20:49] (03PS5) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [06:26:59] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:27:00] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:28:09] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:39] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:50] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:09] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [06:29:29] RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 52.96 ms [06:29:49] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 52.59 ms [06:29:49] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:09] RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 51.88 ms [06:34:09] PROBLEM - puppet last run on es1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:29] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [06:46:11] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:08] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:09] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:51:09] RECOVERY - puppet last run on es1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:53:38] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:05:48] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:10:40] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail [07:30:18] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:59] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [07:40:53] mutante: whelp, sorry about that. [07:53:31] (03PS3) 10Glaisher: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [07:54:50] (03CR) 10Glaisher: [C: 031] "Just added bd.m, be.m and nyc.m" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [07:57:09] PROBLEM - Host ns1-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::e [07:57:15] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:2:d6ae:52ff:fead:5610 [07:57:29] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:1:d6ae:52ff:feac:4dc8 [07:57:39] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:57:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [08:00:01] (03CR) 10Glaisher: "There are other private/small wikis as well (eg. stewardwiki, checkuserwiki). Do we add them as well?" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [08:12:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 104, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-307236) {#10694} [10Gbps wave]BR [08:14:27] (03PS1) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. [puppet] - 10https://gerrit.wikimedia.org/r/171515 [08:15:14] (03CR) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh) [08:16:06] (03PS2) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. [puppet] - 10https://gerrit.wikimedia.org/r/171515 [08:17:10] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 106, down: 0, dormant: 0, excluded: 0, unused: 0 [08:17:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [08:17:43] <_joe_> ori: mmmh not sure I like this solution, isn't there a way to check if that's activated? [08:17:58] <_joe_> uh, sorry, bbiab (if you go, good night) [08:17:59] RECOVERY - Host 2620:0:860:2:d6ae:52ff:fead:5610 is UP: PING OK - Packet loss = 0%, RTA = 52.00 ms [08:17:59] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 52.37 ms [08:18:10] RECOVERY - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is UP: PING OK - Packet loss = 0%, RTA = 51.90 ms [08:19:58] _joe_: there isn't (that was what my comment was about) [08:20:23] good night [08:24:49] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [08:25:25] <_joe_> gee [08:29:00] (03CR) 10Steinsplitter: [C: 031] Adding Ukraine photo sources to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171484 (https://bugzilla.wikimedia.org/73045) (owner: 10Dereckson) [08:29:08] <_joe_> so, we seriously need to re-do the ocg servers from scratch [08:29:24] <_joe_> for now, I'm going to bind-mount /var/log into /srv [08:32:26] <_joe_> well, I'll do that after I have created the new HHVM package and tried it [08:33:22] <_joe_> I still can't believe we created a server with a non-LVM root partition of 9 GB, with no space for expansion [08:40:00] PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error [08:40:00] PROBLEM - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [08:45:14] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [08:47:23] (03PS1) 10Nikerabbit: Revert "Disable l10nupdate for the duration of CLDR 26 plural migration" [puppet] - 10https://gerrit.wikimedia.org/r/171516 [09:01:39] RECOVERY - Disk space on ms-be2003 is OK: DISK OK [09:29:46] (03PS1) 10Yuvipanda: icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 [09:29:47] godog: ^ +1? [09:30:40] YuviPanda: sure, taking a look [09:31:42] YuviPanda: what's with the garbled html in the commit message? :) [09:32:06] lolwut [09:32:23] https://phabricator.wikimedia.org/T789 is how it looks on my console [09:32:37] godog: lolwut, hit edit, and it shows it correctly [09:33:45] mhhh perhaps the auto-linking of issues like RT # is clashing? [09:34:04] perhaps... [09:34:09] let me try to edit [09:34:21] (03PS2) 10Yuvipanda: icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 [09:34:31] (03CR) 10Filippo Giunchedi: [C: 031] "note: the url shows garbled in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/171519 (owner: 10Yuvipanda) [09:34:36] (03PS3) 10Yuvipanda: icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 [09:34:46] godog: seems to be the case now, is fine now [09:34:47] ah nevermind my comment, +1 [09:35:05] so ok T is enough [09:35:21] yeah [09:37:37] (03CR) 10Yuvipanda: [C: 032] icinga: Send betalabs alerts to alerts list [puppet] - 10https://gerrit.wikimedia.org/r/171519 (owner: 10Yuvipanda) [09:41:08] YuviPanda: I'd rather we get rid of this icinga/betalabs thing entirely [09:41:13] yeah [09:41:22] should be gone next week [09:41:27] good :) [09:41:58] paravoid: I'm going to go ahead and do (1) from my ops@ email next week, if not much more discussion happens. [09:42:28] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:29] PROBLEM - HHVM rendering on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:54] I never replied you on-list, but I'm wondering if the two problems to external resources you stated are really unsolvable [09:50:42] <_joe_> interesting, mw1030 will be fixed soon-ish [09:53:51] <_joe_> !log installing the new hhvm package on mw1030 and mw1018 in order to test for stability [09:53:59] Logged the message, Master [09:54:39] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.149 second response time [09:54:49] RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 67494 bytes in 0.352 second response time [09:56:40] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [09:56:41] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 67494 bytes in 0.129 second response time [09:57:37] <_joe_> instead of getting more traffic to hhvm, I will raise the weight of those two servers in pybal [10:07:52] <_joe_> !log temporary raising weight of mw1018 and 1030 in pybal to load-test them and check for crashes [10:07:58] Logged the message, Master [11:04:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [11:12:33] (03PS1) 10Filippo Giunchedi: add graphite-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/171525 [11:13:54] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [11:14:26] <_joe_> mmmh what's up? [11:33:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [11:48:04] PROBLEM - nutcracker port on mw1163 is CRITICAL: Connection refused [11:52:06] RECOVERY - nutcracker port on mw1163 is OK: TCP OK - 0.000 second response time on port 11212 [11:57:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While I do agree with the idea, wouldn't it be better to set this as a local cronjob running every hour?" [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh) [12:01:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "1) put the new command definition in a separate declaration, like most of the check commands are now" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [12:02:26] (03PS2) 10Giuseppe Lavagetto: hhvm: remove unnecessary upstart stanza, config option [puppet] - 10https://gerrit.wikimedia.org/r/171244 [12:09:38] !log Depool wtp1001, wtp1003-1006 for trusty upgrade [12:09:46] Logged the message, Master [12:10:07] <_joe_> 5 at a time, w00t [12:10:17] yesterday it was 7 [12:10:19] <_joe_> I usually do 4 [12:10:26] and parsoid did not sweat [12:10:48] <_joe_> yeah resource usage on parsoid is not that high [12:13:35] PROBLEM - check if salt-minion is running on wtp1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:14:04] PROBLEM - check if salt-minion is running on wtp1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:14:24] PROBLEM - check if salt-minion is running on wtp1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:14:28] PROBLEM - check if salt-minion is running on wtp1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:15:48] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: remove unnecessary upstart stanza, config option [puppet] - 10https://gerrit.wikimedia.org/r/171244 (owner: 10Giuseppe Lavagetto) [12:21:44] RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 67939 bytes in 0.225 second response time [12:22:05] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [12:22:30] <_joe_> bbl, lunch [12:52:54] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [13:16:14] PROBLEM - puppet last run on wtp1003 is CRITICAL: Timeout while attempting connection [13:16:36] PROBLEM - Host wtp1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:00] PROBLEM - Host wtp1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:19] PROBLEM - Host wtp1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:18:00] PROBLEM - Host wtp1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:09] PROBLEM - Host wtp1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:26:41] (03PS1) 10Adrian Lang: Add qunit localhost setup to role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/171535 (https://bugzilla.wikimedia.org/72184) [13:48:24] (03PS1) 10Alexandros Kosiaris: Move wtp1001 to wtp1004 to raid1-lvm partman scheme [puppet] - 10https://gerrit.wikimedia.org/r/171536 [13:48:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move wtp1001 to wtp1004 to raid1-lvm partman scheme [puppet] - 10https://gerrit.wikimedia.org/r/171536 (owner: 10Alexandros Kosiaris) [13:53:46] PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:47] RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [14:41:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [14:44:35] (03CR) 10Nikerabbit: [C: 031] Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [14:51:38] (03CR) 10Ottomata: "> One is the user/pass for a given cluster; what will the path in the private puppet repos be." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata) [14:53:04] (03CR) 10Manybubbles: [C: 031] "No idea on the ensure_packages. I'm not really a skilled puppet developer so I just do whatever the rest of the repository does." [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [14:55:12] !log running performance test for Cirrus taking zhwiki [14:55:21] Logged the message, Master [14:55:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [14:57:06] !log performance test for zhwiki was good. trying dewiki [14:57:10] ^d: ^^^ [14:57:13] Logged the message, Master [14:57:24] <^d> sweet [14:58:28] yayyy [14:59:16] <^d> manybubbles: i'd try zhwiki again after we rebuild it. it got more shards. [14:59:29] ^d: that _should_ only make it faster [14:59:38] <^d> indeed :) [15:01:01] !log dewiki is fine. trying enwiki. [15:01:08] Logged the message, Master [15:03:05] ^d: I'm replaying 100% of enwiki's traffic now. http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Elasticsearch+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [15:03:40] Nemo_bis: good news! [15:04:07] sweet [15:04:36] all done [15:04:53] well, done with 20k searches. those new disks are much much much much better [15:04:55] ottomata: ^^^^ [15:05:20] woowooo [15:05:40] having doubled the number of machines might have something to do with that too [15:05:56] ottomata: tried replaying 100% enwiki's traffic. Servers saw almost no bump in, well, anything [15:06:00] godog: might just :) [15:06:16] I suppose what I'm saying is that we have more than enough power to cut over [15:06:35] <^d> let's do it today then! [15:06:37] <^d> gogogogogo [15:19:25] (03PS2) 10Faidon Liambotis: Expose Content-Range response header for CORS requests on upload [puppet] - 10https://gerrit.wikimedia.org/r/171502 (owner: 10Brion VIBBER) [15:19:42] (03PS3) 10Faidon Liambotis: Expose Content-Range response header for CORS requests on upload [puppet] - 10https://gerrit.wikimedia.org/r/171502 (owner: 10Brion VIBBER) [15:19:57] (03CR) 10Faidon Liambotis: [C: 032] Expose Content-Range response header for CORS requests on upload [puppet] - 10https://gerrit.wikimedia.org/r/171502 (owner: 10Brion VIBBER) [15:24:41] !log finished with performance testing for cirrus - new servers look like way way more than enough power [15:24:43] <^d> manybubbles: I came across http://www.elasticsearch.org/overview/shield, bleh [15:24:48] Logged the message, Master [15:25:17] ^d: there's been lots of clawing for that [15:25:22] I'm not suprised [15:25:37] <^d> Yeah. But it's like paid add-on stuff which is why I said bleh :\ [15:25:55] <_joe_> enterprise-grade security. Having worked in an enterprise, I won't use that as a marketing pitch :P [15:38:22] (03CR) 10Krinkle: [C: 04-1] "Won't work. and duplicate of" [puppet] - 10https://gerrit.wikimedia.org/r/171535 (https://bugzilla.wikimedia.org/72184) (owner: 10Adrian Lang) [15:39:06] Whenever I hear "enterprise" anything, I think Star Trek. [15:39:45] Galaxy-class security. [15:40:41] (03PS1) 10Ottomata: Use mysql::config::client to render a research pw file readable by the stats user. [puppet] - 10https://gerrit.wikimedia.org/r/171543 [15:41:31] (03PS2) 10Ottomata: Use mysql::config::client to render a research pw file readable by the stats user. [puppet] - 10https://gerrit.wikimedia.org/r/171543 [15:43:28] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [15:43:38] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [15:43:48] <_joe_> !log upgrading mw1031,mw1032 to the new package, no crashes seeen since reinstall [15:43:49] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [15:43:49] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [15:43:53] Logged the message, Master [15:43:55] (03CR) 10Ottomata: [C: 032] Use mysql::config::client to render a research pw file readable by the stats user. [puppet] - 10https://gerrit.wikimedia.org/r/171543 (owner: 10Ottomata) [15:44:41] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:44:59] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 2 failures [15:45:10] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet has 2 failures [15:45:10] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Puppet has 2 failures [15:45:20] marktraceur, manybubbles, ^d: So who wants to SWAT today? [15:45:29] I can do it! [15:45:29] Hmm [15:45:34] OK then! [15:45:38] ok! [15:48:22] (03PS3) 10Manybubbles: Enable TemplateData GUI for all wikis; move config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) (owner: 10Jforrester) [15:48:57] (03CR) 10Manybubbles: [C: 031] "Rebased clean. Should be ok. -1 seemed to have been cleared before I got to it. Looks fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157478 (https://bugzilla.wikimedia.org/60158) (owner: 10Jforrester) [15:49:37] <_joe_> !log load-testing hhvm, in particular the servers with the new package [15:49:43] Logged the message, Master [15:50:00] (03CR) 10Manybubbles: [C: 031] "Noop for production so fine by me. Will +2 during SWAT in 10 minutes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [15:52:58] manybubbles: thanks [15:55:38] PROBLEM - NTP on wtp1006 is CRITICAL: NTP CRITICAL: Offset unknown [15:55:39] PROBLEM - NTP on wtp1005 is CRITICAL: NTP CRITICAL: Offset unknown [15:56:22] anyone had this error before in labs? first time I see it but can't find exactly what's wrong [15:56:25] Error: Failed to apply catalog: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/memcached.py] at /etc/puppet/modules/memcached/manifests/ganglia.pp:19 [15:56:38] RECOVERY - NTP on wtp1006 is OK: NTP OK: Offset -0.001505970955 secs [15:56:49] RECOVERY - NTP on wtp1005 is OK: NTP OK: Offset -0.007328987122 secs [15:56:53] i.e. ganglia_new should get applied and that dir created inturn [15:59:09] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 8: initializing_shards: 1: unassigned_shards: 1 [15:59:48] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 9: initializing_shards: 1: unassigned_shards: 1 [15:59:57] silly health check [16:00:04] manybubbles, anomie, ^d, marktraceur, James_F: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141106T1600). Please do the needful. [16:00:18] its the stupid health check the complain when we rebuild indexes [16:00:26] (03CR) 10Manybubbles: [C: 032] Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [16:00:34] (03Merged) 10jenkins-bot: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [16:01:20] manybubbles: I never checked but the percent-based check seems to be working ok? [16:01:22] James_F|Away: are you around for https://gerrit.wikimedia.org/r/#/c/157478/3 ? [16:01:29] checked for false positives that is [16:01:42] its much much less likely to complain for no reason [16:01:49] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [16:02:28] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [16:02:40] yeah, possibly next week we could disable this one [16:02:52] !log manybubbles Synchronized wmf-config/: SWAT deploy some beta configs. Should be noop. (duration: 00m 04s) [16:03:00] Logged the message, Master [16:03:08] kart_: ^^ [16:03:09] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [16:03:09] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [16:03:09] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [16:03:27] elasticsearch is green you silly health check [16:04:02] gi11es: around for your swat? [16:04:08] manybubbles: yes [16:05:24] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [16:05:25] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [16:05:25] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2086: active_shards: 6274: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [16:06:10] (03CR) 10BBlack: "For the future: is there somewhere we could be auto-generating these from that's upstream in the mediawiki-config sense, both for mobile a" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [16:08:41] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2081: active_shards: 6257: relocating_shards: 7: initializing_shards: 6: unassigned_shards: 4 [16:09:42] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2085: active_shards: 6271: relocating_shards: 8: initializing_shards: 0: unassigned_shards: 0 [16:09:42] (03CR) 10Cscott: "That's not actually 7 days, that's "7 log files which are 256M each". But we're probably growing the log more than 256M/day, and logrotat" [puppet] - 10https://gerrit.wikimedia.org/r/171477 (owner: 10Andrew Bogott) [16:11:37] !log manybubbles Synchronized php-1.25wmf7/extensions/MultimediaViewer/: SWAT revert layout changes (duration: 00m 06s) [16:11:38] gi11es: ^^^^ [16:11:41] andrewbogott: ocg1001's / is full again. could you delete /var/log/upstart/ocg* -- those files shouldn't be being created any more. [16:11:42] Logged the message, Master [16:11:58] James_F|Away: last ping for https://gerrit.wikimedia.org/r/#/c/157478/3 [16:12:26] manybubbles: testing [16:13:54] cscott: done, but that box is pretty much doomed regardless. Needs a rebuild with more disk space. [16:14:19] or we just need to make /var/log its own partition or some such. [16:14:26] <_joe_> cscott: they are empty [16:14:38] _joe_: did that today already. I think [16:14:41] <_joe_> cscott: your home on that server has 500 mb of data, mwalkers' 200 [16:14:54] <_joe_> andrewbogott: no I had no time this morning [16:14:58] ah, ok [16:14:59] du /var/log is 1.2G right now, on a root partition that is only 9.1G large. [16:15:04] <_joe_> those servers need reimaging [16:15:40] <_joe_> cscott: can we make ocg log to a different directory than /var/log? [16:15:40] manybubbles: not seeing the changes yet, but it's not usual for there to be a delay when I test after the deploy [16:15:45] _joe_: oh, yeah, it's a little weird that /home is on the root partition -- but useful, since we need to rebuild binary modules from time to time and the build process hates flock on nfs. [16:16:03] manybubbles: ah, there it is. all good [16:16:13] gi11es: great! [16:16:24] _joe_: we could; the logrotate config would have to change as well. [16:16:26] manybubbles: thanks for the swat [16:16:58] _joe_: i'm working on a patch now to make logrotate run hourly, that should help some in terms of making logs use a consistent amount of space, so sudden load doesn't cause the partition to fill up [16:17:21] _joe_: we already are using the space-limited logrotate options, instead of the time-limited options, but because logrotate only runs once/day it's not really as effective as it should be [16:17:24] gi11es: your welcome! [16:17:43] <_joe_> cscott: no, can we log to another directory? [16:17:52] RECOVERY - Disk space on ocg1001 is OK: DISK OK [16:18:05] <_joe_> you don't need to do log rotation aggressively if we have room [16:18:20] <_joe_> cscott: say /srv/log/ocg/ [16:18:25] _joe_: i can roll that into the same puppet patch if that's helpful. but i'd still like to be space-limited rather than date-limited. [16:18:48] especially since the /srv directory is used for ocg's cache & etc. they should all be space-limited in theory. [16:19:04] putting logs on /srv will probably create more problems than it solves [16:19:16] <_joe_> cscott: in practice, it will not [16:19:23] since ocg doesn't actually stop or anything if / fills up, but it *would* start giving errors if /srv fills up. [16:19:29] <_joe_> we have a ridiculously large /srv partition for now [16:19:49] _joe_: only ridiculously large because i turned caching in ocg *way* down. [16:21:52] so i'll make the patch :) [16:21:56] <_joe_> cscott: ok :) [16:22:38] <_joe_> cscott: while you're at it, can you make the log file and dir a class variable and use templates for log configs? [16:22:43] also, we're using 70G on /srv for 2 days of cache right now. i'd like to increase the cache lifetime up to 7 days or so, which would use closer to 245G. [19:32:38] <_joe_> which would leave ~ 120 gb free [19:32:38] _joe_: maybe we need two patches. ;) this is starting to sound like advanced puppeting. [19:32:38] <_joe_> :) [19:32:38] <_joe_> ok, make yours, I'll do the other later :) [19:32:39] _joe_: in theory! [19:32:39] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [19:32:39] cscott: can ocg do its own rotation instead? [19:32:39] godog: we should just turn off syslog logging in that case. it's already logging to logstash. [19:32:39] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:32:39] !log manybubbles is done with SWAT [19:32:40] cscott: ah syslogging, nevermind my comment then [19:32:41] syslog is/should just be a small amount of local log mirroring for emergency use. [19:32:41] it's just that / doesn't even allow a "small amount" of logging right now. [19:32:41] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Puppet has 2 failures [19:32:41] and, like i keep saying, it would be fine if logrotate was actually size-limiting the logs like it is supposed to. but... [19:32:42] on the other hand, someone should be writing a logstash gc/rotation process, if there isn't one already. [19:32:42] logstash keeps 31 days of logs. It drops the 32nd day every night [19:32:43] bd808: perhaps you should read backlog ;) [19:32:43] and/or the comment on https://gerrit.wikimedia.org/r/171477 -- and there's another one or two puppet patches where i have this same discussion, fruitlessly. [19:32:43] bd808: oh, wait. never mind. you said 'logstash'. [19:32:43] bd808: i still had syslog on my brain. forgive me. [19:32:43] * cscott is hacking puppet [19:32:44] bd808: good to know, thanks. [19:32:45] bd808: i like the powers of 2 [19:32:45] _joe_: /srv/deployment/ocg/log good? [19:32:45] Apparently it actually drops the 31st day -- https://github.com/wikimedia/operations-puppet/blob/production/modules/logstash/files/logstash_delete_index.sh#L14 [19:32:45] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [19:32:45] bd808: how lunar. [19:32:46] PROBLEM - RAID on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:46] anyone up for a dns change? https://gerrit.wikimedia.org/r/#/c/171525/ [19:32:47] PROBLEM - DPKG on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:47] * cscott notes, pedantically, that the 30-day period is the lunar synodic period, as opposed to the 28-day lunar orbital period. [19:32:47] PROBLEM - puppet last run on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:47] PROBLEM - check configured eth on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:47] PROBLEM - check if dhclient is running on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:47] PROBLEM - check if salt-minion is running on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:47] PROBLEM - Disk space on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:47] PROBLEM - SSH on labstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:50] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [19:32:58] I'm on it. (labstore1001 just went away) [19:33:00] <_joe_> !log upgrading the hhvm API appservers to use the new package [19:33:04] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:33:05] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:11] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [19:33:14] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:33:14] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.027 second response time [19:33:14] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:33:15] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:33:15] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.006 second response time [19:33:18] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.019 second response time [19:33:18] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:33:18] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2087: active_shards: 6277: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [19:33:18] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2087: active_shards: 6277: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:33:21] PROBLEM - Host wtp1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:33:21] PROBLEM - Host wtp1006 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:33:21] PROBLEM - Host wtp1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:21] PROBLEM - Host wtp1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:22] RECOVERY - Host wtp1006 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [19:33:22] RECOVERY - Host wtp1001 is UP: PING OK - Packet loss = 0%, RTA = 2.86 ms [19:33:22] RECOVERY - Host wtp1003 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [19:33:22] RECOVERY - Host wtp1004 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [19:33:22] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.011 second response time [19:33:23] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [19:33:24] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:33:25] * Coren brings labstore1001 up gradually and carefully. [19:33:26] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [19:33:26] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 4.81 ms [19:33:26] !log repool wtp1001, wtp1003-1006 after trusty upgrade [19:33:27] _joe_: a present for you: https://gerrit.wikimedia.org/r/171578 [19:33:27] <_joe_> cscott: thanks a lot [19:33:27] _joe_: pls review carefully -- that's probably the largest puppet patch i've written so far. not improbable that i'm misunderstanding how things work in some way. [19:33:27] <_joe_> cscott: rotate hourly means you need to restart node every hour? or it has a signal for log rotation? [19:33:27] _joe_: Are you happy with that patch as an alternative to remapping /var/log? We shouldn't do both... [19:33:27] <_joe_> andrewbogott: I am [19:33:27] ok [19:33:27] _joe_: rotate hourly means logrotate is run every hour. node doesn't have to be restarted. [19:33:27] <_joe_> andrewbogott: also, the *real* solution is to just reimage those servers with LVM and sensible partitioning [19:33:27] _joe_: logrotate is configured to use copytruncate [19:33:27] rsyslog is probably HUPped or something, but i assume logrotate knows what it's doing there. [19:33:27] <_joe_> cscott: if you do copytruncate, no; [19:33:27] <_joe_> gwicke: ok [19:33:27] the log file in question isn't being written to directly from node. [19:33:27] _joe_: yes! that's what I wrote in the SAL yesterday. "andrewbogott: ocg1001 is depressingly tiny and will probably keeping complaining about disk space until it's rebuilt" [19:33:27] the upstart log file is piped from the console, so that's more "interesting" -- but that log file is empty nowadays [19:33:27] cscott: stdout & stderr is piped into it [19:33:27] <_joe_> cscott: oh ok [19:33:27] <_joe_> but as gwicke told you ^^ [19:33:27] gwicke: yes, that's /var/log/upstart/ocg.* which is a different log file, not the one being tweaked in this patch. [19:33:27] <_joe_> cscott: so ocg just sends messages to rsyslog [19:33:27] <_joe_> ok [19:33:27] cscott: kk [19:33:27] yeah. somebody (andrewbogott maybe?) added 'console none' to the upstart file a few days ago in any case, so there isn't actually an upstart log file right now. [19:33:27] wasn't me [19:33:27] <_joe_> cscott: I did [19:33:27] but the latest ocg reconfig (part of the move from winston to bunyan logging) turned off console logging for ocg. [19:33:28] <_joe_> cscott: so I can remove that, I was waiting for this :) [19:33:28] so if 'console none' was removed from the upstart config, i *expect* that very little will actually end up in the upstart logs. but that's a patch for another time. [19:33:28] yeah, let's do one thing at a time. [19:33:28] the rotation of the upstart log files are actually impossible to reconfigure from a puppet module, which is a buglet in its own right. [19:33:28] <_joe_> cscott: upstart dictates one-size-fits-all [19:33:28] <_joe_> mmm the gerrit bot is dead apparently [19:33:28] yeah, so i figure it's upstart's responsibility to cleanly handle its logs being rotated in any case. [19:33:28] but fwiw /var/log/upstart/*.log does *not* have copytruncate configured. [19:33:28] but /etc/logrotate.d/upstart comes directly from the upstart package, so again it's upstream's problem if it doesn't work right. and we can't easily change it in any case. [19:33:28] PROBLEM - Host labstore1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:33:28] <_joe_> andrewbogott: can you care to merge that patch? I got to get back to hhvm stuff [19:33:28] <_joe_> and it's already pretty late here [19:33:28] _joe_: Yep! Will merge as soon as I finish reading [19:33:28] cscott: Those ensure=>absent crons that you're removing… they're a relic of another age? [19:33:28] things that are no longer created? [19:33:29] cscott: I've merged your patch and applied it on ocg1001. So far I don't see any log files. Is it possible the service is totally wrecked due to the full drive earlier? [19:33:31] <_joe_> andrewbogott: you need to restart rsyslog maybe? [19:33:31] _joe_: trying... [19:33:32] <_joe_> sorry but I'm going off for a few hours, I most surely have business to attend to this evening [19:33:32] you should have called it a day after the hhvm upgrade :) [19:33:32] <_joe_> !log load test done on the HHVM pool [19:33:33] 17.28 -!- morebots [~morebots@208.80.155.255] has quit [Ping timeout: 245 seconds] [19:33:34] cscott: please ping me if/when you return [19:33:35] or, hm, Jeff_Green are you about? [19:33:35] ya [19:33:35] You worked on ocg, right? [19:33:35] some yeah [19:33:35] looking at backscroll.... [19:33:35] We've been tinkering with the logfiles there and I'm trying to figure out what, if anything, is currently broken [19:33:35] it had 100% full / earlier today, so it might be that a lot of it is busted. [19:33:35] I don't know what it does enough to check w/not it's working properly [19:33:35] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2087: active_shards: 6277: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [19:33:35] (It is definitely not producing any logs!) [19:33:35] i've had a couple squabbles with rsyslog there before [19:33:35] hateses rsyslog [19:33:35] ocg1001? [19:33:35] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2088: active_shards: 6280: relocating_shards: 7: initializing_shards: 0: unassigned_shards: 0 [19:33:35] yep [19:33:35] we can at least test the rsyslog filters using logger [19:33:35] manybubbles: Hey, sorry, just out of meetings. Was https://gerrit.wikimedia.org/r/#/c/157478 listed without a note that it was good to go? Eurgh. :-( [19:33:35] James_F: I just won't SWAT anything without _someone_ supporting it around. [19:33:35] manybubbles: Sure, but why did greg-g list me as supporting it? [19:33:35] checking [19:33:35] I thought you were the supported because you committed it [19:33:35] let me make sure it was you on the list... [19:33:35] yeah - it was you on the calendar too [19:33:35] It was just going to go out in the usual Thursday deploy, but we changed the existence of those. [19:33:35] Maybe I should have just let Reedy push it yesterday instead. [19:33:35] * James_F sighs. [19:33:35] * James_F runs to the next meeting. [19:33:35] lol [19:33:35] oops, forgot about that [19:33:35] Reedy: little help with that ^ :) [19:33:35] Sorry! [19:33:35] James_F: my bad, too [19:33:36] andrewbogott: the syslog user doesn't have write privs to the log dir [19:33:36] Jeff_Green: well, that would do it. One moment... [19:33:36] there are a lot of log dirs on this box [19:33:45] Jeff_Green: it's already owner => root, group => syslog. What should it be? [19:33:46] (Or so says puppet) [19:33:46] g+w [19:33:46] maybe 0775 [19:33:46] oh of course [19:33:47] we should probably purge or move the old logs when all is said and done to reduce confusion [19:33:47] Hm. "To illustrate the costs of permanently storing 1 TB of files: $2000 " http://archiveteam.org/index.php?title=Swipnet#Donating [19:33:48] Jeff_Green: https://gerrit.wikimedia.org/r/#/c/171586/1 [19:33:50] merged [19:33:50] thanks [19:33:50] np [19:33:50] I'll apply on ocg1001 and clean up [19:33:50] unless you already did [19:33:51] nope, go for it [19:33:51] !log reedy Synchronized wmf-config/: Enable TemplateData GUI everywhere (duration: 00m 14s) [19:33:51] Jeff_Green, cscott, ocg logging seems right now. [19:33:52] cool, thanks! [19:34:00] Reedy: Aha, thanks. [19:34:03] !log Coren and cmjohnson frantically working to resolve a Labs NFS failure [19:34:04] andrewbogott: Looks like a dead controller atm. Chris is shuffling hardware around to make sure now. [19:34:05] _joe_: from -tech: 13:43 < mbh_> hi, what's with api? I cannot save pages with api, error 503 [19:34:06] <_joe_> greg-g: I am off, can you ping someone else? [19:34:06] <_joe_> is this hhvm specific? [19:34:06] no idea [19:34:06] <_joe_> ok so... I have dinner in 1 minute :) [19:34:06] opsen ^^ :) [19:34:06] _joe_: enjoy [19:34:06] coren: give it a whirl [19:34:07] cmjohnson: Drac isn't responding? [19:34:07] Ah, there it is. [19:34:11] greg-g: You should champion a global ping word for opsen like the "hhvm-help:" stalk word that is used in #hhvm :) [19:34:11] I'll just put that on my list of things to champion.... [19:34:11] what, !domas doesn't do that? [19:34:11] hah [19:34:12] * matanya thanks _joe_ [19:34:25] Reedy: maybe we can have a DB on extension1 called 'shared' for these sorts of random global tables [19:34:25] AaronSchulz: sounds sensible [19:34:25] seems better than picking meta or centralauth ;) [19:34:25] sql aawiki -h db1029 [19:34:25] CREATE DATABASE shared; [19:34:25] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [19:34:27] NFS is back up; LABS instances are gradually recovering. [19:34:27] Reedy: grants need tweaking [19:34:27] (03CR) 10Ottomata: [C: 032] Slightly refactor misc::statistics::limn::mobile_data_sync [puppet] - 10https://gerrit.wikimedia.org/r/171553 (owner: 10Ottomata) [19:34:27] (03PS1) 10Ottomata: Fix typo variable name in statistics.pp [puppet] - 10https://gerrit.wikimedia.org/r/171559 [19:34:27] (03CR) 10Ottomata: [C: 032 V: 032] Fix typo variable name in statistics.pp [puppet] - 10https://gerrit.wikimedia.org/r/171559 (owner: 10Ottomata) [19:34:27] well that is delayed! [19:34:27] (03CR) 10Filippo Giunchedi: [C: 04-1] "depends on Id3bc2b4c9" [puppet] - 10https://gerrit.wikimedia.org/r/171547 (owner: 10Filippo Giunchedi) [19:34:27] (03PS1) 10Ottomata: Couple of more fixes for misc::statistics::limn::data::generate refactor [puppet] - 10https://gerrit.wikimedia.org/r/171563 [19:34:27] ottomata: It's actually impressively robust that it came through at all. :-) [19:34:27] (03CR) 10Ottomata: [C: 032 V: 032] Couple of more fixes for misc::statistics::limn::data::generate refactor [puppet] - 10https://gerrit.wikimedia.org/r/171563 (owner: 10Ottomata) [19:34:28] :) [19:34:28] (03PS1) 10Ottomata: Fix source_dir variable reference [puppet] - 10https://gerrit.wikimedia.org/r/171566 [19:34:28] (03CR) 10Ottomata: [C: 032 V: 032] Fix source_dir variable reference [puppet] - 10https://gerrit.wikimedia.org/r/171566 (owner: 10Ottomata) [19:34:28] (03PS1) 10Ottomata: Remove management of source_dir ownership [puppet] - 10https://gerrit.wikimedia.org/r/171568 [19:34:28] (03CR) 10Ottomata: [C: 032 V: 032] Remove management of source_dir ownership [puppet] - 10https://gerrit.wikimedia.org/r/171568 (owner: 10Ottomata) [19:34:28] Reedy: Access denied for user 'wikiadmin'@'10.%' to database 'shared' (10.64.16.18) [19:34:38] (03PS1) 10Ottomata: Separate rsync from generate cron job [puppet] - 10https://gerrit.wikimedia.org/r/171572 [19:34:42] (03CR) 10Ottomata: [C: 032] Separate rsync from generate cron job [puppet] - 10https://gerrit.wikimedia.org/r/171572 (owner: 10Ottomata) [19:34:44] AaronSchulz: heh. I wonder what they're setup as [19:34:57] (03PS1) 10Ottomata: Don't need to manage rsync_from and output directories [puppet] - 10https://gerrit.wikimedia.org/r/171577 [19:35:01] (03CR) 10Ottomata: [C: 032] Don't need to manage rsync_from and output directories [puppet] - 10https://gerrit.wikimedia.org/r/171577 (owner: 10Ottomata) [19:41:11] (03PS1) 10Ori.livneh: apache::monitoring: provision 'links' package [puppet] - 10https://gerrit.wikimedia.org/r/171621 [19:41:32] godog: got a moment for a small patch? ^ [19:43:35] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [19:45:48] !log restarted ntp on labstore1001 [19:45:54] Logged the message, Master [19:46:03] not that it helped [19:46:25] 'offset unknown' sounds sci-fi-ish [19:46:30] yeah [19:46:33] i like it, it's exciting [19:47:06] Timecop 3: Offset Unknown [19:47:33] RECOVERY - DPKG on labstore1001 is OK: All packages OK [19:47:33] RECOVERY - Disk space on labstore1001 is OK: DISK OK [19:47:34] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 60 logical, 60 physical [19:47:52] RECOVERY - check if salt-minion is running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:56] RECOVERY - check configured eth on labstore1001 is OK: NRPE: Unable to read output [19:47:56] RECOVERY - check if dhclient is running on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [19:47:56] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:50:45] Reedy: I made 'wikishared', which is usable [19:51:24] (03CR) 10Giuseppe Lavagetto: [C: 031] "please do." [puppet] - 10https://gerrit.wikimedia.org/r/171621 (owner: 10Ori.livneh) [19:51:44] <_joe_> ori: /win 39 [19:51:47] <_joe_> err [19:51:51] hello [19:51:52] <_joe_> ori: merge it! [19:51:58] (03PS2) 10Ori.livneh: apache::monitoring: provision 'links' package [puppet] - 10https://gerrit.wikimedia.org/r/171621 [19:52:05] (03CR) 10Ori.livneh: [C: 032 V: 032] apache::monitoring: provision 'links' package [puppet] - 10https://gerrit.wikimedia.org/r/171621 (owner: 10Ori.livneh) [19:52:11] thanks! [19:54:14] Hi all, i wanted to share some of my time for wiki foundation, how I can do this and what is required? [19:54:55] mihau_: What would you like to do? [19:55:33] I don't know exactly, some scripting in shell maybe, usual operations sth not difficult for starters [19:56:07] (03CR) 10Ottomata: "I refactored the limn data sync puppet code. You should now be able to add a single line to class misc::statistics::limn::data::jobs to d" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171465 (owner: 10Milimetric) [19:57:03] giuseppe (i'm avoiding a ping :P) see my reply on if you're around [20:00:25] (03PS3) 10Ottomata: Create eventlogging-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169195 [20:01:40] Reddy: Can you tell me what entry level can do for wikimedia? [20:04:03] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset 0.000152349472 secs [20:04:25] (03CR) 10Ottomata: [C: 032 V: 032] Create eventlogging-roots group, add qchris [puppet] - 10https://gerrit.wikimedia.org/r/169195 (owner: 10Ottomata) [20:04:32] (03CR) 10Ori.livneh: Create eventlogging-roots group, add qchris (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169195 (owner: 10Ottomata) [20:04:37] br_mail_timestamp ON /*_*/bounce_records(br_user_email(50), br_timestamp); [20:04:50] Reedy: heh, I doubt mysql is smart enough to use the second part of that index [20:04:53] ottomata: heh, ignore my comment, not worth a follow up [20:05:01] can somebody do something about gerrit's ssh being down (or incredibly slow) for me and aparently other people outside US? [20:05:09] haha [20:05:12] ori, sorry :) [20:05:19] np! [20:05:22] naw i like pedandicness :) [20:05:23] (03CR) 10Tnegrin: "works for me." [puppet] - 10https://gerrit.wikimedia.org/r/169195 (owner: 10Ottomata) [20:05:48] a `git fetch` runs for ten minutes, then times out. [20:06:10] MatmaRex: mine responded after about a minute and then finished reasonably quickly [20:06:19] WFM too [20:06:21] mine timed out after ten minutes. [20:06:36] (03PS1) 10Ottomata: Fix capitalization of EventLogging in comments in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/171626 [20:06:53] (03CR) 10Ottomata: [C: 032 V: 032] Fix capitalization of EventLogging in comments in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/171626 (owner: 10Ottomata) [20:06:58] (03CR) 10Ori.livneh: "<3" [puppet] - 10https://gerrit.wikimedia.org/r/171626 (owner: 10Ottomata) [20:07:11] (03PS2) 10Milimetric: Add cron job that generates flow statistics [puppet] - 10https://gerrit.wikimedia.org/r/171465 [20:07:15] Reedy: it could on an equality with a string of len < 50, but I know it's not clever enough [20:07:19] too bad [20:10:46] (03CR) 10Giuseppe Lavagetto: "It is two operations (crontab -e; puppet agent --disable) vs one; but it's using the right tool for the job... my opposition comes more fr" [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh) [20:11:38] <_joe_> ori: even better - make it a cron.d file instead of using the idiotic puppet cron resource, just renaming it will disable it [20:11:52] <_joe_> so we can do that programmatically for multiple hosts [20:12:01] _joe_: that could work. i had another idea too [20:12:33] <_joe_> (mv /etc/cron.d/cover-our-lazy-asses /etc/cron.d/cover-our-lazy-asses.disabled) [20:12:58] we could make it a cron.hourly script that checks if any users are logged in [20:13:07] on the assumption that if you have heap profiling enabled you at least have a screen session open [20:13:26] <_joe_> too complicated :) [20:14:02] <_joe_> I mean, you can disable puppet and move a file, it's no big deal, and we don't risk it not running because I forgot a screen session there [20:14:10] <_joe_> (which I do all the time btw) [20:14:23] MatmaRex: Is it trying to hit Gerrit over IPv6? [20:15:03] RoanKattouw: i doubt it, how do i check? (i'm on windows) [20:15:16] actually, the plain SSH connections to gerrit API succeed [20:15:20] oh OK [20:15:21] it's just the `git fetch` that hangs [20:15:30] oh wow [20:15:34] nevermind, it finished now [20:15:41] only took nine minutes this time [20:15:44] crisis averted [20:15:53] <^d> 10% better! [20:30:26] (03CR) 10Dzahn: "matanya suggested to "write that class in the role but not in the main one, but include only it in site.pp on a specific host". so like on" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [20:35:03] (03CR) 10Dzahn: "not sure yet about getting it from mediawiki config, but what we could do is make a new template in helpers, like langlist/langs.tmpl, tha" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [20:43:54] apparently jenkins is stuck? [20:44:03] anyone in here working on that, or should i give it a go? [20:45:26] aude: About? Do you know if mobilefrontend was disabled on wikidata because it didn't work well? Or was it just because there wasn't a mobile domain setup? [20:47:36] Lydia_WMDE: ^ [20:47:59] Reedy: i think because it is broken [20:48:09] doesn't work very well with our pages [20:48:14] at the moment at least [20:48:36] Thanks :) [20:48:44] That did come to mind when mutante asked about it [20:50:00] (03PS4) 10Dzahn: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 [20:50:14] Lydia_WMDE: ^ so because of that, adding a bunch of missing ".m."'s [20:50:23] but we currently exclude wikitech and wikidata [20:50:36] *nod* please keep it excluded for now [20:50:47] we were wondering if there was no DNS because no mobile frontend, or no MF because no DNS :) [20:50:50] ok! [20:50:55] :) [20:51:16] hopefully the situation will improve with the new design [20:51:20] and then we can see again [20:52:30] Reedy: statements render ugly [20:52:41] what Lydia_WMDE says [20:52:46] or rather Wikidata doesn't work with narrow screens;) [20:53:27] MaxSem: yeah [20:53:30] heh [20:53:46] mutante, a mobile subdomain for WD or anything else is fine, as long as were not starting to redirect people there:) [20:54:01] (03CR) 10Dzahn: "added "steward" and "checkuser" wikis" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [20:55:00] MaxSem: i think the bug was that for other missing wikis we did redirect people but then it didnt exist [20:55:07] (03CR) 10Matanya: [C: 031] "Please add DFC as well." [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [20:55:38] dfc? [20:55:49] (03CR) 10Matanya: "FDC, of course." [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [20:56:05] ottomata: do you have a moment for a puppet / hiera question? [20:57:12] (03PS5) 10Dzahn: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 [20:57:33] (03CR) 10Dzahn: "added fdc" [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [20:57:35] gwicke: sure [20:59:22] so I think we'll need to come up with some path for private info like user/pass per cluster [20:59:40] $db_pass = '<%= scope.lookupvar('passwords::bugzilla::bugzilla_db_pass') %>'; [21:00:18] yeah, for sure [21:00:24] gwicke it will be in the private repo somewhere [21:00:36] but, you will pass that in from the role class, it won't be in any of your modules [21:00:41] (or it will come from hiera...?) [21:00:58] so something like passwords::cassandra::cluster1::user ? [21:00:59] you woudln't do it in a template [21:01:09] yeah, something like that, but the template should just do [21:01:11] yeah, I'll pass it to the module [21:01:16] pass=<%= @db_pass %> [21:01:17] or whatever [21:01:18] yeah [21:01:52] so for now, is there a reason / possibility to put any of this in hiera? [21:02:40] yes, so, hiera would just fill in your module parameters automatically [21:02:52] your module should still work without hiera [21:03:04] e.g. if all required parameters were passed in manually [21:03:26] hmm.. wouldn't that mean that i'd have to pass in a cluster name? [21:03:40] yup [21:04:00] in my vagrant instance wher ei was developing the cassandra module [21:04:04] i have this in my heira common.yaml [21:04:11] cassandra::cluster_name: mediawiki-vagrant [21:04:11] cassandra::listen_address: "%{::ipaddress_eth1}" [21:04:11] cassandra::rpc_address: "%{::ipaddress_eth1}" [21:04:11] cassandra::seeds: ["%{::ipaddress_eth1}"] [21:04:11] cassandra::dc: vagrant [21:04:12] cassandra::rack: 1 [21:04:46] so it's not keyed off the cluster name? [21:04:51] ? [21:04:55] how do you support multiple clusters in this scenario? [21:05:15] ah, good q, not entirely sure. if i was not using hiera, i would do it in multiple roles [21:05:22] role::restbase::cassandra, maybe [21:05:26] role::analytics::cassandra [21:05:30] and those would each be users of the cassandra module [21:05:33] and pass in relevant values [21:05:49] yeah, I think I know how to do that [21:05:55] with hiera (which I am still new at)... [21:05:56] hm [21:06:09] let me just do the obvious thing first [21:06:13] i guess there would be a separate yaml file for each? and somehow those yaml files would be applied per role? [21:06:37] I'd think that it's hierarchical info, so we'd key off a cluster name for example [21:06:40] aye ok, i mean, for your module developemnt, it doesn't matter as much, you can test your modules using a role class and manually setting params [21:06:56] ja, gwicke, if there is a way to do that, then yeah [21:06:57] makes sense [21:08:12] I'll manually copy in my WIP module on some of the labs vms [21:08:20] aye cool [21:08:27] can't check out two things at once [21:08:31] gwicke: supposedly you can edit yaml configs in labs now too! [21:08:40] although I don't yet know how it works, YuviPanda|zzz pointed me to it [21:09:04] yaml configs as in hiera? [21:09:06] yes [21:09:08] um um um [21:09:09] kk [21:10:15] pretty brand new though [21:10:15] https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2014-11-03#Hiera [21:10:21] need documentation :) [21:10:35] it will let you edit a special wikipage on wikitech to set hiera yaml for your labs project [21:10:54] cool [21:13:23] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [21:18:50] (03CR) 10Dzahn: [C: 032] add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [21:20:33] (03CR) 10QChris: "The dependent change has been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris) [21:25:23] (03PS2) 10Ottomata: Make varnishkafka pick up Range header [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris) [21:26:05] (03CR) 10Ottomata: [C: 032 V: 032] Make varnishkafka pick up Range header [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris) [21:26:56] !log added Range header field to varnishkafka webrequest logs [21:27:01] Logged the message, Master [21:32:44] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:39:05] ottomata: I'm unclear on how I can get from the hiera setup in labs to what you sketched on the patch [21:39:46] the role classes, gwicke [21:39:47] ? [21:40:12] yes [21:40:33] so I'll have to set up some role for the cluster [21:41:42] yes, you'll have to do that anyway, at the very least to include the class...i think there is some fancy hiera mainrole thing that _joe_ made [21:41:49] but i'm not entirely sure how to use it [21:42:06] but, ja, you will need role classes defined [21:42:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [21:42:29] ottomata: so how do I get a role class that I can include per cluster? [21:42:29] if you are doing self hosted puppetmaster, you could manually create them in /var/lib/git/operations/puppet/manifests/role/ for now [21:43:06] currently everything seems to be just cassandra::* [21:43:10] gwicke: i'm not sure how to do the hiera yaml stuff we were just talking about [21:43:19] but, traditionally, with roles, you'd have [21:43:28] maybe [21:44:41] role::restbase::cassandra { [21:44:41] class { '::cassandra': cluster_name => 'restbase', ... } [21:44:41] } [21:44:41] role::othercluster::cassandra { [21:44:42] class { '::cassandra': { cluster_name => 'othercluster', ... } [21:44:42] } [21:44:48] and include those roles on whatever nodes you want [21:45:18] not saying you *should* do that, exactly, but that is a way [21:45:24] as for hiera...uhhh [21:45:40] IF you can figure out how to set hiera values based on cluster [21:45:49] or, pick them based on cluster [21:45:53] then you might not need the role classes [21:45:56] you might be able to do [21:45:57] just [21:46:02] class { '::cassandra': } [21:46:05] and that's it [21:46:12] and hiera would fill in the appropriate parameters [21:46:41] okay, the latter sounds less certain [21:46:54] yeah, i mean, its just because I don't have a lot of experience with hiera yet [21:47:01] need someone who does..._joe_? :) [21:47:26] can you explain what the nested stanzas are doing? [21:47:40] especially class { '::cassandra': cluster_name [21:47:42] => 'restbase', ... } [21:47:58] does that write into the cassandra namespace? [21:49:04] ah, no, that is just including a class with parameters [21:49:17] class { 'classname': [21:49:18] parameterA => 'valueA', [21:49:18] ... [21:49:18] } [21:50:03] the :: in ::cassandra means to load the class starting from root scope, [21:50:11] I see, thanks [21:50:25] often needed in similarly named role classes, because puppet allows for nested includes and classes (i wish it didn't) [21:50:30] relative* incldues [21:50:55] e.g. [21:50:55] class role::cassandra { [21:50:55] class { 'cassandra': ... } [21:50:55] } [21:50:58] could likely error [21:51:09] because puppet would think you are trying to include the class from inside itself [21:54:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:55:49] ottomata: *nod*, thx [21:57:24] (03CR) 10Greg Grossmeier: "From: https://bugzilla.wikimedia.org/show_bug.cgi?id=72275#c5" [puppet] - 10https://gerrit.wikimedia.org/r/154710 (owner: 10Ori.livneh) [22:09:32] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [22:09:35] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [22:09:47] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [22:09:48] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [22:10:02] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [22:10:02] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [22:10:22] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [22:10:33] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [22:10:37] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [22:11:03] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [22:11:03] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [22:11:23] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [22:11:32] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [22:11:33] grr ... syntax error in the config file. [22:11:36] hotfixing. [22:11:52] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [22:12:13] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [22:12:22] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: Connection refused [22:12:39] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [22:12:39] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [22:12:39] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [22:12:49] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [22:12:55] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [22:12:59] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [22:13:06] <_joe_> subbu: ok [22:13:10] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.022 second response time [22:13:10] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.033 second response time [22:13:11] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time [22:13:19] jshint should have caught it .. how did jenkins let it by. [22:13:19] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [22:13:19] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time [22:13:29] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [22:13:36] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.023 second response time [22:13:40] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.021 second response time [22:13:49] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [22:13:49] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time [22:14:00] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.015 second response time [22:14:00] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [22:14:00] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.022 second response time [22:14:00] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.016 second response time [22:14:00] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time [22:14:00] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.004 second response time [22:14:01] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.014 second response time [22:14:09] on the bright side, paging works [22:14:13] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.027 second response time [22:14:14] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time [22:14:14] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time [22:14:14] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.028 second response time [22:14:14] Yup [22:14:14] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.007 second response time [22:14:19] hey, what's up ? [22:14:30] sorry .. it was bad config. [22:14:31] paravoid: It worked very well yesterday too. I got like 26 pages for the mobile outage [22:14:54] fixed. we have to look at our jenkins setting why jshint didn't catch it. [22:14:55] subbu: ok, no worries. [22:15:12] RoanKattouw: why are you getting pages? :) [22:15:17] <_joe_> akosiaris: I was worried the upgrade did have something to do with this [22:15:25] Because we have separate alerts for mobile-lb.{eqiad,esams,ulsfo} IPv{4,6} [22:15:28] you gave up your root recently, didn't you? [22:15:35] paravoid: Yeah and I asked to only get Parsoid pages [22:15:43] But mutante said he wasn't sure that was possibel [22:15:48] _joe_: me too [22:15:49] s/Parsoid/*oid/ [22:16:35] * akosiaris going back to sleep [22:16:42] akosiaris, really sorry! [22:17:17] we wil have to fix that config testing hole next in jenkins. [22:17:19] (03PS1) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [22:17:19] <_joe_> akosiaris: you shouldn't get pages after midnight! [22:18:01] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [22:18:05] (03CR) 10John F. Lewis: "It one of these cases of 'it works for what we want it to do' really. As the current version, puppet-lint gives a warning about indentatio" [puppet] - 10https://gerrit.wikimedia.org/r/170493 (owner: 10John F. Lewis) [22:19:15] _joe_: we're on CET, for simplicity [22:19:36] (iirc) [22:19:48] !log updated parsoid to d23d2be6 (+ a hotfix to the production localsettings config file) [22:19:50] ok, I'm going to sleep as well [22:19:51] (03PS2) 10John F. Lewis: dataset: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170492 [22:19:53] Logged the message, Master [22:21:29] (03CR) 10John F. Lewis: bacula: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [22:23:36] Hi RoanKattouw, I'm taking you up on ur offer of letting me bug you with more questions :) here are some probably silly ones: - How is it determined whether a "version" param will be included in the call to bits for ResourceLoader modules? (asking because I see that impacts on caching) ...and (incidentally) how does bits know which version of MW to serve files for? [22:24:42] (03Abandoned) 10John F. Lewis: backup: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170474 (owner: 10John F. Lewis) [22:24:44] The version parameter is purely for cache busting, it is never read by the server [22:24:51] Just so that's clear [22:25:11] We put in a version parameter whenever possible [22:25:24] greg-g, what is the protocol here? does that config snafu require an email to the ops list? [22:25:30] The value of the version parameter is determined by JS on the client, as the max(...) of the timestamps of the modules in the request [22:25:37] subbu: how long was the outage? [22:25:39] it was a javascript syntax error that wasn't caught. [22:25:40] (03PS2) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [22:25:51] ~3 mins or so. [22:26:04] Some requests are initiated without using JS, e.g. from . Those do not have version parameters because we can't use dynamic URL composition there [22:26:14] subbu: I'm curious why Jenkins let it happen, as you are, maybe that itself is worth the outage report/bug report [22:26:23] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [22:26:52] yes, cscott is starting investigating that. [22:26:56] * greg-g nods [22:27:00] Also, the request for the startup module (load.php?modules=startup) does not have a version parameter, because of bootstrapping: the startup module is what contains this timestamp information in the first place, so we cannot give it a version parameter because we don't have any timestamps yet [22:27:11] Generally we try to avoid version-less requests except for startup [22:27:20] Ah hmmm [22:27:25] greg-g, ok, will email. [22:27:39] RoanKattouw_away: thanks! :) [22:29:00] (03CR) 10John F. Lewis: authdns: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170473 (owner: 10John F. Lewis) [22:29:23] subbu: ty sir [22:49:39] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [22:57:01] Reedy, Unknown site ID configured: maiwiki [22:57:11] MaxSem: Where's that? [22:57:25] logstash/exception log [22:57:33] link/stacktrace? [22:58:00] /what is it from? [22:58:07] learn to use logstash? :P [22:58:41] {"file":"/srv/mediawiki/php-1.25wmf6/extensions/Wikidata/extensions/Wikibase/client/includes/ChangeHandler.php","line":67,"function":"__construct","class":"Wikibase\\ChangeHandler","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.25wmf6/extensions/Wikidata/extensions/Wikibase/lib/includes/ChangeNotificationJob.php","line":128,"function":"singleton","class":"Wikibase\\ChangeHandler","type":"::","args":[]}, {"file":"/srv/mediawiki/php-1.25w [22:58:41] mf6/includes/jobqueue/JobRunner.php","line":136,"function":"run","class":"Wikibase\\ChangeNotificationJob","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.25wmf6/maintenance/runJobs.php","line":80,"function":"run","class":"JobRunner","type":"->","args":["array"]}, {"file":"/srv/mediawiki/php-1.25wmf6/maintenance/doMaintenance.php","line":101,"function":"execute","class":"RunJobs","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.25wm [22:58:42] f6/maintenance/runJobs.php","line":95,"args":["string"],"function":"require_once"}, {"file":"/srv/mediawiki/multiversion/MWScript.php","line":97,"args":["string"],"function":"require_once"} [22:58:56] https://logstash.wikimedia.org/#/dashboard/elasticsearch/default is showing absolutely nothing [22:59:33] blah. Stupid cluster rebalance strikes again :( [22:59:50] aude: ^^ looks like something didn't take for maiwiki :( [23:00:23] call me maybe [23:00:34] https://logstash.wikimedia.org/#dashboard/temp/yYZV9Vo4TiG4wSmM_yvqHQ [23:01:12] Looks like they're all jobrunner? [23:02:22] !log restarted logstash on logstash1001 for the usual reason (no events making it to elasticsearch) [23:02:29] Logged the message, Master [23:02:30] Reedy: oh noes [23:02:36] better link: https://logstash.wikimedia.org/#dashboard/temp/afps2EJsTESbcVXBEx4kSg [23:02:39] -em looks [23:02:46] (03PS1) 10Giuseppe Lavagetto: hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 [23:03:11] bd808, create a monitoring metric for it? [23:03:32] <_joe_> ori: ^^ I'm merging it [23:03:42] MaxSem: Yeah. We really should. [23:03:48] aude: Ah, its sites table is empty [23:03:54] (03CR) 10Ori.livneh: [C: 031] hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 (owner: 10Giuseppe Lavagetto) [23:03:55] _joe_: please do [23:04:00] i can take care of ti [23:04:02] it* [23:04:18] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 (owner: 10Giuseppe Lavagetto) [23:04:26] (03CR) 10Giuseppe Lavagetto: [V: 032] hhvm: remove jemalloc profiling config due to a bug in HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171763 (owner: 10Giuseppe Lavagetto) [23:04:27] I did run foreachwikiindblist wikidataclient.dblist extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --strip-protocols [23:04:28] :/ [23:04:55] why not just everywhere? [23:05:12] Not everywhere has wikidataclient [23:06:05] You get Fatal error: Class 'SiteMatrixParser' not found in /srv/mediawiki-staging/php-1.25wmf7/extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php on line 55 [23:06:39] enwiki knows about maiwiki for example [23:09:07] it needs the --site-group thing [23:09:23] * aude really needs to fix it.... getting annoyed [23:10:18] Does the autodetect stuff not work right? [23:10:23] $this->addOption( 'site-group', 'Site group that this wiki is a member of. Used to populate ' [23:10:24] . ' local interwiki identifiers in the site identifiers table. If not set and --wiki' [23:10:24] . ' is set, the script will try to determine which site group the wiki is part of' [23:10:24] . ' and populate interwiki ids for sites in that group.', false, true ); [23:12:50] any zuul experts here? Mobile's Jenkins is broken again.. https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-qunit-mobile/6812/console [23:12:54] it says that? [23:13:07] indeed [23:13:11] 22:31:45 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/p/mediawiki/core [23:13:45] https://github.com/wikimedia/mediawiki-extensions-Wikidata/blob/master/extensions/Wikibase/lib/maintenance/populateSitesTable.php#L29-L32 [23:14:05] mutante: yup :( [23:14:16] 22:31:45 IOError: Lock for file '/srv/ssd/jenkins-slave/workspace/mwext-MobileFrontend-qunit-mobile/src/.git/config' did already exist, delete '/srv/ssd/jenkins-slave/workspace/mwext-MobileFrontend-qunit-mobile/src/.git/config.lock' in case the lock is illegal [23:15:07] http://upload.wikimedia.org/wikipedia/commons/thumb/6/65/Kmii_logo_en.gif/220px-Kmii_logo_en.gif [23:15:39] aude: ah. I see the sitegroup for maiwiki on enwiki is mai [23:15:57] oh really? [23:16:12] http://p.defau.lt/?613TLEkJ_tIKaGjfVIO5CQ [23:16:39] the rest looks right/sane in comparison to aawiki row [23:16:51] (03PS3) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [23:17:10] jdlrobson: I deleted the git lock file. Let's try re-running the job [23:17:17] thanks bd808 [23:17:23] * jdlrobson crosses fingers [23:17:34] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [23:18:13] jdlrobson: Well it's at least broken differently now :/ [23:18:20] bd808: lolz [23:18:36] bd808: parsoid and ocg logging on logstash appears to be down [23:18:43] subbu: ^ [23:19:07] cscott: I'll kick the logstash instance. That goes to 1002 correct? [23:19:15] bd808, parsoid to 1003 [23:19:24] i just checked 1002 and it said the cluster is green, fwiw [23:19:26] is there a plan in place for figuring out the root of the problem? (i presume it is unknown) [23:20:05] seems to say site_group 'mai' on all wikis [23:20:26] i wish logstash's own log was more useful [23:20:47] jgage: The logstash service that feeds the es backend gets hung up sometimes [23:21:00] !log restarted logstash on logstash1003 [23:21:03] (03PS7) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [23:21:05] bd808: https://logstash.wikimedia.org/#dashboard/temp/Pqjx___vQ-GeeAb0Y3xaUQ says ocg logging stopped around 21:00 [23:21:06] Logged the message, Master [23:21:08] utc, presumably [23:21:11] bd808, how do you detemrine if it's in a bad state? [23:21:12] (03PS4) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [23:21:21] parsoid logs seem to be back [23:21:42] bd808: ocg logs are still missing [23:21:59] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [23:22:35] !log restarted logstash on logstash1002 [23:22:39] Logged the message, Master [23:22:56] bd808: yup, i'm seeing logs again. thanks. [23:22:57] viwikivoyage looks ok [23:23:05] cscott: np [23:24:11] !log Killed 3 hung /usr/local/bin/logstash_optimize_index.sh processes on logstash1002 [23:24:17] Logged the message, Master [23:24:42] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [23:26:18] (03PS5) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [23:26:39] oho, logstash_optimize_index.sh is called by root's crontab [23:26:50] !log deleted corrupt mediawki/core clone in workspace/mwext-MobileFrontend-qunit-mobile on gallium [23:26:52] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:52] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:52] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:53] PROBLEM - ElasticSearch health check on elastic1027 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:53] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:53] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:54] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:54] PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:26:56] Logged the message, Master [23:26:58] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [23:27:51] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [23:27:51] RECOVERY - ElasticSearch health check on elastic1027 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [23:27:51] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [23:27:52] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [23:30:03] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [23:30:06] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [23:30:06] RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [23:30:06] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [23:30:22] (03CR) 10Ori.livneh: [C: 04-1] "Looks good. I left comments inline, but they are all for cosmetic issues, except the one about the service resource." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [23:30:48] i went to a talk last night by an elasticsearch author, he was excited that we have a 31-node cluster. apparently that's large :) [23:31:01] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [23:31:21] <^demon|away> jgage: How was the talk? [23:31:37] bd808, https://gerrit.wikimedia.org/r/#/c/170935/ can probably be abandoned now. [23:31:42] it was pretty good, i need to check my notes against our config [23:31:46] jgage: that's always worrisome [23:31:52] he gave several specific tuning suggestions [23:32:08] lemme see if the deck is online [23:32:14] jgage: That cron job comes from this puppet config -- not a big deal if it hangs but I haven't seen that before [23:32:45] cscott: Not that worrisome. We have more content than most folks would by a long shot [23:33:12] And many.bubbles is practically a core contributor to the project :) [23:33:49] thanks bd808. /etc/cron.daily/ would probably be better than a user crontab, but it's a minor point. [23:33:59] bd808: i assume you've seen http://aphyr.com/posts/317-call-me-maybe-elasticsearch ? [23:34:06] <^demon|away> bd808: github's elastic setup is probably pretty decently sized, but i don't know details [23:34:37] (03Abandoned) 10BryanDavis: logstash: Drop spammy parsoid messages [puppet] - 10https://gerrit.wikimedia.org/r/170935 (owner: 10BryanDavis) [23:34:38] <^demon|away> cscott: A ton of call me maybe was the motivation behind the zen discovery improvements in 1.4.0 [23:35:36] looks like this is the ~same talk i watched, but you have to register to watch the video, haven't found slides: http://www.elasticsearch.org/webinars/elk-stack-devops-environment/ [23:35:48] (03PS6) 10Ori.livneh: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [23:35:58] ^ gwicke -- made some lint fixes for you [23:36:11] <^demon|away> cscott: I think they've got a whole set of integration tests to simulate those network failure conditions now. [23:36:22] i would hope they are actually running jepsen [23:36:48] <^demon|away> Not a clue. [23:36:57] (03CR) 10Ori.livneh: "Avoid class names with dashes in them ('otto-cass'); they're a headache." [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [23:37:03] ^demon|away: A bit dated but said 44 EC2 instances with 2T of SSD :) [23:37:26] <^demon|away> 2T x 44 EC2 instances? [23:37:28] oops 2T SSD per machine [23:37:41] <^demon|away> Oh ok, 88T makes more sense. [23:37:50] <^demon|away> I was like "what could you be indexing in only 2T?" [23:37:59] "That one is running elasticsearch 0.2, The volume of data there is 30 terabytes of primary data." [23:38:06] so old data for sure [23:38:57] <^demon|away> "Behind the scenes, we actually have probably a good 40 to 50 search indexes" [23:39:16] ori: thanks, I'm tweaking & testing currently as well [23:42:33] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [23:43:03] (03PS4) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [23:43:33] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [23:47:23] (03CR) 10Ori.livneh: Add class and role for Openstack Horizon (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170340 (owner: 10Andrew Bogott) [23:49:34] (03PS1) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [23:50:55] (03PS2) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [23:51:13] (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [23:51:42] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [23:51:42] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [23:51:43] PROBLEM - ElasticSearch health check on elastic1027 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2099: active_shards: 6313: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [23:52:34] (03PS7) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [23:52:42] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [23:52:43] RECOVERY - ElasticSearch health check on elastic1027 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [23:52:43] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2100: active_shards: 6316: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [23:53:05] bd808 mutante zuul seems to have fixed itself :) [23:53:12] (03PS3) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [23:53:28] (03PS8) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [23:53:35] (03PS8) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [23:53:40] (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [23:55:01] jdlrobson: I deleted the git clone of mw/core and let it check it out again. It got corrupted somehow. [23:55:04] andrewbogott: do you remember anything re: what the issue was for parsoid/ve on wikitech? [23:55:36] ori: I'm not sure we ever knew -- I think it didn't work and we turned it off and moved on. [23:56:18] ori: sorry for the incurious response; at the time many things were broken all at once :) [23:56:55] andrewbogott: no worries, i totally understand [23:57:16] andrewbogott: plus "incurious" is a wonderful word so you get bonus points for that ;) [23:57:56] Reedy: https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json [23:58:07] maiwiki is listed as a special site [23:58:59] I wonder why [23:59:09] the dblists it is in are sane [23:59:33] yeah [23:59:45] oh [23:59:49] did I sync langlist!?