[00:00:01] I think not :( [00:00:04] RoanKattouw, ^d, marktraceur, MaxSem, ebernhardson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141107T0000). Please do the needful. [00:00:10] !log reedy Synchronized langlist: mai (duration: 00m 14s) [00:00:18] Logged the message, Master [00:00:36] !log reedy Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 14s) [00:00:41] Logged the message, Master [00:00:56] aude: that fixed it. lol [00:01:11] Shall I re-run populateSitesTable.php? [00:01:27] (03PS1) 10GWicke: Add restbase deploy target [puppet] - 10https://gerrit.wikimedia.org/r/171772 [00:01:49] Who's doing SWAT? [00:02:18] Reedy: sure, we can try [00:02:30] it might get confused that maiwiki is there already though [00:02:45] Nope :) [00:02:49] ah [00:02:56] Just re-ran it on enwiki [00:03:01] | 889 | maiwiki | mediawiki | wikipedia | local [00:03:02] yay [00:03:22] ok :) [00:03:29] * aude see if it's in site_identifiers [00:03:35] !log running foreachwikiindblist wikidataclient.dblist extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --strip-protocols [00:03:40] Logged the message, Master [00:03:53] yep [00:03:56] looks good [00:04:33] (03PS5) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [00:04:34] Should take ~20 minutes to run based on last time [00:04:39] ok [00:05:10] (03PS6) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [00:11:50] Reedy: hhvm.log has lots of random "undefined index" errors [00:12:46] I think brad was reporting some of those [00:13:17] think he fixed it [00:13:23] but maybe not deployed yet in wmf6 [00:13:59] https://bugzilla.wikimedia.org/show_bug.cgi?id=72764 [00:14:08] If there is a load maybe it should be backported [00:14:22] it is several patches [00:14:26] (03PS7) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [00:14:34] hmm, that's property, not index [00:14:45] ori: what's the deal with eventual consistency, init & trebuchet? [00:14:57] it sounds like enabling a service on boot would be a bad idea [00:15:02] oh [00:16:19] I see the ones Brad fixed [00:16:27] Then we have Nov 6 23:52:22 mw1163: message repeated 2 times: [ #012Notice: Undefined index: 0 in /srv/mediawiki/php-1.25wmf6/extensions/Echo/includes/DiscussionParser.php on line 873] [00:16:55] reedy@fluorine:/a/mw-log$ grep -c "ApiPageSet.php on line 712" hhvm.log [00:16:55] 7650 [00:16:55] reedy@fluorine:/a/mw-log$ wc -l hhvm.log [00:16:55] 11993 hhvm.log [00:16:59] I think we should backport [00:17:01] (03CR) 10Gage: [C: 032] Add restbase deploy target [puppet] - 10https://gerrit.wikimedia.org/r/171772 (owner: 10GWicke) [00:17:31] i don't know which patch... maybe the geo one [00:17:33] or all [00:17:53] would be nice if it had a stack trace so we could see where it is coming from [00:17:54] Still time to get stuff in swat [00:17:59] :) [00:18:08] * James_F coughs. [00:18:33] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [00:18:47] we've wanted stacktraces for notice/warnings for aaages [00:19:02] James_F: No one seems to be doing anything towards it yet :P [00:19:45] Reedy: :-) [00:19:56] (03CR) 10GWicke: WIP: RESTBase puppet module (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [00:20:24] (03PS9) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [00:21:14] (03PS9) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [00:21:57] Has anyone actually stepped up to do swat? [00:22:16] * James_F nominates Reedy. [00:23:17] RoanKattouw, ^d, marktraceur, MaxSem, ebernhardson: any of you want to swat? Or shall I do it? [00:23:45] Reedy: RoanKattouw_away's away (in a meeting). [00:23:47] reedy@fluorine:/a/mw-log$ grep -c "Undefined index" hhvm.log [00:23:47] 2110 [00:23:54] uh, I already accidentally wikipedia this week [00:24:01] heh [00:24:21] * Reedy gets on the jfdi train [00:24:36] James_F: Are you here for his patch then? [00:24:39] Reedy: Yeah. [00:24:44] cool [00:24:56] Just like I wasn't here for my patch this morning. :-) [00:24:57] * James_F coughs. [00:26:25] !log reedy Synchronized php-1.25wmf6/extensions/GeoData: (no message) (duration: 00m 14s) [00:26:32] Logged the message, Master [00:27:13] !log reedy Synchronized php-1.25wmf7/extensions/VisualEditor/: (no message) (duration: 00m 15s) [00:27:17] Logged the message, Master [00:27:34] James_F: ^^ [00:28:35] Reedy: Testing now. [00:31:25] Reedy: Looks good. Close. [00:34:16] Reedy, I've prepared a core commit for kaldari|2's change [00:34:24] updated the deployments page [00:34:26] lol [00:34:36] I'm waiting for jenkins to do whatever it is jenkins pretends to do [00:35:03] "queued 9 min ago" [00:35:04] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [00:35:04] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [00:35:04] PROBLEM - ElasticSearch health check on elastic1020 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [00:35:04] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [00:35:04] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [00:35:18] manybubbles, ^ [00:36:12] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2104: active_shards: 6328: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [00:36:12] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2104: active_shards: 6328: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [00:36:12] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2104: active_shards: 6328: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [00:36:12] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2104: active_shards: 6328: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [00:36:12] RECOVERY - ElasticSearch health check on elastic1020 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2104: active_shards: 6328: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [00:36:27] !log reedy Synchronized php-1.25wmf7/extensions/MobileFrontend/: (no message) (duration: 00m 16s) [00:36:32] Logged the message, Master [00:37:43] MaxSem: ^ [00:37:50] kaldari|2, ^ [00:37:57] thanks Reedy [00:46:45] (03CR) 10Dzahn: [C: 031] authdns: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170473 (owner: 10John F. Lewis) [00:49:23] (03PS2) 10Dzahn: deployment: fix lint [puppet] - 10https://gerrit.wikimedia.org/r/170493 (owner: 10John F. Lewis) [00:49:30] !log reedy Synchronized php-1.25wmf6/includes/api/: (no message) (duration: 00m 14s) [00:49:35] Logged the message, Master [00:49:55] (03CR) 10Dzahn: "ugly but does it ?! hrmm" [puppet] - 10https://gerrit.wikimedia.org/r/170493 (owner: 10John F. Lewis) [00:50:43] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2101: active_shards: 6319: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [00:50:43] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2101: active_shards: 6319: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [00:50:44] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2101: active_shards: 6319: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [00:50:46] Just https://gerrit.wikimedia.org/r/#/c/171724/ left [00:51:52] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 312 seconds [00:52:14] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 351 seconds [00:52:14] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2102: active_shards: 6322: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [00:52:14] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2102: active_shards: 6322: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [00:52:14] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2102: active_shards: 6322: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [00:53:20] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:54:11] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:57:46] (03CR) 10Dzahn: "can't really decide, it's ugly but technically correct. Alex ?" [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [01:01:16] !log reedy Synchronized php-1.25wmf7/extensions/Flow/: (no message) (duration: 00m 16s) [01:01:23] Logged the message, Master [01:01:24] Reedy: because fuck log messages [01:01:32] yarly [01:13:24] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [01:21:38] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2102: active_shards: 6322: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [01:21:38] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2102: active_shards: 6322: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [01:26:44] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2102: active_shards: 6322: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [01:27:43] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [01:27:53] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [01:27:55] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2103: active_shards: 6325: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [01:35:40] (03PS1) 10BryanDavis: logstash: Use conditional instead of deprecated grep filter [puppet] - 10https://gerrit.wikimedia.org/r/171790 [01:37:21] (03PS10) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [01:42:35] (03PS10) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [01:47:33] (03PS11) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [01:50:07] (03PS11) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [01:53:11] MaxSem: its just the broken monitoring. will talk to ops about it. [02:10:35] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2110: active_shards: 6346: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [02:10:37] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2110: active_shards: 6346: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [02:11:36] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2111: active_shards: 6349: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [02:11:36] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2111: active_shards: 6349: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [02:13:10] (03PS1) 10Ori.livneh: dotfiles update for ori [puppet] - 10https://gerrit.wikimedia.org/r/171792 [02:13:57] (03CR) 10Ori.livneh: [C: 032 V: 032] dotfiles update for ori [puppet] - 10https://gerrit.wikimedia.org/r/171792 (owner: 10Ori.livneh) [02:14:01] (03PS4) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [02:14:11] (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [02:16:45] PROBLEM - puppet last run on elastic1003 is CRITICAL: CRITICAL: Puppet has 2 failures [02:23:47] (03PS5) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [02:23:55] (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [02:30:05] (03PS6) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [02:30:13] (03CR) 10jenkins-bot: [V: 04-1] (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [02:32:17] (03PS7) 10Dzahn: (WIP) generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [02:34:15] RECOVERY - puppet last run on elastic1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:37:24] (03CR) 10Dzahn: [C: 031] dataset: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170492 (owner: 10John F. Lewis) [02:40:37] (03PS1) 10Dzahn: ocg templates: retab [puppet] - 10https://gerrit.wikimedia.org/r/171797 [02:44:51] (03PS1) 10Dzahn: wikistats: retab Apache template [puppet] - 10https://gerrit.wikimedia.org/r/171798 [02:45:28] (03CR) 10Dzahn: [C: 032] wikistats: retab Apache template [puppet] - 10https://gerrit.wikimedia.org/r/171798 (owner: 10Dzahn) [02:48:48] (03PS1) 10Dzahn: eventlogging: fix lint errors [puppet] - 10https://gerrit.wikimedia.org/r/171799 [03:22:50] (03PS1) 10Dzahn: site.pp: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/171801 [03:55:51] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [03:55:51] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [03:55:51] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [03:55:51] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [03:55:51] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [03:55:51] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [03:55:51] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6357: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:52] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:52] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:53] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:53] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:07:54] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 2: initializing_shards: 2: unassigned_shards: 1 [04:09:52] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [04:09:52] PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6354: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [04:11:31] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:11:31] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:11:31] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:11:31] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:11:31] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:11:32] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:11:32] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [04:12:08] RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:08] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:08] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:08] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:08] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:08] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:09] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:09] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:10] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:10] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:11] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:11] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:12] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:12:12] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [04:14:10] PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:11] PROBLEM - Host mw1169 is DOWN: CRITICAL - Plugin timed out after 15 seconds [04:34:11] RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [06:28:31] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:50] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [06:45:32] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:55:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [06:57:50] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:00:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:01:16] (03CR) 10Yuvipanda: [C: 032] checkhosts: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169249 (owner: 10Tim Landscheidt) [07:02:59] (03CR) 10Yuvipanda: [C: 032] udpprofile: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169256 (owner: 10Tim Landscheidt) [07:03:56] (03CR) 10Yuvipanda: [C: 032] swiftcleaner: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169255 (owner: 10Tim Landscheidt) [07:05:13] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:06:04] is jenkins dead again? [07:08:02] hmm, I don't know what lutetium is, but I can't seem to ssh to it. [07:08:49] oh, lutetium is in frack? [07:08:58] well, that explains [07:09:00] * YuviPanda lets it be [07:10:11] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:15:03] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:20:11] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:44:51] <_joe_> !log upgrading the hhvm appservers to the new package version, it seems stable enough [07:44:59] Logged the message, Master [07:52:41] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [07:52:49] <_joe_> !log killed the master apache process on mw1191, stuck in a futex wait, restarted apache [07:52:54] Logged the message, Master [07:53:03] <_joe_> just in case this happens again soon [07:58:41] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:00:36] <_joe_> !log powercycled mw1169, console unresponsive, not responding to pings [08:00:42] Logged the message, Master [08:03:22] RECOVERY - Host mw1169 is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [08:17:45] PROBLEM - NTP on mw1169 is CRITICAL: NTP CRITICAL: Offset unknown [08:20:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6399 MB out of 7627 MB) [08:20:59] RECOVERY - NTP on mw1169 is OK: NTP OK: Offset -0.006458044052 secs [08:25:08] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6400 MB out of 7627 MB) [08:30:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6400 MB out of 7627 MB) [08:35:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6400 MB out of 7627 MB) [08:35:20] (03PS1) 10Giuseppe Lavagetto: Move 10% of anons to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171807 [08:36:23] <_joe_> paravoid: can I ask your opinion? ^^ [08:36:37] sure [08:36:39] what's the status? [08:36:58] <_joe_> new packages are on 4 machines since yesterday morning, no crashes [08:37:25] <_joe_> I also kept those servers serving most of the 5% of traffic for ~ 6 hours with no effect [08:37:52] <_joe_> so I'm pretty convinced moving the whole cluster to 10% would feel like a breeze [08:38:05] <_joe_> if we don't have crashes that magically happen right now [08:38:45] <_joe_> I'm pretty sure we can handle 15% whenever we want with these 18 servers, but I'd like to make that move on monday if the weekend was ok [08:39:05] I think crashes is what we're worrying here most, not the ability to handle traffic, right? [08:39:11] <_joe_> right [08:39:57] ori mentioned something that very specific patterns were the ones that were crashing HHVM? [08:40:16] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6401 MB out of 7627 MB) [08:40:16] <_joe_> yes [08:40:32] <_joe_> (I cant log into lutetium to take a look btw) [08:41:18] <_joe_> paravoid: the issue is, we are putting a patch on a fundamentally broken memory model, so we feared we didn't catch all the angles [08:42:50] ok [08:43:24] and all of the 5% of traffic was served by this new version? [08:44:09] <_joe_> no, during the day yesteday I kept the 4 servers with the new version at weight 30 in pybal, and 5 other machines at weight 1 [08:44:18] <_joe_> so that most of it was served by them [08:44:27] <_joe_> but in case of crashes, we had a fallback [08:45:03] <_joe_> I did that to accumulate traffic and possible problems [08:45:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6404 MB out of 7627 MB) [08:45:22] <_joe_> however, ori found one of the crashes was originated by MobileFrontend [08:45:32] <_joe_> so I guess that's pretty easy to encounter [08:48:09] ok [08:48:33] then go for it I suppose [08:48:42] I have a side question [08:48:55] <_joe_> shoot [08:49:04] I thought you had coded varnish to restart 503s and direct them to zend appservers [08:49:14] <_joe_> me too [08:49:21] however, on yesterday's outage, users were reporting here seeing 503s [08:49:35] <_joe_> it was yesterday? [08:49:44] <_joe_> however, yes, we have to check that [08:49:56] no, the day before I guess [08:50:03] <_joe_> it was already on the list of the things I wanted to check [08:50:08] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6404 MB out of 7627 MB) [08:50:12] <_joe_> I am so sleep deprived that I lost the count [08:50:20] don't do that? :) [08:52:05] <_joe_> https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/text-backend.inc.vcl.erb#L42 <-- this should make it retry the request if the first one was an error [08:52:12] <_joe_> I guess it doesn't work with timeouts [08:52:41] <_joe_> because I'm not sure if the varnish timeout is shorter or longer than the backend one [08:52:48] <_joe_> mmmh, I got to check this [08:53:08] <_joe_> (don't do what?) [08:53:14] not sleep [08:53:38] (03CR) 10Giuseppe Lavagetto: mediawiki: simplify apache config (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170300 (owner: 10Giuseppe Lavagetto) [08:53:54] <_joe_> ahahah [08:54:05] <_joe_> isn't it like 1 am there? [08:54:06] <_joe_> :P [08:55:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6406 MB out of 7627 MB) [08:55:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Move 10% of anons to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171807 (owner: 10Giuseppe Lavagetto) [08:55:26] we never call 'restart;' [08:55:36] so request.restarts is always zero [08:55:45] *req.restarts [08:56:22] lol [08:56:28] so yeah, that VCL change is noop right now [08:56:33] <_joe_> ori: I thought we did on errors, mark told me [08:57:05] <_joe_> actually, that was mark's suggestion :) [08:57:46] oh no, it's in modules/varnish/vcl/wikimedia.vcl.erb [08:57:56] <_joe_> I was pretty sure we did [08:58:20] <_joe_> I don't remember the specifics, but I checked - adding that fallback caused us quite the headaches [08:58:30] but not for text-backend [08:58:57] <_joe_> ori: wikimedia.vcl is included everywhere [08:59:06] <_joe_> it's the common part [08:59:06] 'retry503' => 1, [08:59:12] different than retry5xx [08:59:14] yes, but the block that does the restart has a guard [08:59:21] however [08:59:27] <_joe_> ori: oh. [08:59:30] for text-backend, that is a noop [08:59:42] apache/hhvm wouldn't restart 503 [09:00:00] I haven't tried but apache would most likely return 502 if it lost the fastcgi socket [09:00:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:00:15] <_joe_> no it returns 503 [09:00:24] <_joe_> I'm almost sure but I'll check [09:00:47] !log oblivian Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 07s) [09:00:55] Logged the message, Master [09:01:11] <_joe_> 503 Service Unavailable [09:01:14] nginx uses 502 for that [09:01:16] <_joe_> confirmed [09:01:19] but I haven't tested apache lately [09:01:23] <_joe_> paravoid: nginx does the right thing [09:01:40] <_joe_> just tested now in my home test env, where hhvm is currently down [09:02:12] <_joe_> and I have results from test runs where I saw 503s when I called /stop on the admin interface [09:02:18] <_joe_> which resets the socket [09:02:54] <_joe_> (also, I suspect we're facing either a mod_proxy_fcgi WTF, or another bug in HHVM in that situation, but I had no time to dig deeper) [09:04:15] hm, I'm wondering if vcl_error is the right place to catch this there [09:05:05] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:05:40] <_joe_> If I ack this, it will pop up again as soon as the swap usage varies by 1 mb [09:05:57] <_joe_> also, I shouldn't ack alarms I can't do shit about [09:07:02] I think catching it in vcl_error may only handle varnish-generated 503s, not hhvm's [09:07:24] this part in varnish is confusing, I remember reading that they were clearing it up a bit in 4.0 [09:10:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:10:22] yes: * 1395_ - End up in vcl_error also if fetch fails vcl_backend_response. [09:10:33] https://www.varnish-cache.org/trac/ticket/1395 : vcl_error is not called if fetch failed on vcl_backend_fetch [09:10:48] fixed in 4.0 [09:11:43] vcl_error is now vcl_backend_error [09:11:43] To make a distinction between internally generated errors and VCL synthetic responses, vcl_backend_error will be called when varnish encounters an error when trying to fetch an object. [09:12:25] what a fucking nightmare [09:12:29] :) [09:12:49] I don't think Google will give us the answer here [09:13:03] it's either reading the source, or experimenting with it locally [09:13:22] i am reading the source; the first line i pasted is from changes.rst [09:13:53] but the whiff of condescension is noted :P [09:13:54] the source of 3.0? or 4.0? [09:14:45] 4.0, because you had just said it changed [09:15:07] right [09:15:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:15:36] we don't use 4.0 (yet) [09:16:37] VCL is such a bad idea [09:17:33] <_joe_> why? [09:17:34] first, your VCL is merged with the default VCL in a way that is completely non-obvious [09:17:51] and in case this didn't add enough mystery [09:18:21] the flow of control between the vcl subroutines is deeper below the surface still [09:18:35] and in case this didn't add enough mystery [09:19:26] it's not just the call path that is mysterious, but also the set of objects you have available [09:19:39] and in case this didn't add enough mystery [09:19:43] so is their mutability [09:20:16] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:20:42] on top of that we have erb, and on top of that puppet [09:22:43] <_joe_> whenever I touch varnish, I keep https://www.varnish-cache.org/trac/wiki/VCLExampleDefault open [09:23:15] it doesn't help that the whole thing was designed by someone who thinks you're being overly verbose if you include vowels in your words [09:25:16] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:29:26] oh, there's also the fact that you have two programming modes: VCL, which can't do basic string manipulation, and in-line C [09:29:51] it's like having a car that can go either 5kmh or 500kmh [09:30:05] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:30:48] ok, so I tested this [09:31:31] vcl_error won't catch a backend 503 [09:32:28] this all sounds familiar [09:32:35] I'm pretty sure I've done all of this before :D [09:33:16] <_joe_> :/ [09:33:16] anyway, the current retry503 logic is just retrying on completely failed backends [09:33:18] waaah [09:33:34] so if apache is down, the HHVM restart mechanism willw ork [09:33:37] VCL can do wonderful string manipulation. with regexes [09:33:55] if HHVM dies but Apache stays alive, varnish will just serve apache's 503 page [09:34:30] <_joe_> :/ [09:35:06] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:35:50] the simplest way to fix that may be to have apache issue HTTP 500 instead of 503 [09:36:17] I don't think it's configurable? [09:36:18] <_joe_> ori: so break apache to unbreak varnish, interesting :) [09:36:23] <_joe_> paravoid: it's not [09:36:39] <_joe_> mod_proxy_fastcgi is underwhelming [09:36:44] we get 503s right now because of the way we put apache in front of hhvm, which is an implementation detail that users shouldn't care about [09:37:03] i'm sure it's doable [09:37:13] <_joe_> ori: I was not saying it's wrong [09:37:19] <_joe_> it's just funny [09:37:20] <_joe_> :) [09:37:42] <_joe_> I'll take a look, but I don't think we can do it in a non-hackish way [09:38:02] <_joe_> we could have a fallback backend in apache that spits 500s, or something like that [09:38:12] <_joe_> I'll think about it [09:38:16] why do all that? [09:38:18] <_joe_> while I brew the new packages [09:38:18] just handle it in VCL? [09:38:47] <_joe_> paravoid: is it doable? from what I got from your discussion before I thought it wasn't [09:39:01] of course it is [09:39:08] we just have to catch the 503 in vcl_fetch instead of vcl_error [09:39:17] <_joe_> ok [09:39:29] but we have to be careful [09:39:34] to only do it on the backend tier [09:40:07] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:40:15] "If the ErrorDocument specifies a local redirect to a CGI script, the script should include a "Status:" header field in its output in order to ensure the propagation all the way back to the client of the error condition that caused it to be invoked." [09:40:20] or else we could have the scenario where backends are unreachable, backend retries N times, fails, issues a 503, then frontend sees a 503 from its backend (so varnish backend) and retries again M times [09:42:12] <_joe_> btw I hope the zend retry will go away in a couple of weeks ;) [09:42:24] hopefully :) [09:43:06] but fixing this will help us not feeling as much nervous of breaking the site [09:43:46] at least help the people who do feel nervous about it :P [09:43:53] <_joe_> yes [09:44:23] yes, it's worth fixing [09:44:43] I suppose this was the original intention as well, right [09:44:46] I wasn't around at the time [09:44:47] I still think having apache issue a 500 is better [09:44:49] <_joe_> I'll take a shot at it in the afternoon [09:44:50] if it's doable [09:44:58] <_joe_> paravoid: it was in fact [09:45:00] it's what our users are accustomed to [09:45:07] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:45:13] <_joe_> but we never "experimented" it recently [09:45:20] <_joe_> *until [09:45:39] and it's the more relevant status anyway, the fact that we have a tiered infrastructure and that server errors are typically emitted by one of the outer layers upon failing to connect to a backend doesn't change that [09:50:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:51:27] yeah I can see the point, although it'd be nice to also discriminate between "mediawiki said this is an error" and "hhvm crashed" [09:51:42] and mediawiki emits 500s for certain operations IIRC [09:53:23] so the outer layers failing to connect to a backend has traditionally been 503 [09:53:33] when one varnish is unable to connect to another or to an appserver [09:54:00] so it really depends on whether you designate apache as one of the outer layers or part of the appserver stack [09:55:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [09:56:14] on that note, what is the intention of the original varnish logic? [09:56:20] is it also to retry mediawiki-emitted 500s? [09:56:32] <_joe_> yes [09:56:36] or are there data consistencies issues with that? [09:57:02] <_joe_> because we didn't know at the time if hhvm would just emit errors on totally legit calls [09:57:03] I suppose e.g. database transactions are rolled back when mediawiki emits a 500, right? [09:57:46] so it's not possible that e.g. a POST would write an edit, emit a 500 because it failed somewhere later, then the whole request being restarted and that edit POSTed again [09:58:59] I have to run, I'll be back in an hour or so [09:59:12] grrrit-wm: are you alive? [09:59:16] ok, you are [10:00:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:00:33] _joe_: if that was the intention, though, then enabling retry5xx in the manifest should be enough, no VCL changes needed [10:01:35] (03PS1) 10Giuseppe Lavagetto: admin: revoke temporarily Yuvi's root [puppet] - 10https://gerrit.wikimedia.org/r/171811 [10:02:02] <_joe_> YuviPanda: can you +1 this? [10:02:29] (03CR) 10Yuvipanda: [C: 031] "I'm just being paranoid, things should be fine by Monday." [puppet] - 10https://gerrit.wikimedia.org/r/171811 (owner: 10Giuseppe Lavagetto) [10:02:29] <_joe_> we also need to revoke your +2 rights in gerrit? I'm not sure how that's done btw [10:02:43] _joe_: me neither, actually. [10:02:49] ldap [10:02:55] remove me from ops group? [10:02:58] yes [10:03:33] <_joe_> paravoid: ok I will take care of that, so I'll also learn something more about our ldap structure [10:03:41] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: revoke temporarily Yuvi's root [puppet] - 10https://gerrit.wikimedia.org/r/171811 (owner: 10Giuseppe Lavagetto) [10:03:52] either way works for me [10:04:29] thanks [10:05:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:05:14] FTR, everything is super fine, the cops have left, but they did say that this group of people are 'under suspicion' and suddenly there's random people in the streets around the hackerspace I haven't seen before. I am *sure* I'm just being overtly paranoid, but just in case... [10:05:20] also we've moving out of here this Sunday. [10:05:40] (was pre-planned, not due to this) [10:10:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:15:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:20:12] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:25:05] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:30:06] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:35:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:40:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:45:17] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:50:06] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [10:55:14] (03CR) 10Filippo Giunchedi: "it seems require_package is what we'd need (see I155a36c8) but it'll do for now" [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [10:55:16] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [11:00:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 84% free (6407 MB out of 7627 MB) [11:05:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:10:15] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:15:08] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:20:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:20:40] mh, wanted to take a look at lutetium but the old/new root password on console doesn't seem to work [11:23:09] ACKNOWLEDGEMENT - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi sdi failed, https://rt.wikimedia.org/Ticket/Display.html?id=8821 [11:23:09] ACKNOWLEDGEMENT - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdi failed, https://rt.wikimedia.org/Ticket/Display.html?id=8821 [11:25:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:30:06] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:35:07] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:40:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:45:08] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:47:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] bacula: lint fixes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [11:50:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [11:52:58] (03CR) 10Alexandros Kosiaris: [C: 032] add graphite-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/171525 (owner: 10Filippo Giunchedi) [11:55:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:00:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:03:54] (03CR) 10Alexandros Kosiaris: [C: 032] openstreetmap: Split expired tile list files [puppet] - 10https://gerrit.wikimedia.org/r/171211 (owner: 10Alexandros Kosiaris) [12:05:11] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:08:04] Jeff_Green: ^ (lutetium) ? [12:10:13] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:15:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:20:14] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:25:05] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6407 MB out of 7627 MB) [12:26:14] godog: thanks, looking [12:28:15] <_joe_> Jeff_Green: no ops on our timezone seems to have access... maybe we need to fix that? [12:28:34] any volunteers? [12:28:57] there's no longer a separate NDA so it's fair game [12:29:09] <_joe_> /part [12:29:15] <_joe_> ouch, damn whitespace [12:29:28] <_joe_> :) [12:30:15] lutetium is the qa/reporting beater box, it's probably swapping because someone ran horrible reporting queries [12:30:16] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6447 MB out of 7627 MB) [12:31:40] Jeff_Green: sure, I would have looked but don't know how where to get the root password from, I'm assuming it is different [12:32:34] ok, i'll get you set up and give you a tour [12:32:59] first thing--can you create a separate ssh keypair for fundraising? [12:33:11] and post the pub key on iron or bast1001 or something? [12:35:09] PROBLEM - check_swap on lutetium is CRITICAL: SWAP CRITICAL - 85% free (6447 MB out of 7627 MB) [12:35:22] for the love of.... [12:35:43] Jeff_Green: heh can we postpone that? my plate is quite full already :( [12:35:48] osmosis just died on me and one whole page of java errors could not help me understand what has happened [12:35:55] godog: sure [12:40:08] RECOVERY - check_swap on lutetium is OK: SWAP OK - 100% free (7596 MB out of 7627 MB) [13:10:59] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'bblack'); [13:11:02] heh [13:11:09] good enough reason ;) [13:15:58] hehe even having $USER as a default might be useful indeed [13:19:09] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [13:38:30] RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:17:49] (03PS1) 10Ottomata: Add researchers to researchers group as given by Dario in RT 7105 [puppet] - 10https://gerrit.wikimedia.org/r/171828 [14:19:46] (03PS2) 10Ottomata: Add researchers to researchers group as given by Dario in RT 7105 [puppet] - 10https://gerrit.wikimedia.org/r/171828 [14:20:07] (03CR) 10Ottomata: [C: 032 V: 032] Add researchers to researchers group as given by Dario in RT 7105 [puppet] - 10https://gerrit.wikimedia.org/r/171828 (owner: 10Ottomata) [14:22:12] paravoid: what box was that? :) [14:22:29] cp1008 [14:22:33] oh right, yeah [14:23:03] it has local hacks for the ciphersuite thing [14:24:50] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Puppet last ran 3 days ago [14:25:34] (not anymore!) [14:25:50] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:46:40] <_joe_> !log installing hhvm package built with full debug symbols on mw1114 [14:46:50] Logged the message, Master [14:59:49] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [15:10:45] (03CR) 10Ottomata: WIP: Add restbase role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171741 (owner: 10GWicke) [15:14:00] (03CR) 10Ottomata: "There was once a java module that managed multiple versions of java, but Faidon got rid of it because we decided that we only wanted to in" [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [15:14:12] (03CR) 10John F. Lewis: bacula: lint fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170476 (owner: 10John F. Lewis) [15:15:43] the module was a very horrible convoluted thing that tried to also accomodate weird package names from Oracle's versions, if you recall :) [15:18:10] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:19:15] ouch [15:19:53] paravoid: context on the code review is some "generic" jvm-related too, which doesn't quite fit in elasticsearch module [15:20:23] I wouldn't mind that at all [15:21:05] https://gerrit.wikimedia.org/r/#/c/93956/1/modules/java/manifests/init.pp [15:21:08] is that I killed [15:21:28] and looking at that again, I'm not regretting it for a second! [15:21:39] ok no I had something less branchy in mind [15:21:43] :) [15:22:30] okay I'll change that to have a java module instead cc: ottomata [15:22:44] <_joe_> paravoid: looking at the varnish config, I see that "retry503" imples if (obj.status == 503 ) { return(restart); } [15:22:57] under vcl_error though [15:23:05] <_joe_> right [15:23:43] <_joe_> so adding that to vcl_fetch should do the trick, without having to change how the text backend works [15:23:56] <_joe_> I'll try to do that in the less invasive way possible [15:24:54] <_joe_> maybe directly in the text backend config, and only if X-Use-HHVM is equal to 1 [15:25:57] pssh convoluted schmovoluted, it was only convolulted if you dont' want to install many version of a jvm [15:26:02] which uhhh, w don't [15:26:03] we [15:26:07] :) [15:31:09] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [15:31:11] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [15:31:49] PROBLEM - ElasticSearch health check on elastic1020 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:53] PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:54] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:55] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:55] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:55] PROBLEM - ElasticSearch health check on elastic1027 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:55] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:55] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:56] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:31:56] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [15:32:19] <^demon|away> Hoping to finish up those elastic rebuilds today so icinga will shut up. [15:34:49] RECOVERY - ElasticSearch health check on elastic1020 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:49] RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:49] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:49] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:49] RECOVERY - ElasticSearch health check on elastic1027 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:50] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:50] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:51] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:34:52] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:35:21] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:35:21] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [15:36:25] (03CR) 10Glaisher: "Seems fine now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [15:53:00] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [15:54:00] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 10: initializing_shards: 0: unassigned_shards: 0 [15:59:00] (03PS1) 10Giuseppe Lavagetto: varnish/text: really retry on zend requests failing on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171839 [16:02:21] (03CR) 10BBlack: [C: 031] varnish/text: really retry on zend requests failing on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171839 (owner: 10Giuseppe Lavagetto) [16:05:45] _joe_: just 503s? [16:05:57] not >= 500 for instance? [16:06:12] <_joe_> paravoid: now, I don't see a reason to retry anything else [16:06:35] <_joe_> I mean, we don't have weird bugs with "normal" requests anymore [16:06:57] <_joe_> so there is no case where zend would return a 2xx and hhvm returns 500 [16:07:12] <_joe_> apache doesn't use 502 with proxied requests [16:10:29] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [16:10:30] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [16:10:30] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [16:10:30] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [16:10:30] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [16:10:30] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2113: active_shards: 6355: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [16:11:21] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [16:11:21] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [16:11:21] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [16:11:21] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [16:11:21] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [16:11:29] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2114: active_shards: 6358: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [16:27:06] !log shut db1017 briefly for cmjohnson to look [16:27:12] Logged the message, Master [16:27:13] (03PS1) 10Giuseppe Lavagetto: Revert "admin: revoke temporarily Yuvi's root" [puppet] - 10https://gerrit.wikimedia.org/r/171844 [16:27:45] (03CR) 10Yuvipanda: [C: 031] Revert "admin: revoke temporarily Yuvi's root" [puppet] - 10https://gerrit.wikimedia.org/r/171844 (owner: 10Giuseppe Lavagetto) [16:27:58] (03PS2) 10Giuseppe Lavagetto: Revert "admin: revoke temporarily Yuvi's root" [puppet] - 10https://gerrit.wikimedia.org/r/171844 [16:28:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "admin: revoke temporarily Yuvi's root" [puppet] - 10https://gerrit.wikimedia.org/r/171844 (owner: 10Giuseppe Lavagetto) [16:33:01] (03CR) 10Yuvipanda: [C: 04-1] "I'm not actually sure the js.erb file should use spaces. We use tabs for JS everywhere, we should here too." [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [16:33:38] (03PS2) 10Yuvipanda: eventlogging: fix lint errors [puppet] - 10https://gerrit.wikimedia.org/r/171799 (owner: 10Dzahn) [16:37:12] (03PS2) 10Yuvipanda: site.pp: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/171801 (owner: 10Dzahn) [16:40:38] (03CR) 10Gage: [C: 032] logstash: Use conditional instead of deprecated grep filter [puppet] - 10https://gerrit.wikimedia.org/r/171790 (owner: 10BryanDavis) [16:45:40] (03PS3) 10Yuvipanda: eventlogging: fix lint errors [puppet] - 10https://gerrit.wikimedia.org/r/171799 (owner: 10Dzahn) [16:45:51] (03CR) 10Yuvipanda: [C: 032] eventlogging: fix lint errors [puppet] - 10https://gerrit.wikimedia.org/r/171799 (owner: 10Dzahn) [17:09:42] (03PS3) 10Yuvipanda: site.pp: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/171801 (owner: 10Dzahn) [17:10:18] (03CR) 10Yuvipanda: [C: 032] site.pp: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/171801 (owner: 10Dzahn) [17:11:29] RECOVERY - RAID on ms-be2003 is OK: OK: optimal, 13 logical, 13 physical [17:15:56] (03PS2) 10Yuvipanda: authdns: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170473 (owner: 10John F. Lewis) [17:17:01] (03CR) 10Yuvipanda: [C: 04-1] "(see previous comment)" [puppet] - 10https://gerrit.wikimedia.org/r/170477 (owner: 10John F. Lewis) [17:17:09] (03CR) 10Yuvipanda: [C: 032] authdns: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170473 (owner: 10John F. Lewis) [17:20:25] mutante: are you planning on getting puppet lint to pass so we can make it voting? :) [17:20:46] (03PS2) 10Yuvipanda: Kill ceph module [puppet] - 10https://gerrit.wikimedia.org/r/170974 [17:21:57] (03CR) 10Yuvipanda: "Is list deletion a common enough activity for this? I'm slightly worried about a compromised list admin password (I've seen how people kee" [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [17:30:53] why are the app servers so spikey? http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [17:31:33] brylcream? [17:31:47] ? [17:32:28] (03PS8) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [17:32:49] ori: terrible 'spike' joke, ignore [17:34:51] (03PS9) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [17:35:02] chasemp: hey! there's cron on all the phab hosts that seem to be failing [17:35:10] ConfigParser.NoSectionError: No section: 'general' [17:35:15] * YuviPanda is looking through cronspam for errors [17:35:18] YuviPanda: ah thanks [17:35:40] chasemp: I can forward you traceback if you want [17:36:09] no I know why just thinking about how to handle it [17:36:17] chasemp: ah, ok :) [17:36:37] it's your basic prod only but don't to do prod only checks but it's tied logically to things I want in labs problem [17:37:29] akosiaris: cron errors from logrotate on the parsoid hosts. [17:37:30] error: skipping "/var/log/parsoid/parsoid.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation [17:37:45] gwicke: ^ [17:38:02] good work YuviPanda (honestly) [17:38:10] :D [17:39:51] spagewmf: hey! ee-flow has a cronjob to do git update, did you set that up? [17:40:07] spagewmf: oh yeah, it's on your user account :) can you dev null it or redirect to a log file? [17:43:13] Hey I am getting an ERR_EMPTY_RESPONSE [17:43:54] I'm on a MiFi, but other sites loading very fast. [17:44:08] <_joe_> StevenW: which url? [17:44:14] any [17:44:25] <_joe_> wikimedia's? [17:44:25] en.wikipedia.org or http://en.wikipedia.org/wiki/Wikipedia:Spoiler which is what I'm trying for [17:45:04] <_joe_> do you know which IP are you connecting to? [17:45:10] <_joe_> ulsfo supposedly? [17:45:22] <_joe_> because I can see that page perfectly from here [17:45:50] <_joe_> also, error rates look good [17:45:51] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [17:46:17] <_joe_> but someone in SF can help you, I was going off now :) [17:46:44] thx! [17:47:07] StevenW: curl -vvv http://en.wikipedia.org/wiki/Wikipedia:Spoiler 2>&1 | pbcopy, then pastebin plz [17:47:09] also, hai [17:47:45] http://pastebin.com/5WfC7mqs [17:51:35] isn't that esams? huh [17:52:50] StevenW: what about traceroute en.wikipedia.org ? [17:53:05] For context I am on a Verizon MiFi [17:54:18] hi StevenW, can you try with ipv4? (add -4 arguement to curl) [17:55:48] With -4 I get the article contents from curl [17:55:56] hmmm ok that's interesting [17:56:24] i have a meeting in 5 mins but hopefully that will help whoever continues this troubleshooting [17:58:46] legoktm: bd808 csteipp heya! labs project sul-test has a cronjob that's been failing for a while. [17:58:50] Cron [ -x /usr/lib/php5/sessionclean ] && /usr/lib/php5/sessionclean [17:58:56] find: invalid argument `-delete' to `-cmin' [18:00:48] ottomata: heya! analytics1003 has a failing cron! [18:00:49] Cron /usr/sbin/ganglia-logtailer --classname PacketLossLogtailer --log_file /var/log/udp2log/packet-loss.log --mode cron [18:00:53] HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403HTTPError = 403File /var/log/udp2log/packet-loss.log cannot be read. [18:01:00] YuviPanda: blerg. that's cron is mine [18:01:04] something is up [18:01:13] bd808: which one? sul-test? [18:01:15] paravoid: are you around? [18:01:19] YuviPanda: Yeah [18:01:31] YuviPanda: I know what broke it too so easy fix [18:01:40] bd808: yay :) [18:03:00] YuviPanda: do you know how to troubleshoot potential ipv6 connectivity issues? [18:03:06] YuviPanda: mw-vagrant changes out the session clean script that comes with php5.5 for one that isn't broken. But the labs puppet code would have upgraded php5 for a security vuln and mw-vagrant's fix wasn't reapplied [18:03:09] ori: sadly no. [18:03:15] bd808: aaah, I see. [18:03:25] there were a whole bunch of alerts yesterday for LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 [18:03:28] i wonder if that's related [18:03:40] bd808: so mathoid02, sul-test, math-preview, ogvjs-testing, mlp all have same error. [18:04:26] YuviPanda: seems likely. Probably every trust box that is using labs-vagrant and hasn't run `labs-vagrant provision` since the php package update [18:04:33] *trusty [18:04:41] bd808: ah [18:04:59] StevenW: it looks like a network issue, not sure if it has to do with verizon or us or some intermediary. i'll keep poking opsen to investigate. [18:05:19] It's probably something weird to do with the Verizon Jetpack. [18:05:31] Thanks for the help [18:05:34] And hi :) [18:05:39] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:06:48] oh YuviPanda that can be deleted, will delete it [18:06:53] StevenW: running: 'networksetup -setv6off Wi-Fi ' will probably work around it [18:07:02] * Connected to en.wikipedia.org (::1) port 80 (#0) [18:07:10] Thanks [18:07:36] why is en.wp resolving to localhost... [18:07:42] and why the heck does your localhost reply to port 80 [18:10:28] ottomata: yay, thanks! :) Am just tracking down cronspam [18:11:20] kart_: hmm, you've a cronjob running a script on your homedir on cxserver. want to puppetize it properly? :) [18:17:47] gwicke: akosiaris I filed https://bugzilla.wikimedia.org/show_bug.cgi?id=73145 [18:23:21] mutante: I've a bug for you! https://bugzilla.wikimedia.org/show_bug.cgi?id=73146 [18:23:41] (03CR) 10John F. Lewis: [C: 04-1] "Is it common, nope." [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [18:25:35] YuviPanda: thanks! /cc subbu [18:27:51] gwicke: :D yw! [18:27:59] chasemp: filed the bug at https://phabricator.wikimedia.org/T1151?workflow=create [18:29:04] YuviPanda: Labs? Yes. Otherwise, cxserver is pretty well on Beta. [18:29:24] kart_: cxserver instance [18:29:35] YuviPanda: lets talk on Monday. [18:29:49] kart_: ah, it's late for you! :) nights! [18:29:55] YuviPanda: too late for me (have to wake up at 4 AM) [18:30:02] kart_: ah, right [18:30:06] (usual bike ride :)) [18:30:19] Night! [18:33:38] dr0ptp4kt: heya! more cronspam for you! [18:33:43] dr0ptp4kt: Cron /home/dr0ptp4kt/job.sh minor [18:33:54] rm: cannot remove `caps/*': No such file or directory [18:33:55] rm: cannot remove `unified-caps/*': No such file or directory [18:34:17] dr0ptp4kt: once an hour, I think [18:36:44] YuviPanda: yeah. how would you recommend telling the script to not complain about that? just redirect 2>/ /dev/null ? [18:36:57] dr0ptp4kt: yeah. also rm -f, but that can be dangerous [18:37:12] YuviPanda: in this case, i don't care if it succeeds or fails, just want the dir clean. i'll just set it to 2>/dev/null [18:37:16] (03PS1) 10Chad: Logging for my least favorite variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171865 [18:37:28] dr0ptp4kt: heh, ok :) [18:37:39] YuviPanda: rm -f scary [18:39:24] dr0ptp4kt: yeah [18:39:31] YuviPanda: was that it for cronspam? [18:39:41] dr0ptp4kt: for you, so far :) [18:39:47] YuviPanda: cool, thanks. [18:39:52] dr0ptp4kt: will notify more :) [18:39:57] I need to write a script that does something for me [18:43:49] YuviPanda: done with the job.sh update on zbdd. thx again [18:43:57] dr0ptp4kt: yw! [18:45:11] Krinkle: hi! I see you've backup cronjobs on cvn* instances. mind redirecting them to /dev/null (or a log?) [19:33:53] Reedy: hmm, php-fss seems to be Version: 1.0-2 [19:34:19] I wonder if it was only rebuilt for 14.04? [19:34:27] (03CR) 10Chad: [C: 032] Logging for my least favorite variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171865 (owner: 10Chad) [19:34:44] Reedy: actually, Installed: 1.0-1 [19:34:53] Reedy: yeah, that's possible. [19:34:58] Reedy: candidate is Candidate: 1.0-2 [19:35:11] Ok, now I'm slightly confused [19:35:17] I'm sure I was looking at a different changelog [19:35:24] php5-fss (1.0-2) precise-wikimedia; urgency=low [19:35:24] * PHP ini file used '#' for comments which is deprecated. [19:35:24] -- Antoine Musso Mon, 29 Jul 2013 14:41:18 +0000 [19:35:41] Reedy: hmm, so puppet has these set as ensure => present, not ensure => latest [19:35:46] Reedy: so they haven't been upgraded [19:35:51] Guess that explains it [19:35:56] (03Merged) 10jenkins-bot: Logging for my least favorite variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171865 (owner: 10Chad) [19:36:03] Reedy: let me upgrade [19:36:15] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 04s) [19:36:17] Logged the message, Master [19:36:40] !log upgraded php5-fss to 1.0-2 on virt1000 to prevent cronspam [19:36:42] Logged the message, Master [19:37:06] gj [19:37:15] dr0ptp4kt: btw, your zbdd script is still cronspamming :) [19:37:27] <^d> Gahhh, so many global titles! [19:37:31] <^d> So much log. [19:37:37] so much cronspam too [19:37:40] YuviPanda: hwhat!? [19:37:51] YuviPanda: is it the same error for rm? [19:37:59] dr0ptp4kt: nope, let me forward [19:38:11] dr0ptp4kt: sent [19:38:14] YuviPanda: thx [19:38:41] dr0ptp4kt: yw [19:39:07] subbu: re: ocg vs parsoid log rotate failures, I'm not getting any cron failure email from ocg, while parsoid arrives once an hour... [19:39:13] subbu: so I don't think that is the case. [19:39:33] ok, cscott ^ [19:44:33] Reedy: think we should set ensure => latest for the wmf php packages? [19:44:45] Reedy: luasandbox, wikidiff2 and fss? [19:44:49] ori: ^ [19:51:36] YuviPanda: no [19:51:39] we upgrade those manually [19:51:53] and we want to retain the ability to push new packages to one or two hosts before doing the rest [19:51:58] or forget to [19:51:59] hmmm [19:51:59] ok [19:53:06] Is virt* in some dsh group? maybe that's why it was missed. [19:54:17] ori: hmm, how does pushing packages to multiple hosts usually happen? [19:54:18] dsh? [19:54:32] salt nowadays? ;) [19:55:02] aaah, right [19:55:02] damn [19:55:38] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:57] YuviPanda: something like: [19:56:17] salt -b 10% -G 'php:hhvm' cmd.run 'apt-get update ; apt-get install hhvm -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" --force-yes' [19:57:21] ori: cool. now most of the fss errors are coming from labs, tho. [19:58:31] YuviPanda: i wouldn't be opposed to making those ensure => latest if $realm == labs [19:58:45] well, would need to think about that, actually [19:58:48] yeah [19:58:49] why are there errors? [19:58:58] ori: not errors, just deprecation warnings. [19:59:11] meh [19:59:16] yeah, not a big deal [19:59:20] new package doesn't [20:00:03] ^d: stuff like RequestContext::getTitle called by ContextSource::getTitle with no title set isn't very useful... [20:02:43] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57887 bytes in 0.446 second response time [20:03:37] !log restarted gitblit on antimony [20:03:42] Logged the message, Master [20:08:51] YuviPanda: anything more showing up for dr0ptp4kt cronspam? [20:08:58] dr0ptp4kt: nope :) [20:13:28] !log ori Synchronized php-1.25wmf7/includes/WebResponse.php: I569b2ebbc: Add WebResponse::getHeader() (duration: 00m 07s) [20:13:32] Logged the message, Master [20:20:57] (03CR) 10Andrew Bogott: [C: 032] Minor changes for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/169608 (owner: 10Andrew Bogott) [20:21:29] chasemp: hmm, fab2 instance has 'reboot' on cron [20:21:43] ^d did that :) [20:22:04] chasemp: only problem seems to be 'reboot' isn't a command :) [20:22:08] should've been shutdown -r now [20:22:23] chasemp: heh, I think I was with him when he did it. I might've even suggested it :) [20:22:24] fab2 I have no real relation too [20:22:27] and disavow all knowledge [20:22:30] should we kill it? [20:22:39] as ^d [20:22:53] chasemp: I asked about the instance itself, not the cron :) [20:23:09] I meant his call on instance :) [20:23:14] it's in his project I Think [20:23:20] oh, was it? [20:25:31] <^d> lol fab2 reboot. [20:25:33] <^d> :) [20:25:39] <^d> someone finally discovered it [20:27:53] (03PS10) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [20:28:20] greg-g: we tightened up some input restrictions on this weeks deploy, but its causing exceptions on Special:Contributions (visible to user as missing contributions entries). Would like to deploy https://gerrit.wikimedia.org/r/#/c/171755/ to relax the restriction so its back to pre-deploy state, and will fix properly next week [20:33:32] !log ori Synchronized php-1.25wmf6/includes/WebResponse.php: I569b2ebbc: Add WebResponse::getHeader() (duration: 00m 09s) [20:33:35] Logged the message, Master [20:37:23] ebernhardson: word [20:37:41] (03PS1) 10Chad: Add additional whitelisted classes for Translate support [puppet] - 10https://gerrit.wikimedia.org/r/171892 [20:38:17] greg-g: thanks [20:38:39] ebernhardson: fire at will [20:48:29] (03CR) 10Manybubbles: [C: 031] Add additional whitelisted classes for Translate support [puppet] - 10https://gerrit.wikimedia.org/r/171892 (owner: 10Chad) [20:49:40] <^d> manybubbles: Now we just got to find a willing opsen :) [20:52:27] ^ what's the whitelisted classes thing about, exactly? [20:52:59] (basically, is this fixing a known issues in prod, or is it a new change + more risk on a Friday?) [20:53:20] !log ebernhardson Synchronized php-1.25wmf7/extensions/Flow: Bump flow submodule for bug 71858 (duration: 00m 08s) [20:53:25] Logged the message, Master [21:00:21] YuviPanda, reg. the logrotate bug .. is that something you can help fix? I imagine it will require a puppet patch? /cc gwicke [21:00:41] greg-g: all done [21:00:49] ebernhardson: good work, soldier [21:00:58] * greg-g stops with that stupid metaphor now [21:01:03] subbu: looking [21:01:07] :) [21:03:39] subbu: looks like a puppet patch, looking [21:05:08] k [21:05:13] <^d> bblack: It's prepping for deploying Translate to Elastic next week. [21:05:21] <^d> (since we have to do a rolling restart to pick up the config) [21:05:32] <^d> What it does is whitelist specific classes for the groovy scripting sandbox. [21:05:43] <^d> s/deploying/moving/ [21:05:52] is it going to break something before monday if I push it through? :) [21:06:08] (03PS1) 10Yuvipanda: parsoid: Log rotate as parsoid, not root [puppet] - 10https://gerrit.wikimedia.org/r/171902 (https://bugzilla.wikimedia.org/73145) [21:06:15] subbu: ^ [21:06:20] that *should* fix it [21:07:14] <^d> bblack: It shouldn't, it's just expanding the config on a feature we already use. [21:07:20] <^d> And nothing will start *using* it until next week [21:07:42] YuviPanda, awesome thanks .. that would have taken me much longer to go looking into the puppet repo. [21:08:09] subbu: :) [21:08:11] (03PS2) 10Yuvipanda: parsoid: Log rotate as parsoid, not root [puppet] - 10https://gerrit.wikimedia.org/r/171902 (https://bugzilla.wikimedia.org/73145) [21:09:01] subbu: I'm going to wait for a +1 from someone else before merging tho [21:09:18] k [21:09:49] mutante: ^ wanna +1? [21:10:01] looks good to me, but i am not familiar with puppet syntax .. so .. better if someone else does it :) [21:10:30] subbu: yeah, was hoping mutante would :) [21:11:48] (03CR) 10BBlack: [C: 032] Add additional whitelisted classes for Translate support [puppet] - 10https://gerrit.wikimedia.org/r/171892 (owner: 10Chad) [21:12:19] bblack: wanna +1 a trivial change as well? https://gerrit.wikimedia.org/r/171902 [21:14:09] (03CR) 10GWicke: [C: 031] parsoid: Log rotate as parsoid, not root [puppet] - 10https://gerrit.wikimedia.org/r/171902 (https://bugzilla.wikimedia.org/73145) (owner: 10Yuvipanda) [21:15:03] <^d> bblack: thanks! [21:15:06] (03PS3) 10Yuvipanda: parsoid: Log rotate as parsoid, not root [puppet] - 10https://gerrit.wikimedia.org/r/171902 (https://bugzilla.wikimedia.org/73145) [21:15:35] * YuviPanda merges anyway [21:16:32] (03CR) 10Yuvipanda: [C: 032] parsoid: Log rotate as parsoid, not root [puppet] - 10https://gerrit.wikimedia.org/r/171902 (https://bugzilla.wikimedia.org/73145) (owner: 10Yuvipanda) [21:19:02] subbu: cscott gwicke alright, things seem fine now! [21:19:45] \o/ [21:19:48] do you guys know which hosts have ocg? [21:19:54] I can take a look there as well, to verify this isn't a problem there [21:20:21] let me look. [21:20:53] ocg1003.eqiad.wmnet .. 1002 and 1001 [21:22:44] subbu: nah, that's fine. it has different poblems of course (very weird partitioning), but not the same one [21:24:23] k [21:26:17] (03CR) 10Yuvipanda: [C: 04-1] "Can't you split the dict into multiple lines instead? This looks fairly ugly, IMO." [puppet] - 10https://gerrit.wikimedia.org/r/170493 (owner: 10John F. Lewis) [21:32:27] (03PS1) 10Ottomata: Add mforns to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/171961 [21:33:17] (03CR) 10Ottomata: [C: 032] Add mforns to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/171961 (owner: 10Ottomata) [21:39:33] (03PS1) 10Dzahn: add Erik to statistics user group [puppet] - 10https://gerrit.wikimedia.org/r/171963 [21:40:54] (03PS2) 10Dzahn: add Erik to statistics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/171963 [21:42:33] (03PS3) 10Dzahn: add Erik to statistics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/171963 [21:51:36] (03CR) 10Dzahn: ""We use tabs for JS everywhere, we should here too."" [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [21:52:31] (03Abandoned) 10Dzahn: ocg templates: retab [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [21:53:22] (03CR) 10Dzahn: [C: 032] add Erik to statistics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/171963 (owner: 10Dzahn) [21:54:03] (03CR) 10Yuvipanda: "We is 'https://www.mediawiki.org/wiki/Manual:Coding_conventions'" [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [21:54:39] (03Restored) 10Yuvipanda: ocg templates: retab [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [21:56:06] (03PS2) 10Yuvipanda: ocg templates: retab [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [21:56:09] mutante: ^ [21:56:33] (03CR) 10Dzahn: [C: 032] "ok, true :) thx" [puppet] - 10https://gerrit.wikimedia.org/r/171797 (owner: 10Dzahn) [21:56:49] mutante: ty! [22:11:49] (03PS1) 10Rush: phab ensure prod cron gets removed in labs [puppet] - 10https://gerrit.wikimedia.org/r/171967 [22:19:46] (03CR) 10Yuvipanda: [C: 031] phab ensure prod cron gets removed in labs [puppet] - 10https://gerrit.wikimedia.org/r/171967 (owner: 10Rush) [22:19:49] chasemp: ty! [22:20:11] (03CR) 10Rush: [C: 032] phab ensure prod cron gets removed in labs [puppet] - 10https://gerrit.wikimedia.org/r/171967 (owner: 10Rush) [22:44:16] (03PS3) 10Yuvipanda: dataset: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170492 (owner: 10John F. Lewis) [22:44:19] mutante: apergos ^ I updated that linting patch, let's see what the linter says. [22:45:43] YuviPanda: it likes nfs.pp, good !:) [22:45:54] mutante: both of 'em? [22:46:01] top-scope variable being used without an explicit namespace on line 30 is unrelated and left [22:46:18] well jenkins isn't going to say anything, we knew that already... [22:46:25] YuviPanda: yes, both [22:46:27] it's whatever linter john was using [22:46:40] it's the same thing, except the version maybe [22:46:49] apergos: it has a linter, I think. it's currently not voting, and fails all the time. [22:46:52] what i used just now, what John uses, and what jenkins uses [22:46:58] is all puppet-lint [22:47:13] the difference is just in the version and the options being used [22:47:26] "lenient" excludes more things than "strict" [22:47:34] * YuviPanda wonders if he should +1 a patch he contributed to [22:47:47] the point of making these patches is that one day it doesn't fail all the time :) [22:47:49] yeah but I just looked at tht output, and it doesn't have those files marked in the bad list [22:48:09] also I've used puppet-lint on lines with (to my mind) the correct indentation and they have lways been fine [22:48:47] wait lenient _excludes_ more things than strict? [22:49:07] yea, strict checks everything, lenient exludes some checks [22:49:25] before his change, the nfs.pp would have this f.e. [22:49:27] (03PS1) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.24 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 [22:49:31] ERROR: two-space soft tabs not used on line 32 [22:49:47] oh, excludes some checks. ok [22:50:00] it has 2 levels of fail, WARNING and ERROR [22:50:02] sure, all the tabs and spaces things are fine to fix up [22:50:08] lenient is basically only the ERRORs [22:50:21] I thought you meant it would exclude more files, I could't see how that would fit with the naming scheme [22:50:43] so anyways nfs.pp, still not in either list, as expected [22:50:45] no i meant checks, if you do puppet-lint --help [22:50:56] you will see a bunch of options all starting with "--no" [22:51:05] ah, yeah I never use o- anything [22:51:07] *no- [22:51:08] to exclude checks you dont want to care about [22:51:12] I just rather have it whine [22:51:28] yea, well, except maybe --no-80chars-check [22:51:32] because i dont see how to fix them [22:51:41] heh I hate it but I do let it whine at me for that [22:51:43] \ as line break isn't nice [22:51:47] no it isn't [22:52:02] 80chars is stupid [22:52:17] see, so that one should be ignored [22:52:28] really Iwonder why people are so stuck o 80 chars, it's not like 'oh, but I can't see more than 80 chars on my mobile phone' dude you can't see 20 chars on your iphone [22:52:52] and i made a bug to ask for jenkins to do --no-autoloader_layout-check if a file is in ./manifests/role/ [22:52:55] but I let it and pylint both whine at me for 80 anyways [22:53:03] ah now that's good [22:53:04] the autoload module layout thing totally makes sense for ./modules/ [22:53:09] but it never will for ./role/ [22:53:12] yep [22:53:45] ok, almost all cronspam cleaned up \o/ [22:53:49] just a couple from spage left. [22:54:17] :) [22:58:04] (03PS2) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.24 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 (https://bugzilla.wikimedia.org/66663) [22:58:33] I just silenced them and emailed him [22:58:46] (03PS12) 10GWicke: WIP: Add restbase role [puppet] - 10https://gerrit.wikimedia.org/r/171741 [23:03:57] mutante: talking about cronspam, saw the bug I filed against wikiviewstats? [23:04:03] https://bugzilla.wikimedia.org/show_bug.cgi?id=73146 [23:05:39] * YuviPanda heads off to sleep [23:05:40] night [23:06:12] YuviPanda|zzz: i had not. i'll take it [23:06:15] cya [23:07:06] it's not Analytics product, but extremely confusing because there is wikistats and wikistats [23:22:40] (03CR) 10ArielGlenn: [C: 031] dataset: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170492 (owner: 10John F. Lewis)