[00:25:37] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [00:42:37] PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:55:47] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:56:37] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:37] RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [01:10:47] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:23:47] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [01:26:37] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [01:30:47] PROBLEM - puppet last run on aqs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:31:47] PROBLEM - puppet last run on etcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:37:47] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:55:37] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:58:47] RECOVERY - puppet last run on aqs1008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:59:47] RECOVERY - puppet last run on etcd1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [02:18:13] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.16) (duration: 06m 44s) [02:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:38] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 20 02:23:38 UTC 2017 (duration 5m 25s) [02:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [02:24:27] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:24:37] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [02:28:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [02:28:27] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [02:32:37] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:35:47] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [02:58:37] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:00:37] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [03:21:27] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [03:21:27] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:25:37] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:28:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [03:28:27] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [03:28:34] Hey... [03:28:47] need some opinions, I think. [03:29:17] related to the backlog, on Commons, of transcodes... [03:30:18] they were divided into two queues… all of the unitialized transcodes that would go into the ‘priority’ queue are now run. [03:30:55] There are about 25k transcodes that would go into the ‘normal’ queue... [03:31:25] If I keep running them a bit at a time, as it catches up, it will take ‘who knows how long’. [03:32:02] If I shove them in all at once, they will (as a high estimate) take 10-12 days to run. [03:32:58] During that time, for new uploads the low-res transcodes would run, but the high-res would be behind that queue. [03:33:17] brion: ^ thoughts? [03:33:44] brion: The ones left are all 720p or 1080p ogg [03:39:55] (03PS1) 10Reedy: [Beta] Enable CollaborationKit on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) [03:42:24] (03PS2) 10Reedy: Enable CollaborationKit on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) [03:45:02] 06Operations, 10Gerrit: Decide how to support polygerrit - https://phabricator.wikimedia.org/T158479#3113507 (10Dzahn) Very nice that we don't need the rewrites anymore :) [03:47:11] (03CR) 10Dzahn: [C: 031] "message says wikipedia/wikimedia, but both are wikipedias, change itself is good. both requests are approved." [dns] - 10https://gerrit.wikimedia.org/r/343572 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [03:47:27] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:47:37] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:15:27] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [04:15:37] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:16:54] !ops AntiSpamMeta rapes piglets [04:21:23] !cookie AlexZ [04:22:03] ty [04:22:05] Because trolling a bot makes ‘so’ much sense. [04:23:16] i think the point was to ping people with the stalkword [04:26:47] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [04:30:27] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:58:27] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:01:28] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:29:41] c: Yes, but he could have aimed it at someone who cares. :P [05:30:27] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:40:01] (03CR) 10Harej: [C: 031] Enable CollaborationKit on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) (owner: 10Reedy) [06:06:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [06:07:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:15:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:16:07] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:18:37] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:47:37] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:47] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:16:47] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:43:33] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3113658 (10elukey) [07:46:20] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3113672 (10elukey) [08:26:37] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.12 seconds [08:28:12] !log cirrus: refreshing all comp sugggest indices in elastic@codfw [08:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:37] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [08:40:20] 06Operations, 06Office-IT, 15User-Urbanecm: Request for email address seniori@wikimedia.org - https://phabricator.wikimedia.org/T160400#3113693 (10Urbanecm) 05Open>03Resolved The alias works. Thanks anyone interested for your help! [08:47:34] !log Jenkins: depooling / deleting Precise instances. T158652 [08:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:41] T158652: Depool precise jenkins instances - https://phabricator.wikimedia.org/T158652 [08:53:33] (03PS1) 10Jcrespo: mariadb: update files for mariadb 10.0.30 and mariadb 10.1.22 [software] - 10https://gerrit.wikimedia.org/r/343602 [08:56:40] (03CR) 10Jcrespo: [C: 032] mariadb: update files for mariadb 10.0.30 and mariadb 10.1.22 [software] - 10https://gerrit.wikimedia.org/r/343602 (owner: 10Jcrespo) [09:02:15] (03CR) 10Hashar: [C: 031] "Cherry picked on the CI puppet master. I have run puppet on hosts one by one and this change is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/343306 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [09:02:32] !log refreshing ttm documents in elastic@codfw [09:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] (03CR) 10Hashar: "IIRC we restored a list of PHP Precise packages in mediawiki::packages::legacy solely for CI usage (which includes mediawiki::packages)." [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [09:13:52] (03PS1) 10Alexandros Kosiaris: netops: Remove domain names from iciga definitions [puppet] - 10https://gerrit.wikimedia.org/r/343603 [09:18:16] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113734 (10Addshore) [09:18:52] (03PS2) 10Alexandros Kosiaris: Amend "Revert "nagios: Specify a parents host relationship"" [puppet] - 10https://gerrit.wikimedia.org/r/343036 [09:19:21] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113734 (10Addshore) p:05Triage>03High [09:23:43] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113734 (10elukey) Seems a perfect match with merge time of https://gerrit.wikimedia.org/r/#/c/341570 [09:31:24] * akosiaris breaking icinga (once again) [09:31:34] (03CR) 10Alexandros Kosiaris: [C: 032] netops: Remove domain names from iciga definitions [puppet] - 10https://gerrit.wikimedia.org/r/343603 (owner: 10Alexandros Kosiaris) [09:31:43] (03CR) 10Alexandros Kosiaris: [C: 032] Amend "Revert "nagios: Specify a parents host relationship"" [puppet] - 10https://gerrit.wikimedia.org/r/343036 (owner: 10Alexandros Kosiaris) [09:34:56] \o/ [09:35:05] finally... [09:37:22] !log restarting db1094 for upgrade [09:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:50] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3113759 (10elukey) [09:38:52] 06Operations, 06Labs: labtestcontrol2001: cron-spam from invoke-rc.d atop _cron - https://phabricator.wikimedia.org/T159532#3113756 (10elukey) 05Open>03Resolved a:03elukey Found `echo "Do the thing"` in /lib/lsb/init-functions, probably added manually. Just removed it. [09:42:10] !log swift bump ms-be2028 -> ms-be2039 weight - T158337 [09:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:17] T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337 [09:44:37] 06Operations, 10ops-eqiad, 10DBA: db1094 crash - https://phabricator.wikimedia.org/T160832#3113761 (10jcrespo) [09:49:26] (03CR) 10Addshore: "https://phabricator.wikimedia.org/T160888" [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:01:49] (03PS1) 10Jcrespo: mariadb: Repool db1094 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343607 (https://phabricator.wikimedia.org/T160832) [10:05:39] (03CR) 10Giuseppe Lavagetto: Add stages to manage maintenance (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/342806 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [10:06:22] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1094 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343607 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [10:06:26] (03PS11) 10Giuseppe Lavagetto: Add stages to manage maintenance [switchdc] - 10https://gerrit.wikimedia.org/r/342806 (https://phabricator.wikimedia.org/T160178) [10:06:28] (03PS4) 10Giuseppe Lavagetto: Switch old tasks to use remote.Remote [switchdc] - 10https://gerrit.wikimedia.org/r/343319 [10:06:30] (03PS2) 10Giuseppe Lavagetto: Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 [10:07:04] (03CR) 10Hashar: [C: 031] "Cherry picked on the CI puppet master for the integration-slave-trusty-*" [puppet] - 10https://gerrit.wikimedia.org/r/343223 (owner: 10Ejegg) [10:07:27] (03Merged) 10jenkins-bot: mariadb: Repool db1094 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343607 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [10:07:36] (03CR) 10jenkins-bot: mariadb: Repool db1094 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343607 (https://phabricator.wikimedia.org/T160832) (owner: 10Jcrespo) [10:09:16] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 after crash (duration: 00m 47s) [10:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:04] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:23] (03PS1) 10Filippo Giunchedi: graphite: cleanup eventstreams rdkafka stale data [puppet] - 10https://gerrit.wikimedia.org/r/343609 (https://phabricator.wikimedia.org/T160644) [10:14:39] (03CR) 10jerkins-bot: [V: 04-1] graphite: cleanup eventstreams rdkafka stale data [puppet] - 10https://gerrit.wikimedia.org/r/343609 (https://phabricator.wikimedia.org/T160644) (owner: 10Filippo Giunchedi) [10:14:49] (03CR) 10Volans: [V: 032 C: 032] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/342806 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [10:17:23] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3113816 (10fgiunchedi) IIRC grafana permissions are not puppetized :( @gilles you can delete alerts though? I see @krinkle is an admin on grafana so maybe that's... [10:18:35] (03CR) 10Volans: [C: 04-1] "Minor thing inline, looks good otherwise" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/343319 (owner: 10Giuseppe Lavagetto) [10:22:40] (03PS1) 10Jcrespo: osm: convert labsdb1006 into an osm slave of labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/343610 (https://phabricator.wikimedia.org/T157359) [10:24:55] (03PS5) 10Giuseppe Lavagetto: Switch old tasks to use remote.Remote [switchdc] - 10https://gerrit.wikimedia.org/r/343319 [10:24:57] (03PS3) 10Giuseppe Lavagetto: Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 [10:25:11] (03CR) 10Giuseppe Lavagetto: Switch old tasks to use remote.Remote (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/343319 (owner: 10Giuseppe Lavagetto) [10:25:16] (03CR) 10jerkins-bot: [V: 04-1] Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [10:25:19] (03CR) 10jerkins-bot: [V: 04-1] Switch old tasks to use remote.Remote [switchdc] - 10https://gerrit.wikimedia.org/r/343319 (owner: 10Giuseppe Lavagetto) [10:25:34] <_joe_> hu? [10:26:30] <_joe_> uhm why complaining just now? [10:31:23] _joe_: I've asked antoine to merge the add of tox checks ;) [10:33:58] (03CR) 10Hashar: [C: 031] Enable CollaborationKit on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) (owner: 10Reedy) [10:34:17] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:35:38] (03CR) 10Volans: [C: 032] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/343319 (owner: 10Giuseppe Lavagetto) [10:35:54] (03CR) 10jerkins-bot: [V: 04-1] Switch old tasks to use remote.Remote [switchdc] - 10https://gerrit.wikimedia.org/r/343319 (owner: 10Giuseppe Lavagetto) [10:36:03] (03CR) 10Hashar: [C: 031] [es5 upgrade] step 3: depool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [10:38:17] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:41:46] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113859 (10elukey) I checked on mwlog1001 and stat1001, the network connection seems fine and not blocked. For some reason files... [10:44:55] (03PS4) 10Gehel: wdqs - remove trebuchet based deployment [puppet] - 10https://gerrit.wikimedia.org/r/343302 [10:46:11] (03CR) 10Gehel: [C: 032] wdqs - remove trebuchet based deployment [puppet] - 10https://gerrit.wikimedia.org/r/343302 (owner: 10Gehel) [10:48:07] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:52:12] (03PS1) 10Gehel: wdqs - wdqs package has been replaced by scap configuration [puppet] - 10https://gerrit.wikimedia.org/r/343614 [10:54:27] (03CR) 10Gehel: [C: 032] wdqs - wdqs package has been replaced by scap configuration [puppet] - 10https://gerrit.wikimedia.org/r/343614 (owner: 10Gehel) [10:54:37] (03CR) 10Alexandros Kosiaris: [C: 031] osm: convert labsdb1006 into an osm slave of labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/343610 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [10:56:07] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:02:17] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:11:33] (03PS2) 10Jcrespo: osm: convert labsdb1006 into an osm slave of labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/343610 (https://phabricator.wikimedia.org/T157359) [11:13:01] (03CR) 10Jcrespo: [C: 032] osm: convert labsdb1006 into an osm slave of labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/343610 (https://phabricator.wikimedia.org/T157359) (owner: 10Jcrespo) [11:14:33] (03PS1) 10Alexandros Kosiaris: monitoring::host: Add the parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343618 [11:14:36] (03PS1) 10Alexandros Kosiaris: netops::check: Add a parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343619 [11:17:08] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:23:31] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3114061 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['labsdb1006.eqiad.wmnet'] ```... [11:25:01] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/5834/ says noop, let's see" [puppet] - 10https://gerrit.wikimedia.org/r/343618 (owner: 10Alexandros Kosiaris) [11:25:05] (03PS2) 10Alexandros Kosiaris: monitoring::host: Add the parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343618 [11:25:08] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] monitoring::host: Add the parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343618 (owner: 10Alexandros Kosiaris) [11:25:20] (03CR) 10Alexandros Kosiaris: [C: 032] contint: remove Precise related switches [puppet] - 10https://gerrit.wikimedia.org/r/343306 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [11:25:24] (03PS2) 10Alexandros Kosiaris: contint: remove Precise related switches [puppet] - 10https://gerrit.wikimedia.org/r/343306 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [11:25:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] contint: remove Precise related switches [puppet] - 10https://gerrit.wikimedia.org/r/343306 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [11:34:17] (03PS6) 10Giuseppe Lavagetto: Switch old tasks to use remote.Remote [switchdc] - 10https://gerrit.wikimedia.org/r/343319 [11:34:19] (03PS4) 10Giuseppe Lavagetto: Add a stage to switch datacenter [switchdc] - 10https://gerrit.wikimedia.org/r/343537 [11:34:21] (03PS1) 10Giuseppe Lavagetto: Fix tox.ini [switchdc] - 10https://gerrit.wikimedia.org/r/343621 [11:34:37] <_joe_> volans: ^^ this should've fixed tox [11:34:47] looking [11:36:13] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114094 (10elukey) This is probably the current error: ``` elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats /usr/bin/rsync -... [11:37:35] (03CR) 10Volans: [C: 032] Fix tox.ini [switchdc] - 10https://gerrit.wikimedia.org/r/343621 (owner: 10Giuseppe Lavagetto) [11:39:52] !log return rdb1007 client-output-buffer-limit config to initially configured value T159850 [11:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:58] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [11:41:47] PROBLEM - puppet last run on wtp1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:46:07] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:47:21] (03CR) 10Giuseppe Lavagetto: [V: 032] Fix tox.ini [switchdc] - 10https://gerrit.wikimedia.org/r/343621 (owner: 10Giuseppe Lavagetto) [11:52:26] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3114112 (10Gilles) I appear to be able to delete one I've just created. @peter which way are you trying to delete the alert and failing to do so? Can you take sc... [11:55:37] (03CR) 10Volans: [C: 04-1] "Looks good, few minor things inline." (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/343537 (owner: 10Giuseppe Lavagetto) [11:58:17] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:00:07] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114123 (10elukey) Restarted with one day of contimeout and --progress: ``` elukey@stat1002:/a/mw-log/archive/api$ sudo -u stats... [12:09:47] RECOVERY - puppet last run on wtp1018 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:10:17] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [12:10:45] (03PS1) 10Giuseppe Lavagetto: Add pypi classifiers [software/conftool] - 10https://gerrit.wikimedia.org/r/343622 [12:11:25] (03CR) 10jerkins-bot: [V: 04-1] Add pypi classifiers [software/conftool] - 10https://gerrit.wikimedia.org/r/343622 (owner: 10Giuseppe Lavagetto) [12:16:14] (03PS17) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [12:20:56] (03PS1) 10Rush: labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 [12:21:37] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3114153 (10Peter) Ah sorry it isn't actually called delete, it is called "Clear history". When I click that button and choose yes, I get "Problem Permission deni... [12:22:27] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3114154 (10Gilles) Possibly an upstream bug, then, that the auth level is so high for that feature? [12:25:00] (03PS1) 10Rush: labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) [12:25:07] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490012703 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3790610 keys, up 2 minutes 35 seconds - replication_delay is 1490012703 [12:26:18] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:27:00] 06Operations, 06Labs: Add monitoring for nfs-exportd on active labstore specifically - https://phabricator.wikimedia.org/T160838#3114170 (10chasemp) https://gerrit.wikimedia.org/r/#/c/343624/ [12:30:57] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3796830 keys, up 140 days 4 hours - replication_delay is 653 [12:34:07] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3789502 keys, up 11 minutes 35 seconds - replication_delay is 0 [12:34:20] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3114181 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labsdb1006.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['labsdb1006.... [12:34:57] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3788966 keys, up 140 days 4 hours - replication_delay is 0 [12:40:38] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2850801 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by filippo on neodymium.eqiad.wmnet for hosts: ``` ['prometheus1003.eqiad.wmnet'] ``` The log can b... [12:45:45] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3114216 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by filippo on neodymium.eqiad.wmnet for hosts: ``` ['prometheus1004.eqiad.wmnet'] ``` The log can b... [12:51:28] (03PS1) 10Volans: Check that core DBs replica is in sync [switchdc] - 10https://gerrit.wikimedia.org/r/343627 (https://phabricator.wikimedia.org/T160178) [12:54:06] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3114234 (10chasemp) 05Open>03Resolved This is still super important but the immediate issue of this task is resolved it seems and followed up by {T159721} [12:57:52] jouncebot: next [12:57:52] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T1300) [12:58:24] 06Operations, 06Labs: Instance creation stalls before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3114255 (10chasemp) [12:58:31] 06Operations, 06Labs: Instance creation stalls before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3114269 (10chasemp) p:05Triage>03High [12:59:10] 06Operations, 06Labs: Instance creation stalls before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3114255 (10chasemp) a:03Andrew [12:59:39] (03PS1) 10Volans: Add varnish text caches task [switchdc] - 10https://gerrit.wikimedia.org/r/343629 (https://phabricator.wikimedia.org/T160178) [12:59:53] 06Operations, 06Labs: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3114255 (10chasemp) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T1300). Please do the needful. [13:00:04] matt_flaschen, reedy, and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:43] o/ [13:00:47] Present [13:01:01] I can swat today [13:01:09] o/ [13:01:11] matt_flaschen: I have +2d your commit [13:01:41] Thanks [13:04:05] 06Operations, 06Labs: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3114282 (10chasemp) [13:06:21] 343433 is still waiting for gate-and-submit jobs to finish [13:07:17] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:07:46] matt_flaschen: is it important to sync one file before the other? or any order would do? [13:08:23] matt_flaschen: also, once it is at mwdebug1002, can you test there before full deployment? [13:08:43] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114299 (10Ottomata) > should we change the MAILTO option in the stats crontab Oo, does it email me? For sure let's change it. [13:09:22] zeljkof, doesn't matter, since one is only a test. [13:09:29] zeljkof, yes, I can test on mwdebug1002. [13:09:47] matt_flaschen: ok, will ping you as soon as it is at mwdebug1002 [13:09:56] (03CR) 10Ottomata: [C: 031] graphite: cleanup eventstreams rdkafka stale data [puppet] - 10https://gerrit.wikimedia.org/r/343609 (https://phabricator.wikimedia.org/T160644) (owner: 10Filippo Giunchedi) [13:10:58] RECOVERY - Disk space on prometheus1003 is OK: DISK OK [13:11:19] I'm here now [13:12:07] Reedy: is there anything that needs to be done with your patch? except +2? [13:12:26] should I fetch is on tin? [13:12:37] zeljkof: It still needs deploying from tin for consistency. But no testing as it's a beta patch :) [13:13:07] Reedy: ok, so a normal deploy, but no testing [13:13:13] exactly :) [13:13:27] can probably just sync-dir wmf-config for ease [13:13:37] will do, as soon as the first patch is done, your patch is the second [13:13:39] :) [13:14:24] Reedy: it's sync-file now even for directories :) [13:14:42] but good point, since all three files are in the same directory [13:19:24] matt_flaschen: 343433 is merged, deploying, it took a long time to merge [13:22:25] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3114320 (10Ottomata) > it seems that we should set broker.version.fallback=0.9.0.1 everywhere Hm, maybe, except that the burrow alarms we got are all for consumer proc... [13:22:37] matt_flaschen: 343433, should be at mwdebug1002, please take a look and let me know if I can continue with the deploy [13:22:45] Checking [13:23:11] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3114321 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['prometheus1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['prometheus1003.eqiad.wmnet... [13:24:24] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3114326 (10elukey) Nono everything auto-recovered by itself, I just reported the errors in the task/email. The only thing that died was mirrormaker but after the puppe... [13:24:46] (03PS3) 10Zfilipin: Enable CollaborationKit on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) (owner: 10Reedy) [13:25:24] matt_flaschen: there is a small chance that the commit is not at mwdebug, this is the second time I am deploying something in core, so mistakes are possible :) [13:25:49] (03PS1) 10Volans: Better handling of item return codes [switchdc] - 10https://gerrit.wikimedia.org/r/343633 (https://phabricator.wikimedia.org/T160178) [13:28:10] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3114329 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['prometheus1004.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['prometheus1004.eqiad.wmnet... [13:28:30] dcausse: you usually deploy your own commits, right? [13:29:25] zeljkof: sure, let me know when you're done [13:29:49] dcausse: will do, still on the first patch, the second one should be quick, then you [13:30:02] ok [13:31:08] zeljkof, working on mwdebug1002, thanks. [13:31:20] matt_flaschen: great, deploying then [13:32:08] (03PS1) 1020after4: Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 [13:33:19] (03CR) 10jerkins-bot: [V: 04-1] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:34:04] !log zfilipin@tin Synchronized php-1.29.0-wmf.16/includes/specials/SpecialWatchlist.php: SWAT: [[gerrit:343433|Watchlist: Fix form and preference overriding (T160734)]] (duration: 01m 01s) [13:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:12] T160734: [Regression] Watchlist filters set in Preferences can't be disabled on Watchlist - https://phabricator.wikimedia.org/T160734 [13:35:09] !log zfilipin@tin Synchronized php-1.29.0-wmf.16/tests/phpunit/includes/specials/SpecialWatchlistTest.php: SWAT: [[gerrit:343433|Watchlist: Fix form and preference overriding (T160734)]] (duration: 00m 48s) [13:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] matt_flaschen: deployed, please check production [13:35:25] (03CR) 1020after4: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:35:36] (03CR) 10DCausse: [C: 031] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:36:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) (owner: 10Reedy) [13:36:19] zeljkof, working in production, thanks. [13:36:20] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:36:27] Reedy: merging 343592 [13:36:35] Cheers [13:36:43] Jerkins should do most of the work getting it on beta [13:36:56] (03CR) 10jerkins-bot: [V: 04-1] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:37:10] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490017025 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 0 databases (), up 29 seconds - replication_delay is 1490017025 [13:37:19] (03PS1) 10Rush: nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 [13:38:32] (03PS2) 10Alexandros Kosiaris: netops::check: Add a parents attribute [puppet] - 10https://gerrit.wikimedia.org/r/343619 [13:39:09] (03Merged) 10jenkins-bot: Enable CollaborationKit on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) (owner: 10Reedy) [13:39:19] (03CR) 10jenkins-bot: Enable CollaborationKit on beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343592 (https://phabricator.wikimedia.org/T138325) (owner: 10Reedy) [13:39:28] Reedy: 343592 merged, deploying [13:39:58] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114349 (10elukey) The above job failed, so I tried to rsync only the March files and I can see success: ``` elukey@stat1002:~$... [13:41:44] !log zfilipin@tin Synchronized wmf-config/: SWAT: [[gerrit:343592|Enable CollaborationKit on beta enwiki (T138325)]] (duration: 00m 44s) [13:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:50] T138325: Deploy CollaborationKit on Beta Wikipedia - https://phabricator.wikimedia.org/T138325 [13:42:13] Reedy: 343592 deployed, if there is anything to check, please do :) [13:42:20] dcausse: swat is all yours :) [13:42:27] zeljkof: thanks! [13:42:36] please log when you are done [13:42:51] I am around in case of emergency [13:43:01] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 613 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3792326 keys, up 140 days 5 hours - replication_delay is 613 [13:43:26] akosiaris: ---^ :) [13:43:37] zeljkof: thanks. Just need to wait for https://integration.wikimedia.org/ci/job/beta-scap-eqiad/147208/console to do it's stuff [13:43:43] zeljkof: thanks, one question: is it ok to scap pull on wasat? so I can test something mwscript? [13:44:05] dcausse: I don't know :/ [13:44:08] (03PS2) 10Gehel: Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:44:14] Reedy? ^ [13:44:22] (question from dcausse) [13:44:40] (03CR) 1020after4: [C: 031] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:44:42] dcausse: Is that scap pull after merging something on tin? [13:45:23] Reedy: yes it's like a normal deploy but instead of scap pull on mwdebug it'll be on wasat to I can use mwscript for testing [13:45:43] (03CR) 10Mobrovac: [C: 031] RESTBase: add kbp. and khw.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/343584 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [13:45:56] Should be perfectly fine then :) [13:46:04] Reedy: thanks! [13:47:10] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3786717 keys, up 10 minutes 29 seconds - replication_delay is 30 [13:50:12] eu swat might be a bit late, still waiting for jenkins, sorry [13:53:44] (03CR) 10DCausse: [C: 031] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [13:55:35] !log shutting down es2015 for maintenance T160242 [13:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:41] T160242: es2015 crashed on 2017-03-11 - https://phabricator.wikimedia.org/T160242 [13:57:04] (03CR) 10Mobrovac: [C: 031] Add pa.wikisource new wiki on RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/343552 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [13:59:00] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3786094 keys, up 140 days 5 hours - replication_delay is 0 [13:59:18] (03CR) 10Mobrovac: [C: 04-1] "Inden nit in-lined. And why does rb2001 fail to compile?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342903 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [14:00:58] zeljkof: php-1.29.0-wmf.16 is HEAD detached from 7972185bf8 is it normal? [14:02:46] security patches? [14:03:22] I can just git fetch/rebase as usual? [14:03:38] yup [14:03:40] ok [14:03:55] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3114387 (10elukey) >>! In T160886#3114320, @Ottomata wrote: > > This reminds me though...we upgraded librdkafka everywhere except for eventlog1001! It is still runni... [14:04:56] Reedy: hm... if I git log HEAD..origin/wmf/1.29.0-wmf.16 I se 2 commits [14:05:26] (after git fetch) [14:05:41] That's... curious [14:06:01] I think I see the one that was previously swated and mine [14:06:05] Why is matt_flaschen's patch ontop of the security ones? [14:07:40] can someone check the status of /srv/mediawiki-staging/php-1.29.0-wmf.16 on tin, I've just run git fetch after a +2 on the Translate extension [14:07:51] Good question. I don't know, zeljkof did SWAT. [14:07:55] o/ [14:08:05] I thought even "git pull" now automatically did rebase (though I still put in --rebase specifically) [14:09:09] I guess patch is not merged in Gerrit ? [14:09:36] https://gerrit.wikimedia.org/r/#/c/343433/ [14:09:40] it is [14:09:44] PROBLEM - MariaDB Slave IO: s1 on db1080 is CRITICAL: CRITICAL slave_io_state could not connect [14:10:16] if we just checkout .16 it shouldn't have lost the security patches [14:10:20] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:30] elukey: yeah I would be the reason for that thankfully [14:10:40] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:40] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:42] I am gonna merge the client-output-buffer increase btw [14:10:52] it has not be rebased [14:11:25] Hashar did I do something wrong during swat? [14:11:30] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:35] the HEAD i [14:11:40] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:11:40] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:43] the HEAD is in detached state for some reason [14:11:44] RECOVERY - MariaDB Slave IO: s1 on db1080 is OK: OK slave_io_state Slave_IO_Running: Yes [14:12:26] dcausse: matt_flaschen: I am going to rebase it properly [14:12:34] PROBLEM - MariaDB Slave IO: s1 on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:35] hashar: thanks [14:12:40] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:12:43] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:12:43] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:53] jynus: have you seen this db1080 alarm before and not db1089? [14:12:55] what's going on? [14:13:25] > git checkout wmf/1.29.0-wmf.16 && git checkout [14:13:28] that fixed it :] [14:13:40] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:13:40] jynus? [14:13:41] volans, I have not seen any alarm [14:13:47] several pages for DBs's, db1080 and db1089 so far as I know not expected [14:13:52] akosiaris: awesome thanks! [14:14:07] jynus: see the emails, I didn't get the pages [14:14:15] dcausse: I guess you can sync extensions/Translate now [14:14:21] load on db1080 is 43 [14:14:26] spikes on tendril for usage [14:14:30] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [14:14:30] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:14:41] hashar: ok thanks [14:14:41] dcausse: well have to first update the submodule then sync it. I can do it if you want [14:14:49] no it's ok thanks :) [14:15:07] zeljkof: not sure. But the patches were no more applied [14:15:22] zeljkof: the HEAD of the repo lacked the local patches [14:15:22] mediawiki is putting down the database [14:15:24] aparently [14:15:30] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:30] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:15:35] all of enwiki [14:15:40] revert revert? [14:15:43] jynus: open connections went to 30k [14:15:50] on s1 [14:16:58] <_joe_> revert now [14:17:25] RECOVERY - MariaDB Slave IO: s1 on db1089 is OK: OK slave_io_state Slave_IO_Running: Yes [14:17:36] I think part of the problem being discussed above was issues with how git/deploy is working out, so "revert" may be complicated by that [14:17:40] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:40] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:40] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:49] but regardless, we need a functional revery of whatever recently changed [14:17:53] *revert [14:18:21] from swat we had SpecialWatchlist ~ 45 minutes ago [14:18:30] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:18:32] it is a special page [14:18:36] I did not scap Translate yet [14:18:36] what it is failing [14:18:40] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:18:40] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:18:45] <_joe_> so from what I see [14:18:47] enwiki is still down [14:19:00] enwiki wfm [14:19:02] 13:41 zfilipin@tin: Synchronized wmf-config/: SWAT: Enable CollaborationKit on beta enwiki (T138325) (duration: 00m 44s) [14:19:02] T138325: Deploy CollaborationKit on Beta Wikipedia - https://phabricator.wikimedia.org/T138325 [14:19:03] <_joe_> since 14:00 we have a huge increase in hhvm processing [14:19:05] 13:35 zfilipin@tin: Synchronized php-1.29.0-wmf.16/tests/phpunit/includes/specials/SpecialWatchlistTest.php: SWAT: Watchlist: Fix form and preference overriding (T160734) (duration: 00m 48s) [14:19:05] T160734: [Regression] Watchlist filters set in Preferences can't be disabled on Watchlist - https://phabricator.wikimedia.org/T160734 [14:19:08] <_joe_> the load of the hhvm servers [14:19:09] 13:34 zfilipin@tin: Synchronized php-1.29.0-wmf.16/includes/specials/SpecialWatchlist.php: SWAT: Watchlist: Fix form and preference overriding (T160734) (duration: 01m 01s) [14:19:28] <_joe_> so it could be due to something killing the db [14:19:29] ^ those are the last SALs for software deploy [14:19:31] <_joe_> and the db being slow [14:19:47] <_joe_> zeljkof, hashar can we revert to pre-SWAT state? [14:19:56] _joe_: the db has thousands of connections in login phase [14:20:06] Reedy, enwiki working for some queries doesn't mean it is up everywhere [14:20:26] _joe_: yeah I can revert the few patches [14:20:28] the problem is a single kind of request [14:20:40] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:40] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:20:49] <_joe_> hashar: I'd do that just to be sure [14:20:51] then the last sync is from 50 minutes ago [14:21:09] hashar: thanks [14:21:15] reverting reverting :D [14:21:21] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=9&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All [14:21:28] for me that is enwiki being down^ [14:21:34] PROBLEM - MariaDB Slave SQL: s1 on db1080 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:21:34] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.053 second response time [14:21:40] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:21:47] it started at 14:01 [14:21:54] PROBLEM - MariaDB Slave SQL: s1 on db1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:21:54] <_joe_> jynus: that doesn't really correspond with any release, though [14:21:55] the fact there is a buffer [14:22:20] reverting SpecialWatchlist.php change [14:22:20] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [14:22:25] is just that we allow 10K connections per db [14:22:51] !log hashar@tin Synchronized php-1.29.0-wmf.16/includes/specials/SpecialWatchlist.php: reverts commit SpecialWatchlist.php 0d675d29d925bda7b42958f99b5e3cac3f78c8a3 (duration: 00m 43s) [14:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:27] dcausse: lets hold the Translate change until above is resolved [14:23:30] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:37] hashar: sure [14:23:44] RECOVERY - MariaDB Slave SQL: s1 on db1055 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:23:44] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:44] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:50] _joe_: jynus Special:Watchlist reverted [14:24:25] PROBLEM - MariaDB Slave IO: s1 on db1089 is CRITICAL: CRITICAL slave_io_state could not connect [14:24:30] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:24:30] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:24:40] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:24:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:30] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [14:26:25] RECOVERY - MariaDB Slave SQL: s1 on db1080 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:26:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:26:40] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:26:44] PROBLEM - MariaDB Slave IO: s1 on db1083 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:27:04] !log Temporary hack for T160886 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention) [14:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:10] T160886: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886 [14:27:24] grr copy paste failure [14:27:25] RECOVERY - MariaDB Slave IO: s1 on db1089 is OK: OK slave_io_state Slave_IO_Running: Yes [14:27:35] RECOVERY - MariaDB Slave IO: s1 on db1083 is OK: OK slave_io_state Slave_IO_Running: Yes [14:27:40] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:28:01] !log (Correct one) Temporary hack for T160888 - moved /srv/mw-log/archive/api.log-20170224.gz to /srv/mw-log/archive/api_log_backup_elukey/ to avoid rsync timeouts to stat1002 (the file is big and close to being deleted for retention) [14:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:07] T160888: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888 [14:30:03] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114426 (10elukey) So after my tests we should see files syncing correctly during the next couple of days, and this will unblock... [14:30:16] (03PS3) 10Alexandros Kosiaris: redis: Increase the hard limit for slave output buffer [puppet] - 10https://gerrit.wikimedia.org/r/343027 (https://phabricator.wikimedia.org/T159850) [14:30:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] redis: Increase the hard limit for slave output buffer [puppet] - 10https://gerrit.wikimedia.org/r/343027 (https://phabricator.wikimedia.org/T159850) (owner: 10Alexandros Kosiaris) [14:30:49] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3114439 (10elukey) The prev comment is of course not related, sorry. [14:31:30] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:54] PROBLEM - MariaDB Slave IO: s1 on db1080 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:32:23] !log disable puppet on all rdb* nodes to shepherd https://gerrit.wikimedia.org/r/343027 into production. T159850 [14:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:30] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [14:32:44] RECOVERY - MariaDB Slave IO: s1 on db1080 is OK: OK slave_io_state Slave_IO_Running: Yes [14:32:44] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:04] can we please stop all deploys until we figure out what's causing this? [14:33:08] akosiaris: ^^ [14:33:13] ok [14:33:18] halting [14:33:27] elukey too and whoever else is deploying [14:33:30] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:33:40] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:34] PROBLEM - MariaDB Slave IO: s1 on db1089 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:35] RECOVERY - MariaDB Slave IO: s1 on db1089 is OK: OK slave_io_state Slave_IO_Running: Yes [14:35:40] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:35:42] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:42] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:42] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:50] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:25] !log Disabled [[Special:AllPages]] on all wikis making it spurts a blank page instead. ( https://gerrit.wikimedia.org/r/#/c/343647/ ) [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:40] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [14:36:40] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:40] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:53] !log hashar@tin Synchronized php-1.29.0-wmf.16/includes/specials/SpecialAllPages.php: Disable SpecialAllPages on all wikis. Temporary workaround (duration: 01m 08s) [14:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:40] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:37:40] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:37:40] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:37:40] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:37:40] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:50] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [14:38:40] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:38:40] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:38:40] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:40:35] zeljkof: dcausse: so in short rest of swat is on hold while whatever issue we have is being recovered :/ [14:40:48] I revert the SpecialWatchlist change for good measure [14:40:57] did some nasty hack to SpecialAllPages [14:41:06] hashar: do we know what is the problem? [14:41:09] and the Translate extension bump is merged but not deployed [14:41:13] hashar: what sould I do? is ok to let Translate unsynced on tin or should I revert? [14:41:16] see private channel [14:41:16] did you revert all commits from swat? [14:41:22] dcausse: yeah it is ok [14:41:27] ok [14:41:32] we will sync it once cluster has recovered [14:41:37] ok [14:42:21] guess the whole idea is to not touch anything and try to figure out what happened [14:42:45] ok, going to logout from tin then :) [14:44:09] (03PS2) 10Dereckson: DNS for khw. and kbp. projects [dns] - 10https://gerrit.wikimedia.org/r/343572 (https://phabricator.wikimedia.org/T160865) [14:49:48] (03PS2) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) [14:50:53] (03PS1) 10Ema: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/343655 [14:52:28] 06Operations, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868#3114483 (10Dereckson) [14:54:00] (03PS2) 10DCausse: [es5 upgrade] step 4: repool eqiad and restore normal operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342034 (https://phabricator.wikimedia.org/T157479) [14:55:25] (03PS1) 10ArielGlenn: fix typo on en wiki dump config [puppet] - 10https://gerrit.wikimedia.org/r/343656 [15:00:32] 06Operations, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868#3114521 (10Dereckson) Once DNS change is merged, this task should be moved to ''Blocked on development''. When that one + MediaWiki support for the language will be r... [15:03:50] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:45] dcausse: zeljkof: I am restoring the swat patches [15:06:26] hashar: ok, I'm here, let me know when I can scap Translate [15:06:42] restoring the wathclist thing first [15:06:59] o/ [15:07:09] hashar: let me know if there is anything I should do [15:08:44] !log hashar@tin Synchronized php-1.29.0-wmf.16/includes/specials/SpecialWatchlist.php: Restoring Watchlist: Fix form and preference overriding https://gerrit.wikimedia.org/r/#/c/343433/ (duration: 00m 51s) [15:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:48] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113734 (10Nuria) We have to fix this at this time but ... have we though to publish this info via kafka instead of udp2log? medi... [15:13:50] dcausse: still around? I am going to push the Translate version bump [15:13:59] hashar: ok [15:14:27] I'll move my other patches to the next swat window [15:14:57] syncing [15:15:05] yeah bad luck :( [15:15:10] (03CR) 10ArielGlenn: [C: 032] fix typo on en wiki dump config [puppet] - 10https://gerrit.wikimedia.org/r/343656 (owner: 10ArielGlenn) [15:15:48] !log hashar@tin Synchronized php-1.29.0-wmf.16/extensions/Translate: ElasticTTM: set the index when deleting docs (duration: 00m 53s) [15:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] zeljkof: looks good now :) [15:17:07] hashar: did swat cause the problem? [15:17:17] (03CR) 10Hashar: [C: 031] [es5 upgrade] step 3: depool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:17:38] hashar: order matters in this patch [15:17:58] dcausse: well guess you can handle it ? [15:17:58] I can swat [15:18:02] sure :) [15:18:10] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:18:13] I am sticking around though [15:19:21] RECOVERY - Disk space on prometheus1004 is OK: DISK OK [15:19:23] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:20:35] (03PS3) 10Ottomata: statistics/cruncher: Add reportupdater job for edit-beta-features [puppet] - 10https://gerrit.wikimedia.org/r/343246 (owner: 10Catrope) [15:22:36] (03Merged) 10jenkins-bot: [es5 upgrade] step 3: depool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:22:48] (03CR) 10jenkins-bot: [es5 upgrade] step 3: depool eqiad for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342033 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [15:23:35] (03PS1) 10Cmjohnson: Adding dns entries for new ms-be servers T160640 [dns] - 10https://gerrit.wikimedia.org/r/343660 [15:24:00] PROBLEM - Check systemd state on labsdb1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:24:10] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pg_basebackup-labsdb1007.eqiad.wmnet] [15:24:19] labsdb1006 was me [15:24:25] timeouts I was fixing [15:24:34] but I got a bit distracted [15:24:49] I will fix them in a few minute's time [15:25:35] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for new ms-be servers T160640 [dns] - 10https://gerrit.wikimedia.org/r/343660 (owner: 10Cmjohnson) [15:27:10] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3114589 (10Ottomata) @Cmjohnson can we take care of this this week? [15:28:20] (03CR) 10Ottomata: [V: 032 C: 032] statistics/cruncher: Add reportupdater job for edit-beta-features [puppet] - 10https://gerrit.wikimedia.org/r/343246 (owner: 10Catrope) [15:31:50] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:34:38] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: T157479 [es5 upgrade] step 3: depool eqiad for writes (1/3) (duration: 00m 45s) [15:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:46] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [15:36:42] dcausse: all going fine? [15:36:49] hashar: yes [15:37:03] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T157479 [es5 upgrade] step 3: depool eqiad for writes (2/3) (duration: 00m 46s) [15:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:38:14] can we merge https://gerrit.wikimedia.org/r/#/c/343635/2 now too? [15:39:04] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: T157479 [es5 upgrade] step 3: depool eqiad for writes (3/3) (duration: 00m 41s) [15:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:24] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114646 (10Addshore) >>! In T160888#3114531, @Nuria wrote: > We have to fix this at this time but ... have we though to publish t... [15:40:47] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: Revert: T157479 [es5 upgrade] step 3: depool eqiad for writes (3/3) (duration: 00m 42s) [15:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:52] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [15:41:14] hashar: help :) [15:41:52] dcausse: what's the problem? [15:41:54] after my last scap I started to see a bunch of warnings in fatalmonitor [15:42:07] I did git revert HASH [15:42:13] and scap the faulty file [15:42:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:42:40] but that didn't fix it? [15:42:43] yes [15:42:57] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:43:06] completionsuggester.php line 388 and 389? [15:43:12] yes [15:44:59] dcausse: it looks ok to me [15:45:04] ok [15:45:12] but now I have a commit on tin [15:45:18] the errors stopped [15:45:27] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 603 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3815420 keys, up 2 hours 8 minutes - replication_delay is 603 [15:45:31] (03CR) 10Eevans: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342903 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [15:45:52] dcausse: submit the same change to gerrit? [15:45:57] (03PS4) 10Eevans: Enable encrypted client connections in RESTBase production [puppet] - 10https://gerrit.wikimedia.org/r/342903 (https://phabricator.wikimedia.org/T111113) [15:46:07] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:46:14] can I submit from tin directly? [15:46:17] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3815096 keys, up 140 days 7 hours - replication_delay is 650 [15:46:17] yes [15:46:44] dcausse: you have to use https authentication [15:46:48] I suppose I have to scap the 2 other files that were reverted even if I want to keep the changes? [15:46:51] or I can push it for you since I'm already set up [15:47:12] dcausse: if you don't then they will get reverted with the next full sync [15:47:33] hm.. cleaner to scap them now... [15:47:36] best to submit the correct change to gerrit with just the bad parts reverted [15:47:42] ah [15:47:53] then merge and pull that and sync [15:48:17] that's what I'd do anyway [15:48:18] well... I'm not going to develop on tin, let me scap the reverted file, I'll start a fresh patch [15:48:34] right that's what I meant [15:48:37] not dev on tin [15:49:05] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3114703 (10Dzahn) [15:49:33] I usually do `git reset --hard HEAD^` to revert the most recent commit on tin staging [15:49:35] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: Revert: T157479 [es5 upgrade] step 3: depool eqiad for writes (2/3) (duration: 00m 42s) [15:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:40] T157479: Put together a production migration plan for ES 2 -> ES 5 - https://phabricator.wikimedia.org/T157479 [15:49:42] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2204382 (10Dzahn) labsdb1006 is down (thanks jynus!) and helium is also down (thanks akosiaris!). current count: **1** (netmon1001) [15:49:43] rather than git revert [15:50:13] twentyafterfour: ah makes sense... [15:50:34] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: Revert: T157479 [es5 upgrade] step 3: depool eqiad for writes (1/3) (duration: 00m 42s) [15:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:51] twentyafterfour: if you're all set could submit the patch on tin to gerrit? [15:51:02] dcausse: sure [15:51:12] should be Revert "[es5 upgrade] step 3: depool eqiad for writes" [15:51:42] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114712 (10Ottomata) Oh! Is all this API data from mw-log udp2log already in Kafka then? If so, we can use kafkatee now to writ... [15:52:09] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3114714 (10Dzahn) I see that labs1005/1006/1007 are all either re-installed or down. They don't show up as precise anymore when checking with salt.... [15:53:27] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3806426 keys, up 2 hours 16 minutes - replication_delay is 0 [15:54:17] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3806172 keys, up 140 days 7 hours - replication_delay is 0 [15:54:32] 06Operations, 10Analytics, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114716 (10Addshore) >>! In T160888#3114712, @Ottomata wrote: > Oh! Is all this API data from mw-log udp2log already in Kafka th... [15:54:36] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3114717 (10jcrespo) Dzhan- the "reinstall as jessie" part is done, but the setup of the passive replica is not 100% complete. It will take one commit... [15:55:51] (03PS1) 1020after4: Revert "[es5 upgrade] step 3: depool eqiad for writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343663 [15:56:09] dcausse: https://gerrit.wikimedia.org/r/343663 [15:56:33] twentyafterfour: thanks! [15:56:53] so I just to +2, should I git fetch/rebase on tin or is it useless? [15:57:29] it should be the identical commit so no need to fetch / rebase on tin afaik [15:57:40] ok [15:57:41] it's a noop so you can if you wanna be sure [15:57:58] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3114742 (10Dzahn) Got it, and thank you very much. [15:58:01] (03CR) 10DCausse: [C: 032] "sent from tin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343663 (owner: 1020after4) [15:58:27] papaul: ps1-c1-codfw [15:58:35] papaul: looks like it is unpingable, known? [15:59:09] (03Merged) 10jenkins-bot: Revert "[es5 upgrade] step 3: depool eqiad for writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343663 (owner: 1020after4) [15:59:17] (03CR) 10jenkins-bot: Revert "[es5 upgrade] step 3: depool eqiad for writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343663 (owner: 1020after4) [15:59:39] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3114753 (10Nuria) [15:59:51] !log Special:AllPages being blank has a public task: https://phabricator.wikimedia.org/T160916 [15:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:17] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Read timed out. (read timeout=5) [16:09:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:11:17] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3806564 keys, up 140 days 7 hours - replication_delay is 653 [16:13:10] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3114820 (10Papaul) DIMM replacement complete, system is back up online. [16:13:40] papaul: thanks! [16:17:06] (03PS1) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) [16:17:43] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3114852 (10elukey) Removed downtime on mw2256, let's see if it holds up without rebooting for a couple of days. Last step before closing - set mw2256 as "active" via co... [16:18:56] (03PS2) 10DCausse: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) [16:19:27] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 624 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3805114 keys, up 2 hours 42 minutes - replication_delay is 624 [16:19:54] (03Abandoned) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [16:23:27] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3804004 keys, up 2 hours 46 minutes - replication_delay is 0 [16:23:28] elukey: no problem [16:25:04] (03CR) 10EBernhardson: [C: 031] "looks right, all reads/writes to codfw and completion suggester disabled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [16:25:17] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3803675 keys, up 140 days 8 hours - replication_delay is 0 [16:25:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [16:26:47] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [16:29:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [16:33:16] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3114901 (10Addshore) [16:33:43] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3114913 (10Addshore) I believe this needs confirmation from @Abraham or perhaps @Tobi_WMDE_SW before it can move forward? [16:33:50] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3114916 (10Addshore) [16:35:20] 06Operations, 10ops-codfw, 10hardware-requests: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3114925 (10Papaul) disk wipe in progress. [16:37:28] (03PS4) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) [16:44:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [16:48:00] !log restbase deploying e4c327b0 [16:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:10] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [16:48:10] RECOVERY - Restbase root url on restbase-dev1001 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.009 second response time [16:50:50] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:44] gerrit why you so slow? [16:52:47] Hmm....not just gerrit? [16:53:17] Meh, must be nothing [16:53:23] RainbowSprinkles works for me [16:53:23] Tubes were clogged I guess [16:53:42] Probably just Comcast being intermittently shitty [16:53:45] * RainbowSprinkles shrugs, moves on [16:54:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T1700). Please do the needful. [17:01:16] 06Operations, 10Ops-Access-Requests: Requesting access to perf-roots for gilles - https://phabricator.wikimedia.org/T160736#3109245 (10fgiunchedi) Discussion on today's ops meeting result: approved [17:07:38] !log gehel@tin Started deploy [wdqs/wdqs@e9e7c95]: (no justification provided) [17:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:02] (03PS1) 10Alexandros Kosiaris: Change the redis slave client_output_buffer_limit [puppet] - 10https://gerrit.wikimedia.org/r/343669 (https://phabricator.wikimedia.org/T159850) [17:08:51] (03PS1) 10Jcrespo: Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) [17:09:19] !log gehel@tin Finished deploy [wdqs/wdqs@e9e7c95]: (no justification provided) (duration: 01m 41s) [17:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:48] SMalyshev: wdqs deployment completed, tests are green [17:09:59] gehel: cool, thanks! [17:13:16] (03PS1) 10ArielGlenn: fix another casualty of the breakout of default config to a file [dumps] - 10https://gerrit.wikimedia.org/r/343671 [17:14:13] (03CR) 10Jcrespo: [C: 04-1] "The master's hiera change is ok, but the master parameter on the slave is wrong: https://puppet-compiler.wmflabs.org/5836/" [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) (owner: 10Jcrespo) [17:16:44] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.70 ms [17:18:44] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:18:49] (03PS2) 10Jcrespo: Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) [17:20:40] (03CR) 10ArielGlenn: [C: 032] fix another casualty of the breakout of default config to a file [dumps] - 10https://gerrit.wikimedia.org/r/343671 (owner: 10ArielGlenn) [17:21:08] `/go -tra [17:21:15] nope.mkv [17:21:23] (03PS2) 10Catrope: Enable RCFilters beta feature on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343436 [17:21:25] (03PS2) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 [17:21:27] (03PS2) 10Catrope: Enable RCFilters beta feature on fawiki, nlwiki, ruwiki, trwiki, cswiki and Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343438 [17:21:29] (03PS2) 10Catrope: Enable RCFilters beta feature on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 [17:22:46] (03CR) 10Alexandros Kosiaris: [C: 031] Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) (owner: 10Jcrespo) [17:22:53] !log ariel@tin Started deploy [dumps/dumps@80d88cd]: fic buglet due to new default config file [17:22:55] !log ariel@tin Finished deploy [dumps/dumps@80d88cd]: fic buglet due to new default config file (duration: 00m 02s) [17:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:24] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:24:41] (03CR) 10Alexandros Kosiaris: [C: 032] Change the redis slave client_output_buffer_limit [puppet] - 10https://gerrit.wikimedia.org/r/343669 (https://phabricator.wikimedia.org/T159850) (owner: 10Alexandros Kosiaris) [17:24:49] (03PS3) 10Jcrespo: Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) [17:30:07] (03PS1) 10Phuedx: pagePreviews: Enable by default on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) [17:30:59] (03PS4) 10Jcrespo: Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) [17:31:51] (03CR) 10Jcrespo: [C: 032] Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) (owner: 10Jcrespo) [17:32:11] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/5838/" [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) (owner: 10Jcrespo) [17:32:21] (03PS1) 10Filippo Giunchedi: site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343677 (https://phabricator.wikimedia.org/T148408) [17:32:32] PROBLEM - Redis replication status tcp_6379 on rdb2004 is CRITICAL: CRITICAL: replication_delay is 1490031142 600 - REDIS 2.8.17 on 10.192.16.123:6379 has 1 databases (db0) with 4714154 keys, up 4 minutes 22 seconds - replication_delay is 1490031142 [17:32:51] known ^ [17:32:52] PROBLEM - Redis replication status tcp_6381 on rdb2004 is CRITICAL: CRITICAL: replication_delay is 1490031168 600 - REDIS 2.8.17 on 10.192.16.123:6381 has 1 databases (db0) with 4627131 keys, up 4 minutes 49 seconds - replication_delay is 1490031168 [17:33:28] (03PS2) 10Phuedx: pagePreviews: Enable by default on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) [17:34:32] RECOVERY - Redis replication status tcp_6379 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6379 has 1 databases (db0) with 4715012 keys, up 6 minutes 22 seconds - replication_delay is 0 [17:34:32] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1490031263 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3799998 keys, up 40 seconds - replication_delay is 1490031263 [17:34:52] RECOVERY - Redis replication status tcp_6381 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6381 has 1 databases (db0) with 4626747 keys, up 6 minutes 49 seconds - replication_delay is 0 [17:35:52] !log slow rolling restart of redis databases in codfw T159850 [17:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:58] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [17:36:06] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3115113 (10matmarex) I ran this query: P5085 to find out if any other wikis are affected – and yes, enwiki has the same problem. https://en.wikipedia.org... [17:36:22] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3808014 keys, up 140 days 9 hours - replication_delay is 633 [17:37:38] (03CR) 10Filippo Giunchedi: [C: 032] site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343677 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [17:37:46] (03PS2) 10Filippo Giunchedi: site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343677 (https://phabricator.wikimedia.org/T148408) [17:37:52] PROBLEM - Redis replication status tcp_6381 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6381 [17:38:31] (03PS2) 10Filippo Giunchedi: Consolidate Performance team's root access [puppet] - 10https://gerrit.wikimedia.org/r/343262 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:38:32] PROBLEM - Redis replication status tcp_6380 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6380 [17:38:37] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/5839/" [puppet] - 10https://gerrit.wikimedia.org/r/343262 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:39:22] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] site: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343677 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [17:39:52] RECOVERY - Redis replication status tcp_6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6381 has 1 databases (db0) with 3796381 keys, up 6 minutes 1 seconds - replication_delay is 0 [17:40:33] (03CR) 10Filippo Giunchedi: [C: 032] Consolidate Performance team's root access [puppet] - 10https://gerrit.wikimedia.org/r/343262 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:40:36] (03PS3) 10Filippo Giunchedi: Consolidate Performance team's root access [puppet] - 10https://gerrit.wikimedia.org/r/343262 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:42:32] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3798378 keys, up 8 minutes 40 seconds - replication_delay is 58 [17:42:32] RECOVERY - Redis replication status tcp_6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6380 has 1 databases (db0) with 3801134 keys, up 8 minutes 39 seconds - replication_delay is 0 [17:43:14] (03CR) 10Jdlrobson: [C: 04-1] "Let's explicitly define PopupsAnonsEnabledSamplingRate as 0.9 here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [17:43:25] (03PS5) 10Jcrespo: Change osm master to be labsdb1007 on configuration [puppet] - 10https://gerrit.wikimedia.org/r/343670 (https://phabricator.wikimedia.org/T123731) [17:44:10] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3115136 (10akosiaris) redis databases on codfw are picking up the change as we speak. I 'll be monitoring and doing EQIAD tomorrow European morning. [17:44:32] PROBLEM - Redis replication status tcp_6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.122 on port 6379 [17:47:05] (03PS1) 10Filippo Giunchedi: Add prometheus200[34] ipv6 [dns] - 10https://gerrit.wikimedia.org/r/343680 (https://phabricator.wikimedia.org/T148408) [17:47:22] (03PS3) 10Phuedx: pagePreviews: Enable by default on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) [17:49:09] (03PS4) 10Filippo Giunchedi: Consolidate Performance team's root access [puppet] - 10https://gerrit.wikimedia.org/r/343262 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:49:37] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Consolidate Performance team's root access [puppet] - 10https://gerrit.wikimedia.org/r/343262 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:50:32] PROBLEM - Redis replication status tcp_6379 on rdb2004 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS 2.8.17 on 10.192.16.123:6379 has 1 databases (db0) with 4715233 keys, up 22 minutes 23 seconds - replication_delay is 647 [17:51:52] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 36.52 ms [17:52:23] 06Operations, 10Ops-Access-Requests: Requesting access to perf-roots for gilles - https://phabricator.wikimedia.org/T160736#3115170 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Done [17:52:32] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3798378 keys, up 18 minutes 40 seconds - replication_delay is 658 [17:54:32] RECOVERY - Redis replication status tcp_6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6379 has 1 databases (db0) with 8507045 keys, up 14 minutes 48 seconds - replication_delay is 0 [17:55:32] PROBLEM - Redis replication status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 1490032523 600 - REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 11281 keys, up 4 minutes 18 seconds - replication_delay is 1490032523 [17:55:32] PROBLEM - Redis replication status tcp_6379 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 1490032523 600 - REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 4714133 keys, up 4 minutes 16 seconds - replication_delay is 1490032523 [17:55:32] RECOVERY - Redis replication status tcp_6379 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6379 has 1 databases (db0) with 8506991 keys, up 27 minutes 23 seconds - replication_delay is 0 [17:55:32] PROBLEM - Redis replication status tcp_6381 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 1490032524 600 - REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 11365 keys, up 4 minutes 18 seconds - replication_delay is 1490032524 [17:55:51] all of these are known and expected ^ [17:56:42] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:56:44] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:59:32] (03PS1) 10Subramanya Sastry: Enable diffserver on ruthenium to view visualdiff images [puppet] - 10https://gerrit.wikimedia.org/r/343682 [18:00:02] (03CR) 10Jdlrobson: pagePreviews: Enable by default on "stage 0" wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T1800). Please do the needful. [18:00:04] phuedx: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:07] PROBLEM - NTP on elastic2020 is CRITICAL: NTP CRITICAL: Offset unknown [18:00:15] (03CR) 10Jdlrobson: [C: 04-1] pagePreviews: Enable by default on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [18:00:21] I can SWAT today [18:00:27] RECOVERY - Redis replication status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 3799368 keys, up 9 minutes 21 seconds - replication_delay is 0 [18:00:47] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:00:47] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:00:58] (03CR) 10Subramanya Sastry: [C: 04-1] "Names used in role and site.pp and file name are inconsistent. Will fix." [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [18:01:27] PROBLEM - Redis replication status tcp_6381 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.0.119 on port 6381 [18:01:47] PROBLEM - Redis replication status tcp_6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1490032900 600 - REDIS 2.8.17 on 10.192.0.119:6379 has 1 databases (db0) with 4714668 keys, up 4 minutes 31 seconds - replication_delay is 1490032900 [18:02:40] (03CR) 10jerkins-bot: [V: 04-1] Enable diffserver on ruthenium to view visualdiff images [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [18:03:09] (03CR) 10Jdlrobson: [C: 031] "I talked with Sam about this and to be defensive during roll out we'd like to keep this as 0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [18:03:50] thcipriani: o/ [18:04:03] phuedx: hello :) [18:04:13] hey! how're you? [18:04:17] (03PS2) 10Subramanya Sastry: Enable diffserver on ruthenium to view visualdiff images [puppet] - 10https://gerrit.wikimedia.org/r/343682 [18:04:47] RECOVERY - Redis replication status tcp_6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6379 has 1 databases (db0) with 8496356 keys, up 7 minutes 32 seconds - replication_delay is 0 [18:04:47] (03PS1) 10ArielGlenn: stomp on a a few more default config stragglers [dumps] - 10https://gerrit.wikimedia.org/r/343683 [18:04:49] (03PS1) 10ArielGlenn: add type for the flagged revs tables [dumps] - 10https://gerrit.wikimedia.org/r/343684 [18:05:02] (03PS1) 10Thcipriani: Revert "Revert "Turn off patrolling for FlaggedRevs in bswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343685 [18:05:07] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.24 ms [18:05:21] phuedx: doing well, just adding a quick patch to swat too :P [18:05:45] srdjan_m: I made https://gerrit.wikimedia.org/r/#/c/343685/ for today [18:05:57] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:09] thcipriani: thanks [18:06:27] RECOVERY - Redis replication status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 8494054 keys, up 15 minutes 19 seconds - replication_delay is 0 [18:06:27] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3796183 keys, up 32 minutes 43 seconds - replication_delay is 13 [18:06:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [18:06:58] thcipriani: could you whack it on mwdebugfoo? [18:07:27] RECOVERY - Redis replication status tcp_6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6381 has 1 databases (db0) with 3793864 keys, up 10 minutes 18 seconds - replication_delay is 0 [18:07:58] (03Abandoned) 10Aaron Schulz: Switch to LoadMonitorMySQL instead of the generic one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [18:08:19] phuedx: heh, I will fetch it down to mwdebug1002 once it merges if that's what you mean :) [18:08:27] (03Merged) 10jenkins-bot: pagePreviews: Enable by default on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [18:08:32] thcipriani: you know what i mean [18:08:37] RECOVERY - Redis replication status tcp_6381 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 3793814 keys, up 17 minutes 24 seconds - replication_delay is 0 [18:09:16] (03PS2) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) [18:09:18] i meant "pull" ;) [18:09:34] phuedx: your change is now pulled over on mwdebug1002, check please [18:10:18] thcipriani: got a few folk taking a look, ta [18:10:26] okie doke [18:11:15] (03CR) 10Subramanya Sastry: "I think setting up hieradata for the images directory can be a followup patch." [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [18:13:03] (03CR) 10Aaron Schulz: "Can be deployed anytime IMO." [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [18:14:02] (03PS1) 10BryanDavis: openstack::clientlib: require ::openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/343688 [18:16:01] 18:15:26 olliv: any chance you could come out of the meeting ;) [18:16:01] 18:15:43 i can enable/disable popups as an anon on ru and it wiki [18:16:01] 18:15:47 checking the logged in flow now [18:16:07] ^ testing [18:16:27] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 613 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3796183 keys, up 42 minutes 43 seconds - replication_delay is 613 [18:17:06] :) ping me when you're happy that all's working and I'll push everywhere. [18:17:14] (03CR) 10ArielGlenn: [C: 032] stomp on a a few more default config stragglers [dumps] - 10https://gerrit.wikimedia.org/r/343683 (owner: 10ArielGlenn) [18:17:22] phuedx: tested logged-in settings, looks good [18:17:35] (03CR) 10ArielGlenn: [C: 032] add type for the flagged revs tables [dumps] - 10https://gerrit.wikimedia.org/r/343684 (owner: 10ArielGlenn) [18:17:51] olliv: special:preferences stuff doesn't appear to be fully translated? [18:18:05] thcipriani: trying to be as thorough as possible [18:18:15] thanks for the patience [18:18:25] !log ariel@tin Started deploy [dumps/dumps@91d3215]: more default config fixes, flagged rev table config fix [18:18:27] !log ariel@tin Finished deploy [dumps/dumps@91d3215]: more default config fixes, flagged rev table config fix (duration: 00m 02s) [18:18:28] no problem, I appreciate the testing. [18:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:20] 18:15:57 logged out setting look good [18:20:03] (03CR) 10Filippo Giunchedi: [C: 032] Add prometheus200[34] ipv6 [dns] - 10https://gerrit.wikimedia.org/r/343680 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [18:20:07] (03PS2) 10Filippo Giunchedi: Add prometheus200[34] ipv6 [dns] - 10https://gerrit.wikimedia.org/r/343680 (https://phabricator.wikimedia.org/T148408) [18:20:18] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Add prometheus200[34] ipv6 [dns] - 10https://gerrit.wikimedia.org/r/343680 (https://phabricator.wikimedia.org/T148408) (owner: 10Filippo Giunchedi) [18:22:02] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3115297 (10Halfak) a:05ellery>03Halfak [18:22:45] olliv: go/no go? [18:22:47] (03PS2) 10Rush: labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 [18:23:01] the translations aren't complete yet which looks a little wonky [18:23:01] (03CR) 10Filippo Giunchedi: [C: 031] "Is there a preview of metrics that will be pushed ?" [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [18:23:14] but the flows are working for me [18:23:18] (03CR) 10Jcrespo: "Will do it tomorrow- addit to puppetswat or I am sure I will forget O:-)" [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [18:23:20] (03CR) 10Andrew Bogott: [C: 032] openstack::clientlib: require ::openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/343688 (owner: 10BryanDavis) [18:23:45] (03CR) 10Madhuvishy: [C: 032] labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) (owner: 10Rush) [18:24:07] PROBLEM - puppet last run on mc1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:24:29] phuedx: checking 1-2 more small things, sorry for the delay [18:24:36] ^ thcipriani [18:24:44] * thcipriani nods [18:25:27] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3795737 keys, up 51 minutes 43 seconds - replication_delay is 9 [18:26:11] phuedx, thcipriani - ok, it's a go [18:26:14] (03CR) 10Madhuvishy: "One question, otherwise +1!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343623 (owner: 10Rush) [18:26:31] olliv: phuedx okie doke, going live everywhere [18:28:15] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:343675|pagePreviews: Enable by default on "stage 0" wikis]] T136602 (duration: 00m 42s) [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:21] T136602: Graduate the Page Previews beta feature on stage 0 wikis - https://phabricator.wikimedia.org/T136602 [18:28:22] ^ phuedx olliv live now [18:29:15] (03PS1) 10Filippo Giunchedi: hieradata: add prometheus200[34] [puppet] - 10https://gerrit.wikimedia.org/r/343693 (https://phabricator.wikimedia.org/T148408) [18:29:46] (03PS2) 10Thcipriani: Revert "Revert "Turn off patrolling for FlaggedRevs in bswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343685 [18:30:07] RECOVERY - NTP on elastic2020 is OK: NTP OK: Offset -0.0009852051735 secs [18:30:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343685 (owner: 10Thcipriani) [18:31:03] thanks thcipriani [18:31:06] (03CR) 10jenkins-bot: pagePreviews: Enable by default on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343675 (https://phabricator.wikimedia.org/T136602) (owner: 10Phuedx) [18:31:20] phuedx: thanks for the thorough testing and the communication, appreciated. [18:31:41] thcipriani: this is a big change for us and we wanted to get it as right as we could, thanks for bearing with us [18:31:42] (03Merged) 10jenkins-bot: Revert "Revert "Turn off patrolling for FlaggedRevs in bswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343685 (owner: 10Thcipriani) [18:31:45] <3 [18:31:51] (03CR) 10jenkins-bot: Revert "Revert "Turn off patrolling for FlaggedRevs in bswiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343685 (owner: 10Thcipriani) [18:31:55] no problem at all :) [18:33:41] DatGuy: srdjan_m I've fetched down https://gerrit.wikimedia.org/r/#/c/343685/2 to mwdebug1002, is there anything you'd like to check there? [18:33:56] godog: Do you think we could do the symlink stuff for scap today (https://gerrit.wikimedia.org/r/#/c/342788/) or should I put it on puppetswat for tomorrow/ [18:33:57] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:33:57] *? [18:33:59] thcipriani: i'm already checking, hold on just a sec [18:34:09] thank you [18:35:08] RainbowSprinkles: I can't now :( tomorrow puppetswat sounds good tho [18:35:18] Okie dokie, will do [18:35:27] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3795568 keys, up 6 minutes 16 seconds - replication_delay is 0 [18:36:50] thcipriani: everything looks good. [18:37:02] srdjan_m: thank you for checking, going live now [18:38:38] (03Abandoned) 10BBlack: linting: remove config-geo-test [dns] - 10https://gerrit.wikimedia.org/r/341573 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:39:39] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:343685|Revert "Revert "Turn off patrolling for FlaggedRevs in bswiki""]] T158662 (duration: 00m 44s) [18:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:45] T158662: Turn off patrolling on bs.wiki and configure FlaggedRevs - https://phabricator.wikimedia.org/T158662 [18:39:51] ^ srdjan_m should be live everywhere now [18:40:17] thank you for checking in on that change :) [18:40:23] (03PS2) 10BBlack: add first discovery records + mock lint data [dns] - 10https://gerrit.wikimedia.org/r/341574 (https://phabricator.wikimedia.org/T156100) [18:41:01] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343694 (https://phabricator.wikimedia.org/T160609) [18:41:02] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343695 (https://phabricator.wikimedia.org/T160844) [18:41:05] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343696 (https://phabricator.wikimedia.org/T160362) [18:41:30] (03PS1) 10Reedy: Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) [18:41:38] (03CR) 10BBlack: [C: 032] add first discovery records + mock lint data [dns] - 10https://gerrit.wikimedia.org/r/341574 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [18:43:38] (03PS3) 10Rush: labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 [18:43:45] (03PS2) 10Rush: nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 [18:44:06] (03PS4) 10Rush: labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 [18:44:08] (03PS2) 10Rush: labstore: apply exportd monitoring to secondary role [puppet] - 10https://gerrit.wikimedia.org/r/343624 (https://phabricator.wikimedia.org/T160838) [18:46:01] (03CR) 10Harej: [C: 031] Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [18:46:28] (03PS3) 10BBlack: dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 [18:46:58] (03CR) 10BBlack: [C: 032] dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 (owner: 10BBlack) [18:47:11] (03PS4) 10BBlack: authdns: add 10/8 to geo map [puppet] - 10https://gerrit.wikimedia.org/r/341616 [18:48:44] (03CR) 10BBlack: [V: 032 C: 032] authdns: add 10/8 to geo map [puppet] - 10https://gerrit.wikimedia.org/r/341616 (owner: 10BBlack) [18:49:20] (03CR) 10Rush: labstore: keep archival copy of dynamic export.d contents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343623 (owner: 10Rush) [18:50:10] (03PS3) 10Rush: nova: nova-fullstack keep instance count for debugging [puppet] - 10https://gerrit.wikimedia.org/r/343636 [18:51:33] (03PS5) 10Rush: labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 [18:52:07] RECOVERY - puppet last run on mc1016 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:57:22] (03CR) 10Madhuvishy: [C: 031] labstore: keep archival copy of dynamic export.d contents [puppet] - 10https://gerrit.wikimedia.org/r/343623 (owner: 10Rush) [18:57:44] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343700 (https://phabricator.wikimedia.org/T160933) [19:01:20] (03PS1) 10BBlack: discovery: add rest of near-term disc services [dns] - 10https://gerrit.wikimedia.org/r/343704 [19:01:43] (03CR) 10BBlack: [C: 032] discovery: add rest of near-term disc services [dns] - 10https://gerrit.wikimedia.org/r/343704 (owner: 10BBlack) [19:02:36] (03PS1) 10BBlack: Revert "discovery: add rest of near-term disc services" [dns] - 10https://gerrit.wikimedia.org/r/343705 [19:02:51] (03CR) 10BBlack: [V: 032 C: 032] Revert "discovery: add rest of near-term disc services" [dns] - 10https://gerrit.wikimedia.org/r/343705 (owner: 10BBlack) [19:02:57] 06Operations, 10ops-ulsfo, 10procurement: ulsfo: lvs system refresh - https://phabricator.wikimedia.org/T160936#3115592 (10RobH) [19:03:17] bah. [19:03:19] fixed. [19:03:29] (it had no data yet so was placeholder anyhow) [19:13:42] (03PS1) 10Yuvipanda: tools: Use profile::kubernetes::node for worker role [puppet] - 10https://gerrit.wikimedia.org/r/343708 (https://phabricator.wikimedia.org/T158452) [19:23:58] (03PS1) 10Jforrester: Enable wgCiteResponsiveReferences on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343714 (https://phabricator.wikimedia.org/T160932) [19:25:52] (03PS1) 10BBlack: discovery: add other service names to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/343720 [19:28:57] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:29:09] (03CR) 10Chad: [C: 04-1] "Then we can abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:29:21] (03CR) 10BBlack: [V: 032] discovery: add other service names to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/343720 (owner: 10BBlack) [19:29:23] (03CR) 10BBlack: [V: 032 C: 032] discovery: add other service names to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/343720 (owner: 10BBlack) [19:31:25] jouncebot: next [19:31:25] In 0 hour(s) and 28 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T2000) [19:31:29] (03PS1) 10BBlack: discovery: fix hyphen typo in data [puppet] - 10https://gerrit.wikimedia.org/r/343724 [19:31:37] (03CR) 10BBlack: [V: 032 C: 032] discovery: fix hyphen typo in data [puppet] - 10https://gerrit.wikimedia.org/r/343724 (owner: 10BBlack) [19:32:00] Nothing for ORES today [19:32:27] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:37] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:33:27] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:33:37] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:35:57] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:38:43] (03PS1) 10BBlack: Revert "Revert "discovery: add rest of near-term disc services"" [dns] - 10https://gerrit.wikimedia.org/r/343728 [19:38:51] (03CR) 10BBlack: [V: 032 C: 032] Revert "Revert "discovery: add rest of near-term disc services"" [dns] - 10https://gerrit.wikimedia.org/r/343728 (owner: 10BBlack) [19:41:18] !log mobrovac@tin Started deploy [changeprop/deploy@decb6a1]: (no justification provided) [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:14] !log mobrovac@tin Finished deploy [changeprop/deploy@decb6a1]: (no justification provided) (duration: 00m 56s) [19:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:02] (03PS2) 10Chad: WIP: Create hourly backup schedule, modeled on weekly [puppet] - 10https://gerrit.wikimedia.org/r/341371 [19:43:09] (03CR) 10Chad: WIP: Create hourly backup schedule, modeled on weekly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [19:45:33] (03PS3) 10Chad: Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371 [19:45:34] !log lists: disabled wikimediaro-l due to inactivity (disabling lists is easy nowadays and also revertable): fermium: sudo /usr/local/sbin/disable_list | (T146563) [19:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:41] T146563: Wikimediaro-l appears to lack list owners, and spam sent to the list owners was forwarded to a non-list owner - https://phabricator.wikimedia.org/T146563 [19:46:04] (03CR) 10Chad: "PS2 adds the second run definition. PS3 adds the backup::set change you mentioned...I think I got that right?" [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [19:47:47] (03PS1) 10Jdlrobson: Restrict page images to lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343733 (https://phabricator.wikimedia.org/T152115) [19:54:47] (03Abandoned) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:55:22] yay [19:55:41] @the fact that this could be abandoned due to better fixes [19:56:16] +10 [19:56:22] (03Draft1) 10Paladox: Gerrit: Add a new polyGerritBaseUrl config [puppet] - 10https://gerrit.wikimedia.org/r/343736 [19:56:24] (03PS2) 10Paladox: Gerrit: Add a new polyGerritBaseUrl config [puppet] - 10https://gerrit.wikimedia.org/r/343736 [19:57:54] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [19:59:11] (03CR) 10Chad: "We can wait until those land (mostly to make sure it doesn't change names or something prior to then). Then we can go ahead and set this p" [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T2000). Please do the needful. [20:00:16] no parsoid deploy today [20:00:26] no ores [20:01:30] mutante yeh, Im running the fix on https://gerrit-new.wmflabs.org/r/ :) [20:03:59] mutante RainbowSprinkles i've just disabled rewrites on ^^, polygerrit workin https://gerrit-new.wmflabs.org/r/?polygerrit=1 [20:04:00] the speed cherry pick works is instantly for me now :) [20:04:24] paladox: great :) [20:04:49] mutante also the neat thing about polygerrit is it has a cc and reviewer state now [20:04:54] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:04:59] so you can be cc'ed but not do any reviews :) [20:05:45] (03PS18) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [20:07:13] But some how cc isen't showing for me unless thats a NoteDB specfic feature [20:08:02] Or it's just something they're still missing :p [20:08:06] (which is likely, lots of stuff is missing still) [20:09:01] Reedy: Everything merged in wmf.16. You wanna do the honors or shall I? [20:09:10] looks up NoteDB [20:09:17] replaces reviewdb? [20:09:20] Yep [20:09:30] I'll do it :) [20:09:40] okie dokie [20:11:43] paladox: you know what would be great, if the cc/reviewer difference would also mean that it's easier for people to filter just the right email. so not get all the gerrit notifications in inbox BUT still get the ones where people actually want their review hope for a response [20:12:15] !log bsitzmann@tin Started deploy [mobileapps/deploy@815ebb5]: Update mobileapps to c0ab01d [20:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:21] !log reedy@tin Synchronized php-1.29.0-wmf.16/includes/specials/SpecialAllPages.php: Re-enable Special:AllPages, disable redirect filter if MiserMode T160916 (duration: 00m 42s) [20:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:27] T160916: Special:AllPages disabled due to performance issues - https://phabricator.wikimedia.org/T160916 [20:12:29] currently it's often all or nothing, which either means "spam" or "no reviews without asking separately" [20:14:12] !log reedy@tin Synchronized php-1.29.0-wmf.16/includes/api/ApiQueryAllPages.php: Limit query=allpages filterredir if MiserMode T160916 (duration: 00m 42s) [20:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:35] https://en.wikipedia.org/wiki/Special:AllPages is working. Looks as expectd [20:14:38] Yep [20:19:46] !log bsitzmann@tin Finished deploy [mobileapps/deploy@815ebb5]: Update mobileapps to c0ab01d (duration: 07m 31s) [20:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:10] (03PS3) 10Dzahn: DNS for khw. and kbp. projects [dns] - 10https://gerrit.wikimedia.org/r/343572 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [20:25:23] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:26:56] (03PS1) 10Andrew Bogott: Keystone policy: Make get/list users permissive [puppet] - 10https://gerrit.wikimedia.org/r/343743 [20:27:35] (03CR) 10Dzahn: [C: 032] DNS for khw. and kbp. projects [dns] - 10https://gerrit.wikimedia.org/r/343572 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [20:28:11] (03PS2) 10Andrew Bogott: nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 [20:28:29] (03CR) 10BryanDavis: [C: 031] "Looks generally correct to me. Untested." [puppet] - 10https://gerrit.wikimedia.org/r/343743 (owner: 10Andrew Bogott) [20:29:24] (03PS1) 10BryanDavis: labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) [20:31:42] (03CR) 10jerkins-bot: [V: 04-1] [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 (owner: 10BBlack) [20:35:38] (03PS19) 10BBlack: [POC] DNS zones to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/342887 [20:35:43] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:37:54] (03PS1) 10Hashar: (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 [20:38:19] (03CR) 10Andrew Bogott: [C: 032] Keystone policy: Make get/list users permissive [puppet] - 10https://gerrit.wikimedia.org/r/343743 (owner: 10Andrew Bogott) [20:38:47] (03PS3) 10Andrew Bogott: nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 [20:43:07] (03PS1) 10Aklapper: Link to Code of Conduct from Phabricator's footer [puppet] - 10https://gerrit.wikimedia.org/r/343749 [20:47:00] !log DNS - ns2 - authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones to create new WP languages 'khw' and 'kbp' [20:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:18] (03CR) 10jerkins-bot: [V: 04-1] (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [20:52:24] !log DNS - new Wikipedias "khw" (Khowar) and "kbp" (Kabiye) created (T160868) (T160865) ( on ns0/ns1: authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones to trigger template recreation after edit to langs.tmpl) [20:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:31] T160868: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868 [20:52:31] T160865: Create Wikipedia Khowar - https://phabricator.wikimedia.org/T160865 [20:53:23] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:53:54] 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Khowar - https://phabricator.wikimedia.org/T160865#3113140 (10Dzahn) ``` dig khw.wikipedia.org ;; QUESTION SECTION: ;khw.wikipedia.org. IN A ;; ANSWER SECTION: khw.wikipedia.org. 600... [20:54:45] 06Operations, 10Wikidata, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Kabiye - https://phabricator.wikimedia.org/T160868#3113180 (10Dzahn) ``` [radon:~] $ dig kbp.wikipedia.org ;; QUESTION SECTION: ;kbp.wikipedia.org. IN A ;; ANSWER SECTION: kbp.wikipedia.org. 600 IN A 208.80.154.224... [20:59:57] (03CR) 10Dzahn: "technically, following the latest puppet coding guide, there should be only one role that is the role of ruthenium itself, so like "parsoi" [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T2100). [21:04:43] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:05:29] (03CR) 10Dzahn: [C: 032] "compiles fine on ruthenium. diff is just nginx config as expected. http://puppet-compiler.wmflabs.org/5840/" [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [21:05:38] (03PS3) 10Dzahn: Enable diffserver on ruthenium to view visualdiff images [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [21:09:50] (03CR) 10Hashar: "So bundle is the ruby world utility to download dependencies from the internet and then execute a program under that environement / set of" [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [21:12:09] (03PS1) 10Ottomata: Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) [21:15:18] (03CR) 10Dzahn: "@Paladox, could you add me when the 2 upstream things have merged?" [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox) [21:15:52] (03CR) 10Chad: "This API work for the namespace maps is pretty simple and not resource intensive. You could likely do it more often if you're wanting more" [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [21:16:02] (03PS2) 10Dzahn: Enable mcrypt extension on CI slaves [puppet] - 10https://gerrit.wikimedia.org/r/343223 (owner: 10Ejegg) [21:16:31] (03CR) 10Dzahn: [C: 032] "thanks hashar, per "already cherry-picked"" [puppet] - 10https://gerrit.wikimedia.org/r/343223 (owner: 10Ejegg) [21:18:27] mutante: danke [21:18:31] (03CR) 10Ottomata: "Aye, k. We'll ask joal, but I think this is just so that there is something to refer to for a monthly job." [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [21:19:00] (03CR) 10Dzahn: [C: 04-1] "@Paladox: per "Hold this until gerrit is switched to systemd..doubt this hack would be needed any longer" let's not worry about it and jus" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [21:20:39] (03CR) 10Dzahn: [C: 04-1] "per "there is no need for CI and it's not in jessie and we won't upload it soon"" [puppet] - 10https://gerrit.wikimedia.org/r/343209 (owner: 10Paladox) [21:21:32] (03CR) 10Ejegg: "Thanks dzahn and hashar!" [puppet] - 10https://gerrit.wikimedia.org/r/343223 (owner: 10Ejegg) [21:21:37] (03CR) 10Dzahn: "maybe change this to just leave a prominent message that it's not maintained anymore" [puppet] - 10https://gerrit.wikimedia.org/r/340164 (owner: 10Dduvall) [21:21:48] mutante yeh NoteDB is replacing ReviewDB [21:22:00] (03CR) 10Dzahn: [V: 032 C: 032] Enable mcrypt extension on CI slaves [puppet] - 10https://gerrit.wikimedia.org/r/343223 (owner: 10Ejegg) [21:23:09] So it stores things in a repo. [21:23:11] paladox: ok, yep [21:23:14] Which is bad as it can be slow [21:23:21] paladox: on https://gerrit.wikimedia.org/r/#/c/342276/6/modules/role/manifests/phabricator/main.pp see lines 146 / 147 on the right [21:23:41] _read / _write but they both get the "read" value from Hiera [21:24:03] (03PS7) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [21:24:06] (03CR) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [21:24:14] (03PS8) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [21:25:11] 06Operations, 10Graphite, 06Performance-Team: Increase Grafana user rights for Performance team members - https://phabricator.wikimedia.org/T160738#3116166 (10Krinkle) 05Open>03Resolved Per , Ori and I were Grafana admins already. I enabled admin rights f... [21:26:41] (03CR) 10jerkins-bot: [V: 04-1] Load wiki project namespace map into HDFS weekly, sqoop mediawiki monthly [puppet] - 10https://gerrit.wikimedia.org/r/343753 (https://phabricator.wikimedia.org/T160083) (owner: 10Ottomata) [21:31:25] (03PS2) 10Hashar: (WIP) compile authdns::config [puppet] - 10https://gerrit.wikimedia.org/r/343747 [21:33:45] (03PS1) 10Aklapper: Monthly Phabricator stats email: Fix output for zero open UBN! tasks [puppet] - 10https://gerrit.wikimedia.org/r/343766 (https://phabricator.wikimedia.org/T159314) [21:38:43] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [21:39:33] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#2921133 (10RobH) @chasemp: Is there a specific existing server that meets this requirement to base a new spec off of? Also for 1TB is that 1TB of space post raid10? So jus... [21:42:01] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3116226 (10mobrovac) How about replicating the precaching redis instance across DCs? Would that be feasible? It seems slightly less... [21:42:39] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3029754 (10RobH) Is there a specific cpu seed we have to stick to? 24 cores without HT is dual 12 core CPUs. Anything between 2-2.6 ok? [21:43:33] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3029672 (10RobH) Is there a specific cpu seed we have to stick to? 24 cores without HT is dual 12 core CPUs. Anything between 2-2.6 ok? Disks: 100G means only need 10... [21:46:57] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3116244 (10Pchelolo) > How about replicating the precaching redis instance across DCs? Would that be feasible? It seems slightly les... [22:02:14] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:08:09] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3116308 (10mobrovac) >>! In T159615#3116244, @Pchelolo wrote: > Having the mirror-maker we can configure the ORES rule in #changepro... [22:08:12] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [22:09:25] (03PS9) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [22:09:28] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3116331 (10Pchelolo) > Hm, that would activate CP for the whole ensemble of messages for the other DC. We can make the `consume_dc... [22:13:29] (03CR) 10Andrew Bogott: [C: 032] labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [22:13:34] (03PS2) 10Andrew Bogott: labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [22:13:41] (03CR) 10Andrew Bogott: [C: 032] nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 (owner: 10Andrew Bogott) [22:14:11] (03PS1) 10Ladsgroup: service: add log-route in uwsgi config [puppet] - 10https://gerrit.wikimedia.org/r/343772 (https://phabricator.wikimedia.org/T149010) [22:25:31] (03PS1) 10Ladsgroup: Enable ORES review tool in etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343777 (https://phabricator.wikimedia.org/T159609) [22:29:36] (03PS4) 10Dzahn: Enable diffserver on ruthenium to view visualdiff images [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [22:30:12] (03CR) 10Dzahn: [V: 032 C: 032] Enable diffserver on ruthenium to view visualdiff images [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [22:30:30] !log ppchelko@tin Started deploy [trending-edits/deploy@5d3eb7f]: Do not purge articles that have trended T160127 [22:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:37] T160127: Re-evaluate purging strategies - https://phabricator.wikimedia.org/T160127 [22:30:42] (03CR) 10Dzahn: "was already verified and compiled earlier" [puppet] - 10https://gerrit.wikimedia.org/r/343682 (owner: 10Subramanya Sastry) [22:31:13] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:31:59] !log ruthenium - gerrit:343682 applied - puppet: OK nginx: OK diffserver service refresh: failed @ssastry [22:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:56] subbu: hello, i merged your change, see comment above [22:33:30] mutante, \o/ thanks .. is it on ruthenium now? [22:33:39] oh, something failed there? [22:33:42] subbu: yes, and puppet is ok, nginx is ok.. just the service [22:33:48] the diffserver service [22:33:53] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/lib/systemd/system/diffserver.service] [22:33:56] ok. must be some config .. will debug that. [22:34:03] "failed dependencies" [22:34:23] maybe one more puppet run [22:34:32] trying that [22:35:06] Dependency File[/lib/systemd/system/diffserver.service] has failures: true [22:35:32] looking [22:36:51] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [22:38:02] Loaded: not-found (Reason: No such file or directory) [22:38:04] mutante, oh .. i think i forgot to add that systemd file :) [22:38:09] there is just no ... [22:38:11] unit file [22:38:22] right, i forgot to add that in my patch. [22:38:28] !log ppchelko@tin Finished deploy [trending-edits/deploy@5d3eb7f]: Do not purge articles that have trended T160127 (duration: 07m 57s) [22:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:34] T160127: Re-evaluate purging strategies - https://phabricator.wikimedia.org/T160127 [22:38:38] let me submit a new patch. [22:38:46] (03Abandoned) 10BryanDavis: Install libicu52 on python & python2 base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337603 (https://phabricator.wikimedia.org/T157744) (owner: 10Zhuyifei1999) [22:38:51] yep,ok [22:42:25] mutante, so, i just create a visualdiff/initscripts/diffserver.systemd.erb file and it is added automatically or does it need hooking up somewhere? [22:42:44] i mean visualdiff/templates/initscripts/diffserver.systemd.erb [22:44:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [22:46:04] (03PS1) 10Catrope: Test ORES migration on ruwiki beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343781 [22:46:19] (03CR) 10Catrope: [C: 032] Test ORES migration on ruwiki beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343781 (owner: 10Catrope) [22:46:23] (03PS1) 10Subramanya Sastry: Add missing diffserver.systemd.service file [puppet] - 10https://gerrit.wikimedia.org/r/343782 [22:46:52] subbu: it needs a file{} section like in https://gerrit.wikimedia.org/r/#/c/343682/4/modules/visualdiff/manifests/server.pp 18/19 to install it into /etc/systemd/system/diffserver.service [22:47:26] ok. let me amend that patch [22:47:37] content => template("visualdiff/initscripts/diffserver.systemd.erb") [22:47:55] !log ppchelko@tin Started deploy [trending-edits/deploy@e4fa9b8]: Config: Set up 'trends_at' property T160127 [22:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:02] T160127: Re-evaluate purging strategies - https://phabricator.wikimedia.org/T160127 [22:48:16] mutante, it is there already .. file { "/lib/systemd/system/${instance_name}.service":\n ..... source => "puppet:///modules/visualdiff/${instance_name}.systemd.service", [22:48:28] (03PS3) 10BryanDavis: tools: Automount useful credentials onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [22:48:30] and instance_name is diffserver. [22:49:04] (03Merged) 10jenkins-bot: Test ORES migration on ruwiki beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343781 (owner: 10Catrope) [22:49:14] (03CR) 10jenkins-bot: Test ORES migration on ruwiki beta too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343781 (owner: 10Catrope) [22:49:29] so, i was just missing the file for the source there .. which presumably comes from modules/visualdiff/files/diffserver.systemd.service ? [22:50:10] subbu: somehow it does not get realized on ruthenium [22:50:35] mutante, i uploaded a new patch. that should fix it. [22:51:17] subbu: yes, right [22:51:23] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.297 second response time [22:52:12] (03PS3) 10Andrew Bogott: labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [22:52:14] yes, except i would have used /etc/systemd/system instead of /lib/systemd/system but that's minor, some are there and some are symlinks from one to another [22:52:18] (03CR) 10jerkins-bot: [V: 04-1] tools: Automount useful credentials onto containers [puppet] - 10https://gerrit.wikimedia.org/r/327235 (owner: 10Yuvipanda) [22:52:23] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.309 second response time [22:52:51] next we could use "base::service_unit" to handle those details [22:53:07] (03PS2) 10Yuvipanda: tools: Use profile::kubernetes::node for worker role [puppet] - 10https://gerrit.wikimedia.org/r/343708 (https://phabricator.wikimedia.org/T158452) [22:53:15] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use profile::kubernetes::node for worker role [puppet] - 10https://gerrit.wikimedia.org/r/343708 (https://phabricator.wikimedia.org/T158452) (owner: 10Yuvipanda) [22:54:02] (03PS2) 10Dzahn: Add missing diffserver.systemd.service file [puppet] - 10https://gerrit.wikimedia.org/r/343782 (owner: 10Subramanya Sastry) [22:54:16] !log ppchelko@tin Finished deploy [trending-edits/deploy@e4fa9b8]: Config: Set up 'trends_at' property T160127 (duration: 06m 20s) [22:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:21] T160127: Re-evaluate purging strategies - https://phabricator.wikimedia.org/T160127 [22:54:35] (03PS3) 10Dzahn: parsoid tests: Add missing diffserver.systemd.service file [puppet] - 10https://gerrit.wikimedia.org/r/343782 (owner: 10Subramanya Sastry) [22:54:53] (03CR) 10Dzahn: [V: 032 C: 032] parsoid tests: Add missing diffserver.systemd.service file [puppet] - 10https://gerrit.wikimedia.org/r/343782 (owner: 10Subramanya Sastry) [22:56:23] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:57:17] !log ruthenium: running puppet after gerrit:343782 added missing diffserver unit file. puppet run looked good: Visualdiff::Server[diffserver]/Service[diffserver]/ensure: ensure changed 'stopped' to 'running', systemctl status says failed though [22:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:37] subbu: so from puppet view it looked all good, like it restarted the service [22:57:53] subbu: only when i do "systemctl" status manually it does not seem to agree [22:58:13] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:58:22] yes, there is a config issue. checking. [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170320T2300). [23:00:04] James_F, Jdlrobson, ebernhardson, and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:17] \0 but not available for next 15 [23:00:23] o/ [23:00:26] Yeah. [23:01:12] \o [23:02:13] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [23:03:04] Anyone SWATing? [23:03:49] I can SWAT, yes. [23:03:53] (03PS1) 10Subramanya Sastry: Fix buggy diffserver config file [puppet] - 10https://gerrit.wikimedia.org/r/343784 [23:03:59] Thanks Dereckson. [23:04:23] mutante, hold on with merging that. [23:04:54] (03CR) 10Subramanya Sastry: [C: 04-1] "there are other config issues. let me debug everything at once and upload a new PS." [puppet] - 10https://gerrit.wikimedia.org/r/343784 (owner: 10Subramanya Sastry) [23:05:56] (03CR) 10Dzahn: "should that be a change in ./hieradata/ instead of .pp? or is it really about permanently changing the defaults (vs. just a regular switch" [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:06:00] subbu: ok [23:07:42] (03CR) 10Dzahn: "i assume this needs some coordination between multiple people when it gets merged" [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:08:13] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:10:17] 06Operations, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3116461 (10RobH) [23:10:20] 06Operations, 10hardware-requests: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#3116460 (10RobH) 05stalled>03Resolved [23:10:51] (03PS4) 10Andrew Bogott: labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [23:11:00] 06Operations, 10hardware-requests: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#2962228 (10RobH) Lead time for the replacement is not until June, so an in place upgrade of netmon1001 will be required. Once the system arrives, its setup will be tracked on T159756. [23:13:55] (03CR) 1020after4: [C: 031] "we can set it in hiera if you prefer. This doesn't need coordination it just needs to be updated asap." [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:15:24] (03PS2) 10Subramanya Sastry: Unbreak diffserver by fixing configs [puppet] - 10https://gerrit.wikimedia.org/r/343784 [23:15:59] mutante, ^ [23:17:52] (03PS3) 10Subramanya Sastry: Unbreak diffserver by fixing configs [puppet] - 10https://gerrit.wikimedia.org/r/343784 [23:18:00] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343694 (https://phabricator.wikimedia.org/T160609) (owner: 10Jforrester) [23:18:34] (03PS1) 10Yuvipanda: k8s: Make kubernetes master profile flexible enough for tools use [puppet] - 10https://gerrit.wikimedia.org/r/343787 (https://phabricator.wikimedia.org/T158452) [23:18:50] James_F: I merge them all, you test all on mwdebug1002 or you prefer one by one? [23:19:11] available when necessary [23:19:25] Dereckson: All at once is fine. [23:19:36] ok [23:19:48] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343694 (https://phabricator.wikimedia.org/T160609) (owner: 10Jforrester) [23:19:57] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343694 (https://phabricator.wikimedia.org/T160609) (owner: 10Jforrester) [23:20:11] (03CR) 10Dzahn: [C: 032] Unbreak diffserver by fixing configs [puppet] - 10https://gerrit.wikimedia.org/r/343784 (owner: 10Subramanya Sastry) [23:21:17] (03PS2) 10Dereckson: Enable wgCiteResponsiveReferences on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343695 (https://phabricator.wikimedia.org/T160844) (owner: 10Jforrester) [23:21:29] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343695 (https://phabricator.wikimedia.org/T160844) (owner: 10Jforrester) [23:21:38] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343714 (https://phabricator.wikimedia.org/T160932) (owner: 10Jforrester) [23:21:40] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343696 (https://phabricator.wikimedia.org/T160362) (owner: 10Jforrester) [23:21:42] (03CR) 10Dereckson: [C: 032] Enable wgCiteResponsiveReferences on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343700 (https://phabricator.wikimedia.org/T160933) (owner: 10Jforrester) [23:22:07] (03PS2) 10Yuvipanda: k8s: Make kubernetes master profile flexible enough for tools use [puppet] - 10https://gerrit.wikimedia.org/r/343787 (https://phabricator.wikimedia.org/T158452) [23:22:15] (03CR) 10Yuvipanda: [V: 032 C: 032] k8s: Make kubernetes master profile flexible enough for tools use [puppet] - 10https://gerrit.wikimedia.org/r/343787 (https://phabricator.wikimedia.org/T158452) (owner: 10Yuvipanda) [23:23:37] ebernhardson: do you need 343665 (config) merged before 343744, 343754 (wmf16)? [23:24:05] subbu: i'm on it, just kind of waiting for jenkins [23:24:24] oh well, we are fixing something anyways [23:24:28] (03CR) 10Dzahn: [V: 032 C: 032] Unbreak diffserver by fixing configs [puppet] - 10https://gerrit.wikimedia.org/r/343784 (owner: 10Subramanya Sastry) [23:24:30] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343695 (https://phabricator.wikimedia.org/T160844) (owner: 10Jforrester) [23:24:34] (03PS4) 10Dzahn: Unbreak diffserver by fixing configs [puppet] - 10https://gerrit.wikimedia.org/r/343784 (owner: 10Subramanya Sastry) [23:24:43] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343695 (https://phabricator.wikimedia.org/T160844) (owner: 10Jforrester) [23:24:59] Dereckson: they are unrelated [23:25:13] ok [23:25:53] (03PS2) 10Dereckson: Enable wgCiteResponsiveReferences on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343696 (https://phabricator.wikimedia.org/T160362) (owner: 10Jforrester) [23:26:01] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343696 (https://phabricator.wikimedia.org/T160362) (owner: 10Jforrester) [23:26:07] (03CR) 10Dzahn: [V: 032 C: 032] Unbreak diffserver by fixing configs [puppet] - 10https://gerrit.wikimedia.org/r/343784 (owner: 10Subramanya Sastry) [23:27:27] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343696 (https://phabricator.wikimedia.org/T160362) (owner: 10Jforrester) [23:27:36] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343696 (https://phabricator.wikimedia.org/T160362) (owner: 10Jforrester) [23:28:11] (03CR) 1020after4: [C: 031] "I'll rework this to put the whole data structure in yaml" [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:28:45] (03PS2) 10Dereckson: Enable wgCiteResponsiveReferences on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343700 (https://phabricator.wikimedia.org/T160933) (owner: 10Jforrester) [23:28:52] (03PS3) 1020after4: Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 [23:28:58] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343700 (https://phabricator.wikimedia.org/T160933) (owner: 10Jforrester) [23:29:30] subbu: applied on ruthenium [23:30:31] thanks. [23:31:06] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343700 (https://phabricator.wikimedia.org/T160933) (owner: 10Jforrester) [23:31:13] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [23:31:15] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343700 (https://phabricator.wikimedia.org/T160933) (owner: 10Jforrester) [23:32:34] (03PS2) 10Dereckson: Enable wgCiteResponsiveReferences on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343714 (https://phabricator.wikimedia.org/T160932) (owner: 10Jforrester) [23:32:51] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343714 (https://phabricator.wikimedia.org/T160932) (owner: 10Jforrester) [23:34:53] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343714 (https://phabricator.wikimedia.org/T160932) (owner: 10Jforrester) [23:35:02] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343714 (https://phabricator.wikimedia.org/T160932) (owner: 10Jforrester) [23:35:34] Here we're. [23:36:20] Ready? [23:36:39] RoanKattouw: you didn't sync Test ORES migration on ruwiki beta too on prod by the way [23:36:43] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [23:36:54] Dereckson: Sorry, I had forgotten [23:37:03] I meant to sync it but then I got distracted [23:37:06] James_F: it's on mwdebug1002, yes [23:37:13] (03PS1) 10Yuvipanda: tools: Use k8s master profile in role [puppet] - 10https://gerrit.wikimedia.org/r/343790 (https://phabricator.wikimedia.org/T158452) [23:37:20] Testing. [23:37:31] bd808: ^ I've rolled the ldap.yaml and .conf patch into this one, btw. will let you know once I merge [23:37:51] (03CR) 10Yuvipanda: [V: 032] labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [23:38:08] (03PS5) 10Yuvipanda: labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [23:38:10] (03CR) 10Yuvipanda: [V: 032] labs: Add openstack::observerenv to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/343745 (https://phabricator.wikimedia.org/T160929) (owner: 10BryanDavis) [23:38:51] andrewbogott: ^ the patch I'm testing depends on this, so I've merged it [23:39:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[Gerrit:343781]] Test ORES migration on ruwiki beta too (labs only, no-op in prod) (duration: 00m 42s) [23:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:44] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [23:40:07] Dereckson: Did the lawiki one merge? frwiki and itwiki look good, checking the others. [23:40:12] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/5845/" [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:40:16] (03PS4) 10Dzahn: Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:40:29] https://gerrit.wikimedia.org/r/#/c/343695/ is merged [23:40:40] twentyafterfour: doing that now, i saw the "asap" part [23:40:48] and thanks for moving it to hiera [23:41:00] here to confirm search, right? [23:42:03] ebernhardson: Allow completion suggester to work with titles that look like integers live on mwdebug1002 [23:42:29] Dereckson: Ah, yes, working now. Good to go. [23:42:34] Dereckson: Thank you. :-) [23:42:49] Dereckson: nothing to test there, it is only part of the maintenance scripts that build the indices [23:42:55] Dereckson: me next? [23:43:11] (03PS1) 10Subramanya Sastry: Fix bad regexp in parsoid-vd settings file [puppet] - 10https://gerrit.wikimedia.org/r/343791 [23:44:38] ebernhardson: ack'ed [23:44:55] jdlrobson: in 3 to 5 minutes, yes [23:44:59] kkkkk [23:45:01] jdlrobson: syncing [23:45:14] James_F: ^ [23:45:44] (03PS2) 10Dzahn: testreduce: Fix bad regexp in parsoid-vd settings file [puppet] - 10https://gerrit.wikimedia.org/r/343791 (owner: 10Subramanya Sastry) [23:45:51] (03CR) 10Dzahn: [V: 032 C: 032] testreduce: Fix bad regexp in parsoid-vd settings file [puppet] - 10https://gerrit.wikimedia.org/r/343791 (owner: 10Subramanya Sastry) [23:46:23] * James_F waits. [23:46:47] subbu: you got it on ruthenium [23:47:06] (03PS4) 10Andrew Bogott: nfs-exportd: Refresh service if script or .yaml changes. [puppet] - 10https://gerrit.wikimedia.org/r/343459 [23:47:07] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable wgCiteResponsiveReferences on fr. en. it. la. no.wp + en.wikt (duration: 00m 46s) [23:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:26] (03PS2) 10Yuvipanda: tools: Use k8s master profile in role [puppet] - 10https://gerrit.wikimedia.org/r/343790 (https://phabricator.wikimedia.org/T158452) [23:48:35] !log dereckson@tin Synchronized php-1.29.0-wmf.16/extensions/CirrusSearch/includes/BuildDocument/Completion/SuggestBuilder.php: [[Gerrit:343754]] Allow completion suggester to work with titles that look like integers (duration: 00m 45s) [23:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:45] Dereckson: Thank you! [23:48:48] (03PS2) 10Dereckson: Restrict page images to lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343733 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [23:48:52] You're welcome. [23:48:53] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [23:48:54] mutante, thanks .. last thing to figure out is why nginx is 404ing https://parsoid-vd-tests.wikimedia.org/diffs/enwiki/Gravity.diff.png .. .but that can be for tomorrow. [23:49:12] (03CR) 10Dereckson: [C: 032] Restrict page images to lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343733 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [23:49:31] mutante, thanks for your help. i am signing off now. [23:49:43] 16 server-side clickmaps (-map) are deprecated in 2.40 in favor of client-side (-csmap) [23:49:53] wonderful new message from ploticus [23:49:58] subbu: ok, yes, icinga is happy :) cu later, yw [23:51:53] (03PS3) 10Yuvipanda: tools: Use k8s master profile in role [puppet] - 10https://gerrit.wikimedia.org/r/343790 (https://phabricator.wikimedia.org/T158452) [23:52:01] (03Merged) 10jenkins-bot: Restrict page images to lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343733 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [23:52:21] (03CR) 10jenkins-bot: Restrict page images to lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343733 (https://phabricator.wikimedia.org/T152115) (owner: 10Jdlrobson) [23:52:40] jdlrobson: live on mwdebug1002 [23:53:21] Dereckson: you can proceed to everywhere [23:54:10] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Restrict page images to lead section (T152115) (duration: 00m 43s) [23:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:16] T152115: [Config] Restricted lead images to lead section - https://phabricator.wikimedia.org/T152115 [23:55:07] (03PS4) 10Yuvipanda: tools: Use k8s master profile in role [puppet] - 10https://gerrit.wikimedia.org/r/343790 (https://phabricator.wikimedia.org/T158452) [23:55:18] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use k8s master profile in role [puppet] - 10https://gerrit.wikimedia.org/r/343790 (https://phabricator.wikimedia.org/T158452) (owner: 10Yuvipanda) [23:56:23] ebernhardson: Don't pass null suggest queries to elasticsearch [23:56:26] live on mwdebug1002 [23:56:59] Dereckson: same, its not live code currently [23:57:10] ok [23:57:13] (03PS5) 10Dzahn: Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:57:22] (03CR) 10Dzahn: [V: 032] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:57:44] (03CR) 10Dzahn: [V: 032 C: 032] Switch phabricator search to codfw [puppet] - 10https://gerrit.wikimedia.org/r/343635 (owner: 1020after4) [23:58:40] !log dereckson@tin Synchronized php-1.29.0-wmf.16/extensions/CirrusSearch/includes/CompletionSuggester.php: Don't pass null suggest queries to elasticsearch (T160896) (duration: 00m 42s) [23:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:45] T160896: Suggestions for articles written in Cyrillic don't show up in search when typing in Latin and vice versa on the Serbian Wikipedia - https://phabricator.wikimedia.org/T160896 [23:59:15] (03PS3) 10Dereckson: [es5 upgrade] step 3: depool eqiad for writes (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343665 (https://phabricator.wikimedia.org/T157479) (owner: 10DCausse) [23:59:21] !log phab2001 / iridium - running puppet after gerrit:343635 - switches phab search to codfw [23:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:59] dcausse: ^