[00:01:48] (03PS4) 10Dereckson: Add urdu logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367946 (https://phabricator.wikimedia.org/T171769) (owner: 10محمد شعیب) [00:01:51] (03CR) 10jerkins-bot: [V: 04-1] Add urdu logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367946 (https://phabricator.wikimedia.org/T171769) (owner: 10محمد شعیب) [00:27:08] !log test vm.zone_reclaim_mode=0 on elastic1017 [00:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:34] (03CR) 10Bearloga: [C: 031] "@Otto: please review this patch as soon as possible. It's more and more urgent to get the problem taken care of and backfill the metrics. " [puppet] - 10https://gerrit.wikimedia.org/r/370530 (https://phabricator.wikimedia.org/T172740) (owner: 10Gehel) [01:20:07] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [02:12:31] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3511056 (10Jayprakash12345) [http://tools.wmflabs.org/siteviews/?platform=all-access&source=pageviews&agent=user&start=2017-08-03&end=2017-08-08&sites=hi.wikivers... [02:28:19] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 08m 38s) [02:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:12] !log restart elasticsearch1017 with niofs store [02:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:33] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.12) (duration: 07m 45s) [02:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:46] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [03:23:52] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.13) (duration: 07m 33s) [03:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:56] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 812.93 seconds [03:50:01] (03CR) 10Jcrespo: [C: 031] labservices: remove ferm rule that opens mysql to all internal hosts [puppet] - 10https://gerrit.wikimedia.org/r/370689 (https://phabricator.wikimedia.org/T169075) (owner: 10Andrew Bogott) [03:55:16] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3511130 (10Reedy) Why would it work for July 2017? The wiki wasn't created till August? [04:08:54] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3511142 (10jcrespo) p:05Low>03Normal `[44534.817426] B` I would say kernel panic again based on the above output, but who knows. [04:09:25] !log powercycling mw2256 [04:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:56] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [04:15:16] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:23:20] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3511146 (10jcrespo) Sadly, I cannot find relevant software or hardware logs. [04:33:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 250.72 seconds [04:35:24] (03PS2) 10Jcrespo: mariadb: temporarely depooling db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370640 (https://phabricator.wikimedia.org/T166546) [04:43:27] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:48:31] !log restarting pdfrender on scb1001, unresponsive [04:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:07] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [04:51:08] (03CR) 10Jcrespo: [C: 032] mariadb: temporarely depooling db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370640 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [04:52:36] (03Merged) 10jenkins-bot: mariadb: temporarely depooling db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370640 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [04:52:50] (03CR) 10jenkins-bot: mariadb: temporarely depooling db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370640 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [04:54:54] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 (duration: 00m 58s) [04:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:03] !log stopping db1069 (s2) and db1060 replication in sync [04:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:14] (03CR) 1020after4: [C: 031] Phabricator: Fix phab dump script to use variable $app_user and $app_pass [puppet] - 10https://gerrit.wikimedia.org/r/370762 (owner: 10Paladox) [05:13:13] (03PS1) 10Jcrespo: Revert "mariadb: temporarely depooling db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370779 [05:21:04] (03PS4) 1020after4: PHAB: deployment scripts to be called by scap [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) [05:23:37] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: temporarely depooling db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370779 (owner: 10Jcrespo) [05:25:04] (03Merged) 10jenkins-bot: Revert "mariadb: temporarely depooling db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370779 (owner: 10Jcrespo) [05:26:10] (03CR) 10jenkins-bot: Revert "mariadb: temporarely depooling db1060 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370779 (owner: 10Jcrespo) [05:26:13] (03PS1) 10Jcrespo: mariadb: Depool db1085 temporarilly for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370780 (https://phabricator.wikimedia.org/T166546) [05:29:25] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1085 temporarilly for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370780 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [05:30:51] (03Merged) 10jenkins-bot: mariadb: Depool db1085 temporarilly for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370780 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [05:31:04] (03CR) 10jenkins-bot: mariadb: Depool db1085 temporarilly for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370780 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [05:32:29] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1060, depool db1085 (duration: 00m 50s) [05:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:37] !log stopping db1069 (s6) and db1085 replication in sync [05:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:31] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1085 temporarilly for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370781 [05:57:17] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1085 temporarilly for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370781 [05:57:19] (03PS1) 10Jcrespo: mariadb: Depool db1079 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370782 (https://phabricator.wikimedia.org/T166546) [05:58:19] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1085 temporarilly for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370781 (owner: 10Jcrespo) [05:59:43] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1085 temporarilly for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370781 (owner: 10Jcrespo) [05:59:48] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1079 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370782 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [05:59:53] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1085 temporarilly for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370781 (owner: 10Jcrespo) [06:01:13] (03Merged) 10jenkins-bot: mariadb: Depool db1079 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370782 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [06:02:25] (03CR) 10jenkins-bot: mariadb: Depool db1079 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370782 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [06:03:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1085, depool db1079 (duration: 00m 51s) [06:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:11] !log stopping db1069 (s7) and db1079 replication in sync [06:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:40] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1079 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370783 [06:21:31] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1079 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370783 (owner: 10Jcrespo) [06:22:56] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1079 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370783 (owner: 10Jcrespo) [06:23:05] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1079 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370783 (owner: 10Jcrespo) [06:26:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 (duration: 00m 51s) [06:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:43] is ores going down on codfw [06:26:44] ? [06:26:54] I don't know [06:27:27] https://grafana.wikimedia.org/dashboard/db/ores?orgId=1 [06:27:31] It looks healthy [06:27:39] it seems it recovered [06:27:49] maybe just a one-time failure [06:28:38] okay, please ping me if it starts to appear again [06:29:37] see the increase on 5xx in the last minutes? [06:30:28] not worrying, but maybe there was a temporary bleh [06:31:15] It's still ongoing according to the graph [06:31:25] but the rate is 3% and it's eqiad [06:31:35] I keep an eye on it [06:31:52] yeah, different from the http errors I saw on codfw [06:35:47] 10Operations, 10puppet-compiler, 10Patch-For-Review, 10User-Joe: puppet compiler fails with modules using puppetdb - https://phabricator.wikimedia.org/T150456#3511263 (10Joe) I wrote a first version of the script that can be used to populate puppetdb; I'll upload it via puppet to all the compiler machines... [06:43:41] (03PS1) 10Jcrespo: mariadb: Depool db1065, db1064, db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370784 (https://phabricator.wikimedia.org/T166546) [06:45:54] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3511265 (10gh87) In case that links don't work: [[http://tools.wmflabs.org/siteviews/?platform=all-access&source=pageviews&agent=user&start=2017-08-03&end=2017-08... [06:48:19] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3511266 (10Liuxinyu970226) >>! In T168765#3511265, @gh87 wrote: > In case that links don't work: [[http://tools.wmflabs.org/siteviews/?platform=all-access&source=... [06:51:42] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1065, db1064, db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370784 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [06:53:01] 10Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3511270 (10Joe) As can be seen here https://puppet-compiler.wmflabs.org/compiler02/7354/ All of these nodes now compile correctly if ran on a node that is puppetdb-enabled. Resolving... [06:53:10] (03Merged) 10jenkins-bot: mariadb: Depool db1065, db1064, db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370784 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [06:53:21] 10Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3511272 (10Joe) 05Open>03Resolved a:03Joe [06:53:24] (03CR) 10jenkins-bot: mariadb: Depool db1065, db1064, db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370784 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [07:07:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1065, db1064, db1070 (duration: 00m 47s) [07:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:13] !log stopping replication in sync between db1069 and db1065, db1044, db1064, db1070 [07:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:43] (03CR) 10Dereckson: "@محمد شعیب Do you wish to update the size or are you happy with the current configuration? (Meanwhile someone added the logo, but with oth" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367946 (https://phabricator.wikimedia.org/T171769) (owner: 10محمد شعیب) [07:36:19] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1065, db1064, db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370785 [07:42:04] 10Operations, 10Puppet, 10puppet-compiler, 10Prometheus-metrics-monitoring: A few hosts never get clean puppet compiler runs - https://phabricator.wikimedia.org/T157496#3007289 (10Joe) Yes, this is a duplicate of T150456 [07:42:18] 10Operations, 10Puppet, 10puppet-compiler, 10Prometheus-metrics-monitoring: A few hosts never get clean puppet compiler runs - https://phabricator.wikimedia.org/T157496#3511300 (10Joe) 05Open>03Resolved a:03Joe [07:42:59] 10Operations, 10puppet-compiler: puppet compiler error on catalog with non-ascii output - https://phabricator.wikimedia.org/T133979#2251021 (10Joe) This is resolved now that we use our own differ. [07:43:09] 10Operations, 10puppet-compiler: puppet compiler error on catalog with non-ascii output - https://phabricator.wikimedia.org/T133979#3511308 (10Joe) 05Open>03Resolved a:03Joe [07:43:16] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1065, db1064, db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370785 (owner: 10Jcrespo) [07:44:44] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1065, db1064, db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370785 (owner: 10Jcrespo) [07:46:25] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1065, db1064, db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370785 (owner: 10Jcrespo) [07:47:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1065, db1064, db1070 (duration: 01m 04s) [07:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:10] (03PS1) 10Elukey: statistics::package: add missing require_package [puppet] - 10https://gerrit.wikimedia.org/r/370786 [07:54:23] (03PS2) 10Elukey: statistics::package: add missing require_package [puppet] - 10https://gerrit.wikimedia.org/r/370786 [07:56:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] statistics::package: add missing require_package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370786 (owner: 10Elukey) [08:02:40] (03CR) 10MarcoAurelio: "> This change grants the right to ADD users to the group, but not to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368939 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [08:07:59] (03Draft2) 10MarcoAurelio: Follow-up Ia62ad0f4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370787 (https://phabricator.wikimedia.org/T101983) [08:10:17] (03PS1) 10Jcrespo: install_server: Allow reimage of db1069, dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/370788 (https://phabricator.wikimedia.org/T166546) [08:14:12] (03CR) 10MarcoAurelio: "@Dereckson Feel free to deploy this so we fix the previous one. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370787 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [08:15:32] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::jobrunner/videoscaler: switch to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/368621 (https://phabricator.wikimedia.org/T171704) [08:15:38] (03PS2) 10Jcrespo: install_server: Allow reimage of db1069, dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/370788 (https://phabricator.wikimedia.org/T166546) [08:19:37] (03PS3) 10Jcrespo: install_server: Allow reimage of db1069, dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/370788 (https://phabricator.wikimedia.org/T166546) [08:20:50] 10Operations, 10puppet-compiler, 10Patch-For-Review, 10User-Joe: puppet compiler fails with modules using puppetdb - https://phabricator.wikimedia.org/T150456#3511340 (10Joe) So one additional complication: we need to refresh the facts timestamp for every pcc run, as we don't want to incur in a case of htt... [08:22:26] (03CR) 10Jcrespo: [C: 032] install_server: Allow reimage of db1069, dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/370788 (https://phabricator.wikimedia.org/T166546) (owner: 10Jcrespo) [08:27:16] (03PS1) 10Jcrespo: dbstore: Migrate dbstore2001 to dbstore_multiinstance role [puppet] - 10https://gerrit.wikimedia.org/r/370789 (https://phabricator.wikimedia.org/T168409) [08:39:51] !log disable puppet on dbstore2001, about to be reimaged [08:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:07] (03CR) 10Jcrespo: [C: 032] dbstore: Migrate dbstore2001 to dbstore_multiinstance role [puppet] - 10https://gerrit.wikimedia.org/r/370789 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [08:40:28] (03PS3) 10Elukey: statistics::package: add missing require_package [puppet] - 10https://gerrit.wikimedia.org/r/370786 (https://phabricator.wikimedia.org/T171924) [08:40:50] (03CR) 10jerkins-bot: [V: 04-1] statistics::package: add missing require_package [puppet] - 10https://gerrit.wikimedia.org/r/370786 (https://phabricator.wikimedia.org/T171924) (owner: 10Elukey) [08:41:20] (03PS4) 10Elukey: statistics::package: add missing package depencency [puppet] - 10https://gerrit.wikimedia.org/r/370786 (https://phabricator.wikimedia.org/T171924) [08:41:44] (03CR) 10jerkins-bot: [V: 04-1] statistics::package: add missing package depencency [puppet] - 10https://gerrit.wikimedia.org/r/370786 (https://phabricator.wikimedia.org/T171924) (owner: 10Elukey) [08:42:28] super lucky to have sent a amended commit msg right when Jenkins was sending me a -1 :D [08:44:37] (03PS5) 10Elukey: statistics::package: add missing package depencency [puppet] - 10https://gerrit.wikimedia.org/r/370786 (https://phabricator.wikimedia.org/T171924) [08:52:24] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:16] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [09:13:34] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:15:01] elukey@mw1169:~$ hhvmadm check-health [09:15:01] { "load":20 [09:15:02] , "queued":6 [09:15:24] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [09:19:21] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3511453 (10elukey) I finally managed to find a graphite query that gives me useful data from EventStreams (the grafana da... [09:28:35] <_joe_> elukey: that means it's fully booked, uhm [09:31:40] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:31:42] _joe_ indeed https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=All [09:34:20] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:34:40] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:35:47] I am trying to open some of the urls issuing 503s and it takes ages [09:37:15] it seems on its way to recover though [09:38:34] a good url to check thumb or general upload issues or slowdowns is: https://commons.wikimedia.org/w/index.php?title=Special:NewFiles&offset=&limit=500 [09:38:35] top 20 503s are mostly for */thumb/* [09:39:08] they are new files, so in most cases not cached on your browser [09:39:26] thanks! [09:39:46] not perfect, though, because it could be dc-specific issue or other stuff [09:42:00] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:02] <_joe_> uhm again [09:43:06] (03PS1) 10TerraCodes: Update InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [09:43:13] but on another host [09:43:23] let me check the queue [09:43:24] <_joe_> yes [09:43:42] <_joe_> that means the transcode queue is full (we still have one host that's offline for now) [09:44:21] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [09:44:41] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:45:21] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&var-jobType=webVideoTranscode&panelId=7&fullscreen&from=now-24h&to=now [09:46:43] does it create so much load that they timeout? [09:47:07] <_joe_> no, it happens that we have more ongoing jobs than hhvm workers available [09:47:15] oh [09:47:19] <_joe_> it should not happen, but I guess it has to do with timeouts etc [09:47:50] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [09:50:55] and we descreased a lot jobrunner workers the last time [09:51:07] it should not reach to that level of queueing [09:53:21] mw1182 wants to fail, too? [09:58:34] (03PS1) 10Elukey: eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) [09:58:54] (03CR) 10jerkins-bot: [V: 04-1] eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [10:08:28] the space after : sigh [10:08:32] (03PS2) 10Elukey: eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) [10:10:45] (03CR) 10Elukey: "pcc diff: https://puppet-compiler.wmflabs.org/compiler02/7360/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [10:20:12] !log stopping dbstore2002's mysqls and cloning them to dbstore2001 [10:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:10] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3511599 (10Marostegui) @Papaul did you get in contact with Dell's support? Thanks! [11:49:10] the transcode lag keeps going up [11:49:51] is it ok? https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&var-jobType=webVideoTranscode&panelId=7&fullscreen&from=1499687380301&to=1502279380302 [11:53:37] too much uploading at commons going on? [11:53:41] heavy files? [12:11:16] (03PS4) 10Elukey: role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) [12:15:45] (03PS1) 10Jcrespo: prometheus_mysqld_exporter: Adding new mysql instances at dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/370796 (https://phabricator.wikimedia.org/T168409) [12:16:01] (03PS5) 10Elukey: role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) [12:20:36] (03PS1) 10Jcrespo: dblists: Add new 5 instances from dbstore2001 to the list of mysqls [software] - 10https://gerrit.wikimedia.org/r/370797 (https://phabricator.wikimedia.org/T168409) [12:20:59] (03CR) 10Jcrespo: [C: 032] prometheus_mysqld_exporter: Adding new mysql instances at dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/370796 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [12:22:45] (03CR) 10Jcrespo: [C: 032] dblists: Add new 5 instances from dbstore2001 to the list of mysqls [software] - 10https://gerrit.wikimedia.org/r/370797 (https://phabricator.wikimedia.org/T168409) (owner: 10Jcrespo) [12:31:52] I do not know that it is, but there is a spike every day at 12:30 in enwiki api request load [12:32:02] at least in the last 5 days [12:32:46] (03PS6) 10Elukey: role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) [12:32:51] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=5&fullscreen&orgId=1&from=1501677155043&to=1502281955044&var-dc=eqiad%20prometheus%2Fops&var-server=db1051&var-port=9104 [12:32:51] a mirror doing their db synchronisation, perhaps ? [12:33:27] mirrors/syncronization doesn't touch the database [12:33:36] but yeah, it could be some bot [12:33:58] *it most likely is some bot [12:34:12] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [12:40:18] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/7362/" [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [12:41:12] !log restarting updater on wdqs1001 (real fix coming up soon [12:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:01] <_joe_> jynus: there is not much we can do about the transcode times [12:42:12] <_joe_> besides re-imaging mw1260 with trusty [12:43:09] (03CR) 10Elukey: [C: 032] role::an_cluster::hadoop::client: moving to profiles (first part) [puppet] - 10https://gerrit.wikimedia.org/r/370187 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [12:57:33] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [12:59:42] ACKNOWLEDGEMENT - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel wdqs-updater restarted, lag should come down [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170809T1300). Please do the needful. [13:00:04] TabbyCat: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] meow [13:01:54] * TabbyCat eyes Amir1 :D [13:02:02] yuvikey with you this time? [13:02:48] TabbyCat: yah [13:02:57] let me see what's going on [13:03:20] yup [13:05:06] got a 2FA issue from one of our en.wp admins. https://phabricator.wikimedia.org/T172878. DoRD has done a CU check already. I suspect, from speaking to them, they've disabled and re-enabled 2FA without saving the scratch codes or updating their Google Authenticator account with the new secret key. [13:05:33] TabbyCat: the first patch looks scary but it's fine. Can you test it? [13:06:10] TabbyCat: yes I can test it on some wikis I've got +crat [13:06:21] Dereckson merged yesterday the first part [13:06:38] I forgot to add it to wgRemoveGroups when patching it first [13:06:44] as usual :S [13:07:18] okay, if there isn't anyone today, I guess I can do the SWAT today [13:08:14] the other patch is a merge on a extension which is already +1'd by a i18n/L10n expert [13:08:23] sure thing [13:09:10] !log gehel@tin Started deploy [wdqs/wdqs@6620e0f]: (no justification provided) [13:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:55] (03PS1) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [13:10:21] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [13:10:42] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [13:10:44] !log gehel@tin Finished deploy [wdqs/wdqs@6620e0f]: (no justification provided) (duration: 01m 34s) [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:40] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370787 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [13:12:50] :D [13:13:26] (03PS2) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [13:14:13] (03Merged) 10jenkins-bot: Follow-up Ia62ad0f4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370787 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [13:14:49] * TabbyCat x-wikimedia-debugs [13:15:20] TabbyCat: your first patch is in mwdebug1002 [13:15:51] Amir1: looks good to me [13:16:09] on special:listgrouprights bureaucrats are displayed as being able to add/remove confirmed [13:16:14] as expected [13:16:17] (03CR) 10jenkins-bot: Follow-up Ia62ad0f4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370787 (https://phabricator.wikimedia.org/T101983) (owner: 10MarcoAurelio) [13:16:41] okay [13:18:58] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Allow bureaucrats remove confirmed user group (T101983) (duration: 00m 51s) [13:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:09] T101983: Allow admins adding confirmed user group on all WMF wikis - https://phabricator.wikimedia.org/T101983 [13:19:14] TabbyCat: ^ your patch is live everywhere [13:19:25] thank you [13:19:38] Tell me if it looks okay so I move on [13:19:45] rechecking [13:20:16] L.G.T.M. [13:20:29] TGIF [13:20:34] Even though it's not [13:20:38] let's go [13:21:02] heh [13:21:16] TG it's Wikimania :D [13:21:20] !log Stop replication on db2075 for maintenance - T170662 [13:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:32] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [13:21:43] TabbyCat: the patch you gave me is for master [13:21:50] yep [13:21:57] it's not for backporting? [13:22:04] nope [13:22:27] I don't think WikimediaMessages use master [13:22:39] it could be though but given that 'extendedmover' it's only used on two sites [13:22:59] they tend to go live on each mediawiki train [13:23:01] well, you need to back port for both wmf.13 and wmf.12 [13:23:32] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [13:23:35] Amir1: nah, don't bother, it'll get merged some day and after that it'll display after the mediawiki train upgrade [13:24:25] TabbyCat: I just merged it, using my developer +2 right as it looks good and had +1 from i18n people [13:24:39] so it's just not live [13:25:00] Amir1: that's what I wanted :) [13:25:12] it'll be live on the beta cluster though [13:25:17] where I can check it [13:25:18] for that you need to use #wikimedia-codereview [13:25:31] oh, new channel? [13:25:32] puppet is for deployment :D [13:25:35] *SWAT [13:27:49] TabbyCat: If it's a big problem, after checking in beta cluster, make two backports and schedule it for the next SWAT [13:28:03] if not, wait until the next week [13:28:08] SWAT is done now [13:28:38] Amir1: no worries, I think that usergroup is used only on enwiki and they have the local overrides already [13:28:45] thank you for the swat [13:28:54] :) [13:29:57] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db2075 to s5 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370801 (https://phabricator.wikimedia.org/T170662) [13:33:21] (03PS1) 10Giuseppe Lavagetto: Better support puppetdb [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370802 (https://phabricator.wikimedia.org/T150456) [13:33:23] (03PS1) 10Giuseppe Lavagetto: Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) [13:33:50] (03CR) 10jerkins-bot: [V: 04-1] Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [13:37:57] (03PS3) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [13:41:37] (03PS4) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [13:42:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:45:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [13:46:17] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] Gehel known issue, second patch is on its way... [13:48:23] (03PS1) 10Marostegui: mariadb: Add db2076 as s6 slave [puppet] - 10https://gerrit.wikimedia.org/r/370805 (https://phabricator.wikimedia.org/T170662) [13:48:49] (03PS5) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [13:50:55] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler03/7366/" [puppet] - 10https://gerrit.wikimedia.org/r/370805 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [13:53:20] (03CR) 10Jcrespo: [C: 031] mariadb: Add db2076 as s6 slave [puppet] - 10https://gerrit.wikimedia.org/r/370805 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [13:55:07] (03PS2) 10Giuseppe Lavagetto: Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) [13:55:56] !log Reboot db2076 for maintenance - T170662 [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:07] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [13:59:33] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3512185 (10elukey) [13:59:55] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db2075 to s5 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370801 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [14:01:24] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2075 to s5 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370801 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [14:01:38] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db2075 to s5 shard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370801 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [14:03:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db2075 to s5 - T170662 (duration: 00m 51s) [14:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:39] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [14:04:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db2075 to s5 - T170662 (duration: 00m 50s) [14:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:44] (03PS2) 10Andrew Bogott: labservices: remove ferm rule that opens mysql to all internal hosts [puppet] - 10https://gerrit.wikimedia.org/r/370689 (https://phabricator.wikimedia.org/T169075) [14:08:42] !log purging req.urls with "^/resources" from varnish cluster, to fix redirect with cached 404 [14:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:59] (03CR) 10Andrew Bogott: [C: 032] labservices: remove ferm rule that opens mysql to all internal hosts [puppet] - 10https://gerrit.wikimedia.org/r/370689 (https://phabricator.wikimedia.org/T169075) (owner: 10Andrew Bogott) [14:12:43] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3512220 (10debt) Thanks, @MaxSem, we'll see if we can figure out how to test it on our maps-test cluster with @Gehel and @Pnorman [14:16:23] (03PS2) 10Giuseppe Lavagetto: Better support puppetdb [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370802 (https://phabricator.wikimedia.org/T150456) [14:16:24] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create wikimedia.org/resources redirect for Wikimedia Resource Center - https://phabricator.wikimedia.org/T172417#3497968 (10Dzahn) This is working now, after: 07:08 < mutante> !log purging req.urls with "^/resources" f... [14:16:26] (03PS3) 10Giuseppe Lavagetto: Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) [14:16:51] (03CR) 10jerkins-bot: [V: 04-1] Better support puppetdb [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370802 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [14:16:53] (03CR) 10jerkins-bot: [V: 04-1] Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [14:17:52] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create wikimedia.org/resources redirect for Wikimedia Resource Center - https://phabricator.wikimedia.org/T172417#3512244 (10Dzahn) 05Open>03Resolved a:03Dzahn I had to follow the commands described on https://wiki... [14:20:16] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10GWicke) [14:28:31] (03PS6) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [14:29:05] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:30:46] 10Operations, 10Ops-Access-Requests, 10Research: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512337 (10leila) [14:31:36] 10Operations, 10Ops-Access-Requests, 10Research: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512352 (10leila) [14:32:34] (03PS7) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [14:33:08] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:33:10] (03CR) 10Elukey: [C: 04-1] "Still broken due to missing hiera config but the bulk of the work should be done :)" [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [14:34:12] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:36:21] (03PS8) 10Elukey: role::analytics_cluster::hadoop::client: move to profiles (p2) [puppet] - 10https://gerrit.wikimedia.org/r/370798 (https://phabricator.wikimedia.org/T167790) [14:36:31] 10Operations, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3512370 (10GWicke) [14:40:36] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10jcrespo) I just reported the issue on the other ticket, interested on a f... [14:41:00] (03CR) 10Giuseppe Lavagetto: [C: 031] discovery-stats user should be a member of wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/370530 (https://phabricator.wikimedia.org/T172740) (owner: 10Gehel) [14:41:38] (03PS1) 10Marostegui: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370813 (https://phabricator.wikimedia.org/T170662) [14:42:28] 10Operations, 10Ops-Access-Requests, 10Research: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512391 (10diego) L3 signed public key here: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDq3m0IyfeKVGQSTOU19CXfP2yxqosqXjxyp8wDU92UxrhR7u1Dk01z6gSOxjsM90RGDe1KIzFp7ce3EH7sNMg... [14:43:33] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 40.58 seconds [14:43:46] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3512393 (10Reedy) 05Open>03Resolved The second link is never gonna work. The wiki didn't exist in July, so there are not going to be any stats to use [14:50:53] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2110079 [14:53:50] (03PS2) 10Gehel: discovery-stats user should be a member of wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/370530 (https://phabricator.wikimedia.org/T172740) [14:54:30] (03CR) 10Gehel: [C: 032] discovery-stats user should be a member of wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/370530 (https://phabricator.wikimedia.org/T172740) (owner: 10Gehel) [14:54:41] bearloga: ^ [14:56:22] (03PS3) 10Giuseppe Lavagetto: Better support puppetdb [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370802 (https://phabricator.wikimedia.org/T150456) [14:56:24] (03PS4) 10Giuseppe Lavagetto: Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) [14:58:55] 10Operations, 10Ops-Access-Requests, 10Research: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512496 (10diego) just FYI my username in wikitech is diego, but my 'Instance shell account name:' is dsaez (diego was not available) [15:01:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Better support puppetdb [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370802 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [15:02:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3512537 (10Papaul) @Marostegui wrote "For the record, after manually forcing the re-learn we got it back to healthy - let's see how long until it fails again:" this was your last comment so... [15:03:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [15:03:37] (03Merged) 10jenkins-bot: Add script to populate PuppetDB [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/370803 (https://phabricator.wikimedia.org/T150456) (owner: 10Giuseppe Lavagetto) [15:04:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3512541 (10Marostegui) >>! In T172265#3512537, @Papaul wrote: > @Marostegui wrote "For the record, after manually forcing the re-learn we got it back to healthy - let's see how long until it... [15:05:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3512542 (10Papaul) p:05Triage>03Normal [15:05:10] (03PS1) 10Urbanecm: Enable NewUserMessage on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370815 (https://phabricator.wikimedia.org/T172894) [15:07:13] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3512548 (10Papaul) @jcrespo that is not good it doesn't help me but I will contact Dell and have their opinion. Thanks. [15:12:27] (03CR) 10Ottomata: [C: 031] eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [15:18:50] (03CR) 10Giuseppe Lavagetto: [C: 031] Define network infra ranges and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) (owner: 10Ayounsi) [15:19:45] !log removing old pfw related config from cr1-codfw - T171970 [15:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:56] T171970: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970 [15:23:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Excellent job - just a couple minor nits, but LGTM in general." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/369682 (owner: 10Gehel) [15:23:31] mutante: Are you in Montreal? [15:23:51] kaldari: he is! I saw him in the big room a bit ago... now not sure [15:24:16] oh yeah, there he is, front left corner as you're looking at the stage [15:24:36] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3512641 (10ayounsi) [15:25:37] kaldari: Yeah, he's sat next to me [15:27:33] Reedy, mutante: I'm at BANQ with Marc Graham from the Internet Archive. Currently Internet Archive is blocked in all of India and they have no idea why. Internet Archive has a tiny ops staff and they don't have experience with this problem. Is there anything we could do to help them? This is obviously also an issue for Wikipedia users in India since we rely on Internet Archive heavily for references. [15:27:55] Blocked at the country level? [15:28:02] As far as they know [15:28:29] Reedy: https://www.theverge.com/2017/8/9/16117578/wayback-machine-blocked-india-internet-archive [15:28:31] I suspect, there's not much anyone can do bar petitioning the government agency/similar that's responsible [15:29:24] Yeah, they're working on that. Just wondering if we knew any good tricks to get around it. [15:29:35] on their end [15:29:35] Proxying or VPN is probably the only way [15:29:42] Depends what's blocke [15:29:44] d [15:29:45] Just the domain? [15:29:47] Their IPs? [15:29:49] Both? [15:29:54] I'll see if I can find out [15:30:06] It's been known for hostnames to be blocked, but IPs not [15:30:13] Does DNS wok? [15:30:15] *work? [15:30:24] Does switching to non Indian DNS servers help? [15:30:48] 10Operations, 10Analytics, 10Research: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#3512670 (10Ottomata) Correct me if I'm wrong, but a multi-instance MySQL will likely not be very useful for many of the analysis use cases. Folks want to join cross wiki, no... [15:30:56] Depending on the answers... Probably varies the potential solutions :) [15:31:05] (03PS5) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [15:31:14] (03CR) 10Gehel: wdqs - moving to role / profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369682 (owner: 10Gehel) [15:34:44] Reedy: Got it. I'll see if I can get any more info from them [15:35:45] (03CR) 10Gehel: wdqs - moving to role / profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369682 (owner: 10Gehel) [15:36:58] (03CR) 10Ayounsi: [C: 032] Define network infra ranges and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) (owner: 10Ayounsi) [15:37:10] (03PS3) 10Ayounsi: Define network infra ranges and allow them to send syslog to logstash [puppet] - 10https://gerrit.wikimedia.org/r/369697 (https://phabricator.wikimedia.org/T166126) [15:38:47] (03PS6) 10Gehel: wdqs - moving to role / profiles [puppet] - 10https://gerrit.wikimedia.org/r/369682 [15:45:11] (03PS2) 10Urbanecm: Enable NewUserMessage on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370815 (https://phabricator.wikimedia.org/T172894) [15:50:15] kaldari: what's up [15:50:45] oh.. Archive is blocked in India.. hmm [15:51:10] why are they blocked? [15:51:20] because they mirror blocked sites? [15:51:40] probably [15:51:52] i am not sure what we could do to help them.. ehmm [15:52:14] well, maybe advice for individuals how to bypass the block [15:52:42] the problem with the advice is it's kind of an arms race. if you find an easy option for bypassing and publicize it widely, it will get shut down. [15:53:24] kaldari: the best thing to do would be to hook him up with our Legal team, as they work on the non-technical front of these issues all the team. Maybe Zhou would be a good initial contact? [15:53:28] yea... so i guess it would be just like "use VPN" [15:53:30] (03PS4) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [15:53:32] (03PS2) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [15:53:33] s/all the team/all the time/ [15:53:43] bblack: thanks! [15:55:09] (03CR) 10Mobrovac: "LGTM but we need to truncate the prod tables first" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [15:55:11] (03PS5) 10Rush: openstack: clean up openstack::repo [puppet] - 10https://gerrit.wikimedia.org/r/370092 (https://phabricator.wikimedia.org/T171494) [15:55:46] (03PS3) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [15:57:47] 10Operations, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3512836 (10GWicke) There are quite a few headless chrome wrappers listed as dependents of chrome-launcher: https://www.npm... [15:57:54] (03CR) 10Ema: [C: 032] Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 (owner: 10Mark Bergsma) [15:58:10] (03CR) 10Ema: [V: 032 C: 032] Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 (owner: 10Mark Bergsma) [15:59:09] (03CR) 10Mobrovac: [C: 031] eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [15:59:18] (03PS4) 10Elukey: eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) [15:59:38] heh [16:00:48] (03PS1) 10EBernhardson: Switch elastic1017-1031 to niofs [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) [16:03:18] mobrovac: o/ - I am planning to merge the eventstreams change, run puppet, then depool/restart/pool every node [16:03:35] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370813 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:03:48] kk elukey, just start with the canary in codfw [16:04:05] mobrovac: ack! [16:04:15] elukey: specifically, after you run puppet on scb2001, check /etc/eventstreams/config* to make sure they are ok [16:04:22] before deploying [16:04:53] mobrovac: s/deploying/restarting right? [16:05:06] yup :) [16:05:11] ok :) [16:06:15] (03CR) 10Elukey: [C: 032] eventstreams: set Kafka API version to 0.9.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/370793 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [16:07:13] (03PS2) 10Gehel: Some requests may have no client IP defined. [puppet] - 10https://gerrit.wikimedia.org/r/370511 (https://phabricator.wikimedia.org/T172713) (owner: 10Smalyshev) [16:07:42] (03PS5) 10Mark Bergsma: Add basic unit tests for protocol BGP send methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 [16:08:04] !log config update on logstash, a few logs might be lost during restart - T172713 [16:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:16] T172713: WDQS logstash parser does not parse some requests - https://phabricator.wikimedia.org/T172713 [16:08:25] (03CR) 10Gehel: [C: 032] Some requests may have no client IP defined. [puppet] - 10https://gerrit.wikimedia.org/r/370511 (https://phabricator.wikimedia.org/T172713) (owner: 10Smalyshev) [16:09:10] (03PS2) 10EBernhardson: Switch elastic1017-1031 to niofs [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) [16:09:46] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370813 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:09:56] (03CR) 10jenkins-bot: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370813 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:10:14] Dereckson, hey are you at Wikimania? [16:10:51] (03PS2) 10Marostegui: mariadb: Add db2076 as s6 slave [puppet] - 10https://gerrit.wikimedia.org/r/370805 (https://phabricator.wikimedia.org/T170662) [16:10:59] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2046 - T170662 (duration: 00m 50s) [16:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [16:12:21] (03CR) 10Marostegui: [C: 032] mariadb: Add db2076 as s6 slave [puppet] - 10https://gerrit.wikimedia.org/r/370805 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [16:14:12] !log rolling restart of eventstream on scb hosts to deploy https://gerrit.wikimedia.org/r/370793 [16:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:52] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3513024 (10Papaul) p:05High>03Normal [16:15:19] !log Stop mysql on db2046 to copy its content to db2076 - T170662 [16:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:31] mobrovac: the only weird thing that I can see in config is api.version.request: False (I expected false lowercase) [16:15:58] (/etc/eventstreams/config.yaml) [16:16:12] elukey: looking, gimme a sec [16:16:26] (do not restart please) [16:17:30] elukey: this is a problem [16:18:01] mobrovac: ehmr I already did it, realized it afterwards :( [16:18:19] is codfw serving eventstreams traffic? I don't see any on its port [16:18:26] elukey: T145510 is the reason, but we need to work around it, and in this case is really hard to do it :/ [16:18:27] T145510: Scap config management: Jinja2 fills templates with Pythonic values - https://phabricator.wikimedia.org/T145510 [16:18:38] elukey: no, codfw is not serving anything [16:18:44] ahhh what a nice bug [16:18:57] elukey: but please disale puppet in eqiad [16:19:04] until we find a way around this [16:19:50] <_joe_> heya [16:19:55] <_joe_> need help? [16:19:59] <_joe_> I was about to leave [16:20:07] no no _joe_ nothing super huge [16:20:12] thanks :) [16:20:38] mobrovac: disabled, might have already been running though (but no restart so we are fine) [16:20:53] (03CR) 10Dzahn: [C: 04-1] "I don't see why this one password should be a special case compared to all the other passwords. Just use labs/private and Hiera like for e" [puppet] - 10https://gerrit.wikimedia.org/r/370762 (owner: 10Paladox) [16:21:11] (03Abandoned) 10Paladox: Phabricator: Fix phab dump script to use variable $app_user and $app_pass [puppet] - 10https://gerrit.wikimedia.org/r/370762 (owner: 10Paladox) [16:21:12] <_joe_> mobrovac: I think the fix for that scap issue is like one line of jinja [16:21:14] (03CR) 10Ema: [C: 032] Add bgp.ip unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355425 (owner: 10Mark Bergsma) [16:21:35] ema: o/ [16:21:51] <_joe_> but tbh, I'm not really going to fix this now. but you just need to define a filter [16:22:03] _joe_: right, but in scap [16:22:23] <_joe_> yes [16:22:32] (03CR) 10Ema: "recheck" [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 (owner: 10Mark Bergsma) [16:22:39] so we still need to find a work-around for now [16:22:42] <_joe_> scap has jinja templates, right? [16:22:51] <_joe_> or do you define your own templates? [16:23:40] <_joe_> because we can hotfix this in your own template, in case [16:23:46] 10Operations, 10Ops-Access-Requests, 10Research: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512337 (10RobH) Please note the following must be met, some already have, and I've checked those off: [x] - user provided ssh key, wikitech username, and shell username... [16:24:04] <_joe_> I'll be back later anyways [16:24:24] _joe_: afaik, from the template itself you can only use jinja contructs available in the env passed by python [16:24:59] <_joe_> mobrovac: you can define a macro though [16:25:12] <_joe_> I know it's horrible, but it might work [16:25:18] let me do an ugly fix now, and we can bike-shed later [16:25:27] etoobusytoday [16:25:39] <_joe_> the correct way to fix this is to add a finalize function to the environment [16:25:59] (03PS3) 10EBernhardson: Switch elastic1017-1031 to niofs [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) [16:26:17] <_joe_> but that must be done in scap [16:26:41] 10Operations, 10Ops-Access-Requests, 10Research: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3513052 (10DarTar) @robh Diego reports to me and this request is approved on my end. The credentials should be tied to his staff account indeed. [16:26:52] correct _joe_ [16:27:04] * mobrovac has been waiting on that to happen, but no luck so far [16:27:47] mobrovac: ruined your schedule sorry :( [16:27:56] elukey: fix coming ina few [16:28:22] (03PS4) 10Mark Bergsma: Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 [16:28:22] FIY: I'll be deploying RESTBase and it will be creating a ton of keyspaces, so there might be some interminent alerts. [16:28:28] elukey: o/ [16:29:11] (03CR) 10Dzahn: "any way to test this that isn't deploy in prod?" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [16:29:49] !log T172384: Upgrading Cassandra in RESTBase dev to 3.11.0-wmf2 (patched to disable use of FastThreadLocal) [16:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:01] T172384: OOM exceptions in dev environment - https://phabricator.wikimedia.org/T172384 [16:30:11] Krenair: I'm not [16:30:19] (03CR) 10EBernhardson: "puppet compiler looks reasonable: http://puppet-compiler.wmflabs.org/7374/" [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [16:30:45] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation [16:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:11] (03PS1) 10RobH: New shell user dsaez [puppet] - 10https://gerrit.wikimedia.org/r/370841 (https://phabricator.wikimedia.org/T172891) [16:31:46] (03CR) 10Gehel: [C: 04-1] Switch elastic1017-1031 to niofs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [16:32:00] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3513071 (10RobH) 05Open>03stalled a:03RobH >>! In T172891#3513052, @DarTar wrote: > @robh Diego reports to me and this request is approved on... [16:32:24] (03CR) 10Dzahn: PHAB: deployment scripts to be called by scap (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [16:32:33] hey robh. one question about T172891 [16:32:33] T172891: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891 [16:32:43] (03PS1) 10Mobrovac: [UGLY-FIX] EventStreams: Double-quote `false` to avoid `False` [puppet] - 10https://gerrit.wikimedia.org/r/370842 (https://phabricator.wikimedia.org/T172681) [16:32:50] Dereckson, :( [16:32:57] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3513075 (10diego) @RobH: I would prefer to have just one account, with 'diego' as username. I can delete the personal one. [16:33:26] robh: is there a way to merge to the two accounts for Diego? He wants to keep diego as the shell access and username if possible. I see that you're moving fast and I figured I message here. ;) [16:33:42] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [16:33:43] (03PS2) 10Mobrovac: [UGLY-FIX] EventStreams: Double-quote `false` to avoid `False` [puppet] - 10https://gerrit.wikimedia.org/r/370842 (https://phabricator.wikimedia.org/T172681) [16:33:59] leila: im not sure what the question is? he wants to delete his personal one? [16:34:01] Topic estreams/arg-ugly-fix [16:34:02] lol [16:34:17] he made his personal account with diego [16:34:45] elukey: running pcc now [16:34:59] robh: the only thing he wants is for "diego" to be the shell access account. if that means we have to delete the volunteer account, it's fine with him. [16:35:31] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation (duration: 04m 46s) [16:35:35] ok, we can make diego his login name [16:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "One host missing, plus bikeshedding, I haven't run PCC yet on the latest PS yet though" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [16:35:45] he listed dsaez [16:35:47] not me. [16:35:59] just FYI my username in wikitech is diego, but my 'Instance shell account name:' is dsaez (diego was not available) [16:36:36] leila: I have no idea how feasible it is to merge his two wikitech accounts, but also not sure why you are asking that it happen? [16:36:47] also, fwiw, you can have a volunteer acccount and a non-volunteer account at the same time, using different email addresses. [16:36:49] We can make his shell login name 'diego' or whatever he wants as long as it isnt taken [16:36:56] robh: he said that because he initially thought diego is taken by someone else. Now he knows diego was his own account, only with his volunteer hat on. [16:37:11] leila: his wikitech name and shell name do not need to match [16:37:15] i can make his login diego [16:37:34] and tie it to his dsaez wiktiech which is his work wikitech no problem [16:37:49] so i think this is easier to fix than you both may realize? (no need to merge wikitech accounts) [16:37:54] sure. making his login as diego sounds good to him. thanks, robh [16:37:57] in fact i advise he not merge his volunteer and staff accounts. [16:38:10] sure. [16:38:13] as someday he may be a volunteer again, and trying to unsort things at that point is terrible ;D [16:38:23] hopefully not too son. ;P [16:38:24] but yeah, ill modify patchset so diego will be his shell username [16:38:35] thank you, robh. [16:38:38] welcome =] [16:38:40] elukey: the pcc output is not useful because scap deploy-local needs to run for this, could you +2 and run puppet on scb2001 only so that we can see if it works? [16:38:41] * leila is happy and goes to grab lunch. [16:39:10] (03PS2) 10RobH: New shell user diego [puppet] - 10https://gerrit.wikimedia.org/r/370841 (https://phabricator.wikimedia.org/T172891) [16:39:43] (03CR) 10Elukey: [C: 032] [UGLY-FIX] EventStreams: Double-quote `false` to avoid `False` [puppet] - 10https://gerrit.wikimedia.org/r/370842 (https://phabricator.wikimedia.org/T172681) (owner: 10Mobrovac) [16:40:51] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3513095 (10RobH) @diego: So I think there is a bit of confusion, which I think I can clear up. We have to make a shell account, and when we do so,... [16:42:14] mobrovac: + api.version.request: "\x22false\x22" - not really better [16:42:32] (03PS5) 10Mark Bergsma: Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 [16:42:34] (03PS4) 10Mark Bergsma: Fix IPPrefix value comparisons with different packed paddings [debs/pybal] - 10https://gerrit.wikimedia.org/r/356611 [16:42:38] (03PS4) 10Mark Bergsma: Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 [16:42:40] (03PS3) 10Mark Bergsma: Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 [16:42:47] (03CR) 10Ema: [C: 032] Add basic unit tests for protocol BGP send methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 (owner: 10Mark Bergsma) [16:42:48] but in the config.yaml api.version.request: "false" [16:42:56] mmmm should be ok no? [16:43:12] (03PS6) 10Mark Bergsma: Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 [16:43:24] elukey: merda, no :/ [16:43:33] we need it as a bool, not a string [16:43:35] damn [16:43:53] (03CR) 10Filippo Giunchedi: "What's broken exactly btw? I tried launching exceptionmonitor on mwlog and it seemed to be working?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/368522 (owner: 10MaxSem) [16:44:27] * urandom waves at mobrovac [16:44:34] mobrovac: really? It is a thing that will go straight to librdkafka no? [16:44:34] heh [16:44:38] where from urandom? [16:44:48] the cafeteria [16:45:03] where i am deploying a patched build of cassandra [16:45:23] :/ [16:45:25] mobrovac: elukey I'm working on a few rounds of scap fixes now. I will prioritize the jinja2 pythonic values fix. Meantime a macro may be your best bet. [16:45:40] elukey: can you check node-rdkafka's source code to see how is the config relayed to rdkafka please? [16:46:58] mobrovac: sure! [16:47:10] thcipriani: any hint in what is the best way to fix this? [16:51:38] mobrovac: mmm from the examples we might need a bool - https://github.com/Blizzard/node-rdkafka [16:52:31] there is an example of var producer = new Kafka.Producer({ in which one of the options is 'dr_cb': true [16:55:01] and from https://github.com/edenhill/librdkafka/blob/0.9.4.x/CONFIGURATION.md - api.version.request -> Type: boolean [16:59:53] elukey: mobrovac one very ugly way to fix it might be to use one of the builtin tests in the template for eventstreams http://jinja.pocoo.org/docs/2.9/templates/#list-of-builtin-tests [17:01:52] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3513178 (10RobH) [17:02:11] something like: <% if val is equalto "False" %><%= key %>: false<% else %><%= key %>: <%= val %><% endif %> [17:03:11] thcipriani: elukey: yeah, iterate over the hash and check i guess [17:03:12] :( [17:03:36] (03CR) 10Gehel: Switch elastic1017-1031 to niofs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [17:03:57] yeah, pretty gross :( [17:04:22] I'll get a proper fix in scap shortly [17:05:14] mobrovac: I am reading https://github.com/edenhill/librdkafka/wiki/Broker-version-compatibility and 'api.version.request=false # default, not need to set (yet)' seems already good for 0.9 brokers.. [17:05:31] and https://github.com/edenhill/librdkafka/blob/0.9.4.x/CONFIGURATION.md (that should be what we have on scb nodes) lists false as default for that option [17:05:35] sooo we might just remove it [17:05:44] even if I wanted that to be there [17:05:48] (03PS1) 10Matěj Suchánek: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370846 [17:06:12] elukey: that's awesome! yes, let's do that please [17:12:57] (03PS1) 10Elukey: role:eventstreams: unset api.version.request due to a scap bug [puppet] - 10https://gerrit.wikimedia.org/r/370850 (https://phabricator.wikimedia.org/T172681) [17:13:01] mobrovac: ---^ [17:13:37] (03Abandoned) 10Thcipriani: Jobrunner: create dsh groups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/368476 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [17:13:39] (03CR) 10Elukey: [C: 032] role:eventstreams: unset api.version.request due to a scap bug [puppet] - 10https://gerrit.wikimedia.org/r/370850 (https://phabricator.wikimedia.org/T172681) (owner: 10Elukey) [17:19:09] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3513283 (10Jgreen) Let's do some initial testing using mintaka and frdb2001. It should be fine to connect the second interface on each of those anytime, since... [17:19:26] mobrovac: ok proceeding with the restarts in codfw then [17:23:23] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Rollback on canary [17:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:50] 10Operations, 10monitoring: Nrpe command_timeout and "Service Check Timed Out" errors - https://phabricator.wikimedia.org/T172921#3513290 (10herron) [17:24:50] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Rollback on canary (duration: 01m 27s) [17:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:08] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15639 bytes in 0.082 second response time [17:27:54] 10Operations, 10monitoring: Nrpe command_timeout and "Service Check Timed Out" errors - https://phabricator.wikimedia.org/T172921#3513337 (10herron) Temporarily setting command_timeout=90 in /etc/nagios/nrpe_local.cfg on the monitored system fixes it, so will submit a patch for this. einsteinium:~# /usr/lib... [17:29:43] elukey: looking good! (sorry, was eating lunch) [17:29:46] (03CR) 10EBernhardson: Switch elastic1017-1031 to niofs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [17:29:53] ok mobrovac depooling scb1001 (only eventstreams), then apply/restart, repool [17:30:20] (03PS4) 10EBernhardson: Switch elastic1017-1031 to niofs [puppet] - 10https://gerrit.wikimedia.org/r/370834 (https://phabricator.wikimedia.org/T169498) [17:30:57] elukey: \o/ [17:31:16] !log changed mediawiki/vendor to fast forward only [17:31:19] James_F: ^^ [17:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513349 (10Papaul) @Marostegui since th BBU is on the raid controller, we will have to replace the whole controller . I will receive the new controller tomorrow. Please see below for case n... [17:32:41] 10Operations, 10monitoring: Nrpe command_timeout and "Service Check Timed Out" errors - https://phabricator.wikimedia.org/T172921#3513290 (10fgiunchedi) Thanks @herron ! Indeed the check is slow when the raid controller is busy and the machines have lots of traffic [17:36:00] mobrovac: any idea where to look for logs/errors? I don't see any on the host [17:36:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513360 (10Marostegui) @Papaul Thanks a lot! That was fast! :-) If you receive it tomorrow we can try to replace it next week if you like (as with Wikimania going on, my schedule is a bit err... [17:36:56] https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1&from=now-5m&to=now is not loading even for the last 5 mins, too many metrics.. [17:37:00] ottomata: --^ [17:37:11] (03CR) 10Ema: [C: 032] Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 (owner: 10Mark Bergsma) [17:37:14] elukey: on each node, you have the tail-eventstreams command [17:37:20] (03CR) 10Ema: [V: 032 C: 032] Add BGP.parseOpen unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355795 (owner: 10Mark Bergsma) [17:37:43] elukey: logs are also sent to logstash, but i dunno if ottomata created a dashboard for it [17:37:48] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [17:38:48] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.005 second response time [17:39:41] this was probably me trying to load metrics --^ [17:39:45] (03CR) 10MaxSem: "Current output:" [puppet] - 10https://gerrit.wikimedia.org/r/368522 (owner: 10MaxSem) [17:40:23] (03PS2) 10MaxSem: logging: Remove exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/368522 [17:40:59] tail-eventstreams is interesting.. [17:41:01] !log Compress innodb on s6 on db2076 - T170662 [17:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:13] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [17:44:50] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3492537 (10RobH) Please note that the old controllers configuration needs to be exported and imported into the new controller, or the system will need to be reimaged. [17:48:26] mobrovac: done, sadly still seeing the error on kafka [17:48:47] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513429 (10Marostegui) Thanks for pointing that out @RobH! @Papaul do you need us to shutdown the host for you to export the config? As a side note, given that codfw is passive, if data is lo... [17:48:58] elukey: *sigh* [17:49:05] (03PS1) 10Herron: Add 90s command_timeout override to nrpe_local.cfg [puppet] - 10https://gerrit.wikimedia.org/r/370858 (https://phabricator.wikimedia.org/T172921) [17:49:10] is ottomata around to help with this? [17:50:46] !log Add db2076 to tendril - T170662 [17:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:58] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [17:51:01] hi, kinda i can help [17:51:07] am at wikimania, but am at compy now [17:51:13] mobrovac: not really super urgent, we have been seeing it from the 28th, it can wait [17:51:16] elukey: what's up? [17:51:21] trying the api version deploy? [17:51:32] ottomata: already did but seems not helping much :( [17:51:56] aye [17:52:07] 10Operations, 10Ops-Access-Requests: Requesting access to wikimedia-tech-channel-op for Luke081515 - https://phabricator.wikimedia.org/T172793#3513453 (10RobH) We don't have an ops meeting next week, so I've forwarded this approval for team review via email. FWIW: I support this request, since @Luke081515's... [17:52:12] elukey: i think adding partitions will be easy and mostly just work.... [17:52:22] but it isn't very tested with eventstreams, since the offsets are sent to consumers [17:52:25] pretty sure it would just work [17:52:26] haha [17:52:26] buuuut [17:52:42] :D :D :D [17:52:42] 10Operations, 10Ops-Access-Requests: Requesting access to wikimedia-tech-channel-op for Luke081515 - https://phabricator.wikimedia.org/T172793#3513454 (10RobH) a:03RobH [17:53:37] ottomata: it is also weird that 26 connected streams are generating imbalance in kafka brokers [17:53:51] but maybe one partition is not enough.. [17:54:56] elukey: ottomata: ok, i'm out of ES for now, gotta attend other problems sorry [17:55:08] mobrovac: all good, thanks for the help :) [17:55:31] 10Operations, 10Ops-Access-Requests: Requesting access to wikimedia-tech-channel-op for Luke081515 - https://phabricator.wikimedia.org/T172793#3509528 (10Dzahn) I think since we already said yes for -operations, -tech shouldn't be a problem is kind of implied. [17:55:48] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [17:56:05] legoktm: Nice, thanks. [17:56:20] hmm, yeah, elukey should only be around 600 per second [17:56:25] total for 30ish consumers [17:56:59] Upload 50x already recovered, /thumb/ related afaics [17:57:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:58:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:58:48] (03PS1) 10Mobrovac: Add certificates for restbase101[678] [labs/private] - 10https://gerrit.wikimedia.org/r/370859 [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170809T1800). Please do the needful. [18:00:04] RoanKattouw and Pchelolo: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:22] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513472 (10RobH) So, I'm working off recollection here, and @papaul will have to confirm if it is how he would do this. With the hardware raid controller, it stores the config on the disks.... [18:00:58] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 7255 [18:01:40] ottomata: the other problem is that https://grafana-admin.wikimedia.org/dashboard/db/eventstreams is not loading [18:01:43] too many metrics [18:01:45] :( [18:01:49] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3513481 (10Papaul) @elukey @jcrespo I open a case with Dell and we are working on the issue. The Dell engineer asked to generate the sosreport the SupportAssist Collection and the RAID Controller Log. He is reviewin... [18:02:30] I'm here for SWAT [18:02:47] (03CR) 10Filippo Giunchedi: [C: 031] Add certificates for restbase101[678] [labs/private] - 10https://gerrit.wikimedia.org/r/370859 (owner: 10Mobrovac) [18:02:58] yeah elukey :/ [18:03:06] i guess we should stop producing consumer lag [18:03:09] too many different consumers [18:03:47] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Add certificates for restbase101[678] [labs/private] - 10https://gerrit.wikimedia.org/r/370859 (owner: 10Mobrovac) [18:05:39] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:06:42] I can SWAT [18:06:56] (03CR) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [18:07:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513483 (10Marostegui) Thanks Rob! Let me know if I can help in anyway (the host is depooled by the way). So we can shut it down when needed, we just need to stop MySQL first. [18:07:20] (03PS11) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [18:09:53] thcipriani: a bit early of a comment? [18:10:13] !log restart es on elastic1018 to test interleaved numa [18:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:24] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 306 bytes in 0.035 second response time [18:10:36] greg-g: swat is now, just got pinged :) [18:10:37] thcipriani: ignore me [18:10:42] thcipriani: I can't math [18:10:45] I'm taking a look at thumbor [18:10:45] #TimezoneThings [18:10:56] godog: thx lemme know if you need me to do anything [18:10:58] sorry about the page [18:11:02] robh: will do! thanks [18:11:26] currently I'm trying to remember any sort of info about computed dblists for https://gerrit.wikimedia.org/r/#/c/370064/ [18:11:33] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time [18:11:41] <_joe_> godog: are you looking into it? [18:11:48] _joe_: I am [18:12:01] group1 seems valid...but I half-remember something like we don't want to calculate dblists on the fly [18:12:22] <_joe_> ok, I was at dinner, I'll checkin later to see if you need help then [18:13:39] thcipriani: never heard of it, but if we don't I can amend to list the wikis explicitly [18:13:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:15:14] there's a bunch of other places where we have similar usage of dblists, page previews enabling for example [18:16:10] Although they're not dynamic.. [18:17:49] Pchelolo: yeah, I think I'm making this up. Will go ahead with change as is. [18:18:14] (03PS4) 10Thcipriani: JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:18:15] lvs1009 may alert, that's us, you can ignore [18:18:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:19:57] (03Merged) 10jenkins-bot: JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:20:11] (03CR) 10jenkins-bot: JobQueueEventBus: Enable on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370064 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [18:20:59] Pchelolo: ^ live on mwdebug1002, check please [18:21:08] checking thcipriani [18:22:08] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:23:58] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:24:19] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:24:28] seems correct thcipriani [18:24:42] Pchelolo: also a heads-up that wikidatawiki is on wmf.11, but is also in group1...still fine? [18:25:19] thcipriani: lemme make sure [18:25:22] thanks [18:25:38] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:06] !log Restarting Cassandra, restbase2001-a.codfw.wmnet (schema jankiness) [18:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513527 (10Papaul) Once in the RAID controller BIOS we should have an option to import foreign config under controller 0 F2/ foreign config/import @Marostegui next week works for me. [18:29:59] hm actually thcipriani it's not ok, wmf.11 doesn't have this apparently: https://gerrit.wikimedia.org/r/#/c/368264/1 [18:30:09] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:30:29] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [18:31:09] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-07-19 10:52:04 +0000 (expires in 343 days) [18:31:19] Pchelolo: hrm, I could backport that, or revert for now. twentyafterfour what's the blocker for getting wikidatawiki up to wmf.13? [18:31:28] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [18:32:40] sorry didn't realise wikidatawiki was not up to date.. [18:33:32] 10Operations, 10Thumbor: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3513559 (10fgiunchedi) [18:33:34] (I think the blocking task is https://phabricator.wikimedia.org/T171370 but it's twentyafterfour who's been working on it/knows things :)) [18:34:46] Pchelolo: no worries, it's definitely not SOP for the train. Do you just want me to revert the config change for now? [18:34:57] hm thcipriani I think let's revert for now, I'll monitor that task and either backport or exclude wikidatawiki and put it on tomorrow SWAT [18:35:07] * thcipriani does [18:35:13] Thank you [18:36:05] !log Restarting Cassandra, restbase2001-b.codfw.wmnet (schema jankiness) [18:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:23] (03PS1) 10Thcipriani: Revert "JobQueueEventBus: Enable on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370862 (https://phabricator.wikimedia.org/T163380) [18:36:34] (03CR) 10Krinkle: [C: 031] "@Filippo: It works in so far that it essentially just echoes the last few entries in exception.log (with a "1" prepended). Using 'cat' and" [puppet] - 10https://gerrit.wikimedia.org/r/368522 (owner: 10MaxSem) [18:37:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370862 (https://phabricator.wikimedia.org/T163380) (owner: 10Thcipriani) [18:37:15] Pchelolo: sorry for all the confusion :( [18:37:23] (03PS1) 10Marostegui: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370863 (https://phabricator.wikimedia.org/T151029) [18:38:08] PROBLEM - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:38:25] (03Merged) 10jenkins-bot: Revert "JobQueueEventBus: Enable on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370862 (https://phabricator.wikimedia.org/T163380) (owner: 10Thcipriani) [18:38:35] (03CR) 10jenkins-bot: Revert "JobQueueEventBus: Enable on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370862 (https://phabricator.wikimedia.org/T163380) (owner: 10Thcipriani) [18:39:08] RECOVERY - cassandra-b SSL 10.192.16.163:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-b valid until 2018-07-19 10:52:06 +0000 (expires in 343 days) [18:39:40] !log Stop replication on db2045 to fix duplicate keys - T151029 [18:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:51] T151029: duplicate key problems - https://phabricator.wikimedia.org/T151029 [18:41:00] (03PS2) 10Marostegui: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370863 (https://phabricator.wikimedia.org/T151029) [18:44:16] mutante, hey. you at wikimania? [18:44:26] He is [18:46:09] (03PS1) 10Ottomata: Allow ANALYTICS_NETWORKS to talk to druid zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/370865 [18:55:56] 10Operations, 10Thumbor: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3513695 (10fgiunchedi) [18:57:24] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation. Attempt 2 [18:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:44] (03PS2) 10Ottomata: Allow ANALYTICS_NETWORKS to talk to druid zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/370865 (https://phabricator.wikimedia.org/T168550) [19:00:00] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170809T1900). Please do the needful. [19:01:01] ... [19:01:02] 10Operations, 10Thumbor: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3513727 (10fgiunchedi) AFAICT from the thumbor dashboard at the time of the outage it is the ghostscript engine (and thus PDF processing) spiking up in its request time {F8999951} [19:01:11] (03PS5) 10Mark Bergsma: Fix IPPrefix value comparisons with different packed paddings [debs/pybal] - 10https://gerrit.wikimedia.org/r/356611 [19:01:13] (03PS5) 10Mark Bergsma: Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 [19:01:14] (03PS4) 10Mark Bergsma: Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 [19:03:10] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.180 second response time [19:03:24] (03CR) 10Ema: [C: 032] Fix IPPrefix value comparisons with different packed paddings [debs/pybal] - 10https://gerrit.wikimedia.org/r/356611 (owner: 10Mark Bergsma) [19:05:26] thcipriani: Pchelolo: it's not T171370 blocking. The blockers for wikidata are T164173, T172320 [19:05:26] T172320: Error in Wikibase/client/includes/Changes/InjectRCRecordsJob.php line 120: Bad value for parameter $params: $params['change'] not set. - https://phabricator.wikimedia.org/T172320 [19:05:26] T171370: ERROR: "LBFactory::getEmptyTransactionTicket: WikiPageUpdater::injectRCRecords does not have outer scope" - https://phabricator.wikimedia.org/T171370 [19:05:26] T164173: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173 [19:05:50] Thank you! [19:16:55] (03CR) 10Mobrovac: [C: 04-1] "PCC - https://puppet-compiler.wmflabs.org/compiler02/7376/ no bueno. We need to keep Puppet's $cluster variable as 'restbase', since other" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [19:17:47] (03PS6) 10Mark Bergsma: Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 [19:17:49] (03PS5) 10Mark Bergsma: Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 [19:18:40] thcipriani: is swat done? so I can push: https://gerrit.wikimedia.org/r/#/c/370863/ ? [19:18:48] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3513808 (10Fjalapeno) 05Open>03Resolved To follow up here, we... [19:18:55] marostegui: yes, swat is complete [19:19:02] \o/ thanks! [19:19:05] (03CR) 10Ema: [V: 032 C: 032] Add basic BGP.parseUpdate test case [debs/pybal] - 10https://gerrit.wikimedia.org/r/356612 (owner: 10Mark Bergsma) [19:19:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370863 (https://phabricator.wikimedia.org/T151029) (owner: 10Marostegui) [19:20:38] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3444393 (10Fjalapeno) [19:20:43] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370863 (https://phabricator.wikimedia.org/T151029) (owner: 10Marostegui) [19:20:53] (03CR) 10jenkins-bot: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370863 (https://phabricator.wikimedia.org/T151029) (owner: 10Marostegui) [19:21:53] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3513823 (10Fjalapeno) [19:22:04] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2045 - T151029 (duration: 00m 51s) [19:23:06] (03PS6) 10Mark Bergsma: Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 [19:23:10] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:24:42] (03CR) 10Ema: [C: 032] Add BGP.parse{KeepAlive,Notification} test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/356620 (owner: 10Mark Bergsma) [19:26:12] (03PS2) 10TerraCodes: Update InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [19:26:22] (03CR) 10TerraCodes: [C: 031] Update InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:30:21] !logj moving forward with wmf.13 but leaving wikidata behind once again. [19:33:50] twentyafterfour wrong log :) [19:34:04] anyways seems stashbot has quit [19:34:11] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [19:34:23] wrong log? how so? [19:34:40] deployments get logged in -operations [19:34:41] twentyafterfour !logj [19:34:45] oh lol [19:34:54] yep [19:34:55] java log? [19:34:58] heheh [19:35:00] lol [19:36:43] hi twentyafterfour [19:36:49] aude: hi [19:37:11] we didn't get around to having a new deployment bulid for wikidata that fixes the issues [19:37:24] so keep wikidata where it is seems best [19:37:34] I see that. Ok thank you! [19:37:50] hopefully while we are at the hackathon, we can get the issues resolved [19:38:15] we might want to fix https://phabricator.wikimedia.org/T172592 in swat sometime though [19:38:27] probably tomorrow [19:39:46] ok thank you aude! [19:39:51] sure [19:40:01] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation. Attempt 2 (duration: 42m 38s) [19:40:15] I can deploy T172592 for you if you'd like [19:40:26] we don't have a patch for it yet [19:40:41] and has to be carefully tested etc [19:41:00] ok let me know if I can help [19:41:04] sure [19:41:25] I tried to restart stashbot, getting an error I'm not sure how to address [19:41:52] hmm [19:43:09] see -cloud [19:43:26] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Rollback on canary [19:43:28] bryan et al are doing their "what is cloud services" session at the hackathon right now (I'm there) I'll get bryan on it after [19:44:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513915 (10RobH) >>! In T172265#3513483, @Marostegui wrote: > Thanks Rob! Let me know if I can help in anyway (the host is depooled by the way). So we can shut it down when needed, we just ne... [19:44:55] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Rollback on canary (duration: 01m 30s) [19:45:11] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15639 bytes in 0.081 second response time [19:49:53] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2013 faulty BBU - https://phabricator.wikimedia.org/T172265#3513934 (10Marostegui) >>! In T172265#3513915, @RobH wrote: >>>! In T172265#3513483, @Marostegui wrote: >> Thanks Rob! Let me know if I can help in anyway (the host is depooled by the way). S... [19:55:30] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 327 bytes in 1.001 second response time [19:57:20] (-cloud is already investigating about stashbot) [19:59:08] (03PS1) 1020after4: All group1 wikis except wikidata to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370882 (https://phabricator.wikimedia.org/T170631) [19:59:10] (03CR) 1020after4: [C: 032] All group1 wikis except wikidata to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370882 (https://phabricator.wikimedia.org/T170631) (owner: 1020after4) [19:59:40] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4851 bytes in 0.017 second response time [20:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170809T2000). Please do the needful. [20:00:51] Nothing for ORES today [20:00:53] Hackathon! [20:01:05] same for parsoid [20:01:49] (03PS1) 10Mark Bergsma: Allow BGP socket to listen on specific IPs only [debs/pybal] - 10https://gerrit.wikimedia.org/r/370894 (https://phabricator.wikimedia.org/T103882) [20:03:01] (03Merged) 10jenkins-bot: All group1 wikis except wikidata to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370882 (https://phabricator.wikimedia.org/T170631) (owner: 1020after4) [20:03:14] (03CR) 10jenkins-bot: All group1 wikis except wikidata to 1.30.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370882 (https://phabricator.wikimedia.org/T170631) (owner: 1020after4) [20:04:44] (03PS2) 10Mark Bergsma: Allow BGP socket to listen on specific IPs only [debs/pybal] - 10https://gerrit.wikimedia.org/r/370894 (https://phabricator.wikimedia.org/T103882) [20:05:55] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: All group1 except wikidata to 1.30.0-wmf.13 refs T170631 [20:11:05] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: Actually sync all group1 wikis (except wikidata) to 1.30.0-wmf.13 refs T170631 [20:11:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [20:12:40] RoanKattouw: Salon 7, Level 3 [20:13:07] twentyafterfour: maybe you need to relog later, when stashbot is fully available again? [20:13:58] Sagan: that wasn't just for the sake of logging, those were scap actions on tin [20:14:51] Krinkle: Coming, I was in another meeting that ran late [20:14:51] this is not even from the new branch but I see a couple of random instances of this error: 'Variable 'wgCirrusSearchInterleaveConfig' is not set.' in /srv/mediawiki/php-1.30.0-wmf.12/maintenance/getConfiguration.php:105 [20:15:41] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.117 second response time [20:20:55] stashbot works again now :) [20:20:55] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:22:12] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: Actually sync all group1 wikis (except wikidata) to 1.30.0-wmf.13 refs T170631 [20:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:28] T170631: 1.30.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T170631 [20:23:42] (03PS1) 10Gilles: Upgrade to 1.2 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/370907 (https://phabricator.wikimedia.org/T161719) [20:25:14] twentyafterfour: thats odd, wgCirrusSearchInterleaveConfig is definatly set in the main CirrusSearch.php file in wmf.12 [20:25:29] yeah it's strange [20:25:45] it seems it's getting called out of context from a mwscript somewhere [20:26:00] but it's probably totally isolated ( I didn't see it pop up more than twice ) [20:26:39] yea it's probably not a big deal. The crazy mwscript thing there is because if you ask mediawiki wgConf to give you the config of another wiki, it shells out :S We have a work in progress to get that info via api instead but it's not rolled out [20:26:54] (it's used when you type say, russian into french wikipedia, and we find nothing in french so search russian instead) [20:27:10] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#3487493 (10Kelson) Why not using a standard like OPDS? The WMF is granting developmen... [20:32:33] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370919 [20:32:37] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370919 [20:34:46] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370919 (owner: 10Marostegui) [20:35:18] ebernhardson: There is a way to get $wgConf stuff for another wiki but it's tricky and annoying [20:35:18] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation. Attempt 3 [20:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:33] We do it in Echo for the API URL I think [20:36:13] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370919 (owner: 10Marostegui) [20:37:09] (03PS1) 10Jdlrobson: Correct config - commonswiki not commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370924 (https://phabricator.wikimedia.org/T170687) [20:37:17] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370919 (owner: 10Marostegui) [20:37:18] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2046 - T170662 (duration: 00m 49s) [20:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:31] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [20:37:34] 10Operations, 10Thumbor: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3514134 (10fgiunchedi) Below there's a list of top 20 files that failed to get converted today, unsurprisingly lots of pdfs there. Checking the first file it seems ghostscript hangs... [20:38:30] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [20:38:57] (03CR) 10Eevans: "> PCC - https://puppet-compiler.wmflabs.org/compiler02/7376/ no" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [20:39:11] (03PS12) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [20:43:23] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation. Attempt 3 (duration: 08m 05s) [20:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:30] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15639 bytes in 0.109 second response time [20:46:41] (03CR) 10Eevans: "New [PC output](http://puppet-compiler.wmflabs.org/7377)" [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [20:47:31] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation. All tables created, finish deploy [20:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:34] (03CR) 10Filippo Giunchedi: [C: 04-1] Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [20:49:57] gah, the -1 wasn't supposed to be there (cc urandom) [20:51:38] !log Remove m3 replication from dbstore1002 - T156758 [20:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:51] T156758: Drop m3 from dbstore servers - https://phabricator.wikimedia.org/T156758 [20:54:31] PROBLEM - Disk space on graphite2002 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 74655 MB (3% inode=97%) [20:54:43] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Temporary fallback to the new storage buckets before truncation. All tables created, finish deploy (duration: 07m 12s) [20:54:48] (03PS3) 10Mark Bergsma: Allow BGP socket to listen on specific IPs and port [debs/pybal] - 10https://gerrit.wikimedia.org/r/370894 (https://phabricator.wikimedia.org/T103882) [20:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:03] !log ppchelko@tin Started deploy [restbase/deploy@f97beeb]: Rollback on canary [20:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:43] the graphite disk alerts are sort of expected, that's the cassandra machines. I'll add 100G from the vg cc mobrovac urandom [21:00:04] MaxSem and Niharika: Respected human, time to deploy Deploying CodeMirror (take two) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170809T2100). Please do the needful. [21:00:12] weeeeeee [21:00:17] o/ [21:00:55] MaxSem: You're in charge. I'll help test when you get it deployed. [21:01:06] Tell me if you need me to do anything. :) [21:01:30] RECOVERY - Disk space on graphite2002 is OK: DISK OK [21:01:43] !log add 100G to carbon lv on graphite1003 and graphite2002 [21:01:51] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [21:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:27] Niharika, so we can't test the CM patch in prod because testwiki is already on wmf.13 [21:05:25] MaxSem: Aren't we deploying for wmf.13 and wmf.12 both? [21:05:33] !log ppchelko@tin Finished deploy [restbase/deploy@f97beeb]: Rollback on canary (duration: 07m 30s) [21:05:40] Or just wmf.13? [21:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:48] yeah, we need to patch wmf.12 only [21:06:27] Okay then. [21:06:27] Niharika: hi, wrt to the repo setup request you left at Phab, I left you a message for when you've got time. Regards. [21:07:10] TabbyCat: Which request? [21:07:29] Niharika: sorry, SamanthaNguyen's one [21:07:40] not sure why I always confuse you too [21:07:43] *two [21:07:46] meh, just pushin' it [21:08:19] MaxSem: We can test on group0 and group1 wikis though? [21:08:42] yep, you can test on testwiki right now:) [21:10:17] (03PS13) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) [21:10:40] (03CR) 10Eevans: Reshape RESTBase Cassandra production cluster; Provision new 3.x cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [21:11:04] (03PS1) 10MaxSem: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370943 [21:11:08] !log maxsem@tin Synchronized php-1.30.0-wmf.12/extensions/CodeMirror: https://gerrit.wikimedia.org/r/#/c/370904/ (duration: 00m 51s) [21:11:10] (03PS2) 10MaxSem: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370943 [21:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:31] (03CR) 10MaxSem: [C: 032] Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370943 (owner: 10MaxSem) [21:13:10] (03Merged) 10jenkins-bot: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370943 (owner: 10MaxSem) [21:13:23] (03CR) 10jenkins-bot: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370943 (owner: 10MaxSem) [21:15:17] Niharika, live on mwdebug1002 [21:15:26] MaxSem: Checking. [21:17:38] MaxSem: Looks good to me. [21:19:08] okie [21:20:53] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: Deploy CodeMirror https://gerrit.wikimedia.org/r/#/c/370943/ (duration: 00m 50s) [21:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:48] MaxSem: Thanks! :D [21:22:30] gogogogtesttesttest [21:23:01] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:25:36] MaxSem: Dang it. Console errors for Danny and Leon. :( [21:25:40] Works for me. [21:25:44] What about you? [21:25:55] Not breaking anything but not loading either. [21:26:36] MaxSem: Never mind, it was a cache problem for Leon. [21:26:41] Danny testing now. [21:26:43] 10Operations, 10Pybal, 10Traffic, 10monitoring: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3514364 (10ema) [21:26:54] let's wait 15 minutes [21:27:01] Works for them both now. [21:28:56] MaxSem: https://en.wikipedia.org/wiki/Carty?veaction=editsource :( [21:29:44] doesn't edit it, but no console spam [21:30:09] MaxSem: With the new wikitext editor? [21:30:21] It doesn't work. Breaks the NWE. [21:30:31] Console spam with CodeMirror in it. [21:31:36] riiight, I didn't enable NWE [21:31:51] edsanders: I wonder if you can check what's wrong?^^^^ [21:32:24] I want to patch it if we can instead of another rollback. [21:32:46] hmm, loaded for me [21:33:36] MaxSem: Really? No console errors? [21:34:06] clicked random, now see one [21:34:30] https://www.irccloud.com/pastebin/khwPH1Q7/cm-nwe-boom [21:34:46] ffffff [21:35:13] MaxSem: You know what's wrong there? Can't make head or tail of that. [21:36:15] mwConfig is null [21:36:50] Why is that so? [21:37:13] who of us ia a JS dev? :P [21:37:20] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2032168 [21:37:29] It looked like a server config issue [21:37:32] because it works locally [21:37:47] MaxSem: JS dev? That is an insult! :P [21:38:14] mwConfig in CodeMirror is community tech thing [21:40:57] MaxSem: Should we rollback then? It's probably related to the patch for https://phabricator.wikimedia.org/T172458 [21:42:27] edsanders: True. It only breaks for NWE though and I wondered if you could make more sense of the error than I could. :) Thanks anyway. [21:43:14] The JS is fine, the data just isn't making it through [21:44:47] Niharika, https://gerrit.wikimedia.org/r/370950 [21:46:09] MaxSem: Deploy on mwdebug1002. [21:46:20] You're awesome, by the way. [21:46:25] :O [21:47:28] seconded! [21:48:24] Niharika, pulled for wmf.12 only [21:48:42] MaxSem: Checking... [21:48:43] (03CR) 10Mobrovac: "LGTM, and PCC concurs - https://puppet-compiler.wmflabs.org/compiler02/7379/ . Note that Puppet pulls in v4 of the metrics collector which" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [21:49:53] okayy, got it [21:50:05] it's a module load dependency [21:50:20] Yeah. [21:50:23] I see it. [21:50:38] ext.CodeMirror.mode.mediawiki needs to depend on ext.CodeMirror [21:50:48] that's why reloading helps [21:50:51] Ah. [21:52:38] https://gerrit.wikimedia.org/r/370954 [21:54:38] (Y) [21:56:18] VM4677:749 Uncaught ReferenceError: CodeMirror is not defined [21:56:23] mehhh [21:56:36] gotta freaking test more in prod [21:56:43] :( [21:56:53] Niharika, I feel we need to revert? [21:57:23] I think we can extend the deploy window since the next one is empty if you think we can fix it? [21:58:08] I'm out of ideas right now - maybe, you understand better what these modules do and how they interact? [21:58:39] MaxSem: I haven't spent enough time with them and Ryan's not around. Well, lets revert then. [21:59:20] (03PS1) 10MaxSem: Revert "Redo "Enable CodeMirror everywhere but RTL wikis and wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370957 [21:59:29] (03CR) 10MaxSem: [C: 032] Revert "Redo "Enable CodeMirror everywhere but RTL wikis and wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370957 (owner: 10MaxSem) [21:59:34] :( [21:59:50] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.20 seconds [21:59:50] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.86 seconds [22:00:11] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.59 seconds [22:00:11] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.92 seconds [22:01:11] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 47.74 seconds [22:01:11] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 48.11 seconds [22:01:50] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:01:51] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [22:03:56] (03Merged) 10jenkins-bot: Revert "Redo "Enable CodeMirror everywhere but RTL wikis and wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370957 (owner: 10MaxSem) [22:04:13] (03PS1) 10Marostegui: db2045.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/370958 (https://phabricator.wikimedia.org/T148507) [22:05:54] (03CR) 10jenkins-bot: Revert "Redo "Enable CodeMirror everywhere but RTL wikis and wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370957 (owner: 10MaxSem) [22:06:32] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: Get baaack you broken CodeMirror! (duration: 00m 51s) [22:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:50] Niharika, game over [22:06:52] :P [22:07:01] Until next time. [22:08:31] (03PS1) 10Marostegui: s6.host: Add db2076 to s6 [software] - 10https://gerrit.wikimedia.org/r/370959 (https://phabricator.wikimedia.org/T170662) [22:08:48] Niharika: third time will be the charm? :) [22:09:19] Third deployment will be. Third rollback, hell no! :P [22:12:02] :) [22:14:24] (03CR) 10Marostegui: [C: 032] s6.host: Add db2076 to s6 [software] - 10https://gerrit.wikimedia.org/r/370959 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [22:14:46] (03PS1) 10Ema: WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) [22:15:25] (03Merged) 10jenkins-bot: s6.host: Add db2076 to s6 [software] - 10https://gerrit.wikimedia.org/r/370959 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [22:17:40] 10Operations, 10Analytics, 10Research: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#3514474 (10Tbayer) >>! In T172410#3512670, @Ottomata wrote: > Correct me if I'm wrong, but a multi-instance MySQL will likely not be very useful for many of the analysis use... [22:17:52] (03CR) 10jerkins-bot: [V: 04-1] WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [22:27:20] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 7801 [22:29:25] (03PS2) 10Mark Bergsma: WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [22:30:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [22:37:56] (03CR) 10Eevans: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370098 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [22:48:26] (03PS1) 10Filippo Giunchedi: admin: set devscripts variables for filippo [puppet] - 10https://gerrit.wikimedia.org/r/370964 [22:48:58] 10Operations, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3510411 (10mobrovac) +1 on this, but let's wait the (final) outcome related to {T150871} before moving on this. >>! In T1... [22:49:10] (03CR) 10Filippo Giunchedi: [C: 032] admin: set devscripts variables for filippo [puppet] - 10https://gerrit.wikimedia.org/r/370964 (owner: 10Filippo Giunchedi) [22:50:47] (03PS1) 10Daniel Kinzler: Enable injection of RC records on wikidata org. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370966 [22:53:54] (03PS1) 10Filippo Giunchedi: mediawiki: switch gothic/mincho fonts to new package names [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170809T2300). [23:05:21] (03PS2) 10Filippo Giunchedi: mediawiki: clean up deprecated fonts packages [puppet] - 10https://gerrit.wikimedia.org/r/370969 (https://phabricator.wikimedia.org/T170817) [23:08:33] (03PS3) 10Mark Bergsma: WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [23:24:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [23:27:17] 10Operations, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3514625 (10GWicke) > IMHO, it would be better to launch headless Chrome processes from a service-runner-enabled service, s... [23:31:11] (03PS4) 10Mark Bergsma: WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [23:31:28] (03CR) 10Filippo Giunchedi: "See comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [23:38:51] (03PS5) 10Ema: WIP: add prometheus metrics [debs/pybal] - 10https://gerrit.wikimedia.org/r/370962 (https://phabricator.wikimedia.org/T171710) [23:55:03] (03PS1) 10Ppchelko: JobQueueEventBus: Enable group1 - wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) [23:56:23] (03CR) 10jerkins-bot: [V: 04-1] JobQueueEventBus: Enable group1 - wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370975 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko)