[00:06:20] RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0] [00:06:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:15:38] (03PS1) 10BryanDavis: logstash: Seed Elasticsearch cluster host [puppet] - 10https://gerrit.wikimedia.org/r/208576 (https://phabricator.wikimedia.org/T97645) [01:01:40] PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 6.67% of data above the critical threshold [500.0] [01:02:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [01:13:00] RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0] [01:13:07] !log Started logstash cluster relocating indices off of logstash100[1-3] to logstash100[4-6] [01:13:17] Logged the message, Master [01:13:40] PROBLEM - puppet last run on mw2194 is CRITICAL puppet fail [01:16:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [01:17:50] PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 6.67% of data above the critical threshold [500.0] [01:29:19] RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0] [01:29:50] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:31:30] RECOVERY - puppet last run on mw2194 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:17] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 08m 58s) [02:22:35] Logged the message, Master [02:26:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [02:27:04] !log LocalisationUpdate completed (1.26wmf3) at 2015-05-04 02:26:00+00:00 [02:27:12] Logged the message, Master [02:27:40] PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 6.67% of data above the critical threshold [500.0] [02:37:20] RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0] [02:38:00] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:44:55] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 07m 33s) [02:45:03] Logged the message, Master [02:49:20] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-04 02:48:16+00:00 [02:49:25] Logged the message, Master [03:30:59] PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:31:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [03:42:20] RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0] [03:42:59] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:46:53] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1255947 (10BBlack) >>! In T97204#1253154, @GWicke wrote: >> this need for retries is limited to Parsoid > > From an overload prevention strategy perspective... [04:01:45] 6operations, 10MediaWiki-extensions-SecurePoll, 6Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1255951 (10tstarling) SecurePoll determines the user's language as part of session transfer from the jump page. The remote wiki (votewiki) calls bac... [04:11:41] (03CR) 10Santhosh: [C: 04-1] CX: Use RESTBase API for page fetch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [04:22:05] (03CR) 1020after4: [C: 031] "This change seems fine. I won't deploy it on a sunday though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208262 (owner: 10Ori.livneh) [04:27:03] 6operations, 10MediaWiki-extensions-SecurePoll, 6Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1255965 (10Jalexander) Thanks Tim, two quick questions. >>! In T97923#1255951, @tstarling wrote: > SecurePoll determines the user's language as par... [04:33:10] PROBLEM - puppet last run on ms-be1012 is CRITICAL Puppet has 1 failures [04:49:10] PROBLEM - puppet last run on mw1205 is CRITICAL Puppet has 1 failures [04:50:50] RECOVERY - puppet last run on ms-be1012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:03:55] 6operations, 10MediaWiki-extensions-SecurePoll, 6Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1255972 (10tstarling) >>! In T97923#1255965, @Jalexander wrote: > Thanks Tim, two quick questions. > >>>! In T97923#1255951, @tstarling wrote: >> S... [05:05:12] RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [05:12:47] (03PS5) 10KartikMistry: CX: Use RESTBase API for page fetch [puppet] - 10https://gerrit.wikimedia.org/r/207378 [05:25:40] PROBLEM - puppet last run on db2051 is CRITICAL Puppet has 1 failures [05:36:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon May 4 05:35:29 UTC 2015 (duration 35m 28s) [05:36:43] Logged the message, Master [05:41:49] RECOVERY - puppet last run on db2051 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:43:46] 6operations, 10MediaWiki-extensions-SecurePoll, 6Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1256027 (10Jalexander) >>! In T97923#1255972, @tstarling wrote: >>>! In T97923#1255965, @Jalexander wrote: >> Thanks Tim, two quick questions. >> >... [05:52:39] !log tstarling Synchronized php-1.26wmf3/extensions/SecurePoll: Iae874c0403a8362929362ca645f4aca18feb0269 (duration: 00m 22s) [05:52:46] Logged the message, Master [05:53:17] !log tstarling Synchronized php-1.26wmf4/extensions/SecurePoll: Iae874c0403a8362929362ca645f4aca18feb0269 (duration: 00m 19s) [05:53:21] Logged the message, Master [05:56:03] !log on terbium: running populateEditCount-fixup.php on all wikis [05:56:08] Logged the message, Master [06:24:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [06:29:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [06:29:19] PROBLEM - puppet last run on ms-fe2003 is CRITICAL puppet fail [06:30:39] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 1 failures [06:31:00] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 1 failures [06:34:39] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:34:49] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:34:49] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:35:10] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:10] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:35:10] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:38:59] PROBLEM - puppet last run on mw1215 is CRITICAL Puppet has 1 failures [06:46:29] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:30] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:31] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:39] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:59] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:59] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:54:49] RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:58:36] (03PS2) 10Filippo Giunchedi: gdash: adjust jobq dashboard [puppet] - 10https://gerrit.wikimedia.org/r/207786 (https://phabricator.wikimedia.org/T87594) [06:58:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: adjust jobq dashboard [puppet] - 10https://gerrit.wikimedia.org/r/207786 (https://phabricator.wikimedia.org/T87594) (owner: 10Filippo Giunchedi) [07:13:55] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1256106 (10Nemo_bis) [07:18:08] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1256114 (10Nemo_bis) (Re-added a blocking task.) There is no hurry to kill bugzilla, please stop boycotting public discussion of the matter (namely, nominations o... [07:20:41] (03CR) 10Dereckson: [C: 031] Modify AbuseFilter block configuration on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206510 (https://phabricator.wikimedia.org/T96669) (owner: 10Glaisher) [07:22:13] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1256126 (10Joe) @Nemo_bis removing a bug tracker that is not used for actual development anymore - or did I miss something? - from production means reducing the... [07:29:40] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1256128 (10Joe) Since no one really complained about my evaluation, we'll go on with etcd for now. [07:29:47] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1256129 (10Joe) 5Open>3Resolved [07:29:47] 6operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1256131 (10Joe) [07:29:49] 6operations, 10Traffic: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1256130 (10Joe) [07:31:10] PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL 7.69% of data above the critical threshold [500.0] [07:31:41] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [07:34:31] 6operations, 10Traffic: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1256139 (10Joe) 3NEW [07:36:41] 6operations, 10Traffic: Be able to deploy confd either as a deb or via trebuchet - https://phabricator.wikimedia.org/T97971#1256145 (10Joe) 3NEW [07:39:04] 6operations, 10Traffic: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1256151 (10Joe) 3NEW [07:41:15] Given so much unused memory in the job runners, perhaps there should be more runners per machine https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Jobrunners+eqiad&h=&tab=m&vn=&hide-hf=false&m=mem_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [07:41:29] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [07:41:45] IIRC the issue used to be that not all machines in the job runners cluster had the same amount of memory, which made settings a bit harder [07:42:29] RECOVERY - HTTP 5xx req/min on graphite1002 is OK Less than 1.00% above the threshold [250.0] [07:42:31] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256157 (10Joe) 3NEW [07:45:48] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256164 (10Joe) [07:45:49] 6operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1256165 (10Joe) [07:45:59] 6operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1197052 (10Joe) [07:46:00] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256157 (10Joe) [07:46:09] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256157 (10Joe) [07:46:13] 6operations, 10Traffic: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1256168 (10Joe) [07:46:34] 6operations, 10Traffic: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1256170 (10Joe) [07:46:34] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256157 (10Joe) [07:49:26] 7Puppet, 6operations, 10Traffic: Create a confd puppet module - https://phabricator.wikimedia.org/T97974#1256173 (10Joe) 3NEW [07:52:02] 6operations, 10Traffic: Integrate confd into the varnish configuration to generate the list of active backends - https://phabricator.wikimedia.org/T97975#1256182 (10Joe) 3NEW [07:53:22] 6operations, 10Traffic: Figure out a data layout for etcd that can work for both varnish backends lists and for pybal pools - https://phabricator.wikimedia.org/T97976#1256188 (10Joe) 3NEW [07:55:43] 6operations, 10Traffic: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1256209 (10Joe) 3NEW [07:56:32] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256217 (10Joe) [07:56:33] 6operations, 10Traffic: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1256219 (10Joe) [07:56:35] 6operations, 10Traffic: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1256218 (10Joe) [07:56:37] 6operations, 10Traffic: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1256209 (10Joe) [08:06:19] (03PS1) 10Muehlenhoff: Update to 3.19.4 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208601 [08:12:58] (03PS1) 10Muehlenhoff: Update to 3.19.5 (Bug: T97441) [debs/linux] - 10https://gerrit.wikimedia.org/r/208602 [08:24:20] (03PS4) 10MaxSem: WIP: Hierator puppetization [puppet] - 10https://gerrit.wikimedia.org/r/202743 [08:29:09] (03PS1) 10Muehlenhoff: Update to 3.19.6 (Bug: T97441) [debs/linux] - 10https://gerrit.wikimedia.org/r/208603 [08:35:18] 7Puppet, 6operations, 10Traffic: Create a confd puppet module - https://phabricator.wikimedia.org/T97974#1256252 (10Joe) a:3Joe [08:35:36] 7Puppet, 6operations, 10Traffic: Create a confd puppet module - https://phabricator.wikimedia.org/T97974#1256173 (10Joe) p:5Low>3High [08:36:09] 6operations, 10Traffic: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1256256 (10Joe) p:5Low>3High a:3Joe [08:48:12] 6operations, 10MediaWiki-extensions-SecurePoll, 6Elections, 7I18n, and 2 others: Cannot select language on votewiki - https://phabricator.wikimedia.org/T97923#1256277 (10Jalexander) >>! In T97923#1256027, @Jalexander wrote: > I'm discussing with a couple members of the committee some short and medium term... [08:56:39] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [08:59:59] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:06:29] (03Abandoned) 10Filippo Giunchedi: ganglia_new: bandaid cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/207759 (https://phabricator.wikimedia.org/T97637) (owner: 10Filippo Giunchedi) [09:48:03] (03PS1) 10Jdlrobson: Enable Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) [09:59:09] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [10:00:25] (03PS2) 10Alex Monk: Enable Gather on the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [10:00:35] (03CR) 10Alex Monk: [C: 04-1] "Why just the English one?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [10:02:29] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [10:11:59] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:14:27] (03CR) 10Nemo bis: "Krenair, probably because there was no discussion elsewhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [10:16:40] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [10:29:21] PROBLEM - puppet last run on cp3008 is CRITICAL puppet fail [10:29:34] 6operations, 6Commons, 6Multimedia, 7HHVM, 5Patch-For-Review: Create an HHVM 3.6.0 package, adding Tim's streaming patch - https://phabricator.wikimedia.org/T93194#1256428 (10Joe) The change in jit size had a positive effect in beta, where it prevented hhvm from crashing. [10:34:20] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [10:35:19] 7Puppet, 6operations: puppet masters are maxed out - https://phabricator.wikimedia.org/T97989#1256432 (10fgiunchedi) 3NEW [10:39:09] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [10:42:29] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:47:09] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:54:03] 6operations, 5Patch-For-Review, 7Swift: swift eqiad capacity planning - https://phabricator.wikimedia.org/T1268#1256456 (10fgiunchedi) new machines fully in service at weight 3000, old machines are still freeing up space ``` [2015-05-04 10:14:34] Checking disk usage now Distribution Graph: 7% 2 ** 8%... [10:56:00] 6operations, 10Traffic: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1256457 (10Joe) I created the package from the HEAD of the unstable branch of the debian repository http://anonscm.debian.org/cgit/pkg-go/packages/etcd.git/?h=unstable Also, I created the pa... [10:58:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various inline comments" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [11:03:29] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [11:09:50] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [11:16:20] PROBLEM - High load for whatever reason on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [11:19:39] RECOVERY - High load for whatever reason on labstore1001 is OK Less than 50.00% above the threshold [16.0] [11:51:33] (03PS1) 10Alexandros Kosiaris: Assign the parsoid::production role to codfw wtps [puppet] - 10https://gerrit.wikimedia.org/r/208623 (https://phabricator.wikimedia.org/T90271) [11:58:59] 6operations, 10Traffic: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256533 (10Joe) For clusterization I want to use the DNS discovery, which needs SRV records to be defined, as described here: https://github.com/coreos/etcd/blob/v2.0.10/Docum... [12:02:58] 6operations, 5Patch-For-Review: improve cisco boxes raid monitoring - https://phabricator.wikimedia.org/T85529#1256550 (10fgiunchedi) 5Open>3stalled stalled, I believe @jgreen has an improved raid monitoring in frack [12:03:16] PROBLEM - puppet last run on wtp2004 is CRITICAL Puppet has 1 failures [12:03:45] PROBLEM - puppet last run on wtp2002 is CRITICAL Puppet has 3 failures [12:07:16] 6operations, 7Elasticsearch: unattended elasticsearch restarts - https://phabricator.wikimedia.org/T89845#1256560 (10fgiunchedi) restart is easily handled by `es-tool` for single node (see https://gerrit.wikimedia.org/r/#/c/164401/) orchestration across hosts is still missing and can be handled by salt for exa... [12:14:56] RECOVERY - puppet last run on wtp2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:36] RECOVERY - puppet last run on wtp2004 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:21:39] (03PS1) 10Filippo Giunchedi: graphite: mirror traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/208626 (https://phabricator.wikimedia.org/T85908) [12:29:57] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [12:31:33] (03CR) 10Alexandros Kosiaris: [C: 032] Assign the parsoid::production role to codfw wtps [puppet] - 10https://gerrit.wikimedia.org/r/208623 (https://phabricator.wikimedia.org/T90271) (owner: 10Alexandros Kosiaris) [12:31:45] 6operations: Add RAID monitoring for Cisco servers - https://phabricator.wikimedia.org/T85529#1256575 (10faidon) p:5Normal>3Low [12:32:23] 6operations, 7Monitoring: Add RAID monitoring for Cisco servers - https://phabricator.wikimedia.org/T85529#948943 (10faidon) [12:35:41] 6operations, 7Monitoring: Add RAID monitoring for Cisco servers - https://phabricator.wikimedia.org/T85529#1256583 (10Jgreen) >>! In T85529#1256550, @fgiunchedi wrote: > stalled, I believe @jgreen has an improved raid monitoring in frack The check-raid.py we use in frack does a better job with combination RAI... [12:35:47] 6operations, 10Wikimedia-Bugzilla: analyze Bugzilla access logs - https://phabricator.wikimedia.org/T86859#1256585 (10JohnLewis) @dzahn comment on this? [12:36:25] 6operations, 7Monitoring: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1256586 (10faidon) 3NEW [12:36:53] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1256595 (10JohnLewis) Still pending an approval form @slaporte (or anyone else from legal who deals with data release). [12:40:53] (03CR) 10John F. Lewis: [C: 031] Change BZ references to Phabricator tickets in MediaWiki module [puppet] - 10https://gerrit.wikimedia.org/r/207355 (https://phabricator.wikimedia.org/T96431) (owner: 10Alex Monk) [12:41:23] bblack: who is taking over ops duty this week? [12:43:27] PROBLEM - puppet last run on wtp2002 is CRITICAL Puppet has 1 failures [12:43:27] PROBLEM - puppet last run on wtp2003 is CRITICAL Puppet has 1 failures [12:43:36] JohnFLewis: ^ [12:43:40] akosiaris: thanks ;) [12:44:18] (03PS1) 10Alexandros Kosiaris: Parsoid LVS codfw records [dns] - 10https://gerrit.wikimedia.org/r/208627 (https://phabricator.wikimedia.org/T90271) [12:46:47] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [12:47:46] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:47:57] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors, please check! [12:51:04] godog: [12:51:04] # Monitor for high load consistently, is a 'catchall' [12:51:04] monitoring::graphite_threshold { 'high_load': [12:51:05] description => 'High load for whatever reason', [12:51:05] metric => "servers.${::hostname}.cpu.total.iowait", [12:51:12] copy/paste fail? [12:51:21] the check right above is "high_iowait_stalling" [12:51:36] paravoid: I don't think I'm the author :) [12:51:46] heh I realized this right after I wrote this [12:52:08] you're the author of a728a4b8 which touched this line, but not of the check [12:55:31] (03PS1) 10Alexandros Kosiaris: hieraize nrpe [puppet] - 10https://gerrit.wikimedia.org/r/208630 [12:56:16] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [12:58:04] (03Abandoned) 10Alexandros Kosiaris: parsoid: add role::parsoid::prod to codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/206479 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [13:03:12] ah yeah, the statsite renaming [13:03:16] (03PS2) 10Alexandros Kosiaris: Parsoid LVS codfw records [dns] - 10https://gerrit.wikimedia.org/r/208627 (https://phabricator.wikimedia.org/T90271) [13:07:24] (03CR) 10BBlack: [C: 031] Switch to a non-trunk build, using abi=1 for our first build. [debs/linux] - 10https://gerrit.wikimedia.org/r/207751 (owner: 10Muehlenhoff) [13:08:41] (03CR) 10BBlack: [C: 031] * Amend older changelog entries with security issues fixed in 3.19.x so that we properly keep track [debs/linux] - 10https://gerrit.wikimedia.org/r/207755 (owner: 10Muehlenhoff) [13:09:48] (03PS1) 10Faidon Liambotis: Adjust various Nagios check descriptions [puppet] - 10https://gerrit.wikimedia.org/r/208631 [13:09:50] (03PS1) 10Faidon Liambotis: labsnfs: fix loadavg check from copy/paste fail [puppet] - 10https://gerrit.wikimedia.org/r/208632 [13:10:14] (03CR) 10Faidon Liambotis: [C: 032] Adjust various Nagios check descriptions [puppet] - 10https://gerrit.wikimedia.org/r/208631 (owner: 10Faidon Liambotis) [13:10:27] (03CR) 10jenkins-bot: [V: 04-1] Adjust various Nagios check descriptions [puppet] - 10https://gerrit.wikimedia.org/r/208631 (owner: 10Faidon Liambotis) [13:11:20] that V-1 is something with modules/mesos [13:11:30] https://integration.wikimedia.org/ci/job/operations-puppet-typos/32246/console [13:11:49] which I can't see on my tree? [13:12:08] ?! [13:12:25] (03CR) 10Faidon Liambotis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/208631 (owner: 10Faidon Liambotis) [13:12:59] hasharOut: ping? [13:13:14] paravoid: good afternoon [13:13:14] hi! [13:13:22] could you shed some light on https://integration.wikimedia.org/ci/job/operations-puppet-typos/32246/console ? [13:13:30] sounds like submodule/jenkins fail [13:14:04] ahazeouu [13:14:26] we don't have a modules/mesos submodule in the tree at all, which is why this job is failing [13:14:46] yeah that is cumbersome [13:14:47] some change (that hasn't been merged yet) (probably) added such a submodule [13:15:01] I guess some puppet patch introduced that module [13:15:05] and the workspace is not clean [13:15:07] right [13:15:33] although the job is supposed to clean the workspace :/ [13:18:13] 6operations, 10Wikimedia-Site-requests, 7I18n, 7Varnish: Anonymous users can't pick language on WMF wikis ($wgULSAnonCanChangeLanguage is set to false) - https://phabricator.wikimedia.org/T58464#1256725 (10faidon) >>! In T58464#994780, @Nikerabbit wrote: > Caching. Either the language cookie would be ignor... [13:19:08] (03CR) 10BBlack: [C: 031] Update to 3.19.4 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208601 (owner: 10Muehlenhoff) [13:19:46] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/208631 (owner: 10Faidon Liambotis) [13:20:18] (03CR) 10BBlack: [C: 031] "(other than ticket# typo)" [debs/linux] - 10https://gerrit.wikimedia.org/r/208602 (owner: 10Muehlenhoff) [13:20:26] (03CR) 10BBlack: "(other than ticket# typo)" [debs/linux] - 10https://gerrit.wikimedia.org/r/208603 (owner: 10Muehlenhoff) [13:20:32] (03CR) 10BBlack: [C: 031] Update to 3.19.6 (Bug: T97441) [debs/linux] - 10https://gerrit.wikimedia.org/r/208603 (owner: 10Muehlenhoff) [13:24:32] paravoid: I have disabled submodule processing on that job :] [13:24:42] that might happen on other jobs though :( [13:24:43] what do you mean? [13:25:00] this job == this change or this job == puppet-typos? [13:27:09] (03PS1) 10Dereckson: Add medialib.naturalis.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208634 (https://phabricator.wikimedia.org/T97995) [13:35:44] paravoid: sorry the puppet-typos job [13:35:47] I have fixed it [13:35:55] that's not really a fix is it [13:36:00] so it completely wipes artifact of the previous run [13:36:02] oh [13:36:07] is the workspace still unclean? [13:36:24] it seems the Jenkins git cleanup command does not unregister / cleanup git submodules that are no more registered :( [13:36:52] I have changed it to a system that wipe the workspace entirely and use a shallow clone of the repo. [13:43:16] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [13:43:30] (03CR) 10coren: [C: 032] "Oh duh, that makes that check pretty worthless." [puppet] - 10https://gerrit.wikimedia.org/r/208632 (owner: 10Faidon Liambotis) [13:44:20] (03PS1) 10Filippo Giunchedi: statsite: decommission class [puppet] - 10https://gerrit.wikimedia.org/r/208635 (https://phabricator.wikimedia.org/T95687) [13:44:58] (03CR) 10jenkins-bot: [V: 04-1] statsite: decommission class [puppet] - 10https://gerrit.wikimedia.org/r/208635 (https://phabricator.wikimedia.org/T95687) (owner: 10Filippo Giunchedi) [13:45:46] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1256787 (10BBlack) 3NEW [13:48:28] 6operations, 10OpenStreetMap, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1248778 (10MaxSem) [13:49:18] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1256809 (10faidon) a:3faidon I need to do at least the packaging part for Debian anyway (and the libmaxminddb part has been done already, needs an upload). Hopefully RSN :) [13:49:53] !log draining all traffic from the Giglinx/Zayo link to ulsfo [13:50:01] Logged the message, Master [13:52:42] (03CR) 10Andrew Bogott: [C: 032] puppetsigner: Clean up certs and salt keys for instances we can't find in ldap [puppet] - 10https://gerrit.wikimedia.org/r/205897 (owner: 10Andrew Bogott) [13:57:56] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [14:02:07] (03PS1) 10Andrew Bogott: Clean up old certs with 'puppet cert clean' [puppet] - 10https://gerrit.wikimedia.org/r/208638 [14:03:12] (03CR) 10Andrew Bogott: [C: 032] Clean up old certs with 'puppet cert clean' [puppet] - 10https://gerrit.wikimedia.org/r/208638 (owner: 10Andrew Bogott) [14:03:28] Hello. Asking for authorization from ops people to perform a bigdelete at enwiki for a page with +10,000 revids. [14:04:48] (03PS2) 10Muehlenhoff: Update to 3.19.5 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208602 [14:06:50] TimStarling: ^ is it safe? [14:09:16] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [14:09:54] paravoid ^ [14:10:31] ignore that [14:10:56] paravoid: I meant the mafk thing. he wants an okay to do a bigdelete from ops. s1 looks fine so [14:11:08] thanks JohnLewis [14:13:28] mafk: which page is it anyway? [14:13:39] 2014 and beyond in film [14:13:42] @enwiki [14:13:54] https://en.wikipedia.org/wiki/2014_and_beyond_in_film [14:14:11] request at: https://meta.wikimedia.org/w/index.php?oldid=12110013#.222014_and_beyond_in_film.22_redirect_on_en.Wikipedia [14:14:20] 10k revisions? geez [14:14:52] needs a steward to bigdelete, but we were told by vvv that it'd be good if we could pass by here and ask for an ok, since that may disrupt a bit the DB [14:15:03] so here I am :) [14:16:58] bblack; you may around as well actually ^ :) [14:17:59] (03PS1) 10Andrew Bogott: Better handling for invalid cert names: [puppet] - 10https://gerrit.wikimedia.org/r/208640 [14:20:37] (03CR) 10Andrew Bogott: [C: 032] Better handling for invalid cert names: [puppet] - 10https://gerrit.wikimedia.org/r/208640 (owner: 10Andrew Bogott) [14:21:54] JohnLewis: I guess there are no objections heh [14:23:43] mafk: probably no one wants to comment :) [14:23:56] well, so doing [14:23:57] 6operations, 10Traffic: Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006#1256936 (10faidon) 3NEW [14:24:04] what's up? [14:24:38] mafk: brandon has appeared [14:24:38] bblack: asking for an ok from ops for a bigdelete of +10k [14:24:46] (on enwiki, so s1 db cluster) [14:24:54] yeah [14:25:27] I don't have any fundamental reason to object I guess. Honestly, I have no idea how we evaluate whether it's ok. I guess just whether we have ongoing other db load issues? [14:26:04] bblack: I'd assume: load and lag. Both seem fine to me so I'd say it's okay but your call :) [14:26:13] yeah seems ok to me then. [14:26:42] mafk: enjoy [14:27:23] thanks guys [14:27:31] I'm just following procedure :) [14:28:29] 6operations, 10Traffic, 10discovery-system: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1256958 (10Joe) [14:28:32] deleting... [14:28:44] 6operations, 10Traffic, 10discovery-system: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1256959 (10Joe) [14:29:01] "2014 and beyond in film" has been deleted (view | salt). See the deletion log for a record of recent deletions. [14:29:14] 6operations, 10Traffic, 10discovery-system: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1256961 (10Joe) [14:29:26] 6operations, 10Traffic, 10discovery-system: Integrate confd into the varnish configuration to generate the list of active backends - https://phabricator.wikimedia.org/T97975#1256963 (10Joe) [14:29:45] 7Puppet, 6operations, 10Traffic, 10discovery-system: Create a confd puppet module - https://phabricator.wikimedia.org/T97974#1256965 (10Joe) [14:30:06] 6operations, 10Traffic, 10discovery-system: Create an etcd puppet module + find suitable servers for deployment - https://phabricator.wikimedia.org/T97973#1256966 (10Joe) [14:30:18] 6operations, 10Traffic, 10discovery-system: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1256968 (10Joe) [14:31:13] 6operations, 10Traffic, 10discovery-system: Be able to deploy confd either as a deb or via trebuchet - https://phabricator.wikimedia.org/T97971#1256978 (10Joe) [14:32:59] mafk: wanna delete talk page as well? [14:33:17] the request was solely for the main page [14:33:34] so I guess no for now [14:33:37] mafk: oh I thought you were a local admin as well :) [14:33:52] nope, just steward :) [14:33:57] okay :) [14:34:58] 6operations, 10Traffic, 10discovery-system: Be able to deploy confd either as a deb or via trebuchet - https://phabricator.wikimedia.org/T97971#1256986 (10Joe) I counted at least 10 unsatified dependencies here, so I think I'll just get the binary and I will distribute it via trebuchet. [14:39:36] 6operations, 10ops-eqiad: Failed disk db1004 - https://phabricator.wikimedia.org/T97814#1256997 (10Cmjohnson) The disk at slot 9 has online now but the disk at slot 10 has now failed. replacing this disk [14:41:41] 6operations, 10ops-eqiad: Failed disk db1003 - https://phabricator.wikimedia.org/T97815#1257007 (10Cmjohnson) 5Open>3Resolved Fixed...all disks back online [14:44:56] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [14:48:06] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [14:48:30] (03PS1) 10Alexandros Kosiaris: Add graphoid stanza for cache in labs [puppet] - 10https://gerrit.wikimedia.org/r/208644 [14:49:10] 6operations, 6Labs, 10Labs-Infrastructure, 10discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1257018 (10Joe) 3NEW [14:49:20] (03PS1) 10Faidon Liambotis: Depool ulsfo, network troubles [dns] - 10https://gerrit.wikimedia.org/r/208645 [14:49:43] (03PS3) 1020after4: phab stage tags for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [14:49:49] (03CR) 10Faidon Liambotis: [C: 032] Depool ulsfo, network troubles [dns] - 10https://gerrit.wikimedia.org/r/208645 (owner: 10Faidon Liambotis) [14:50:29] kart_: Ping for SWAT in 10 minutes [14:50:31] !log draining ulsfo, network troubles (internal network packet loss) [14:50:39] Logged the message, Master [14:50:49] ^d, thcipriani, marktraceur: Who wants to SWAT this morning? [14:51:00] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1257028 (10fgiunchedi) ``` *-disk:0 description: ATA Disk product: SAMSUNG HE502HJ... [14:51:10] anomie: I can take it [14:51:21] thcipriani: ok! [14:51:29] !log halt fluorine to fix console and swap sda [14:51:35] Logged the message, Master [14:51:45] kart_: ping for SWAT in ~10min [14:52:03] (03PS1) 10Ottomata: Revert 7c3af0d. This caused memory reservation errors in the mediacounts job [puppet] - 10https://gerrit.wikimedia.org/r/208646 (https://phabricator.wikimedia.org/T97753) [14:52:05] (03CR) 1020after4: "need a +2 on this for wednesday deployment." [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [14:52:15] thcipriani: ack [14:52:35] (03PS2) 10Ottomata: Revert 7c3af0d. This caused memory reservation errors in the mediacounts job [puppet] - 10https://gerrit.wikimedia.org/r/208646 (https://phabricator.wikimedia.org/T97753) [14:53:26] (03CR) 10Alexandros Kosiaris: [C: 032] Add graphoid stanza for cache in labs [puppet] - 10https://gerrit.wikimedia.org/r/208644 (owner: 10Alexandros Kosiaris) [14:54:00] 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1257036 (10BBlack) [14:54:34] <^d> godog: How long will fluorine be down? [14:54:43] <^d> thcipriani: Might want to wait if we don't have logs. [14:55:11] ^d: gah didn't realize it'd impact SWAP, should be back shortly cc cmjohnson1 [14:55:16] SWAT even [14:59:07] (03PS3) 10Ottomata: Revert 7c3af0d. This caused memory reservation errors in the mediacounts job [puppet] - 10https://gerrit.wikimedia.org/r/208646 (https://phabricator.wikimedia.org/T97753) [14:59:14] (03CR) 10Ottomata: [C: 032 V: 032] Revert 7c3af0d. This caused memory reservation errors in the mediacounts job [puppet] - 10https://gerrit.wikimedia.org/r/208646 (https://phabricator.wikimedia.org/T97753) (owner: 10Ottomata) [14:59:16] maybe test. and test2. schould have robots NOINDEX by default. results popping up in google [14:59:16] PROBLEM - puppet last run on lvs2002 is CRITICAL puppet fail [14:59:37] ottomata: all cp* servers are warning about 3 varnishncsa processes running [15:00:05] manybubbles, anomie, ^d, thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150504T1500). [15:00:43] will start merges now, hold for deployment until we have flourine back [15:00:46] <^d> Steinsplitter: How can I search for test2 on google then? :p [15:00:48] (03CR) 10Jdlrobson: "Hi Alex, I'm not sure why you have -1ed. This is purposely only for English Wikivoyage as we've not rolled out to non-English projects yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [15:01:23] ^d: do we need to search stuff on test2 :? [15:01:43] <^d> Search alllllll the things [15:01:48] :-D [15:02:20] (03PS1) 10Yurik: LABS: Enable Graphoid fallback for graph ext [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208647 [15:02:36] any objections to me deploying this now ^ [15:02:38] ACKNOWLEDGEMENT - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott Down while we fix memory errors [15:02:41] MaxSem ^^ [15:02:44] <^d> yurik: Yes, it's the middle of swat [15:02:50] ^d, thx [15:03:01] ^d, can swat add it pls? its a labs only patch [15:03:11] <^d> Ask thcipriani, he's doing swat :) [15:03:40] fluorine should be booting, looking [15:04:44] if it's a labs patch, then it doesn't need to be swatted, right? [15:04:59] !log halting virt1011 pending its rename to labvirt1007 [15:05:05] Logged the message, Master [15:05:10] <^d> thcipriani: Well it's still a mw-config merge during your swat [15:05:12] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1257074 (10Cmjohnson) first disk swapped. [15:05:15] <^d> So you'd be like "WHAT IS THIS?!?" [15:05:17] <^d> :) [15:05:28] thcipriani, it is still needed to be synced :) [15:06:14] ^d: fair enough [15:06:16] thcipriani, although i could just +2 it - and it will go into betalabs automatically :) [15:06:51] yurik: yeah, go ahead and +2 it, jenkins should just deploy it automatically, I'm aware of the change and will fetch it down to live tin. [15:07:28] * yurik will get killed by ^d for making prod & mediawiki-config differ [15:08:27] (03CR) 10Yurik: [C: 032] "per discussion in #ops, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208647 (owner: 10Yurik) [15:08:35] (on fluorine's console, looking) [15:09:40] (03Merged) 10jenkins-bot: LABS: Enable Graphoid fallback for graph ext [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208647 (owner: 10Yurik) [15:10:07] godog: ssh client still thinking about logging into fluorine... [15:10:49] thcipriani: indeed, I'm on the console looking why it isn't booting back up [15:10:59] ah, gotcha. [15:11:12] thcipriani: another patch too :) [15:11:40] kart_: yup, merging :) [15:15:41] hmm, fetch on 1.26wmf3 brought down a bunch of revisiondelete changes :\ [15:15:43] 6operations, 5Patch-For-Review, 5wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1257091 (10akosiaris) wtp2001-wtp2020 have been deployed and parsoid seems to be running normally. Next up: LVS - IP assignment. https://gerrit.wikimedia.org/r/#/c/208627/ - LVS configuration. [15:15:48] thcipriani: we're back [15:15:56] sorry about the bad timing [15:16:32] (03PS1) 10Andrew Bogott: Rename virt1011 to labvirt1007 [dns] - 10https://gerrit.wikimedia.org/r/208648 [15:16:42] godog: awesome. I'm in, thanks for the quick work :) [15:17:07] RECOVERY - puppet last run on lvs2002 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:17:12] np [15:17:13] (03PS1) 10Andrew Bogott: Rename virt1011 to labvirt1007, switch to Trusty [puppet] - 10https://gerrit.wikimedia.org/r/208649 [15:17:24] robh: I could use a hand with ^ if you are up and working. [15:17:31] will wait for swat to finish before continuing [15:17:33] Just in case I’m forgetting anything [15:17:36] !log starting upgrade of Analytics Cluster to CDH 5.4: https://phabricator.wikimedia.org/T97453 [15:17:42] Logged the message, Master [15:19:47] (03PS2) 10Andrew Bogott: Rename virt1011 to labvirt1007, switch to Trusty [puppet] - 10https://gerrit.wikimedia.org/r/208649 [15:22:09] (03PS4) 10Andrew Bogott: For cert names, use the fqdn instead of the ec2id if use_dnsmasq is lowered. [puppet] - 10https://gerrit.wikimedia.org/r/202924 [15:23:11] !log thcipriani Synchronized php-1.26wmf3/extensions/ContentTranslation/modules/tools/ext.cx.tools.formatter.js: Update ContentTranslation to 6f81619 [[gerrit:208605]] (duration: 00m 25s) [15:23:17] Logged the message, Master [15:23:26] ^ kart_ there's wmf3 [15:23:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:24:45] thcipriani: ack. [15:25:10] kart_: lmk if it is ok [15:25:26] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [15:26:56] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:06] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:27:09] thcipriani: still not updated? [15:27:21] 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1257132 (10akosiaris) Most of those services look very good candidates for virtualization indeed. A couple of notes: - Timekeeping has been known to have a bad history with virtualization. http://www.vmware.... [15:27:27] thcipriani: https://git.wikimedia.org/tree/mediawiki%2Fextensions%2FContentTranslation.git/53ab07e8ae91c9378bb9ccdc618b48e1bbde47db - here only. [15:28:04] 6operations, 7Graphite, 5Patch-For-Review: use graphite1002 to test dm-cache - https://phabricator.wikimedia.org/T88994#1257134 (10fgiunchedi) machine is up and working, however graphite is not fully jessie-ready and we need it to properly use dm-cache [15:28:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [15:28:59] * thcipriani looking [15:30:27] thcipriani: see, https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Version - Content Translation [15:30:44] thcipriani: do we need to use sync-dir for extension? [15:31:15] probably to see it updated there. Blerg. k, doing. [15:32:40] (03PS1) 10Alexandros Kosiaris: Add parsoid_codfw icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/208653 [15:32:54] !log thcipriani Synchronized php-1.26wmf3/extensions/ContentTranslation: Sync-dir for ContentTranslation to 6f81619 [[gerrit:208605]] (duration: 00m 18s) [15:32:57] Logged the message, Master [15:34:25] thcipriani: pretty strange, but fix is deployed and version yet not updated. [15:34:33] ^d: any idea? ^^ [15:34:48] thcipriani: go ahead for wmf4 [15:34:58] kk [15:38:37] RECOVERY - Disk space on stat1002 is OK: DISK OK [15:38:41] (03PS1) 10JanZerebecki: Enable Graph extension on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208654 (https://phabricator.wikimedia.org/T97993) [15:39:47] 7Puppet, 6Multimedia, 6Reading-Infrastructure-Team, 6Release-Engineering, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1257170 (10faidon) >>! In T84956#1166762, @hashar wrote: > @Gilles and I had a quick conf call yesterday. Seems the Debian packaging is goi... [15:40:33] (03PS1) 10Aude: Update Wikibase site id and group for test2wiki and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) [15:40:43] !log thcipriani Synchronized php-1.26wmf4/extensions/ContentTranslation: Update ContentTranslation to 0bd91b6 [[gerrit:208607]] (duration: 00m 30s) [15:40:49] Logged the message, Master [15:41:24] thcipriani: thanks! [15:41:52] kart_: yw [15:41:58] SWAT complete [15:42:15] evening [15:43:01] (03PS1) 10KartikMistry: Beta: Enable ContentTranslation for 20150507 deployment [puppet] - 10https://gerrit.wikimedia.org/r/208656 (https://phabricator.wikimedia.org/T97966) [15:44:37] Nikerabbit: evening [15:45:31] (03CR) 10Aude: "this is somewhat important to be able to evaluate that subscription tracking is working correctly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [15:45:50] (03PS1) 10Yurik: LABS: Enable graph extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208657 [15:46:22] 7Puppet, 6Multimedia, 6Reading-Infrastructure-Team, 6Release-Engineering, 5Patch-For-Review: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1257199 (10bd808) [15:47:23] 6operations, 10ops-eqiad: fluorine console not working - https://phabricator.wikimedia.org/T94554#1257204 (10fgiunchedi) 5Open>3Resolved resolving, console fixed by @cmjohnson [15:47:31] thcipriani, are you done deploing? I would like to +2 ^^^ [15:47:42] its a labs-only again [15:47:45] yurik: I am done deploying [15:47:58] ^d, any objections to +2 labs only without sync? [15:48:07] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim tungsten as spare - https://phabricator.wikimedia.org/T97274#1257212 (10Cmjohnson) 5Open>3Resolved Tungsten has been wiped and added to server spares [15:49:00] yurik: if you don't sync, icinga will generate errors and future deployers will likely revert the patch [15:49:21] 7Puppet, 6Multimedia, 6Reading-Infrastructure-Team, 6Release-Engineering, 5Patch-For-Review: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1257228 (10bd808) I removed the scrum of scrums and blocked on releng tags. I'll work with @tgr to come up with a basic plan on how t... [15:49:30] JohnLewis, sigh, ok, will sync, unless anyone is deploying now [15:49:41] usually labs only is tolerated :) [15:49:50] yurik: I can pull down the patch [15:49:57] go for the +2 [15:49:59] 6operations, 10Analytics-EventLogging, 5Patch-For-Review: Reclaim vanadium, move to spares - https://phabricator.wikimedia.org/T95566#1257241 (10Cmjohnson) 5Open>3Resolved Disk has been wiped and server added to spares list. [15:50:00] thcipriani, thx! https://gerrit.wikimedia.org/r/#/c/208657/1 [15:50:12] (03CR) 10Yurik: [C: 032] LABS: Enable graph extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208657 (owner: 10Yurik) [15:50:59] <^d> yurik: labs only is not tolerated. [15:51:11] (03Merged) 10jenkins-bot: LABS: Enable graph extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208657 (owner: 10Yurik) [15:51:14] thcipriani, ^ [15:51:22] <^d> You should still pull (or icinga will yell) and syncing is good practice since you're already on tin. [15:51:32] ^d, has there ever been any discussion to have a separate repo that would mix-in somehow for labs? [15:51:55] <^d> No. And actually I want to /reduce/ the delta between labs/prod in mw-config [15:51:57] i am always uneasy about changing prod settings [15:52:00] <^d> So I'd veto such a move :P [15:52:24] i totally agree that they should be one and the same, but the "extra" settings could benefit from being separate [15:52:35] e.g. *-labs [15:52:47] this way you don't accidently break production [15:53:00] and you have a clear vision of how labs is different at any given point [15:53:14] <^d> Ugh, then I'd have to grep 2 repos to figure out why a config setting is something on labs. [15:53:15] <^d> No [15:53:17] <^d> Not doing it [15:53:48] the complex and myriad ways in which labs is different...:( [15:55:11] ^d, how about a subdir that doesn't have to be in sync with prod. I am unhappy about just one thing -- changing settings in the *-labs need to be synced to production servers. I think this is wrong [15:56:05] beta cluster is not a free for all. If you want a change there that you are afraid to see merged then you should seek review/feedback and become confident [15:56:58] bd808, its not about not being confident. The less you touch something, the less chance you have of breaking it. Just because you are confident does not mean you won't accidently break it :) [15:57:15] thus, if you can avoid it, why not? [15:57:23] just because greping is easier? [15:57:29] a bit of a weak argument :) [15:58:12] multiple people from multiple teams have been working on reducing the delta between those two environments [15:58:29] which shouldn't have been there on the first place, I should say [15:58:40] so I don't think you'll find many fans of that idea, yurik [15:58:45] <^d> That ^ [15:58:55] godog: akosiaris if around, please merge, https://gerrit.wikimedia.org/r/#/c/208656/ [15:59:01] paravoid, i agree that they should be the same, EXCEPT when we enable a certain feature -- read deployment steps for the new extension - it MUST be enabled on beta BEFORE in production [15:59:15] Is test2 on mw1017? [15:59:25] <^d> Glaisher: No, it's a normal load balanced wiki [15:59:31] ok [15:59:32] <^d> Just test is tied to a specific node [15:59:36] so unless we say that we enable everytihng in beta and in prod at the same time, we have to have two different configuration [15:59:42] (03CR) 10Alexandros Kosiaris: [C: 031] hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [15:59:50] (tiny difference, but still a diference) [15:59:51] reducing the repository delta != both environments are configured exactly the same [16:00:03] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Enable ContentTranslation for 20150507 deployment [puppet] - 10https://gerrit.wikimedia.org/r/208656 (https://phabricator.wikimedia.org/T97966) (owner: 10KartikMistry) [16:00:09] we deploy features gradually in production as well, so I don't see your point [16:00:13] false argument yurik. It ignores that we have multiple wikis in the farm that have differing configuration [16:00:16] (03CR) 10Steinsplitter: [C: 031] Add medialib.naturalis.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208634 (https://phabricator.wikimedia.org/T97995) (owner: 10Dereckson) [16:00:28] hmm... perhaps.. [16:00:33] * yurik is going to rethink it [16:00:50] you're not going to deploy every wiki in prod at the same time, are you? [16:01:10] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1257265 (10Cmjohnson) The current version of asw-c4-eqiad is 11.4 R6.5...the download on the juniper site is 11.4 R6.6. [16:01:15] deploy to* [16:01:39] kart_: has the LE team thought about migrating cxserver to service-runner ? [16:02:15] it would probably help with logging, configuration changes deploys, standardize across the various services and probably more [16:03:21] akosiaris: yes. [16:03:57] akosiaris: we will plan that out sometime in next week. [16:04:15] kart_: great! [16:04:16] (not actual implementation, but plan for sure) [16:04:21] :-) [16:06:00] (03PS1) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 [16:06:39] (03PS3) 10Gage: ipsec-global: fix bug in non-verbose mode, exit if not root [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) [16:06:41] (03CR) 10jenkins-bot: [V: 04-1] refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 (owner: 10BBlack) [16:07:24] (03CR) 10Alexandros Kosiaris: [C: 032] Add parsoid_codfw icinga groups [puppet] - 10https://gerrit.wikimedia.org/r/208653 (owner: 10Alexandros Kosiaris) [16:09:54] (03PS2) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 [16:10:11] (03PS2) 10Gage: IPsec: Icinga monitor for Strongswan connections [puppet] - 10https://gerrit.wikimedia.org/r/199787 [16:10:38] (03CR) 10jenkins-bot: [V: 04-1] refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 (owner: 10BBlack) [16:12:05] (03CR) 10Alex Monk: "While I appreciate that you went to ask the local wiki community about installing it, we should not have a habit of deploying to English w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [16:12:28] (03PS3) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 [16:14:09] (03PS4) 10BBlack: refactor varnish::logging for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/208659 [16:14:28] (03Abandoned) 10Alexandros Kosiaris: start on icinga iptables to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/101842 (owner: 10ArielGlenn) [16:15:51] PROBLEM - Parsoid on wtp2003 is CRITICAL: Connection refused [16:17:02] PROBLEM - Parsoid on wtp2002 is CRITICAL: Connection refused [16:17:08] ^ parsoid? [16:17:15] anyone working on that? [16:18:49] cscott: ping! [16:19:12] oh wait not you :) [16:19:37] also, that's codfw I guess, so not really critical [16:20:45] (03PS1) 10Muehlenhoff: Update to 3.19.6 (Bug: T97411) [debs/linux] - 10https://gerrit.wikimedia.org/r/208662 [16:20:52] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [16:23:04] <_joe_> akosiaris: ^^ \o/ [16:23:09] <_joe_> thanks a lot [16:23:20] <_joe_> bblack: yes akosiaris is [16:23:39] <_joe_> bblack: he's imaging the codfw parsoid servers [16:23:59] <_joe_> now if we only had parsoid-cache too :P [16:25:20] :P [16:36:08] (03CR) 10RobH: [C: 031] Rename virt1011 to labvirt1007, switch to Trusty [puppet] - 10https://gerrit.wikimedia.org/r/208649 (owner: 10Andrew Bogott) [16:37:01] (03CR) 10RobH: [C: 031] Rename virt1011 to labvirt1007 [dns] - 10https://gerrit.wikimedia.org/r/208648 (owner: 10Andrew Bogott) [16:39:11] robh: thanks! That’s all I need, right? If I pxe-boot that box it will rename and reimage? [16:39:39] (03PS2) 10Gage: logstash: Seed Elasticsearch cluster host [puppet] - 10https://gerrit.wikimedia.org/r/208576 (https://phabricator.wikimedia.org/T97645) (owner: 10BryanDavis) [16:40:17] you need to revoke the old certs/keys/puppetstoreddb but yep [16:40:41] wmf-reimage should be able to take care of that fwiw [16:40:59] godog: it'll handle hostname changes? [16:41:12] i have been told about said script, but have yet to use it =] [16:42:24] robh: it should afaik, ymmv, omglolbbq [16:42:45] godog: where do I run that? [16:42:49] * andrewbogott should read the docs [16:42:53] andrewbogott: palladium! [16:42:59] (03PS2) 10John F. Lewis: Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) [16:43:05] (03CR) 10jenkins-bot: [V: 04-1] Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) (owner: 10John F. Lewis) [16:48:40] godog: /is/ there a doc page for wmf-reimage? Wikitech search shows nothing [16:49:36] andrewbogott: mhh then no, there isn't [16:49:42] …because it doesn’t seem like the sort of thing I should just run and see what happens [16:50:19] What does it do? [16:51:27] it is a bit modal, but essentially a combination of unsigning/signing puppet/salt keys and rebooting [16:51:50] 6operations, 10Traffic, 10discovery-system: Be able to deploy confd either as a deb or via trebuchet - https://phabricator.wikimedia.org/T97971#1257536 (10Joe) After a quick discussion on irc we came to the conclusion that a binary debian package, although ugly and needing a fix, is a good stopgap solution f... [16:51:50] rebooting w/a chance in boot order? Or just rebooting? [16:51:51] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [16:52:15] 6operations, 10Traffic, 10discovery-system: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1257538 (10Joe) [16:52:16] 6operations, 10Traffic, 10discovery-system: Be able to deploy confd either as a deb or via trebuchet - https://phabricator.wikimedia.org/T97971#1257539 (10Joe) [16:52:31] no it'll DTRT andrewbogott, pxe boot once [16:52:48] it is shell tho, fairly easy to understand [16:52:55] godog: think it’ll work with an HP server? [16:53:04] * andrewbogott tries it [16:53:31] 6operations, 10Traffic, 10discovery-system: Properly package confd and its dependencies - https://phabricator.wikimedia.org/T97971#1257540 (10Joe) a:3Joe [16:53:49] andrewbogott: good question, don't know if the pxe-boot only once will work, perhaps _joe_ has run it on hp [16:54:11] ok — I’m going to just do this by hand. In encourage someone to document this tool someday :) [16:54:56] (03CR) 10Andrew Bogott: [C: 032] Rename virt1011 to labvirt1007 [dns] - 10https://gerrit.wikimedia.org/r/208648 (owner: 10Andrew Bogott) [16:55:09] (03CR) 10Andrew Bogott: [C: 032] Rename virt1011 to labvirt1007, switch to Trusty [puppet] - 10https://gerrit.wikimedia.org/r/208649 (owner: 10Andrew Bogott) [16:55:30] (03PS3) 10John F. Lewis: Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) [16:56:20] bblack ^ if you have a few spare moments, mind a merge of that ar? :) [16:58:26] !log reimaging/renaming virt1011 -> labvirt1007 [16:58:36] Logged the message, Master [17:00:13] greg-g: do you think parsoid could do an early deploy today? [17:00:57] greg-g: subbu has a flight to catch at the end of our normally-scheduled window, it would be better if we had some more space before that hard stop. [17:01:26] greg-g: we're thinking about doing the deploy in an hour (1100 PDT) instead of our normal 1300 PDT window. [17:05:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:05:26] cscott: that should be fine (cc subbu ) [17:06:06] (03PS4) 10BBlack: Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) (owner: 10John F. Lewis) [17:06:38] (03CR) 10BBlack: [C: 032 V: 032] Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) (owner: 10John F. Lewis) [17:07:24] bblack: awesome :) ebernhardson ^^ [17:07:53] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Add ebernhardson to 'stats' group for query access to eventlogging data on stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97332#1257602 (10BBlack) 5Open>3Resolved merged, could take half an hour to take effect everywhere (and then you'll n... [17:10:35] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 5 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1257608 (10Arlolra) > I don't think I actually ever got any confirmation about whether merging this change had any intended effe... [17:12:57] What does it mean when Carbon says “DHCPDISCOVER from 40:a8:f0:38:1a:40 via 10.64.20.2: network 10.64.20.0/24: no free leases” and then my netboot fails? [17:13:42] PROBLEM - puppet last run on cp3035 is CRITICAL puppet fail [17:13:48] andrewbogott: that you probably haven't added the MAC address to linux-hosts-... [17:14:09] (03Abandoned) 10BBlack: include favicons in static hashing stuff [puppet] - 10https://gerrit.wikimedia.org/r/208078 (owner: 10BBlack) [17:14:25] (03PS1) 10Aude: Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 [17:14:57] paravoid: isn’t that this? https://gerrit.wikimedia.org/r/#/c/208649/2/modules/install-server/files/dhcpd/linux-host-entries.ttyS1-115200 [17:17:05] andrewbogott: yes, but this says "fixed-address virt1011.eqiad.wmnet" and that hostname does not resolve [17:18:37] paravoid: Sorry, what does ‘this’ refer to in the above? I’m renaming from virt1011 to labvirt1007. [17:18:45] So virt1011 is no longer defined in dns, that’s on purpose. [17:19:21] oh sorry, just saw the gerrit change [17:20:32] labvirt1007/labvirt1007.mgmt are the other way around [17:20:53] $ dig +short labvirt1007.mgmt.eqiad.wmnet [17:20:54] 10.64.20.16 [17:20:57] $ dig +short labvirt1007.eqiad.wmnet [17:20:57] 10.65.3.228 [17:21:13] it should be flipped [17:21:26] labvirt1007.mgmt.eqiad.wmnet = 10.65.3.228, labvirt1007.eqiad.wmnet = 10.64.20.16 [17:21:47] The related patch is this: https://gerrit.wikimedia.org/r/#/c/208648/1 [17:21:50] so what happens is that the DHCP server resolves labvirt1007.eqiad.wmnet to 10.65.3.228, but sees the request in a subnet that is defined as 10.64.20.0/24 [17:21:55] So… “How did this ever work” [17:22:11] you flipped them while renaming it [17:22:23] oh, so I did. [17:22:26] hm [17:22:29] ok, thanks [17:22:40] (03PS1) 10Andrew Bogott: Revert "Rename virt1011 to labvirt1007" [dns] - 10https://gerrit.wikimedia.org/r/208680 [17:22:42] (03PS1) 10Yurik: Graphoid configuration - add protocol [puppet] - 10https://gerrit.wikimedia.org/r/208679 [17:22:51] see how they're all 10.64.. 10.64.. 10.64.. and then there's a 10.65 there [17:22:54] (03Abandoned) 10Andrew Bogott: Revert "Rename virt1011 to labvirt1007" [dns] - 10https://gerrit.wikimedia.org/r/208680 (owner: 10Andrew Bogott) [17:23:13] akosiaris, hi, could you take a look at ^^^ [17:25:22] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1257669 (10Joe) Still needed and only in eqiad: [ ] Parsoid cluster [ ] CirrusSearch [ ] Varnishes (all of them) [ ] OCG [ ] All services that are on SCA [ ] restbase... [17:26:15] _joe_: add them to the description so that it can be editable :) [17:26:26] (one can only edit the task description, not others' comments) [17:26:48] (I have more) [17:28:20] _joe_: a codfw tin exists iirc (mira) [17:28:28] <_joe_> paravoid: yeah I was just starting to add things there [17:28:36] <_joe_> JohnLewis: ok thanks [17:28:44] (03PS1) 10Andrew Bogott: unswap mgmt and standard ip for labvirt1007 [dns] - 10https://gerrit.wikimedia.org/r/208681 [17:29:12] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1257708 (10Joe) [17:29:17] (03CR) 10Andrew Bogott: [C: 032] unswap mgmt and standard ip for labvirt1007 [dns] - 10https://gerrit.wikimedia.org/r/208681 (owner: 10Andrew Bogott) [17:29:37] yurik: labs do support HTTPS. Although I am not sure what exactly it is you want to do [17:29:41] <_joe_> JohnLewis: where did you gat that info? [17:29:42] care to elaborate ? [17:30:10] akosiaris, graphoid on beta labs tries to access labs' api, and it fails [17:30:11] <_joe_> JohnLewis: I can't find it in puppet [17:30:13] _joe_: well robh allocated it for deployment and all the onsite work was done by papaul and is exists in dns just not puppet [17:30:18] let me find the ticket for you :) [17:30:44] <_joe_> JohnLewis: so we have the hardware, we did not set it up :) [17:30:47] beta doesn't have https [17:30:54] _joe_: https://phabricator.wikimedia.org/T95436 needs to be deployed as a deployment server :) [17:31:01] there's an ongoing thread/ticket about it, it's complicated [17:31:02] akosiaris, let me fix another bug that i found in graphoid, and will try it again. Do I need to simply update the /deploy project for it to sync up with labs? [17:31:17] <_joe_> JohnLewis: ok I'll link it to this phab ticket then [17:31:33] akosiaris, as bblack said - no https for labs :( [17:31:34] bblack: that's not entirely true. https://graphoid-beta.wmflabs.org does work for example [17:31:42] okay, at least you know you don't have to requets hardware now ;) [17:31:54] * yurik steps away from the debate and gets popcorn [17:31:55] that being said, there are parts where beta does not indeed have https [17:32:02] uhm [17:32:03] how? :) [17:32:13] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1257740 (10Joe) [17:32:13] RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:32:14] 6operations: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1257741 (10Joe) [17:32:16] dynamicproxy termination [17:32:28] oh using rapidssl star cert [17:32:30] right [17:32:38] bblack, akosiaris, could it be that betalabs in multilevel nested, and the cert only allows one level? [17:32:46] in the general case, that shouldn't work [17:32:57] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1257742 (10faidon) [17:33:08] (it should be going through the beta equivalent of parsoidcache varnish + nginx, which would lack the cert...) [17:33:51] (and have the appropriate hostname, would would be the same as the prod hostname, but s/.org$/.beta.wmflabs.org$/) [17:34:00] bblack: and it is not the only one that should be doing that. But no service is doing that at this point [17:34:14] right [17:34:29] in any case, text/mobile/etc are broken in that way [17:34:33] beta is more like "let's prove it works so it can get in production" [17:34:38] (right hostname layout, no keys/ssl) [17:34:43] yup [17:34:55] akosiaris, https://en.wikipedia.beta.wmflabs.org/w/api.php fails - that's the issue i had [17:34:57] we want beta to mirror prod as closely as possible [17:35:05] but, it's an ongoing issue [17:35:10] _joe_: mind if I add a basic mira def to site.pp giving it standard and ipv6 for now? [17:35:19] yurik: yeah, that's to be expected [17:35:32] <_joe_> JohnLewis: I'm not sure that would be enough [17:35:40] akosiaris, would my patch fix that? [17:35:49] <_joe_> JohnLewis: but please, be my guest [17:36:17] _joe_: its just for listing it in site.pp&co. service implementation can be dealt with later :) [17:36:21] <_joe_> also, I'm not going to work on that, so you should coordinate with the owner of that ticket [17:36:22] _joe_: is there some special plan for parsercache or should that be on the list too? [17:36:36] yurik: is the graphoid code is using defaultProtocol to construct the api url calls, yes [17:36:43] s/is/if/ [17:36:46] <_joe_> parsercache? [17:36:48] akosiaris, yep [17:36:54] parsoidcache [17:37:02] no, parser cache [17:37:05] oh? [17:37:13] pc100N [17:37:18] <_joe_> I forgot about it :P [17:37:28] I've never even heard of those [17:37:30] <_joe_> paravoid: of course not, we should add it [17:37:40] _joe_: no one is but I can find someone :) [17:37:54] <_joe_> bblack: I did, but I wrote that ticket while in a meeting [17:38:16] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1257801 (10faidon) [17:38:18] done [17:38:23] big list isn't it [17:38:42] <_joe_> paravoid: also, we miss some slaves but I'd let springle do that [17:39:06] (03PS2) 10Alexandros Kosiaris: Graphoid configuration - add protocol [puppet] - 10https://gerrit.wikimedia.org/r/208679 (owner: 10Yurik) [17:39:11] <_joe_> paravoid: think how long it becomes when we add all the things needed to be able to switch between DCs :) [17:40:33] (03PS1) 10John F. Lewis: add basic mira def to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/208687 [17:40:34] akosiaris, i am not sure where the parameters are passed into the init - how is that parameter matching happens? [17:40:47] (03CR) 10Alexandros Kosiaris: [C: 032] Graphoid configuration - add protocol [puppet] - 10https://gerrit.wikimedia.org/r/208679 (owner: 10Yurik) [17:40:54] it might be simpler to first create a quantum cloning scanner that can deconstruct and replicate our cages in eqiad, and then add a sed script to it to rewrite the network prefixes in flight. [17:41:12] akosiaris, is it purely based on the fact that graphoid::protocol has the same name as init(protocol) ? [17:41:19] yurik: it's an automatic hiera lookup done by puppet [17:41:50] it search for graphoid::protocol based on 2 hiera backend written by _joe_ [17:41:53] it searches* [17:43:12] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1257830 (10faidon) 11.4R6.5 for both EX4200 and EX4500 are now available on our server. [17:43:47] 6operations, 10Traffic, 10discovery-system: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1257833 (10chasemp) I have a fair interest in helping with this and I guess, depending on timeline, taking this on. I created someth... [17:45:11] 6operations: puppet-compiler has strange problems with some facts - https://phabricator.wikimedia.org/T96802#1257838 (10BBlack) Now I'm getting new failures on all of what used to be my working canary hosts, can't test anything :/ http://puppet-compiler.wmflabs.org/760/change/208659/compiled/puppet_catalogs_3_p... [17:52:19] (03CR) 10coren: "Inline notes, and some questions. Changeset with the fixes coming shortly." (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [17:52:24] 6operations, 5Interdatacenter-IPsec: IPsec: roll-out plan - https://phabricator.wikimedia.org/T92604#1257885 (10Gage) [17:52:26] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1257886 (10Gage) To summarize remaining work: * Strongswan 5.3.0 is needed but is currently only in Experimental. It won't be coming to Jessie so it needs to be imported to... [17:52:36] (03CR) 10Alex Monk: "While my comment above stands, and the next deployment should not be to an English wiki, I don't want to block this deployment specificall" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [17:53:01] !log running delete-wmf-tags (https://phabricator.wikimedia.org/P531) on all extension repos [17:53:07] Logged the message, Master [17:55:08] (03PS9) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [17:57:46] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1257902 (10GWicke) RESTBase expansion incl. codfw is discussed in T93790. [18:01:26] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1257908 (10bd808) 3NEW a:3bd808 [18:01:54] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1257921 (10bd808) a:5bd808>3None [18:02:55] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review, 15User-Bd808-Test: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1257927 (10bd808) [18:02:56] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1257926 (10bd808) 5Open>3Resolved [18:03:08] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review, 15User-Bd808-Test: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1248972 (10bd808) [18:07:46] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review, 15User-Bd808-Test: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1257972 (10bd808) [18:07:47] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1257973 (10bd808) [18:11:39] 7Puppet, 6operations: puppet masters are maxed out - https://phabricator.wikimedia.org/T97989#1258050 (10akosiaris) a:3akosiaris [18:17:46] greg-g: ok, starting parsoid deploy. [18:18:26] twentyafterfour: do you mind if i push out https://gerrit.wikimedia.org/r/#/c/208262/ ? [18:19:38] ori: no I don't mind at all I would have +2'd it but I figured I'd wait until tomorrow's deployment... anyway go for it I tested pretty thoroughly and it works well [18:20:00] thanks [18:20:05] appreciate the reviews! [18:20:12] (03PS2) 10Ori.livneh: Use MWWikiversions::readDbListFile to read dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208262 [18:20:15] greg-g: FYI -- https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=157401 [18:20:27] (03CR) 10Ori.livneh: [C: 032] Use MWWikiversions::readDbListFile to read dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208262 (owner: 10Ori.livneh) [18:20:31] (03Merged) 10jenkins-bot: Use MWWikiversions::readDbListFile to read dblist files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208262 (owner: 10Ori.livneh) [18:20:37] (03PS4) 10Ori.livneh: Allow computed dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208263 [18:20:42] (03CR) 10Ori.livneh: [C: 032] Allow computed dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208263 (owner: 10Ori.livneh) [18:20:46] bd808: /me nods [18:20:48] (03Merged) 10jenkins-bot: Allow computed dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208263 (owner: 10Ori.livneh) [18:20:49] greg-g: In it's own window because it needs a scap for l10n [18:20:58] (03PS4) 10Ori.livneh: Add group1.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208264 [18:21:04] (03CR) 10Ori.livneh: [C: 032] Add group1.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208264 (owner: 10Ori.livneh) [18:21:09] (03Merged) 10jenkins-bot: Add group1.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208264 (owner: 10Ori.livneh) [18:26:56] !log ori Synchronized wmf-config: I81df3a614, I02b06f8e2, I366561a0f: Use MWWikiversions::readDbListFile to read dblist files; Allow computed dblist expressions; Add group1.dblist (duration: 00m 14s) [18:27:07] Logged the message, Master [18:29:15] (03PS2) 10Matanya: access: Remove Erik Moeller's Production Shell Access [puppet] - 10https://gerrit.wikimedia.org/r/208566 [18:34:52] howdy, i'm getting a 503 from https://meta.wikimedia.org/w/index.php?title=Schema:MobileWikiAppSavedPages. [18:35:13] mholloway: looking [18:35:21] cool, thanks! [18:35:30] 2015-05-04 18:35:04 mw1017 metawiki fatal INFO: [36863940] /w/index.php?title=Schema:MobileWikiAppSavedPages ErrorException from line 264 of /srv/mediawiki/php-1.26wmf3/includes/exception/MWExceptionHandler.php: Fatal Error: Call to undefined method FormatJson::parse() [18:35:48] milimetric: ^ [18:35:51] before you fix that [18:35:58] why is that a 503 instead of 500? [18:36:39] Fatals are served as 503s, but I'm not sure why. [18:37:25] yeah let's debug that [18:38:04] I've seen this in the past where varnish was retrying N times and failing with a 503 in the end [18:38:13] I just saw it yesterday [18:38:21] with the KKK thing [18:39:08] $ curl -I -H 'host: meta.wikimedia.org' http://localhost/wiki/Schema:MobileWikiAppSavedPages [18:39:08] HTTP/1.1 500 Internal Server Error [18:39:14] dbe483b31a42da607714b5805566277ad9055b5c [18:39:22] on mw1041 [18:39:23] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [18:39:48] ottomata, milimetric: do you know who is maintaining the PHP code of EventLogging right now? [18:40:01] I am not up-to-speed on recent changes [18:40:54] ori: catching up [18:41:16] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1258151 (10RobH) a:3RobH [18:41:22] PROBLEM - Parsoid on wtp1020 is CRITICAL - Socket timeout after 10 seconds [18:41:42] gwicke: ^ [18:41:50] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146174 (10RobH) I'll snag this and get the updated quotes put onto the RT ticket later today. [18:42:17] ori: theoretically we are, but we have no php people on the team since Nuria's on leave [18:42:18] PHP code, no i don't know [18:42:18] FormatJson::parse() is very much a thing [18:42:25] so I don't know why it's fataling [18:42:44] i can't find an obviously culpable commit in EL's commit log [18:42:53] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1086 bytes in 0.020 second response time [18:43:10] ori: could it be related to the change that made the schemas return "required": "" instead of "required": true? [18:43:24] this was fixed as a hack through the API but I'm not sure what's happening behind the scenes [18:43:25] yuvipanda: cscott and subbu are on it [18:43:46] milimetric: why is it returning that, and why was it fixed with a hack? [18:44:00] !log updated Parsoid to version b53a7272 [18:44:08] Logged the message, Master [18:44:47] ori: i'll get you the bug from the debacle last week [18:44:48] yuvipanda, gwicke: yeah, i had to manually restart wtp1020 after the deploy. [18:44:56] jdlrobson: 138 Fatal error: Call to undefined method FormatJson::parse() in /srv/mediawiki/php-1.26wmf3/extensions/Gather/includes/api/ApiEditList.php on line 686 [18:46:00] ori: https://gerrit.wikimedia.org/r/#/c/207297/ [18:46:03] milimetric: I don't need to know the history; it just needs to be fixed properly [18:46:34] well, my point is, I only know the history and I don't understand the root cause [18:46:48] it was a change in core is all I was told [18:47:14] legoktm: https://gerrit.wikimedia.org/r/#/c/208262/ [18:47:22] i think that's the culrpit; there's a FormatJson in multiversion [18:48:55] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1258196 (10RobH) p:5Normal>3High [18:49:14] (03PS1) 10Ori.livneh: Update FormatJson to 532337e6ff from mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208705 [18:49:39] (03CR) 10Ori.livneh: [C: 032] Update FormatJson to 532337e6ff from mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208705 (owner: 10Ori.livneh) [18:49:44] (03Merged) 10jenkins-bot: Update FormatJson to 532337e6ff from mediawiki/core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208705 (owner: 10Ori.livneh) [18:50:56] !log ori Synchronized multiversion/FormatJson.php: Ice8f1796c: Update FormatJson to 532337e6ff from mediawiki/core (duration: 00m 12s) [18:51:04] Logged the message, Master [18:51:47] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1258201 (10GWicke) Just discussed this in the ops meeting. The plan is to: 1) order three of the smaller instances now (~r320 from https://phabricator.wikimedia.org/T93790#1251095), and... [19:05:13] (03PS1) 10Ori.livneh: Remove FormatJson from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 [19:05:59] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:07:08] legoktm: ^ [19:08:20] ori: +1'd (bot just died?) [19:08:42] i guess so [19:09:10] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:09:31] akosiaris, what's the procedure to push graphoid service into prod? [19:09:43] (03CR) 10Ori.livneh: [C: 032] Remove FormatJson from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 (https://phabricator.wikimedia.org/T98051) (owner: 10Ori.livneh) [19:09:50] (03Merged) 10jenkins-bot: Remove FormatJson from mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 (https://phabricator.wikimedia.org/T98051) (owner: 10Ori.livneh) [19:10:06] akosiaris, sorry, i meant - update the version [19:10:22] i have already updated the /depl repo [19:12:10] !log ori Synchronized multiversion: I2d93ede75: Remove FormatJson from mediawiki-config (duration: 00m 13s) [19:12:17] Logged the message, Master [19:14:26] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1258277 (10bd808) When we are ready to do this we should also move 2 of the three boxes to new racks. Today all three are in the same rack behind the same switch which can lead to catastrophic down time ra... [19:16:00] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1258279 (10faidon) (preferrably, different rows too for even more protection) [19:20:46] paravoid: if you want to debug the 500 -> 503 alchemy, http://test.wikipedia.org/w/5xx.php [19:20:55] i put it on mw1017 only [19:20:59] which has a different varnish config [19:22:04] it's a 500 from HHVM and it remains a 500 through Varnish and nginx (when requesting over https://). Since the nginx configuration doesn't special-case mw1017 / testwiki, it must be that the Varnish config is causing it to be a 503. [19:22:37] that page is a 500 [19:22:51] yes, i know [19:22:52] not that it's much better, template-wise :P [19:23:10] i'm saying: the HHVM config isn't special, the apache config isn't special, and the nginx config isn't special [19:23:12] and it's a 500 [19:23:16] yeah [19:24:14] !log ori Synchronized w/5xx.php: (no message) (duration: 00m 14s) [19:24:22] Logged the message, Master [19:24:44] huh, weird [19:24:47] it's 500 still [19:24:53] en.wikipedia.org/w/5xx.php [19:25:23] varnish does have code to generate its own 503 from within varnish [19:25:47] which is part of what that whole ticket revolves around, too: https://phabricator.wikimedia.org/T97206 [19:25:47] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1258324 (10Cmjohnson) Upgrade has been completed --- JUNOS 11.4R6.5 built 2012-11-28 20:02:31 UTC S/N is Item Version Part number Serial number Description Chas... [19:27:24] <% if @vcl_config.fetch("retry503", "0") != "0" -%> if (obj.status == 503 && req.restarts < <%= @vcl_config["retry503"].to_i %>) { [19:27:32] modules/role/manifests/cache/bits.pp: 'retry503' => 4, [19:27:35] modules/role/manifests/cache/misc.pp: 'retry503' => 4, [19:27:38] modules/role/manifests/cache/mobile.pp: 'retry503' => 4, [19:27:41] modules/role/manifests/cache/parsoid.pp: 'retry503' => 4, [19:27:46] paravoid: there's a summary of those in the ticket linked above [19:27:48] that won't work unless max_restarts is also adjusted [19:28:03] and upon reaching max_restarts, varnish emits a 503 itself [19:28:12] the retry503 vs retry5xx currently deployed is nonsensical [19:28:54] (I think retry5xx blocks retry503 behavior) [19:29:08] still, the case above was for text [19:29:14] well, text doesn't have retry5xx [19:29:15] (03CR) 10Negative24: [C: 031] phab stage tags for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [19:29:22] https://meta.wikimedia.org/w/index.php?title=Schema:MobileWikiAppSavedPages that is [19:29:37] so text is the only one with a working retry503 block in practice [19:29:38] it was 500ing before in appservers, 503'ing in varnish [19:29:56] (I haven't touched any of this recently) [19:29:58] (03CR) 10Negative24: "Sometimes I like to think that my contributions actually do something. :(" [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [19:30:18] mholloway: are you sure it was a 503? [19:30:29] I am [19:30:30] 10Ops-Access-Requests, 6operations: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1258344 (10coren) I need approval verbiage from @kevinator, please. [19:30:31] I saw it [19:30:52] ori: yep [19:31:04] Request: GET http://meta.wikimedia.org/w/index.php?title=Schema:MobileWikiAppSavedPages, from 10.20.0.176 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 2433167918 [19:31:08] ori: Error: 503, Service Unavailable at Mon, 04 May 2015 18:31:27 GMT [19:31:16] yurik: https://wikitech.wikimedia.org/wiki/Trebuchet#Deploying [19:31:17] Forwarded for: , 10.20.0.165, 10.20.0.165, 10.20.0.176 [19:31:17] Error: 503, Service Unavailable at Mon, 04 May 2015 18:35:44 GMT [19:31:31] yes, that makes sense on text [19:31:32] yurik: in tin:/srv/deployment/graphoid/deploy [19:31:33] 10Ops-Access-Requests, 6operations: Add Tilman to "researchers" group on stat1003 - https://phabricator.wikimedia.org/T97916#1258351 (10coren) p:5Triage>3Normal [19:31:38] it does? [19:31:40] how? [19:31:47] akosiaris, don't i need access rights to the actual service box? [19:32:08] paravoid: text has no retry5xx, and retry503 that it does have only applies to the applayer sending varnish a 503. So it's just doing whatever default varnish restarts until it hits limit and does a varnish-generated 503 [19:32:14] akosiaris, gwicke said he also can't do it and needs to poke you [19:32:41] (03PS4) 10Rush: phab stage tags for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/205723 [19:32:57] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team, 5Patch-For-Review: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1258365 (10coren) @tfinc: Can I get approval language, please? [19:33:18] (03CR) 10Rush: [C: 032] "I don't think phab_update_tag hits the extension repo fyi but puppet will notify on mismatch for any managed repo's" [puppet] - 10https://gerrit.wikimedia.org/r/205723 (owner: 10Rush) [19:33:20] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team, 5Patch-For-Review: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1258366 (10Tfinc) Approved [19:33:24] bblack: I don't think varnish restarts by default (with no VCL) [19:33:37] oh, right [19:33:37] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1258368 (10coren) @greg: Can I get approval language for this request, please? [19:34:06] by default with no restarts then, would it pass on the 500 or s/500/503/? [19:34:49] pass on the 500, as it does with https://meta.wikimedia.org/w/5xx.php right now [19:35:08] right, ok [19:35:33] so, something triggered a restart then, I guess, but I'm at a loss as to what [19:39:05] another possible avenue to explore there is that, for some reason for that request, it thought all backends were unhealthy? [19:39:25] well, all it could reach via retries on the hash [19:41:21] additional confirmation that this was done in varnish: I found Faidon's request from earlier in the apache access log on mw1071, and it was indeed served as a 500 [19:41:29] ok [19:41:43] do we have full headers for what came out of varnish somewhere? [19:42:09] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [19:43:36] (03PS1) 10John F. Lewis: [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) [19:43:48] (03PS2) 10coren: add jdouglas to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/207643 (https://phabricator.wikimedia.org/T97559) (owner: 10Dzahn) [19:43:49] oh, wait, it only has one backend for LVS, right? [19:43:58] hmmm [19:44:23] yurik: technically speaking, no access, is not needed. That being said, you should have access to the sca cluster to allow restarting the service and assume the uid the service runs under, but not deploying [19:45:01] (03CR) 10jenkins-bot: [V: 04-1] [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [19:45:03] (03PS3) 10coren: add jdouglas to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/207643 (https://phabricator.wikimedia.org/T97559) (owner: 10Dzahn) [19:46:30] 6operations, 10Citoid, 10Graphoid, 6Mobile-Apps, and 2 others: SCA services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#1245380 (10Mvolz) [19:46:55] (03CR) 10coren: [C: 032] "Approved." [puppet] - 10https://gerrit.wikimedia.org/r/207643 (https://phabricator.wikimedia.org/T97559) (owner: 10Dzahn) [19:47:53] 10Ops-Access-Requests, 6operations, 10CirrusSearch, 6Search-Team: James Douglas sudo on Elasticsearch machines - https://phabricator.wikimedia.org/T97559#1258470 (10coren) 5Open>3Resolved [19:48:54] (03PS1) 10Ori.livneh: wmgUseBits: false for nl and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208726 [19:48:57] bblack: ^ [19:50:17] akosiaris, thx! should i create a phab ticket to request sca cluster access? Also, will you have time later to watch over me updating the service? [19:50:48] yeah I'm lost on the 503 thing. I don't see how that's possible unless varnish was 503ing because it thought all backends had failed multiple healthchecks [19:50:59] (all being one in the LVS case) [19:51:08] but then we'd be seeing that for all cache misses at that point [19:51:40] I'm not sure if it does that anyways, or just considers them all up if they're all dead (the latter is what I would do if I were implementing it) [19:52:25] (03CR) 10BBlack: [C: 031] wmgUseBits: false for nl and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208726 (owner: 10Ori.livneh) [19:53:06] (03PS1) 10coren: Add neilpquinn-wmf to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/208732 (https://phabricator.wikimedia.org/T97746) [19:53:23] (03PS2) 10Ori.livneh: wmgUseBits: false for nl and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208726 [19:53:39] (03CR) 10Ori.livneh: [C: 032] wmgUseBits: false for nl and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208726 (owner: 10Ori.livneh) [19:54:25] (03Merged) 10jenkins-bot: wmgUseBits: false for nl and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208726 (owner: 10Ori.livneh) [19:54:27] (03CR) 10jenkins-bot: [V: 04-1] Add neilpquinn-wmf to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/208732 (https://phabricator.wikimedia.org/T97746) (owner: 10coren) [19:54:30] oh, varnish will also 503 if it's can't parse the backend status in the response [19:54:43] i.e. if the 500 in this case came out garbled from varnish's point of view [19:56:02] or probably if we ran out of connections, too [19:56:08] !log ori Synchronized wmf-config/InitialiseSettings.php: I62dffd271: wmgUseBits: false for nl and dewiki (duration: 00m 11s) [19:56:15] Logged the message, Master [19:56:37] i.e.: backend ipv4_10_2_2_1 { ... ; .max_connections = 1000; } [19:57:38] de and nl look good [19:58:17] what's up with the buffer drop here around the same time as the 503? http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=cp1065.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1430769445&g=mem_report&z=large&c=Text%20caches%20eqiad [20:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150504T2000). [20:01:00] sorry not buffer, actual alloc [20:01:31] varnish 8093 30246 83 18:41 ? 01:06:31 /usr/sbin/varnishd [20:01:32] bblack: https://phabricator.wikimedia.org/P604 [20:01:34] actual varnish restart [20:01:46] (03CR) 10Dereckson: "When there is a bug triggering a config change, please add the relevant task number near the change (here "// T97488"), that eases traceab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [20:02:19] May 4 18:41:47 cp1065 frontend[30246]: Child (7569) Panic message: Assert error in VRT_time_string(), cache_vrt.c line 347:#012 Condition((p [20:02:22] .... [20:02:32] yep [20:03:11] Hm. Can anyone understand why jenkins-bot threw a fit over https://gerrit.wikimedia.org/r/#/c/208732/ ? [20:03:12] CHECK_OBJ_NOTNULL(req, REQ_MAGIC); [20:03:24] I can't make heads or tails of the report. [20:03:44] Coren: modules/admin/data/data_test.py failed [20:03:46] see https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27/872/console [20:04:12] Ah! I was looking at the wrong test result! [20:04:14] Thanks, ori! [20:04:20] np [20:08:52] (03PS1) 10Ottomata: Set yarn.app.mapreduce.am.env to work around MAPREDUCE-5799 [puppet] - 10https://gerrit.wikimedia.org/r/208740 [20:09:31] (03CR) 10Ottomata: [C: 032 V: 032] Set yarn.app.mapreduce.am.env to work around MAPREDUCE-5799 [puppet] - 10https://gerrit.wikimedia.org/r/208740 (owner: 10Ottomata) [20:10:11] (03PS2) 10John F. Lewis: [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) [20:10:54] (03CR) 10jenkins-bot: [V: 04-1] [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [20:11:17] (03CR) 10Dzahn: [C: 032] add basic mira def to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/208687 (owner: 10John F. Lewis) [20:11:32] !log deployed restbase v0.6.0 / 76583a07 [20:11:40] Logged the message, Master [20:17:06] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Give Neil Quinn access to stats1003.eqiad.wmnet - https://phabricator.wikimedia.org/T97746#1258605 (10coren) In addition, I will need a SSH key with which you will be accessing production servers. Please do //not// use the same key which you use on Labs... [20:17:36] (03PS2) 10coren: Add neilpquinn-wmf to researchers, statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/208732 (https://phabricator.wikimedia.org/T97746) [20:18:13] (03PS3) 10John F. Lewis: [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) [20:18:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [20:20:32] (03PS4) 10John F. Lewis: [WIP] deploy mira as codfw deployment server [puppet] - 10https://gerrit.wikimedia.org/r/208723 (https://phabricator.wikimedia.org/T95436) [20:20:42] is something wrong with wikitech wiki? [20:20:51] * aude can't login [20:21:07] (03PS1) 10Andrew Bogott: Don't purge the puppet key for the puppetmaster itself! [puppet] - 10https://gerrit.wikimedia.org/r/208747 [20:21:14] yuvipanda: a good one ^ [20:21:16] Not that I can see; though I'm already logged in. andrewbogott, you playing with keystone? [20:21:25] Coren: nope [20:21:29] I’ll check [20:21:30] andrewbogott: wow how... [20:21:53] (03PS1) 10Ottomata: Set absolute path of yarn.app.mapreduce.am.env [puppet] - 10https://gerrit.wikimedia.org/r/208749 [20:21:55] yuvipanda: apparently ‘puppet cert list’ lists the server cert as well as the client certs… [20:22:00] ... [20:22:06] (03PS2) 10Ottomata: Set absolute path of yarn.app.mapreduce.am.env [puppet] - 10https://gerrit.wikimedia.org/r/208749 [20:22:12] (03CR) 10Ottomata: [C: 032 V: 032] Set absolute path of yarn.app.mapreduce.am.env [puppet] - 10https://gerrit.wikimedia.org/r/208749 (owner: 10Ottomata) [20:22:29] (03CR) 10Yuvipanda: Don't purge the puppet key for the puppetmaster itself! (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208747 (owner: 10Andrew Bogott) [20:22:41] Coren: what time do you want to do the nfs switchover? [20:22:49] and the sidebar of wikitech wiki is weird https://phabricator.wikimedia.org/F160466 [20:22:54] e.g. the default, on some pages [20:23:10] seems like a memcached issue or it's just me [20:23:11] aude: I logged out and in, no problems. [20:23:14] ooooh [20:23:37] yuvipanda: I already put it on the ticket 19h UTC same as last time. [20:23:42] now it works :o [20:23:46] aude: if you’re not logged in the sidebar should probably look like that... [20:23:50] Coren: ah, cool. sorry missed that [20:23:59] yuvipanda: No worries. [20:24:03] andrewbogott: i see [20:24:11] Coren: wait isn’t it already past 19h UTC? [20:24:14] andrewbogott: actually some pages, it shows SAL [20:24:16] some not [20:24:33] yuvipanda: ... *headdesk* It is. I got busy working on access-requests and didn't see the time pass by. [20:24:46] I guess "now" is a good alternative time. :-) [20:25:26] We're still in the window. [20:25:44] yuvipanda: when a client refers to a local puppetmaster, what does it call it? [20:25:49] fqdn, right? [20:26:30] Coren: yup [20:26:41] andrewbogott: ‘refers’ as in? [20:26:57] yuvipanda: as in… if instance b is using instance a as its puppetmaster... [20:27:03] what does it have as the puppetmaster in puppet.conf? [20:27:06] andrewbogott: ah yes, fqdn [20:27:13] !log Starting NFS server switch - graceful labstore1001 shutdown. [20:27:17] ok, should be fine then. [20:27:22] Logged the message, Master [20:55:28] (03CR) 10Dzahn: "@andre__ @coren we need to go through the steps here to make a shell account. create a ssh key please and paste it on office wiki or phab" [puppet] - 10https://gerrit.wikimedia.org/r/207846 (https://phabricator.wikimedia.org/T97642) (owner: 10Dzahn) [20:55:28] !log rebooting analytics1037 [20:55:29] (03CR) 10Andrew Bogott: "It should be moot. As far as I know, servers are always referred to with fqdn; if it's referred to with its ec2 name, then that is alread" [puppet] - 10https://gerrit.wikimedia.org/r/208747 (owner: 10Andrew Bogott) [20:55:29] greg-g: please paste approval guy meme https://phabricator.wikimedia.org/T97642 [20:55:31] (03CR) 10Yuvipanda: [C: 031] Don't purge the puppet key for the puppetmaster itself! [puppet] - 10https://gerrit.wikimedia.org/r/208747 (owner: 10Andrew Bogott) [20:55:34] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:34] Coren: ^ I’m going to acknowledge [20:55:37] !log checking raid consistency from labstore1002 [20:55:38] (03CR) 10Andrew Bogott: [C: 032] Don't purge the puppet key for the puppetmaster itself! [puppet] - 10https://gerrit.wikimedia.org/r/208747 (owner: 10Andrew Bogott) [20:55:38] ACKNOWLEDGEMENT - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% Yuvi Panda Coren doing switchover. [20:55:39] (03PS1) 10John F. Lewis: move scap::master to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/208801 [20:55:40] yuvipanda, could I ask you for a review? [20:55:42] a small RB config change: https://gerrit.wikimedia.org/r/#/c/208313/ [20:55:43] or mutante? ^^ [20:55:43] * gwicke notes that Coren is on duty [20:55:44] yuvipanda: The thin_check takes longer than I hoped. Paranoia has a cost. about 5 min left [20:55:44] * aude keeps getting logged otu of wikitech [20:55:44] Coren: :) [20:55:44] gwicke: we’re doing an NFS switchover, so probably not for the next 24hours? :) [20:55:44] (03PS1) 10Dzahn: admin: create user for aklapper [puppet] - 10https://gerrit.wikimedia.org/r/208802 (https://phabricator.wikimedia.org/T97642) [20:55:44] uhh, okay; good luck with that! [20:55:44] * gwicke looks around for opsens not involved in labs NFS [20:55:45] aude, you think there are issues with wikitech? [20:55:48] yuvipanda: There seems to be an issue with thin volume version mismatch. :-( Aborting and rolling back. [20:55:48] Krenair: i do [20:55:48] Coren: ok! [20:55:48] It'll need research; I see tools to convert the metadata. [20:55:48] i have trouble logging in sometimes, when i do get logged in then get logged out, and sidebar is sometimes strange [20:55:49] all symptoms of memcached issues in my opinion [20:55:49] How long ago did this start? [20:55:49] but don't have time to look [20:55:49] idk when it started, but at least the past hour maybe [20:55:49] maybe it's just me [20:55:49] (03PS2) 10GWicke: Add x-host-basePath config for /api/rest_v1/ entry point [puppet] - 10https://gerrit.wikimedia.org/r/208313 [20:55:49] <^demon|lunch> ori: { "message": "Class undefined: MissingObject",#012 "file": "/srv/mediawiki/w/5xx.php",#012 "line": 2,#012 "context": [],#012 "backtrace": [#012 {#012 "file": "/srv/mediawiki/w/5xx.php",#012 "line": 2#012 }#012 ]#012} [20:55:49] have problems in firefox and chromium [20:55:49] Coren: and now we rebooted labstore1001! [20:55:49] aude, to be honest the last time I saw a session issue with wikitech, restarting apache there did the trick [20:55:49] * yuvipanda wonders what gremlins are going to come out of the woodwork now [20:55:49] Krenair: could be the solutin [20:55:49] yuvipanda: Did you already power it up? [20:55:49] Coren no no [20:55:49] aude, well it's not exactly a "solution", but... :) [20:55:49] * aude is going out for a couple hours [20:55:49] can't try or investigate [20:55:50] ok [20:55:50] Coren: I’m not doing anything :) but by rolling back we are going to bring it back to labstore1001 [20:55:50] PROBLEM - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:50] I guess [20:55:50] so [20:55:50] Ah, yes indeed. [20:55:50] !log Powering labstore1001 back up [20:55:50] 1001 on its way back up. [20:55:50] (03CR) 10GWicke: "The code that uses this is now deployed. Currently the documentation at the main domain returns broken example URLs, so it would be great " [puppet] - 10https://gerrit.wikimedia.org/r/208313 (owner: 10GWicke) [20:55:50] yuvipanda: I'll need to do a thin_dump on 1001, that exports the metadata to xml which can then be reimported on 1002 with the newer tools. [20:55:51] Coren: are we going to do that today? I guess not [20:55:52] yuvipanda: I'd rather not; I've never done it before so I want to test it in codfw first. [20:55:52] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/208802/" [puppet] - 10https://gerrit.wikimedia.org/r/207846 (https://phabricator.wikimedia.org/T97642) (owner: 10Dzahn) [20:55:52] Coren: yup, +1 [20:55:52] yuvipanda: Especially since that seems... very one-way. [20:55:53] Oh, gawd. Like the POSt wasn't long enough on these things I have to do it twice. [20:55:53] On the plus side, the RAID and non-thin volumes were detected and worked without issue. [20:55:54] !log starting nfs on labstore1001 [20:55:54] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 1.46 ms [20:55:54] Blah. MMP protection delay. [20:55:54] !log NFS done starting on labstore1001 [20:55:54] At least we had a clean rollback path. :-) [20:56:27] !log NFS service active and working [20:56:54] (03PS3) 10Jdlrobson: Enable Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) [20:59:56] Coren: hmm, I still can’t ssh anywhere [20:59:56] yuvipanda: I just got into bastion - it'll take some time to recover interactive performance since every instance is trying at once. [21:00:16] yuvipanda: I'm seeing a lot of connection from clients to the server, and increasing performance as dust settles. [21:00:42] I just got onto tools-bastion too; sluggish but working. [21:01:52] (03CR) 10BryanDavis: "I have all of the shards moved off of logstash100[1-3] to logstash100[4-6] now, so this is ready to go. At the moment I have used transien" [puppet] - 10https://gerrit.wikimedia.org/r/205971 (https://phabricator.wikimedia.org/T96814) (owner: 10BryanDavis) [21:02:27] Coren: ok! [21:03:33] yuvipanda: I'm seeing reasonable performance everywhere. Are you still seeing issues? [21:03:42] jouncebot: next [21:03:42] In 1 hour(s) and 56 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150504T2300) [21:03:43] Coren: nope, am good too [21:03:58] jouncebot: y u no announce my window? [21:04:13] Coren: can you update ticket + email labs-announce / ops@? [21:04:40] yuvipanda: Will do so shortly, I'm still keeping a very close eye on this. [21:04:48] Coren: +1 thanks [21:05:41] I'm going to deploy https://gerrit.wikimedia.org/r/#/c/208702/ now. It needs a full scap [21:08:08] enjoy [21:12:38] 6operations, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1258826 (10GWicke) So, should we a) get a new box with more / bigger SSDs (most 2.5" cases have space for 8 SSDs), or b) replace the existing SSDs with bigger ones? [21:18:16] !log bd808 Started scap: Update 1.26wmf4 ContactPage and WikimediaMessages for AffCom contact form [21:18:22] Logged the message, Master [21:21:08] (03CR) 10PleaseStand: Remove FormatJson from mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 (https://phabricator.wikimedia.org/T98051) (owner: 10Ori.livneh) [21:23:35] uhm [21:24:27] (03PS1) 10BBlack: lower http_req_size from def 32K... [puppet] - 10https://gerrit.wikimedia.org/r/208814 [21:24:41] I think Tyler has a point there [21:24:59] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [21:25:18] That code fails on tin (5.3) but works on mw1001 (hhvm) [21:25:31] (03CR) 10BBlack: [C: 032] lower http_req_size from def 32K... [puppet] - 10https://gerrit.wikimedia.org/r/208814 (owner: 10BBlack) [21:26:37] Although I don't know if that code is actually used to be honest [21:26:46] Do you use it thcipriani? [21:27:39] no, hang on, who was doing train deploys... twentyafterfour? [21:27:50] Krenair: yes [21:27:54] sorry, reading scrollback [21:28:06] (03CR) 10BryanDavis: Remove FormatJson from mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208711 (https://phabricator.wikimedia.org/T98051) (owner: 10Ori.livneh) [21:28:10] yes, it would break next time twentyafterfour tried to push out a new version [21:28:29] Krenair: yes in response to the question about who does traindeploys [21:28:32] we really should upgrade tin [21:28:39] So are we going to fix it by finishing the HHVM migration? [21:28:43] Or by reverting ori's patch? [21:28:57] ori, and all the other hosts still on 5.3... [21:28:58] we shouldn't revert it anyway, we can just make it output non-pretty json [21:29:03] We don't have hhvm on all of the image scalers yet either do we? [21:29:15] yeah, but we only change wikiversions from tin [21:29:19] Nor on silver. [21:29:24] it's in public static function writeWikiVersionsFile( $path, $wikis ) { [21:29:30] ah. true [21:29:31] But tin is the host we deploy from, so... [21:29:32] we don't run that anywhere else [21:29:53] Are we actually able to deploy from anywhere else? terbium perhaps? [21:30:25] nope. and terbium is still 5.3 as well in any event [21:30:37] It looks like we have a /srv/mediawiki-staging directory on terbium... [21:30:51] I mean, we definitely do have that directory. But I don't know what it's used for. [21:31:00] ok, let's drop the pretty-printing flags [21:31:09] * ori writes a patch [21:31:33] I'm mid scap so don't merge it yet plz [21:31:56] sure [21:33:31] 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Guillaume Paumier - https://phabricator.wikimedia.org/T98077#1258864 (10gpaumier) 3NEW [21:34:05] !log cr{1,2}-{eqiad,ulsfo}: swapping metrics for ulsfo's transport links [21:34:13] Logged the message, Master [21:35:26] dropping pretty print on that json blob is going to make reviewing it "fun" [21:35:39] (03CR) 10Dzahn: [C: 032] Add my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/208400 (owner: 10Alex Monk) [21:37:20] shell out to python to pretty print? ;) [21:37:34] (03PS10) 10Alex Monk: Add my script for generating meta:System_administrators#List [puppet] - 10https://gerrit.wikimedia.org/r/208395 [21:39:04] 6operations, 10Traffic, 7discovery-system: Integrate confd into the varnish configuration to generate the list of active backends - https://phabricator.wikimedia.org/T97975#1258889 (10GWicke) @joe: One consideration is that the deployment system will need synchronous confirmation that a service is in fact de... [21:39:45] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1258893 (10RobH) RT ticket for Dell quote: https://rt.wikimedia.org/Ticket/Display.html?id=9337 [21:40:28] !log bd808 Finished scap: Update 1.26wmf4 ContactPage and WikimediaMessages for AffCom contact form (duration: 22m 11s) [21:40:35] Logged the message, Master [21:40:59] (03PS1) 10Ori.livneh: Make json_encode() options backward-compatible with PHP 5.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208821 [21:41:45] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1258898 (10Multichill) Someone uploaded this really nice video at https://commons.wikimedia.org/wiki/File:Snowdonia_by_drone.webm . Transcoding failed: ``` '/u... [21:42:52] * bd808 is done with his deploy window [21:45:52] (03PS3) 10BBlack: Add x-host-basePath config for /api/rest_v1/ entry point [puppet] - 10https://gerrit.wikimedia.org/r/208313 (owner: 10GWicke) [21:46:29] (03CR) 10BBlack: [C: 032 V: 032] Add x-host-basePath config for /api/rest_v1/ entry point [puppet] - 10https://gerrit.wikimedia.org/r/208313 (owner: 10GWicke) [21:48:22] bd808 / Krenair: https://gerrit.wikimedia.org/r/#/c/208821/ ? [21:49:40] PROBLEM - puppet last run on stat1002 is CRITICAL Puppet last ran 6 hours ago [21:50:51] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1218173 (10yuvipanda) [21:50:53] (03CR) 10BryanDavis: [C: 031] "This does run without error on tin (5.3.10-1ubuntu3.18+wmf1). The resulting output will not be pretty printed which will make reviewing th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208821 (owner: 10Ori.livneh) [21:51:11] (03CR) 10Ori.livneh: [C: 032] Make json_encode() options backward-compatible with PHP 5.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208821 (owner: 10Ori.livneh) [21:51:16] (03Merged) 10jenkins-bot: Make json_encode() options backward-compatible with PHP 5.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208821 (owner: 10Ori.livneh) [21:51:27] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1218173 (10yuvipanda) This was reported by @legoktm just now again: ```tools.extreg-wos@tools-bastion-01:~/src/extensions$ git pull error: inflate: data stream error (incorrect header check) fatal: loose object... [21:53:28] (03PS1) 10Ori.livneh: wmgUseBits: false for ru and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208822 [21:53:31] bblack: ^ [21:53:39] PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100% [21:53:45] i suggest the next step after that is to set it to false by default [21:53:45] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1258932 (10yuvipanda) This could be due to interaction between git / NFS, or just git, or just NFS. I hope it's just git. [21:54:14] (03CR) 10Mobrovac: "Should we let commons fall into the default storage group or create another one for known exceptions?" [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [21:54:17] ori: sounds fair, maybe exclude en until last at that point? [21:54:42] (03CR) 10BBlack: [C: 032] wmgUseBits: false for ru and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208822 (owner: 10Ori.livneh) [21:54:59] oh that was meant to be +1 [21:55:03] does it automerge? :) [21:55:06] Yes [21:55:09] yep [21:55:12] It already has [21:55:15] fantastic lol [21:55:31] !log ori Synchronized wmf-config/InitialiseSettings.php: Id56e33263: wmgUseBits: false for ru and eswiki (duration: 00m 12s) [21:55:37] Logged the message, Master [21:55:40] wait what it was merged by you before Jenkins got to it? :O [21:55:42] wat [21:55:44] (03CR) 10GWicke: "@Mobrovac, I was wondering the same. We don't want too many storage groups, but maybe commons is large / important enough to warrant its o" [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [21:55:47] autodeploy too [21:55:49] 😇 [21:56:17] (03PS3) 10GWicke: Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) [21:56:19] I guess that's just Gerrit's history reordering bug [21:56:24] automerge seems dangerous :P [21:56:31] at least, for people like me! [21:57:08] ori: ok let's leave that for today, try default (minus maybe en) tomorrow? [21:57:54] (03PS3) 10Dzahn: doc.wikimedia.org: fix DirectorySlash https->http [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) [21:58:14] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1258945 (10Legoktm) I moved the corrupt folder to `/data/project/extreg-wos/src/extensions-corrupt` so I can unbreak my tool. [21:58:52] (03CR) 10Dzahn: [C: 032] doc.wikimedia.org: fix DirectorySlash https->http [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [21:59:30] RECOVERY - Host analytics1037 is UPING OK - Packet loss = 0%, RTA = 3.70 ms [21:59:35] bblack: itwiki, ruwiki, eswiki, nlwiki and dewiki collectively get over 4m PVs per hour, compared to enwiki's 8, so i think there wouldn't be much to gain from having two additional steps; going default (including en) tomorrow seems right [21:59:48] 6operations, 10ops-eqiad: /dev/sdm not loading on analytics1037 - https://phabricator.wikimedia.org/T98081#1258967 (10Ottomata) 3NEW a:3Cmjohnson [21:59:54] (in other words, I don't see much value in an all-but-enwiki step) [22:00:27] or we could do all-but-en today [22:00:34] well, up to you [22:00:37] cmjohnson: yt? [22:00:38] https://phabricator.wikimedia.org/T98081 [22:00:48] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/206832 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [22:01:17] ottomata: yep [22:01:20] danke. [22:01:55] ori: fair enough, all tomorrow then [22:02:03] cool [22:02:11] (03PS4) 10GWicke: Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) [22:02:38] (03CR) 10GWicke: "@Mobrovac, added a storage group for for *.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [22:02:51] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1258992 (10Dzahn) fixed on doc.wm: before: curl -vv https://doc.wikimedia.org/mediawiki-core The document has moved @Mobrovac, added a storage group for for *.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [22:14:10] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:22:01] (03PS1) 10Yuvipanda: tools: Backup user crontabs on to NFS [puppet] - 10https://gerrit.wikimedia.org/r/208831 (https://phabricator.wikimedia.org/T95798) [22:22:02] valhallasw`nuage: ^ [22:23:06] yuvipanda: that's dirty :P [22:23:14] valhallasw`nuage: but effective :P [22:23:27] valhallasw`nuage: I also don’t want to add a cron that backs up crons. that seems a bit eugh for some reason :) [22:23:43] yeah, this has the advantage we have monitoring automagically [22:23:58] maybe we should just have a backup.pp for various backup purposes? [22:24:11] (03CR) 10Merlijn van Deen: [C: 031] tools: Backup user crontabs on to NFS [puppet] - 10https://gerrit.wikimedia.org/r/208831 (https://phabricator.wikimedia.org/T95798) (owner: 10Yuvipanda) [22:24:18] then it's clear where it comes from [22:24:21] yuvipanda: oh, idea [22:24:37] valhallasw`nuage: nah, the manifests should get rid of the need for this. [22:24:46] mmm right [22:25:12] valhallasw`nuage: so basically manifests need to have this functionality, and then I’ll just write another crontab script that just modifies service.manifest :) [22:25:46] (03CR) 10Hoo man: [C: 031] "This will break all existing pages on test2?wiki that make use of the Item linked... but I guess we have to do this at some point :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [22:25:53] (03PS2) 10Yuvipanda: tools: Backup user crontabs on to NFS [puppet] - 10https://gerrit.wikimedia.org/r/208831 (https://phabricator.wikimedia.org/T95798) [22:26:06] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Backup user crontabs on to NFS [puppet] - 10https://gerrit.wikimedia.org/r/208831 (https://phabricator.wikimedia.org/T95798) (owner: 10Yuvipanda) [22:26:40] !log on terbium: running voterList.php again, with corrected edit counts [22:26:48] Logged the message, Master [22:28:04] (03CR) 10Hoo man: [C: 04-1] "The ticket asks for beta, this is testwikidata (in production)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208654 (https://phabricator.wikimedia.org/T97993) (owner: 10JanZerebecki) [22:30:24] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259115 (10coren) It's not immediately clear what could have happened to those files - the pattern does not match any form of usual corruption I've ever seen NFS do when in breaks badly. One pattern that may be o... [22:37:05] !log silver: apache2ctl restart for T98084 [22:37:13] Logged the message, Master [22:39:31] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1259164 (10Krenair) [22:39:47] aude, ^ [22:42:05] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1259175 (10csteipp) We have two security goals for the whitelist, * prevent users from exploiting client-side issues on the cluster * keep the cluster from attacking other websites In this case... [22:43:43] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1259177 (10yuvipanda) Such whitelisting should also be signed off by labs ops, I'd guess - it's fairly trivial to accidentally take down all of labs from prod :) [22:45:48] yuvipanda, what should we really do about this wikitech sessions issue? :/ [22:45:50] andrewbogott, ^ [22:53:32] Krenair: thanks [22:55:25] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259267 (10scfc) Could these be related to one of the moving instances between virtual servers? I. e., process A gets frozen, meanwhile process B updates the repository, process A gets thawed and is confused abou... [22:56:51] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259268 (10coren) After correlation in time, it turns out that the file timestamp exactly match the period where the NFS server had to be forcibly rebooted without clean unmounts; it is almost certain that the cor... [22:57:15] (03CR) 10Aude: "i realize this will break existing connections to test.wikidata, but is inevitable and needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [23:00:04] RoanKattouw, ^d, RoanKattouw, Legoktm: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150504T2300). Please do the needful. [23:00:33] o/ [23:02:07] who sets the message on jouncebot? [23:04:19] Negative24: it just reads the [[Deployments]] page [23:05:05] not the "Respected human" part [23:05:28] (03PS1) 10Hashar: diamond: collectors require python-diamond [puppet] - 10https://gerrit.wikimedia.org/r/208840 [23:05:39] oh that's in the repo [23:06:49] its been changing [23:11:35] Yeah it randomly chooses from a few options [23:11:43] (03PS2) 10Aude: Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 [23:12:23] (03CR) 10Aude: "want to run updateSubscriptions for wikis with usage tracking, before enabling this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude) [23:12:36] (03CR) 10Aude: [C: 04-1] Enable use of subscription tracking on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude) [23:13:32] !log catrope Synchronized php-1.26wmf4/includes/skins/SkinTemplate.php: SWAT (duration: 00m 11s) [23:13:38] Logged the message, Master [23:14:04] !log catrope Synchronized php-1.26wmf4/extensions/VisualEditor/: SWAT (duration: 00m 12s) [23:14:07] Logged the message, Master [23:14:41] !log catrope Synchronized php-1.26wmf4/extensions/MassMessage/: SWAT (duration: 00m 12s) [23:14:46] Logged the message, Master [23:15:30] !log catrope Synchronized php-1.26wmf3/extensions/MassMessage/: SWAT (duration: 00m 12s) [23:15:34] Logged the message, Master [23:15:40] ...and that concludes our regularly scheduled programming [23:15:57] SWAT will be on again tomorrow, same time, same channel [23:16:48] Krenair: I’m still not having any trouble with wikitech. Can you tell me exactly what you’re seeing? [23:17:04] andrewbogott, there shouldn't be any issues with wikitech right this moment [23:17:13] apache was restarted there not too long ago [23:17:52] And that fixed it? [23:17:55] Temporarily? [23:18:05] but this is not the first time people have run into session issues on wikitech, solved by someone simply restarting apache [23:18:26] It looked like it. Tried to edit, got a session error. Restarted apache. Tried to edit again, worked. [23:20:50] 6operations, 10Wikimedia-Mailing-lists: move analytics-internal list to analytics-wmf - https://phabricator.wikimedia.org/T97618#1259348 (10kevinator) @JohnLewis no it is not really necessary. We can live with keeping analytics-internal as it is. Feel free to decline this task. [23:28:23] (03CR) 10Hoo man: [C: 031] "Ok with me, need to follow the dispatchers closely." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208675 (owner: 10Aude) [23:33:51] (03PS4) 10Dzahn: integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) [23:34:38] !log catrope Started scap: (no message) [23:34:48] Logged the message, Master [23:35:04] (03CR) 10Dzahn: [C: 032] integration: Apache turn DirectorySlash Off [puppet] - 10https://gerrit.wikimedia.org/r/206460 (https://phabricator.wikimedia.org/T95164) (owner: 10Dzahn) [23:45:44] (03PS1) 10Dzahn: Revert "integration: Apache turn DirectorySlash Off" [puppet] - 10https://gerrit.wikimedia.org/r/208844 [23:47:41] (03CR) 10Dzahn: "same config works for doc.wm but leads to errors on integration.wm:" [puppet] - 10https://gerrit.wikimedia.org/r/208844 (owner: 10Dzahn) [23:48:05] (03CR) 10Dzahn: [C: 032] Revert "integration: Apache turn DirectorySlash Off" [puppet] - 10https://gerrit.wikimedia.org/r/208844 (owner: 10Dzahn) [23:49:04] 6operations, 6Labs, 10Tool-Labs: NFS file corruption - https://phabricator.wikimedia.org/T96488#1259449 (10yuvipanda) p:5High>3Normal @Coren should we check for other files that might be corrupted, or just let things be? [23:53:11] 6operations, 10Wikimedia-Apache-configuration, 5Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#1259451 (10Dzahn) reverted on integration: same config that works on doc.wm , but here: after:

Forbidden

You don't have perm... [23:59:13] !log catrope Finished scap: (no message) (duration: 24m 34s) [23:59:22] Logged the message, Master