[00:00:16] * Krinkle doesn't know [00:00:32] AndyRussG: \o/ [00:00:38] (03PS4) 10Andrew Bogott: Grants for labspuppet (user for the labspuppetbackend tool) [puppet] - 10https://gerrit.wikimedia.org/r/309414 [00:00:47] But not until we first try it in beta and in a canary in prod without errors, so it's blocked on CentralNotice as of now :) [00:00:55] I'm guessing CN JS is via RL, and therefore after several minutes or so we're not serving new copies of the old versions, basically [00:01:08] Krinkle: K gotcha [00:01:11] bblack: yep :) [00:03:15] (03PS1) 10Andrew Bogott: Move hiera settings again [puppet] - 10https://gerrit.wikimedia.org/r/309491 [00:04:12] bblack: I'm getting the new code live on prod... Try this in the console: mw.geoIP.getPromise().done( function ( a ) { console.log ( a ); } ); [00:04:42] That should print out the geo object only if the new code is live and working [00:04:59] (03CR) 10Andrew Bogott: [C: 032] Move hiera settings again [puppet] - 10https://gerrit.wikimedia.org/r/309491 (owner: 10Andrew Bogott) [00:06:34] !log hoo@tin Synchronized php-1.28.0-wmf.18/extensions/Wikidata: Don't use multiple return values (T145138) (duration: 02m 24s) [00:06:35] T145138: Wikibase Lua API have breaking change - https://phabricator.wikimedia.org/T145138 [00:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:58] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:10:17] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:10:18] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:53] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:12:54] yuvipanda: https://gerrit.wikimedia.org/r/#/c/309491/ worked for some reason [00:13:21] andrewbogott: yay [00:14:13] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:14:25] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:15:44] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:16:25] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:16:28] (03CR) 10BBlack: [C: 032] Remove geoiplookup DNS entries [dns] - 10https://gerrit.wikimedia.org/r/305422 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [00:16:33] (03PS6) 10BBlack: Remove geoiplookup DNS entries [dns] - 10https://gerrit.wikimedia.org/r/305422 (https://phabricator.wikimedia.org/T100902) [00:16:38] (03CR) 10BBlack: [V: 032] Remove geoiplookup DNS entries [dns] - 10https://gerrit.wikimedia.org/r/305422 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [00:17:13] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:17:41] what's with the db/pc puppetfails? [00:18:32] whoa suddenly I'm getting extreme slowness on enwiki [00:19:21] Hmmm hopefully it's just my ISP [00:19:45] (03PS10) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [00:19:47] (03PS10) 10BBlack: text VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [00:20:27] andrewbogott: ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item labspuppetbackend::mysql_password in any Hiera data file and no default supplied at /etc/puppet/manifests/role/mariadb.pp:50 on node db1047.eqiad.wmnetESC[0m [00:20:46] ^ it's the labspuppetbackend stuff killing puppet on db/pc hosts [00:21:17] grrrrrrrrrrr [00:21:45] ok hiera, I guess you won't let me put that password in just one place so I'll just put it in two places. [00:21:51] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:06] bblack: better? [00:22:10] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:13] 06Operations, 10ops-eqiad: system WMF3096 lacking details in racktables - https://phabricator.wikimedia.org/T145156#2621647 (10RobH) [00:22:24] AndyRussG: I don't think it's a global problem [00:23:28] andrewbogott: still failing the same way on db1047 [00:23:41] bblack: K just thought I'd say something... seems much faster if I log out btw [00:24:19] oh I'm usually logged out and usually mostly looking at those stats, hmmm [00:24:20] can't find it in any Hiera data file [00:24:26] and yet there it is in eqiad.yaml [00:24:49] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:26:07] (03PS1) 10Andrew Bogott: Remove labspuppet user creation from production-grants-m5.sql.erb [puppet] - 10https://gerrit.wikimedia.org/r/309494 [00:26:27] bblack: ^ [00:27:21] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:27:59] I have no idea about that :) [00:28:21] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:30] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:56] (03CR) 10Andrew Bogott: [C: 032] Remove labspuppet user creation from production-grants-m5.sql.erb [puppet] - 10https://gerrit.wikimedia.org/r/309494 (owner: 10Andrew Bogott) [00:29:31] 06Operations, 10ops-eqiad: many items in rack 'z1' are lacking info - https://phabricator.wikimedia.org/T145158#2621680 (10RobH) [00:29:42] (03PS11) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [00:29:56] (03CR) 10BBlack: [C: 032 V: 032] Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [00:30:10] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:26] bblack: puppet should clear up now [00:32:37] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:32:58] andrewbogott: did for db1047 [00:33:40] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:34:21] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:34:48] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:34:58] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:35:50] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:35:59] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:37:19] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:38:20] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [00:39:38] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [00:39:39] 06Operations, 10ops-eqiad: many items in rack 'z1' are lacking info - https://phabricator.wikimedia.org/T145158#2621736 (10RobH) [00:42:18] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:43:18] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:49:21] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:52:01] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:53:40] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:53:47] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:54:59] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:57:27] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:58:28] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:02:50] !log aaron@tin Synchronized php-1.28.0-wmf.18/extensions/SpamBlacklist: 56effa952c48725a2665dec72782bc8f7c7915a2 (duration: 00m 49s) [01:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:18] (03PS4) 10Aaron Schulz: Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 [01:05:24] (03CR) 10Aaron Schulz: [C: 032] Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 (owner: 10Aaron Schulz) [01:05:51] (03Merged) 10jenkins-bot: Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 (owner: 10Aaron Schulz) [01:07:49] !log aaron@tin Synchronized tests/Defines.php: (no message) (duration: 00m 46s) [01:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:08:56] (03PS11) 10BBlack: text VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [01:29:11] (03PS1) 10Dereckson: Enable Flow on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309499 (https://phabricator.wikimedia.org/T127792) [01:34:17] (03PS1) 10Aaron Schulz: Increate "descriptionCacheExpiry" as this uses page_touched in the key now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309500 [01:34:38] (03PS2) 10Aaron Schulz: Increate "descriptionCacheExpiry" as this uses page_touched in the key now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309500 [01:34:43] (03CR) 10Aaron Schulz: [C: 032] Increate "descriptionCacheExpiry" as this uses page_touched in the key now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309500 (owner: 10Aaron Schulz) [01:35:12] (03Merged) 10jenkins-bot: Increate "descriptionCacheExpiry" as this uses page_touched in the key now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309500 (owner: 10Aaron Schulz) [01:38:19] !log aaron@tin Synchronized wmf-config/filebackend-production.php: Bump description text expiry for files (duration: 00m 46s) [01:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:43:20] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:43:30] (03CR) 10BBlack: [C: 032] text VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [01:46:45] (03PS1) 10BBlack: Trivial bugfix followup to 4a586c3f6 [puppet] - 10https://gerrit.wikimedia.org/r/309501 (https://phabricator.wikimedia.org/T100902) [01:46:59] (03CR) 10BBlack: [C: 032 V: 032] Trivial bugfix followup to 4a586c3f6 [puppet] - 10https://gerrit.wikimedia.org/r/309501 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [01:50:13] PROBLEM - Postgres Replication Lag on maps-test2004 is CRITICAL: CRITICAL - Rep Delay is: 1814.790397 Seconds [01:52:43] RECOVERY - Postgres Replication Lag on maps-test2004 is OK: OK - Rep Delay is: 118.201586 Seconds [01:54:54] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2621906 (10BBlack) [01:54:56] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2621905 (10BBlack) 05Open>03Resolved [01:55:09] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323297 (10BBlack) 05Open>03Resolved a:03BBlack [02:11:08] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:13:18] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:23:01] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [02:24:57] PROBLEM - Hadoop DataNode on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [02:37:50] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2621981 (10Ottomata) [02:38:38] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:39:36] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 17m 37s) [02:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:42] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2621999 (10Ottomata) Also, megacli shows: ``` $sudo megacli -PDList -aAll ... Enclosure Device ID: 32 Slot Number: 3 ... Firmware state: Failed ``` [02:42:35] ACKNOWLEDGEMENT - Hadoop DataNode on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode ottomata https://phabricator.wikimedia.org/T145170 [02:42:39] ACKNOWLEDGEMENT - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) ottomata https://phabricator.wikimedia.org/T145170 [02:45:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Sep 9 02:45:47 UTC 2016 (duration 6m 11s) [02:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:48] PROBLEM - Disk space on analytics1032 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%) [02:58:07] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-hdfs-datanode] [03:10:45] (03PS2) 10Alex Monk: Remove upload7 references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [03:14:07] 06Operations, 10MediaWiki-extensions-VipsScaler, 10Wikimedia-Site-requests: VIPS scaled thumbnails don't have a comment with a link to the file description page - https://phabricator.wikimedia.org/T71336#2622080 (10Dereckson) @Bawolff So next step is to install pyexiv2 on mediawiki::packages::multimedia? [03:20:40] (03PS1) 10Aaron Schulz: Avoid $wmfMasterDatacenter notices from noc files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309503 (https://bugzilla.wikimedia.org/143785) [03:23:20] 06Operations, 10MediaWiki-extensions-VipsScaler, 10Wikimedia-Site-requests: VIPS scaled thumbnails don't have a comment with a link to the file description page - https://phabricator.wikimedia.org/T71336#2622087 (10Dereckson) a:03Dereckson [ Taking this bug, as I can sheperd it, but I'll mainly rely on @ba... [03:23:27] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:23:29] (03PS1) 10Dereckson: Sort by alphabetical order mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309504 [03:23:31] (03PS1) 10Dereckson: Install exiv2 to mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) [03:25:50] (03CR) 10Alex Monk: [C: 031] "Scheduled for Monday morning SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [03:27:27] AaronSchulz, that bug line seems wrong? [03:27:42] (03CR) 10Niedzielski: "@hashar, just a heads up that I'm currently working on this again as part of T133183. I swear I had this patch working before! IIRC I even" [puppet] - 10https://gerrit.wikimedia.org/r/264303 (owner: 10Niedzielski) [03:28:07] (03PS2) 10Aaron Schulz: Avoid $wmfMasterDatacenter notices from noc files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309503 (https://phabricator.wikimedia.org/T143785) [03:28:36] (03PS3) 10Aaron Schulz: Avoid $wmfMasterDatacenter notices from noc files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309503 (https://phabricator.wikimedia.org/T143784) [03:28:40] (03PS4) 10Aaron Schulz: Avoid $wmfMasterDatacenter notices from noc files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309503 (https://phabricator.wikimedia.org/T143784) [03:28:43] (03CR) 10jenkins-bot: [V: 04-1] Install exiv2 to mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [03:28:54] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:29:28] (03CR) 10Aaron Schulz: [C: 032] Avoid $wmfMasterDatacenter notices from noc files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309503 (https://phabricator.wikimedia.org/T143784) (owner: 10Aaron Schulz) [03:29:56] (03Merged) 10jenkins-bot: Avoid $wmfMasterDatacenter notices from noc files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309503 (https://phabricator.wikimedia.org/T143784) (owner: 10Aaron Schulz) [03:31:16] !log aaron@tin Synchronized docroot/noc/db.php: Avoid $wmfMasterDatacenter notices from noc files (duration: 00m 48s) [03:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:32:46] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: Avoid $wmfMasterDatacenter notices from noc files (duration: 00m 46s) [03:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:49:23] (03PS2) 10Dereckson: Install exiv2 to mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) [03:51:09] RECOVERY - Hadoop DataNode on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [03:54:11] RECOVERY - puppet last run on elastic2021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:58:49] PROBLEM - Hadoop DataNode on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [03:58:58] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[hadoop-hdfs-datanode] [04:52:20] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:55:01] (03CR) 10Brian Wolff: "Multimedia was at one point going to put exiftool on the image scalars in order to use TinyRGB as the colour profile. I'm not sure if they" [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [05:06:42] (03PS1) 10Urbanecm: Lift of IP cap - WomenInSience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) [05:06:48] (03PS1) 10Giuseppe Lavagetto: admin: fix documentation of hashuser [puppet] - 10https://gerrit.wikimedia.org/r/309512 [05:07:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: fix documentation of hashuser [puppet] - 10https://gerrit.wikimedia.org/r/309512 (owner: 10Giuseppe Lavagetto) [05:11:00] (03PS1) 10Giuseppe Lavagetto: puppetmaster: offline temporarily puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/309513 [05:12:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: offline temporarily puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/309513 (owner: 10Giuseppe Lavagetto) [05:12:52] (03PS1) 10Urbanecm: Fix ilegal wgFlaggedRevsWhitelist for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309514 (https://phabricator.wikimedia.org/T144673) [05:13:32] (03PS2) 10Urbanecm: Fix ilegal wgFlaggedRevsWhitelist for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309514 (https://phabricator.wikimedia.org/T144673) [05:22:00] (03PS3) 10Urbanecm: Fix illegal wgFlaggedRevsWhitelist for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309514 (https://phabricator.wikimedia.org/T144673) [05:22:33] !log deploying schema change on s5 hosts T139090 [05:22:35] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [05:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:23:49] (03CR) 10Dereckson: "Check if the actual page name on wiki contains or not this invalid character." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309514 (https://phabricator.wikimedia.org/T144673) (owner: 10Urbanecm) [05:26:53] (03CR) 10Dereckson: [C: 031] "Okay, الصفحة_الرئيسية matches the name of the Main page on ar." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309514 (https://phabricator.wikimedia.org/T144673) (owner: 10Urbanecm) [05:27:03] (03CR) 10Dereckson: [C: 031] Fix illegal wgFlaggedRevsWhitelist for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309514 (https://phabricator.wikimedia.org/T144673) (owner: 10Urbanecm) [05:28:00] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hadoop/data/e/yarn/logs] [05:34:54] (03CR) 10Dereckson: "libimage-exiftool-perl is currently installed" [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [05:41:23] (03CR) 10Dereckson: "So does VIPS currently allow to use exiftool or do we need exiv2 pending further extension improvement?" [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [06:16:42] PROBLEM - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100% [06:30:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See comments, I've already prepared a PS for addressing those." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [06:32:54] (03PS14) 10Giuseppe Lavagetto: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [06:48:25] !log reimaging mw2120-mw2123 to jessie [06:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:06:09] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2622342 (10grin) I respectfully disagree with most of the points, but as it's been said before: I have noted that the topic should be considered c... [07:10:33] (03PS1) 10Giuseppe Lavagetto: service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 [07:17:10] !log puppet disabled on analytics1032, Hadoop services stopped - T145170 [07:17:11] T145170: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170 [07:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:21:42] PROBLEM - Hadoop NodeManager on analytics1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [07:21:51] (I am scheduling downtime) [07:22:38] (03PS1) 10Muehlenhoff: toollabs::proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/309524 [07:24:49] acked all the services, sorry for the page [07:25:56] (brb in 30 mins, I need to commute :) [07:30:21] good morning [07:46:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "-1 just to point out there is a question about exiftool vs exiv2 that needs to be answered before merging" [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [07:47:11] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2622431 (10MoritzMuehlenhoff) I don't think it would cause problems: The I/O performance of the Ganeti clusters should be adequate for deployments (but of course bare metal is still faster). Wrt av... [07:47:36] (03PS2) 10Alexandros Kosiaris: Sort by alphabetical order mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309504 (owner: 10Dereckson) [07:47:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Sort by alphabetical order mediawiki::packages::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/309504 (owner: 10Dereckson) [07:53:33] (03CR) 10Mobrovac: [C: 04-1] "Some comments about auto_refresh :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [08:00:02] (03CR) 10Hashar: [C: 031] "Added Tyler as a reviewer." [puppet] - 10https://gerrit.wikimedia.org/r/308132 (https://phabricator.wikimedia.org/T144043) (owner: 10Muehlenhoff) [08:02:11] (03CR) 10Alexandros Kosiaris: [C: 031] service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 (owner: 10Giuseppe Lavagetto) [08:02:37] (03CR) 10Giuseppe Lavagetto: "See comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [08:02:55] (03CR) 10Hashar: "I am not sure what broke, though on beta/integration I have often noticed that puppet 3.4 get confused when the manifests are moved around" [puppet] - 10https://gerrit.wikimedia.org/r/308152 (owner: 10Hashar) [08:04:46] (03CR) 10Mobrovac: [C: 031] "Duh, indeed" [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [08:04:50] (03CR) 10Hashar: [C: 031] Workaround a bug in gerrit on Microsoft Edge [puppet] - 10https://gerrit.wikimedia.org/r/309385 (https://phabricator.wikimedia.org/T145130) (owner: 10Paladox) [08:06:04] PROBLEM - mediawiki-installation DSH group on mw2123 is CRITICAL: Host mw2123 is not in mediawiki-installation dsh group [08:06:33] PROBLEM - mediawiki-installation DSH group on mw2122 is CRITICAL: Host mw2122 is not in mediawiki-installation dsh group [08:06:35] PROBLEM - mediawiki-installation DSH group on mw2121 is CRITICAL: Host mw2121 is not in mediawiki-installation dsh group [08:06:53] PROBLEM - mediawiki-installation DSH group on mw2120 is CRITICAL: Host mw2120 is not in mediawiki-installation dsh group [08:06:54] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 1 minute ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [08:09:36] (03CR) 10Hashar: contint: vary ssh from= for prod slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [08:09:43] (03PS2) 10Hashar: contint: vary ssh from= for prod slave [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) [08:09:57] (03PS2) 10Giuseppe Lavagetto: ci::master: drop legacy definitions [puppet] - 10https://gerrit.wikimedia.org/r/309274 (owner: 10Hashar) [08:15:45] (03CR) 10Giuseppe Lavagetto: [C: 032] ci::master: drop legacy definitions [puppet] - 10https://gerrit.wikimedia.org/r/309274 (owner: 10Hashar) [08:18:02] (03PS2) 10Giuseppe Lavagetto: ci::master: drop mwext-sync leftover [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [08:19:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ci::master: drop mwext-sync leftover [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [08:19:25] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:27] (03CR) 10Hashar: "I have manually cleaned up the files on gallium." [puppet] - 10https://gerrit.wikimedia.org/r/309274 (owner: 10Hashar) [08:25:45] (03CR) 10Hashar: "Manually cleaned up the files on gallium:" [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [08:27:03] (03PS3) 10Giuseppe Lavagetto: contint: vary ssh from= for prod slave [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [08:27:59] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint: vary ssh from= for prod slave [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [08:31:18] !log reimaging mw2124-mw2127 to jessie [08:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:23] (03PS2) 10Giuseppe Lavagetto: zuul: stop logging paramiko [puppet] - 10https://gerrit.wikimedia.org/r/308805 (https://phabricator.wikimedia.org/T137525) (owner: 10Hashar) [08:31:38] <_joe_> hashar: merging ^^ [08:31:48] ah yeah that as well [08:31:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] zuul: stop logging paramiko [puppet] - 10https://gerrit.wikimedia.org/r/308805 (https://phabricator.wikimedia.org/T137525) (owner: 10Hashar) [08:32:05] RECOVERY - Disk space on analytics1032 is OK: DISK OK [08:32:05] will restart Zuul eventually [08:32:28] <_joe_> I'm merging it [08:32:39] <_joe_> it's the last puppetswat-suitable patch [08:32:52] <_joe_> the other two will have to wait for monday [08:33:00] <_joe_> I'm not that comfortable with those [08:33:37] the hiera one is a bit scary ( https://gerrit.wikimedia.org/r/#/c/308778/ ) [08:33:42] <_joe_> yes [08:33:44] but ppc is all happy about it and I ran it through rspec [08:33:48] <_joe_> and the other is a followup, right? [08:33:51] yeah [08:33:55] should probably merge both [08:34:02] but I was less sure about the second [08:34:28] <_joe_> yeah let me look at those today/monday with some time [08:34:32] in theory both process (server and merger) can use the same config file. But on our setup I have split them so it is easier to understand [08:34:42] <_joe_> I am sure they're correct, but I want to look at the style and all [08:35:06] <_joe_> puppet ran, zuul reloaded [08:35:08] the only thing I have noticed is that PPC think the class 'contint::website' has disappeared which I believe is a bug in ppc [08:35:22] in the old manifest it is invoked with Class { 'contint::website': someparam } [08:35:28] in new it is: include contint::website [08:35:46] but looking at the catalog for the change, the class and resources are realized/included [08:36:21] <_joe_> hashar: I'm updating the deployments page now [08:37:37] also puppet does not notify the service when config files change :D have to manually restart it [08:38:58] <_joe_> hashar: uh? [08:39:15] <_joe_> Notice: /Stage[main]/Zuul::Server/Exec[zuul-reload]: Triggered 'refresh' from 1 events [08:42:26] hey [08:42:38] looks like change to the logging stuf trigger a reload [08:42:45] that is harmless to zuul though [08:43:05] and I am not sure it redo the logging module config on a reload anyway [08:44:59] it does [08:45:04] paramiko log bucket is gone \o/ [08:48:32] _joe_: so the hiera change you dont feel adventurous to land it ? [08:49:05] <_joe_> hashar: I have other things to do (goal related, so higher priority) [08:49:51] well getting rid of gallium is kind of goal as well :D [08:49:59] anyway I will prepare a bunch of patches for contint1001 [08:50:11] at least there is lot of cruft gone now, so that simplify the migration. Thank you! [08:52:02] (03CR) 10Hashar: "When moving the aptly puppet class to autoloader layout the labs puppet had some issue. I am not sure how bad this one is going to be, s" [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [08:54:35] (03PS8) 10Volans: Automation: automatically reimage host [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) [08:55:15] !log reset power on ms-be2019, cpu "soft lockup" [08:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:38] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2622515 (10elukey) kern.log, syslog and jmxtrans kept getting errors logged ending up filling the disks, the major cause seemed to be a "du" process launched by the "hdf... [08:57:16] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2622517 (10Gehel) Some notes of a discussion with @EBernhardson and @dcausse: * Our most promising option is probably tuning... [08:59:24] RECOVERY - salt-minion processes on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:59:52] RECOVERY - DPKG on ms-be2019 is OK: All packages OK [08:59:52] RECOVERY - swift-object-auditor on ms-be2019 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:59:52] RECOVERY - swift-account-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:59:54] RECOVERY - swift-object-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:59:54] RECOVERY - Disk space on ms-be2019 is OK: DISK OK [09:00:04] RECOVERY - swift-account-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:00:14] RECOVERY - swift-object-server on ms-be2019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:00:14] RECOVERY - swift-container-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:00:32] RECOVERY - HP RAID on ms-be2019 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [09:00:53] RECOVERY - Check size of conntrack table on ms-be2019 is OK: OK: nf_conntrack is 10 % full [09:00:53] RECOVERY - swift-object-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:01:05] RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 29.08, 11.99, 4.50 [09:01:22] RECOVERY - MD RAID on ms-be2019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:01:45] RECOVERY - SSH on ms-be2019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [09:01:55] RECOVERY - configured eth on ms-be2019 is OK: OK - interfaces up [09:02:00] !log reimage ms-be1022 - T140597 [09:02:01] T140597: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597 [09:02:03] RECOVERY - dhclient process on ms-be2019 is OK: PROCS OK: 0 processes with command name dhclient [09:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:22] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:03:03] RECOVERY - swift-account-reaper on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:03:11] !log reimage mw2128->mw2131 to Jessie [09:03:13] RECOVERY - swift-account-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:28] (03PS2) 10Gehel: LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/309357 (https://phabricator.wikimedia.org/T142393) [09:03:33] RECOVERY - swift-container-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:03:44] RECOVERY - NTP on ms-be2019 is OK: NTP OK: Offset 1.609325409e-05 secs [09:03:52] RECOVERY - swift-container-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:04:12] RECOVERY - swift-container-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:05:04] (03CR) 10Gehel: [C: 032] LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/309357 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [09:05:22] !log deploying new LVS configuration for kartotherian.svc.eqiad.wmnet [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:53] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [09:07:34] RECOVERY - mediawiki-installation DSH group on mw2123 is OK: OK [09:07:53] RECOVERY - mediawiki-installation DSH group on mw2122 is OK: OK [09:08:04] RECOVERY - mediawiki-installation DSH group on mw2121 is OK: OK [09:08:14] RECOVERY - mediawiki-installation DSH group on mw2120 is OK: OK [09:09:18] 06Operations, 06Labs: Puppet broken on labcontrol1002 - https://phabricator.wikimedia.org/T145185#2622533 (10Volans) [09:16:54] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [09:25:24] !log restarting pybal on lvs1003 [09:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:04] (03CR) 10Muehlenhoff: "Two comments, this is looking good overall, I think I'll test this with a reimage next week." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:28:49] (03CR) 10Volans: Automation: automatically reimage host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:30:05] (03CR) 10DCausse: "I'm good to merge." [puppet] - 10https://gerrit.wikimedia.org/r/307652 (https://phabricator.wikimedia.org/T127788) (owner: 10EBernhardson) [09:33:37] (03PS3) 10Gehel: Report partial result from mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/307652 (https://phabricator.wikimedia.org/T127788) (owner: 10EBernhardson) [09:35:03] PROBLEM - mediawiki-installation DSH group on mw2125 is CRITICAL: Host mw2125 is not in mediawiki-installation dsh group [09:35:06] (03CR) 10Gehel: [C: 032] Report partial result from mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/307652 (https://phabricator.wikimedia.org/T127788) (owner: 10EBernhardson) [09:35:34] PROBLEM - mediawiki-installation DSH group on mw2124 is CRITICAL: Host mw2124 is not in mediawiki-installation dsh group [09:35:35] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 12 seconds ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [09:36:31] (03PS2) 10Elukey: Add the pivot.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/309312 (https://phabricator.wikimedia.org/T138262) [09:36:44] mw2127 is mine, silencing it [09:37:19] no not true sorry, I have mw2128 -> 31, but anyway is is a host being reimaged [09:37:47] added downtime [09:53:05] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:56:35] 06Operations: graphite-web cronspam - https://phabricator.wikimedia.org/T144797#2622629 (10elukey) We could also think to add https://github.com/graphite-project/graphite-web/commit/78cee79623ffc4c89665c072aa8948d6bd2927f8 to the package but afaics we should also create a jessie-wikimedia version. [09:59:31] (03CR) 10Giuseppe Lavagetto: "https://docs.puppet.com/puppet/3/reference/lang_variables.html#facts-and-built-in-variables this fact wasn't needed at all." [puppet] - 10https://gerrit.wikimedia.org/r/307086 (https://phabricator.wikimedia.org/T142531) (owner: 10Andrew Bogott) [10:02:43] (03PS1) 10Giuseppe Lavagetto: base: remove useless fact ::puppetmastername [puppet] - 10https://gerrit.wikimedia.org/r/309543 [10:03:00] !log gehel@palladium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=maps,service=kartotherian [10:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:40] !log reimage mw2132 to Jessie [10:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:00] (03PS1) 10Elukey: Test graphite-web's internal logrotation [puppet] - 10https://gerrit.wikimedia.org/r/309544 (https://phabricator.wikimedia.org/T144797) [10:05:06] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [10:05:27] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [10:05:39] elukey: ??? ^^^ [10:05:47] (03PS2) 10Gehel: Maps LVS - activate icinga check [puppet] - 10https://gerrit.wikimedia.org/r/304221 (https://phabricator.wikimedia.org/T142393) [10:06:41] volans: yes yes [10:06:57] now I can schedule downtime [10:06:58] :) [10:08:57] (03CR) 10Gehel: [C: 032] Maps LVS - activate icinga check [puppet] - 10https://gerrit.wikimedia.org/r/304221 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [10:15:34] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10elukey) @Dereckson,... [10:18:58] !log reimaging mw2080, mw2083-mw2085 to jessie [10:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:26] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:10] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:30:16] (03Abandoned) 10Elukey: Test graphite-web's internal logrotation [puppet] - 10https://gerrit.wikimedia.org/r/309544 (https://phabricator.wikimedia.org/T144797) (owner: 10Elukey) [10:36:32] RECOVERY - mediawiki-installation DSH group on mw2125 is OK: OK [10:37:04] RECOVERY - mediawiki-installation DSH group on mw2124 is OK: OK [10:38:58] 06Operations, 13Patch-For-Review: graphite-web cronspam - https://phabricator.wikimedia.org/T144797#2622710 (10elukey) I was wrong: http://launchpadlibrarian.net/207471033/graphite-web_0.9.12+debian-7_0.9.13+debian-1.diff.gz in the package there is a patch called "remove_internal_logrotate.patch", so this is... [10:40:56] (03CR) 10Elukey: "Any update on this one?" [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [10:42:13] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:43:20] 06Operations, 05Continuous-Integration-Scaling, 07Nodepool, 07WorkType-NewFunctionality: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#2622714 (10hashar) The commit just after the version we run ( 9e2937cedc9360e45c70f9038dca6dd44b0c6460 ) just ha... [10:48:17] 06Operations, 06Discovery, 06Maps, 10Maps-data, and 2 others: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2622721 (10Gehel) [10:48:20] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Configure LVS in front of maps100? servers - https://phabricator.wikimedia.org/T142393#2622719 (10Gehel) 05Open>03Resolved LVS configuration deployed, host have been pooled. I checked that the service responds correctly. Thi... [10:55:13] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [10:59:20] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2622726 (10fgiunchedi) @Cmjohnson still seeing the same error on sdb, though I noticed it happens only after a reboot. If the server is powered down and then powered back on I don't see... [11:01:23] (03CR) 10Mobrovac: "@Elukey, Giuseppe's version is in I4c330e1e71e" [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [11:01:52] (03CR) 10Mobrovac: [C: 031] service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 (owner: 10Giuseppe Lavagetto) [11:02:50] mobrovac: super, I think we can close 303846? [11:03:09] yup, think so elukey [11:04:42] (03Abandoned) 10Elukey: Install config owned by dedicated user [puppet] - 10https://gerrit.wikimedia.org/r/303846 (owner: 10Eevans) [11:04:48] PROBLEM - mediawiki-installation DSH group on mw2080 is CRITICAL: Host mw2080 is not in mediawiki-installation dsh group [11:04:48] PROBLEM - mediawiki-installation DSH group on mw2084 is CRITICAL: Host mw2084 is not in mediawiki-installation dsh group [11:04:56] PROBLEM - dhclient process on mw2083 is CRITICAL: Timeout while attempting connection [11:04:57] PROBLEM - nutcracker port on mw2085 is CRITICAL: Timeout while attempting connection [11:05:17] PROBLEM - mediawiki-installation DSH group on mw2083 is CRITICAL: Host mw2083 is not in mediawiki-installation dsh group [11:05:17] PROBLEM - HHVM jobrunner on mw2085 is CRITICAL: Connection timed out [11:05:28] PROBLEM - nutcracker process on mw2085 is CRITICAL: Timeout while attempting connection [11:05:28] PROBLEM - nutcracker port on mw2080 is CRITICAL: Timeout while attempting connection [11:05:28] PROBLEM - nutcracker port on mw2084 is CRITICAL: Timeout while attempting connection [11:05:36] PROBLEM - HHVM jobrunner on mw2080 is CRITICAL: Connection timed out [11:05:37] PROBLEM - HHVM jobrunner on mw2084 is CRITICAL: Connection timed out [11:05:38] new jobrunners [11:05:58] PROBLEM - HHVM jobrunner on mw2083 is CRITICAL: Connection timed out [11:06:06] PROBLEM - nutcracker process on mw2080 is CRITICAL: Timeout while attempting connection [11:06:06] PROBLEM - nutcracker port on mw2083 is CRITICAL: Timeout while attempting connection [11:06:06] PROBLEM - nutcracker process on mw2084 is CRITICAL: Timeout while attempting connection [11:06:06] PROBLEM - puppet last run on mw2085 is CRITICAL: Timeout while attempting connection [11:06:29] PROBLEM - puppet last run on mw2080 is CRITICAL: Timeout while attempting connection [11:06:29] PROBLEM - salt-minion processes on mw2085 is CRITICAL: Timeout while attempting connection [11:06:37] moritzm: silenced them [11:06:38] PROBLEM - salt-minion processes on mw2080 is CRITICAL: Connection refused by host [11:06:49] mmm not 2080 [11:06:58] done [11:07:17] RECOVERY - dhclient process on mw2083 is OK: PROCS OK: 0 processes with command name dhclient [11:07:47] RECOVERY - nutcracker port on mw2080 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:07:47] RECOVERY - nutcracker port on mw2084 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:07:47] RECOVERY - nutcracker process on mw2085 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [11:08:28] RECOVERY - nutcracker port on mw2083 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:08:28] RECOVERY - nutcracker process on mw2080 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [11:08:28] RECOVERY - nutcracker process on mw2084 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [11:08:57] RECOVERY - salt-minion processes on mw2085 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:09:17] RECOVERY - salt-minion processes on mw2080 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:09:35] me too, thanks :-) [11:09:47] RECOVERY - nutcracker port on mw2085 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:12:39] Nikerabbit, can you please tell me in which ways you are helping T69223 happen? [11:12:39] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [11:13:27] (03CR) 10Filippo Giunchedi: "recheck" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309317 (owner: 10Gilles) [11:14:01] "This has been fixed in Translate long time ago" is not a helpful response to our users [11:14:35] "look at the git code" seems to me rather unhelpful [11:16:06] hashar: (going to lunch) but how do we configure debian-jenkins-glue to take additional parameters like backports or use wikimedia repos? for https://gerrit.wikimedia.org/r/#/c/309317/ [11:16:21] The user did not seem dissatisfied with the response. [11:17:15] 06Operations: puppet run stopping qrunner on fermium - https://phabricator.wikimedia.org/T144933#2622744 (10MoritzMuehlenhoff) It's a problem in the puppet module: It hardcodes a number of debconf choices, but e.g. Asturian is not available when running "dpkg-reconfigure mailman" manually. This possibly depends... [11:17:37] RECOVERY - HHVM jobrunner on mw2085 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.084 second response time [11:18:01] RECOVERY - HHVM jobrunner on mw2084 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.086 second response time [11:18:01] RECOVERY - HHVM jobrunner on mw2080 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.084 second response time [11:18:16] RECOVERY - HHVM jobrunner on mw2083 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.086 second response time [11:18:17] Nemo_bis, I think he is rather unsatidfied: https://phabricator.wikimedia.org/T69223#2622513 [11:18:47] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:09] he wants to do something and he/she cannot [11:20:27] jynus: yes, many people are dissatisfied with T69223 not being fixed. [11:20:28] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [11:20:37] This has nothing to do with the responses they get from the developers. [11:21:04] a) the response is not adecuate [11:21:33] b) it doesn't help get fixed faster [11:21:34] On https://phabricator.wikimedia.org/T145039#2622516 he expressed a desire for an ETA for T69223, do you know who can provide one? [11:21:35] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [11:22:05] a) Debatable, b) nobody claimed it would, AFAIK, it's just informing about the status quo. [11:22:11] Nemo_bis, you think your answer is also constructive? [11:22:28] Providing correct information is constructive in my opinion, yes. [11:22:41] is your information accurate, do you think? [11:22:49] or maybe biased and partial? [11:22:56] It's also helpful for MediaWiki users to know that MediaWiki works correctly on other clusters (so that they decide where to host documents). [11:23:01] like how you contributed to make that longer [11:23:03] Perfectly accurate [11:23:17] I have no impact on schema changes [11:23:24] oh, you have [11:23:38] you blocked me for the last 1 year, I think [11:24:12] Sounds like today is blame wheel day [11:24:20] blame? [11:24:35] no, I am asking you to please do more constructive comments [11:24:43] and helping get issues fixed faster [11:24:45] My comments are extremely constructing [11:24:52] I can give you better ways [11:24:58] Your accusations that I blocked a fix are insulting [11:25:02] to contribute to that, if you can [11:25:16] And it's extremely inappropriate for a WMF employee to insult a volunteer [11:25:25] what? [11:25:35] in which ways I am insulting? [11:25:41] But I don't care. I just don't accept attitude lessons from you, so don't even get started [11:26:16] anyway [11:26:53] I only asked Nikerabbit to please not say "it is fixed on git", when it is not deployed [11:27:03] I mean, you can say it [11:27:12] but explain the whole story [11:27:20] it is not yet available, etc. [11:29:04] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [11:29:09] or how the mediawiki architecture comittee mentioned that it was indeed an error to create a feature without a proper deployment plan [11:29:10] (03PS15) 10Giuseppe Lavagetto: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [11:29:58] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/pybal-check] [11:31:16] RECOVERY - puppet last run on mw2085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:31:51] <_joe_> mobrovac: I'm merging the patch, disabling puppet on rb/scb first just to go the safe route [11:33:50] kk [11:35:30] jynus: the feature is deployed and in use by other wikis [11:35:44] the user did not ask about what errors were made or by whom [11:35:54] The user is only interested in an ETA. [11:36:22] So the only useful matter to discuss is who can provide an ETA, or how. [11:36:47] which Nikerabbit did not provide [11:36:54] <_joe_> restbase is a noop, moving to scb [11:37:05] jynus: do you want him to decide an ETA? [11:37:11] <_joe_> I guess this discussion is better suited for phabricator or a private chat [11:37:22] #wikimedia-tech would have been a better channel, yes [11:37:28] <_joe_> surely not htis channel [11:37:32] I want him to be involved with the deployment process and ask if he doesn't know, yes [11:37:35] <_joe_> thanks :) [11:39:33] <_joe_> mobrovac: all ok, it's a noop as we expected [11:40:07] \o/ [11:40:21] _joe_: are we brave enough to switch mathoid to scap config deploys today? [11:40:30] <_joe_> nah [11:40:33] and hence ensure the patch we've just put up actually works [11:40:40] <_joe_> not on friday with this connection [11:40:45] <_joe_> sorry :( [11:40:53] <_joe_> i might drop off at any time [11:41:16] heh, if i were to judge moves based on my conn, i would be on holidays until the end of the month :) [11:41:24] kk, we'll do it monday [11:43:04] (03CR) 10Hashar: "Filippo asked on IRC:" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309317 (owner: 10Gilles) [11:46:38] godog: I have replied on the task with more details [11:46:51] godog: the TLDR is : lack of documentation (bug:1) [11:47:22] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2622824 (10BBlack) >>! In T144508#2622342, @grin wrote: >> it would take me pages just to explain in depth the huge range of problems with that se... [11:54:16] (03PS1) 10Muehlenhoff: Mark mw2017 and mw2099 as codfw test app servers [puppet] - 10https://gerrit.wikimedia.org/r/309554 [11:54:58] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:05:35] !log reimaging mw2133-mw2136 to jessie [12:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:06:20] RECOVERY - mediawiki-installation DSH group on mw2080 is OK: OK [12:06:20] RECOVERY - mediawiki-installation DSH group on mw2084 is OK: OK [12:06:48] RECOVERY - mediawiki-installation DSH group on mw2083 is OK: OK [12:07:18] godog: akosiaris: gilles: I have added some basic doc about the Debian package builder on Jenkins https://wikitech.wikimedia.org/wiki/Debian_Glue [12:07:22] long overdue [12:07:27] should announce it on ops list probably [12:07:33] or wherever else [12:15:19] did to ops-l [12:20:49] (03PS2) 10Mobrovac: Mathoid: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/308574 (https://phabricator.wikimedia.org/T144755) [12:22:28] <_joe_> mobrovac: how would you deploy that? [12:22:41] what _joe_? mathoid? [12:22:51] <_joe_> I mean first you need to deploy mathoid I guess, with the config included [12:23:01] <_joe_> then you run puppet? [12:23:05] <_joe_> or the reverse? [12:23:07] no, vice-versa [12:23:16] first run puppet, then deploy [12:23:20] <_joe_> ok so we'd need to do that one-by-one [12:23:26] when puppet runs, the config will still be there in place [12:23:35] <_joe_> because puppet will try to call scap deploy-whatever [12:23:52] <_joe_> I see a potential race condition there [12:24:21] <_joe_> just because we're doing this transition [12:24:31] <_joe_> so prolly the right thing to do would be [12:24:57] <_joe_> uhm no, in fact, given the config file is not absented by us [12:25:03] well, scap will try to do the config deploy, but we can disable it on tin first, rendering the config-deploy phase a no-op [12:25:05] <_joe_> that should work [12:25:08] and then force a deploy from tin [12:25:15] <_joe_> or [12:25:25] <_joe_> we upload the new code to deploy to tin [12:25:29] <_joe_> including the config [12:25:33] <_joe_> then we run puppet [12:25:47] trying it out in labs now [12:25:52] <_joe_> that should work as long as the change is just a config change [12:26:10] <_joe_> mobrovac: does deployment-tin let you in today? :D [12:26:15] haha [12:26:18] it does, magically [12:27:26] _joe_: Failed to call refresh: /usr/local/bin/apply-config-mathoid returned 70 instead of one of [0] [12:27:27] :/ [12:27:40] CalledProcessError: Command 'mkdir -p '/srv/deployment/mathoid/deploy-cache/revs/9937e0fec4a26641b74be6709f17f21e236b193f/.git/config-files'' returned non-zero exit status 1 [12:27:45] !log reimaging mw213[789] and mw2075 to Jessie [12:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:53] the dir exists already so scap deploy-local dies [12:28:00] so that's a scap bug [12:28:07] i guess we're blocked now on that [12:29:16] <_joe_> which dir? [12:29:22] <_joe_> /etc/something? [12:29:33] <_joe_> so it won't work ever [12:29:49] <_joe_> sorry, gtg to lunch now [12:30:23] (03PS3) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [12:31:28] (03CR) 10jenkins-bot: [V: 04-1] prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [12:35:25] _joe_: see above, in /srv/deployment/... [12:37:06] 06Operations: Remove mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2608981 (10elukey) @Joe can we proceed with the decom or would it be better to wait for the replacements? [12:37:43] 06Operations: Remove mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2622881 (10MoritzMuehlenhoff) We discussed that on IRC, I'll decom these. [12:38:51] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 185 bytes in 1.419 second response time [12:42:50] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2622884 (10elukey) @Papaul would you mind to check mw2075? I wasn't able to run wmf-reimage because of IPMI errors :( [12:43:26] (03CR) 10Elukey: [C: 032] Add the pivot.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/309312 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [12:45:28] !log running authdns-update on ns0.w.o to pick up the new domain pivot.wikimedia.org (T138262) [12:45:29] T138262: Productionize Pivot UI - https://phabricator.wikimedia.org/T138262 [12:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:46:01] (03CR) 10Mobrovac: "PCC not particularly interesting - https://puppet-compiler.wmflabs.org/4034/" [puppet] - 10https://gerrit.wikimedia.org/r/308574 (https://phabricator.wikimedia.org/T144755) (owner: 10Mobrovac) [13:02:33] PROBLEM - Apache HTTP on mw2135 is CRITICAL: Connection timed out [13:02:53] PROBLEM - Apache HTTP on mw2134 is CRITICAL: Connection timed out [13:03:25] PROBLEM - Apache HTTP on mw2133 is CRITICAL: Connection timed out [13:03:25] this is me [13:03:51] PROBLEM - Apache HTTP on mw2132 is CRITICAL: Connection timed out [13:04:00] no ok not me but hosts reimaged [13:04:03] silencing [13:04:24] just did that [13:04:37] elukey: moritzm: thanks 2132 is mine [13:04:57] moritzm: you are too quick :P [13:05:05] volans: how is the reimage going? [13:05:29] runnign wmf_reimage now, had some issues before, let's see [13:06:06] elukey: icinga is just too slow... [13:06:22] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2622966 (10Cmjohnson) @fgiunchedi I will push HP for a new system board but I am beginning to think that this is not h/w related. I will update once I speak with their tech support [13:07:12] RECOVERY - Apache HTTP on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [13:07:32] RECOVERY - Apache HTTP on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.076 second response time [13:08:12] RECOVERY - Apache HTTP on mw2133 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.074 second response time [13:08:37] moritzm: 4 servers to go (except the decoms and the video-scaler) and codfw done? [13:10:03] elukey: please leave at least 1 more for testing the script ;) [13:11:11] RECOVERY - Apache HTTP on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.078 second response time [13:11:16] 06Operations, 07RfC: #Varnish being used although archived - https://phabricator.wikimedia.org/T142244#2622969 (10Aklapper) I'm not aware of users confused by archived projects left associated to tasks - if there are any such reports, links are welcome so potential approaches could be discussed (e.g. I could i... [13:11:43] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 12 logical, 13 physical [13:12:44] this is weird [13:12:59] cmjohnson1: Hi! Did you do anything to analytics1032 by any chance? [13:13:13] yes....that's me [13:13:15] elukey: plus mira and wasat and mw2017, be generally yes [13:13:18] replacing the disk [13:13:36] volans: you can still just reimage a jessie system to jessie [13:13:55] yeah of course :) [13:15:01] cmjohnson1: thanksssssss! [13:15:08] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2622972 (10Cmjohnson) The disk on analytics1032 is failed. Replaced the failed disk, cleared the cache, added the disk back and all disks are back online @analytics1032... [13:15:09] elukey: new disk installed and disks back online [13:15:35] cmjohnson1: super, going to create the partition and bring it up again [13:15:41] (03PS1) 10Muehlenhoff: Update to 4.4.20 [debs/linux44] - 10https://gerrit.wikimedia.org/r/309560 [13:19:07] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2622982 (10Dereckson) The T1377... [13:23:24] <_joe_> /win 17 [13:23:44] RECOVERY - Hadoop NodeManager on analytics1032 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [13:24:55] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:25:16] !log analytics1032 back in service after disk swap [13:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:29] hashar: thanks for the documentation on wikitech! [13:26:14] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2622996 (10elukey) 05Open>03Resolved created the partition, rebooted the host since we don't have UUIDs and enabled puppet. All good! Thanks @Cmjohnson! [13:33:01] godog: it is very rough [13:33:44] godog: Alexandros would know a bit more about the pbuilder hooks we use. For the rest http://jenkins-debian-glue.org/docs/ is an overflow of infos :D [13:36:02] hashar: yeah no worries, it is enough to get started, though what about BACKPORTS, not supported yet? [13:36:37] not sure when and how we should rely on it [13:37:02] upstream uses a special distro jessie-backports [13:37:06] we have jessie-wikimedia [13:37:41] but maybe we can always inject the backports [13:38:14] PROBLEM - mediawiki-installation DSH group on mw2075 is CRITICAL: Host mw2075 is not in mediawiki-installation dsh group [13:38:18] else the solution is to manually inject BACKPORTS=yes from Zuul. So every single repo that needs it would need to be listed in the Zuul config [13:38:26] ah yeah so there's also a mismatch in what I meant by backports, backports-the-component of jessie-wikimedia vs jessie-backports the distro, I needed the latter [13:38:46] I havent tested [13:38:53] hashar: ok I'll poke at it a bit and let you know if I get stuck [13:38:57] and I dont think our hooks would support jessie-backports [13:39:11] in theory in such a case it should use the 'jessie' cowimage and inject the apt component for backports [13:39:13] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [13:39:32] and maybe we can come up with some ugly like jessie-wikimedia-backports ;] [13:40:22] anyway modules/package_builder/templates/pbuilderrc.erb might the place to hack [13:50:49] (03PS2) 10Ottomata: Update camus jar version [puppet] - 10https://gerrit.wikimedia.org/r/309323 (https://phabricator.wikimedia.org/T144716) (owner: 10Joal) [13:51:53] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2623030 (10Cmjohnson) I used an on-site spare to swap the disk, ordered a new one from Dell. Congratulations: Work Order SR935921121 was successfully submitted. [13:52:04] (03PS1) 10Hashar: package_builder: support '-backports' in distribution [puppet] - 10https://gerrit.wikimedia.org/r/309568 [13:52:54] (03CR) 10Hashar: "Will be helpful in case one want to build against jessie-backports. Simply mention it in the debian/changelog distribution and Debian Glue" [puppet] - 10https://gerrit.wikimedia.org/r/309568 (owner: 10Hashar) [13:52:57] godog: ^^ [13:53:10] probably want alexandros to review it first [13:53:36] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Disk sde likely failing on analytics1032 - https://phabricator.wikimedia.org/T145170#2623032 (10Ottomata) Thanks you two! So much action between my bedtime and my coffee! :D :D [13:59:33] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2623036 (10Papaul) IPMI over LAN is enable on mw2075. [14:01:52] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 109, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/9: down - db1008BR [14:03:08] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2623039 (10Cmjohnson) @Jgreen frauth1001 is cabled to pfw1 2/0/9 [14:03:21] hashar: thanks! [14:03:30] cmjohnson1, Jeff_Green ^^^ is db1008 being down planned? [14:03:58] paravoid: yes, but where did it show up? [14:04:01] paravoid: yes [14:04:08] interface alert on the router [14:04:13] ah [14:04:22] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 111, down: 0, dormant: 0, excluded: 2, unused: 0 [14:04:25] it's been replaced with frdb1001 [14:04:39] frdb1001 or frauth1001 jeff_green? [14:04:54] task is frauth1001 [14:04:55] the interface description on the pfw wasn't changed [14:05:02] db1008's db service has been moved to frdb1001 [14:05:18] and we need to borrow that port to image the next box, since it's the only free interface left [14:05:45] I don't think it is? [14:05:54] I think I found a few free ports during that audit [14:06:08] according to that survey it's the only one free [14:06:53] the port survey is at boron:/a/fundraising_system_documentation/20160726_pfw_port_survey.txt [14:07:07] https://phabricator.wikimedia.org/T141363#2496417 [14:07:21] ge-0/0/2, ge-0/0/3, ge-9/0/2, ge-9/0/3 are all free [14:07:45] hmmm, ok even better :-) [14:07:55] as are ge-2/0/14, ge-2/0/15, ge-9/0/14, ge-2/0/15 [14:07:57] er [14:08:11] as ge-2/0/14, ge-2/0/15, ge-11/0/14, ge-11/0/15 [14:08:13] I thought we had concluded they weren't usable for some reason [14:08:29] they are, with the caveat of the last paragraph in the task above [14:10:20] ge-0/* is pfw1 and ge-9/* is pfw2? [14:10:59] ge-0 and ge-2 are pfw1, ge-9 and ge-11 are pfw2 [14:11:10] ok [14:11:23] we have 8 free ports atm [14:12:53] i see 0/0/2, 0/0/3, 9/0/2, 9/0/3, 2/0/9 (db1008), what are the others? [14:13:10] ge-2/0/14, ge-2/0/15, ge-11/0/14, ge-11/0/15 [14:13:21] ah, connected but not configured, ok [14:14:37] ok. i guess for now let's leave db1008's port alone, I want to use that box to do some VM prototyping for possible dev use, before we scrap it [14:15:06] does it being down trigger an alert? [14:17:23] yes, but that's fixable [14:17:33] (by adding "no-mon" to the interface description) [14:18:11] (03PS1) 10Muehlenhoff: Decom mw2061-mw2074 [puppet] - 10https://gerrit.wikimedia.org/r/309572 (https://phabricator.wikimedia.org/T144745) [14:18:13] done [14:18:15] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.20 [debs/linux44] - 10https://gerrit.wikimedia.org/r/309560 (owner: 10Muehlenhoff) [14:18:15] oic. we need a port for frauth1001 and chris already cabled it, so I guess we could instead just change the hostname on the interface [14:18:30] it's already in the correct vlan [14:18:41] as you guys wish, but cmjohnson1: please remember to update port descriptions when reshuffling servers [14:18:45] too confusing otherwise :) [14:19:29] fwiw the shuffling just happened this AM, it's been a little chaotic since I've been operating on the assumption we are down to one free port [14:20:52] maybe we should disconnect 2/0/14, 2/0/15, 11/0/14, 11/0/15 now so it's a little clearer visually what's in use too [14:21:29] 06Operations, 10ops-eqiad: system WMF3096 lacking details in racktables - https://phabricator.wikimedia.org/T145156#2623070 (10Cmjohnson) That server was one of the original R610's I updated ST and warranty expiration [14:22:14] 06Operations, 10ops-eqiad: system WMF3096 lacking details in racktables - https://phabricator.wikimedia.org/T145156#2623071 (10Cmjohnson) We should decom this server [14:22:32] paravoid while we're on the subject of pfws, i'm pondering how to image the replacement LVS servers [14:24:26] do you know if I can bring up additional lvs servers in the same network without disrupting the two that are in service, as long as we don't modify the BGP config on the pfw to pay attention to them? [14:24:36] (03CR) 10Eevans: [C: 031] service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 (owner: 10Giuseppe Lavagetto) [14:30:44] (03CR) 10Muehlenhoff: "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/309267 (https://phabricator.wikimedia.org/T144938) (owner: 10Filippo Giunchedi) [14:30:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] package_builder: support '-backports' in distribution (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309568 (owner: 10Hashar) [14:31:03] (03CR) 10Muehlenhoff: "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/309279 (https://phabricator.wikimedia.org/T144938) (owner: 10Filippo Giunchedi) [14:31:08] Jeff_Green: I don't see why not [14:34:34] (03CR) 10Andrew Bogott: [C: 032] base: remove useless fact ::puppetmastername [puppet] - 10https://gerrit.wikimedia.org/r/309543 (owner: 10Giuseppe Lavagetto) [14:34:39] (03PS2) 10Andrew Bogott: base: remove useless fact ::puppetmastername [puppet] - 10https://gerrit.wikimedia.org/r/309543 (owner: 10Giuseppe Lavagetto) [14:35:34] ok great, that's what I concluded from the wikitech docs and pybal config, just wondering if I'm missing something obvious [14:36:25] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5165699 keys - replication_delay is 711 [14:37:35] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy puppetmaster100[12] - https://phabricator.wikimedia.org/T143219#2623081 (10akosiaris) [14:37:48] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2623086 (10akosiaris) [14:37:48] I've written a few lines about systemtap, feedback/edits welcome https://wikitech.wikimedia.org/wiki/SystemTap [14:37:50] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup/deploy puppetmaster100[12] - https://phabricator.wikimedia.org/T143219#2561022 (10akosiaris) 05Open>03Resolved a:03akosiaris Boxes have been installed, resolving [14:38:11] (03PS1) 10Filippo Giunchedi: thumbor: tune nginx next_upstream behaviour [puppet] - 10https://gerrit.wikimedia.org/r/309574 (https://phabricator.wikimedia.org/T139606) [14:38:54] thanks ema! [14:39:46] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2527576 (10akosiaris) I suppose we can resolve this ? [14:48:54] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5157442 keys - replication_delay is 0 [14:49:04] (03PS1) 10Jgreen: internal IP DNS for pay-lvs1003 & pay-lvs1004 [dns] - 10https://gerrit.wikimedia.org/r/309576 [14:51:07] (03CR) 10Jgreen: [C: 032] internal IP DNS for pay-lvs1003 & pay-lvs1004 [dns] - 10https://gerrit.wikimedia.org/r/309576 (owner: 10Jgreen) [14:51:36] !log change-prop deployed 34b23e7 [14:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:17] !log authdns-update for pay-lvs1001 & pay-lvs1002 [14:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:09] moritzm: elukey: can I pick mw2076 for some additional testing? [15:01:26] sure, just make sure to keep it in pooled=inactive status around deployment windows [15:01:44] as found at https://wikitech.wikimedia.org/wiki/Deployments [15:02:08] sure, I was planning to do them now.. and AFAIK no deployments on friday :) [15:02:21] nope :-) [15:05:26] (03PS1) 10Andrew Bogott: Redefine labs_global_puppet_master [puppet] - 10https://gerrit.wikimedia.org/r/309583 [15:10:25] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2623158 (10Jgreen) [15:11:01] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2582571 (10Jgreen) [15:11:39] (03PS2) 10Andrew Bogott: Don't use hiera setting labs_global_puppet_master [puppet] - 10https://gerrit.wikimedia.org/r/309583 [15:14:19] (03CR) 10Andrew Bogott: [C: 032] Don't use hiera setting labs_global_puppet_master [puppet] - 10https://gerrit.wikimedia.org/r/309583 (owner: 10Andrew Bogott) [15:15:21] !log T133805: Renabling Pupppet, forcing run, and restarting Cassandra to restore 8M region size on restbase1013-a.eqiad.wmnet [15:15:23] T133805: Isolated testing of GC settings for aggressive Cassandra chunk_length_kb values - https://phabricator.wikimedia.org/T133805 [15:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:03] (03PS1) 10Andrew Bogott: Rearrange labspuppetbackend hiera settings again [puppet] - 10https://gerrit.wikimedia.org/r/309588 [15:17:14] (03PS1) 10Muehlenhoff: Retroactively add CVE ID to changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/309589 [15:17:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Retroactively add CVE ID to changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/309589 (owner: 10Muehlenhoff) [15:17:55] (03CR) 10Andrew Bogott: [C: 032] Rearrange labspuppetbackend hiera settings again [puppet] - 10https://gerrit.wikimedia.org/r/309588 (owner: 10Andrew Bogott) [15:18:04] 06Operations, 10Traffic, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623166 (10RobH) [15:19:11] (03PS2) 10Urbanecm: Lift of IP cap - WomenInSience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) [15:23:33] 06Operations, 10Cassandra, 06Services: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044#2623184 (10fgiunchedi) note we'll need to renew some instance certs in codfw as part of this as they are about to expire anyway [15:24:26] (03PS4) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [15:24:45] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:01] (03CR) 10jenkins-bot: [V: 04-1] prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [15:28:23] urandom: I got your ping sorry, I'll review the CR asap :( [15:29:21] elukey: no worries, i just didn't want to send your way via a stealth puppet update without you being aware :) [15:29:55] :D [15:31:55] elukey: best not to merge it until next week anyway [15:33:08] super, sorry again for the delay [15:36:06] (03PS1) 10Andrew Bogott: Wrap labspuppetbackend in a role class [puppet] - 10https://gerrit.wikimedia.org/r/309591 [15:36:53] (03CR) 10Alex Monk: "You renamed a class but didn't update the list of classes used by those servers." [puppet] - 10https://gerrit.wikimedia.org/r/308152 (owner: 10Hashar) [15:37:15] (03CR) 10jenkins-bot: [V: 04-1] Wrap labspuppetbackend in a role class [puppet] - 10https://gerrit.wikimedia.org/r/309591 (owner: 10Andrew Bogott) [15:38:43] (03PS1) 10RobH: ssl cert renewals: ldap-[codfw|eqiad].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/309592 (https://phabricator.wikimedia.org/T145201) [15:39:12] 06Operations, 10Traffic, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623236 (10RobH) [15:39:55] (03PS2) 10Andrew Bogott: Wrap labspuppetbackend in a role class [puppet] - 10https://gerrit.wikimedia.org/r/309591 [15:40:16] (03CR) 10jenkins-bot: [V: 04-1] ssl cert renewals: ldap-[codfw|eqiad].wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/309592 (https://phabricator.wikimedia.org/T145201) (owner: 10RobH) [15:40:59] (03PS5) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [15:41:03] hrmm [15:41:09] why did that fail.... [15:41:10] <_joe_> andrewbogott: I don't really see how that patch can simplify anything [15:41:14] 15:39:15 ERROR: invocation failed (exit code 1), logfile: /home/jenkins/workspace/operations-puppet-tox-jessie/.tox/py27/log/py27-1.log [15:41:14] 15:39:15 ERROR: actionid=py27 [15:41:33] (03CR) 10jenkins-bot: [V: 04-1] Wrap labspuppetbackend in a role class [puppet] - 10https://gerrit.wikimedia.org/r/309591 (owner: 10Andrew Bogott) [15:41:35] seems some kind of jenkins error? [15:41:40] _joe_: ok... [15:41:53] andre__: you have same error [15:41:57] ack, wrong ping [15:41:58] andrewbogott: [15:41:58] <_joe_> andrewbogott: honestly, it just complicates things [15:42:10] <_joe_> andrewbogott: want to see what would make your life easier? [15:42:19] I got feedback yesterday from bblack that using hiera to set params on a class buried in a role was weird [15:42:23] _joe_: sure [15:42:31] (03CR) 10jenkins-bot: [V: 04-1] prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [15:42:36] 15:39:11 rsync: change_dir "/operations-puppet/production/operations-puppet-tox-jessie" (in caches) failed: No such file or directory (2) [15:43:07] i think there is something borked with jenkins though im not sure what yet, that above tox test failure seems unrelated to my actual patchset. [15:43:18] <_joe_> yes [15:43:22] <_joe_> robh: there is [15:44:57] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2623246 (10greg) On the note about Ganeti vs bare metal performance for deploys: Before/If we migrate to Ganeti VMs for tin/mira I'd like a performance test of a full scap and other common actions.... [15:45:26] heh, monkey broke ci. [15:45:30] (whatever it is) [15:47:02] 06Operations, 10Traffic, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623255 (10RobH) a:05RobH>03MoritzMuehlenhoff [15:47:30] (03PS1) 10Giuseppe Lavagetto: labs::puppetmaster: move hiera setting to the role [puppet] - 10https://gerrit.wikimedia.org/r/309593 [15:47:33] greg-g: heads up, we may need a friday deploy or undeploy. the zero: namespace on zerowiki is down, my hunch is content model related something or other. i've emailed legoktm and yurik concerning this and poked them on #wikimedia-mobile as well. i just spoke with yurik on the phone, too, and am going to give legoktm a call. [15:47:51] _joe_: "want to see what would make your life easier?" [15:48:17] <_joe_> andrewbogott: ^^ that change [15:48:41] <_joe_> variables (be them class variables or anything else) that make sense just for one role [15:48:53] <_joe_> should be in the role/common/... tree [15:49:32] ok [15:49:54] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [15:50:00] <_joe_> and now the only error is 'Error: Failed to compile catalog for node labcontrol1002.wikimedia.org: Must pass mysql_password to Class[Labspuppetbackend]' [15:50:05] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/309591 (owner: 10Andrew Bogott) [15:50:17] <_joe_> which I guess you have to set in the private repo [15:50:39] (03CR) 10Andrew Bogott: [C: 032] labs::puppetmaster: move hiera setting to the role [puppet] - 10https://gerrit.wikimedia.org/r/309593 (owner: 10Giuseppe Lavagetto) [15:51:19] dr0ptp4kt: yuck [15:51:30] greg-g: mmmhmm [15:51:31] dr0ptp4kt: noted [15:51:40] greg-g: thx. happy friday! [15:51:46] :) [15:52:22] 06Operations, 10Deployment-Systems, 10scap, 13Patch-For-Review, 03Scap3: Make keyholder work with systemd - https://phabricator.wikimedia.org/T144043#2623263 (10AlexMonk-WMF) [15:52:23] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:52:24] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2623262 (10AlexMonk-WMF) [15:52:37] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2623268 (10BBlack) I'm putting up a straw-man hostname-decom date of 2016-09-19, which is ~10 days out from now. We'll never actually eliminate the trailing traffic before decom, and mo... [15:55:04] (03Abandoned) 10Andrew Bogott: Wrap labspuppetbackend in a role class [puppet] - 10https://gerrit.wikimedia.org/r/309591 (owner: 10Andrew Bogott) [15:56:33] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:01] (03PS1) 10DCausse: Upgrade elasticsearch pluglins to 2.4.0 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) [16:00:15] (03CR) 10DCausse: [C: 04-1] "should be merged just before the rolling restart" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [16:01:14] (03CR) 10Alexandros Kosiaris: "first round of comments. Premise looks ok, questions inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [16:01:34] greg-g, dr0ptp4kt - worst case scenario - the zero banners don't show [16:01:55] its not like people won't see MW content [16:05:13] PROBLEM - mediawiki-installation DSH group on mw2139 is CRITICAL: Host mw2139 is not in mediawiki-installation dsh group [16:06:31] (03CR) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [16:06:42] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [16:08:02] this is me --^ [16:10:24] !log live hacking on mw1017 [16:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:03] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:20:41] (03CR) 10Elukey: [C: 031] "Checked out the patch, didn't see anything left with git grep. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/309572 (https://phabricator.wikimedia.org/T144745) (owner: 10Muehlenhoff) [16:21:34] (03CR) 10Madhuvishy: labstore: nfs-manage-binds a sync-exports replacments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/309387 (owner: 10Rush) [16:21:38] (03PS2) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 [16:21:40] (03PS1) 10Chad: WIP: Work towards not needed MWMinimalScriptInit.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309601 [16:26:35] bblack: heads up, legoktm is deploying a fix for that zerowiki-related thing, he and yurik have been troubleshooting. i'll be checking in 30 mins or so (assuming deploy is done within 15 mins) to see that it takes effect through the cascade all the way to action=zeroconfig and other header enrichment and things [16:26:44] thanks legoktm and yurik much appreciated! [16:27:00] all creds go to legoktm ;) [16:27:34] * yurik was up until 6am and is not thinking too well [16:28:13] legoktm, synced? [16:28:26] syncing [16:28:45] yep, seems to be up again [16:28:46] is this going to affect cached stuff that needs banning? [16:28:53] ah, no, waiting [16:29:07] !log legoktm@tin Synchronized php-1.28.0-wmf.18/extensions/JsonConfig/: Unbreak Zero namespace, Check globals in addition to attributes https://gerrit.wikimedia.org/r/309598 (duration: 00m 51s) [16:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:25] bblack, technically yes - anything that went for special:ZRMA page [16:29:34] I have no idea about caches and stuff, but the underying cause is fixed in MW [16:29:44] legoktm, yep, works!!! [16:29:46] awesome job [16:30:11] bblack, i think it should be ok because the cache for banners is fairly quick [16:30:23] we had it at 15 min i think [16:30:28] not worth the hassle [16:31:37] ok [16:32:08] ok, now that legoktm did the impossible, i will go do my morning routine :) [16:34:59] I'll wait those 15 min and then go back to sleep...I can write up the incident report later [16:35:27] legoktm: thanks [16:38:46] (03PS1) 10Chad: mwrepl: Use MWScript.php directly [puppet] - 10https://gerrit.wikimedia.org/r/309605 [16:42:41] RECOVERY - mediawiki-installation DSH group on mw2075 is OK: OK [16:44:31] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2623490 (10Gehel) I'm starting to understand a few things about scap: * the `[codfw.wmnet]` section in `scap.cfg` is activated if we depl... [16:50:07] there may be small amount of temporal lag during the weekend on s5 tokudb databases (dbstore and labsdb), that is normal and should disappear soon after it appears [16:50:43] it is part of the continued maintenance done for propagating production schema changes [16:50:57] (03CR) 10Chad: [C: 032] activeMWVersions.php: Remove script and get info for noc from scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309372 (owner: 10Chad) [16:51:04] (03PS2) 10Chad: activeMWVersions.php: Remove script and get info for noc from scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309372 [16:53:06] !log demon@tin Synchronized docroot/noc/conf/: Updating activeMWVersions data (duration: 00m 47s) [16:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:46] !log demon@tin Synchronized multiversion/: rm one more ugly file (duration: 01m 05s) [16:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:48] (03CR) 1020after4: [C: 031] "@dzahn: There is some support on the task now and I've set this in the live config, so might as well merge it to make the default config m" [puppet] - 10https://gerrit.wikimedia.org/r/281071 (https://phabricator.wikimedia.org/T131622) (owner: 10Paladox) [17:03:09] (03Abandoned) 10Chad: Remove mw2080 from scap-proxy list [puppet] - 10https://gerrit.wikimedia.org/r/309043 (owner: 10Chad) [17:03:10] 06Operations, 10Traffic, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2623166 (10akosiaris) FWIW, those were up to now certificates issued by our internal CA (along with the labvirt* certs from what I remember and can see). II... [17:04:21] PROBLEM - Apache HTTP on mw2076 is CRITICAL: Connection timed out [17:05:02] PROBLEM - nutcracker process on mw2076 is CRITICAL: Connection refused by host [17:06:43] RECOVERY - mediawiki-installation DSH group on mw2139 is OK: OK [17:06:45] RECOVERY - Apache HTTP on mw2076 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.077 second response time [17:07:25] RECOVERY - nutcracker process on mw2076 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [17:08:32] (03PS1) 10Alexandros Kosiaris: puppetmaster: Ship a gitconfig file [puppet] - 10https://gerrit.wikimedia.org/r/309608 [17:08:34] (03PS1) 10Alexandros Kosiaris: puppetmaster: Fixes to the post-receive hook on a frontend [puppet] - 10https://gerrit.wikimedia.org/r/309609 [17:08:36] (03PS1) 10Alexandros Kosiaris: puppetmaster: More strict permission on private on frontends [puppet] - 10https://gerrit.wikimedia.org/r/309610 [17:09:00] PROBLEM - mediawiki-installation DSH group on mw2076 is CRITICAL: Host mw2076 is not in mediawiki-installation dsh group [17:14:10] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2623524 (10Andrew) [17:18:10] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 2 minutes ago with 8 failures. Failed resources (up to 3 shown): File[/etc/apache2/mods-available/setenvif.conf],File[/etc/apache2/mods-available/userdir.conf],File[/etc/apache2/mods-available/autoindex.conf],Package[fonts-noto-cjk] [17:19:11] that's me, reimaging [17:29:39] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2623579 (10Smalyshev) > the [codfw.wmnet] section in scap.cfg is activated if we deploy from codfw, not to codfw That looks like a bug. W... [17:32:23] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2623601 (10Smalyshev) One strange thing that I see is that symlink is still created, even though the check is listed only for canary and w... [17:40:24] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:44:49] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2623697 (10Gehel) >>! In T144380#2623579, @Smalyshev wrote: >> the [codfw.wmnet] section in scap.cfg is activated if we deploy from codfw,... [17:50:32] (03PS1) 10Andrew Bogott: Move labs/puppetmaster ferm rules into the labs/puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/309613 [17:51:58] (03CR) 10jenkins-bot: [V: 04-1] Move labs/puppetmaster ferm rules into the labs/puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/309613 (owner: 10Andrew Bogott) [17:56:00] (03PS2) 10Andrew Bogott: Move labs/puppetmaster ferm rules into the labs/puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/309613 [17:58:00] (03CR) 10Andrew Bogott: [C: 032] Move labs/puppetmaster ferm rules into the labs/puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/309613 (owner: 10Andrew Bogott) [18:01:04] (03PS1) 10Andrew Bogott: Remove some redundant settings from labs/puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/309614 [18:02:50] (03CR) 10Andrew Bogott: [C: 032] Remove some redundant settings from labs/puppetmaster role [puppet] - 10https://gerrit.wikimedia.org/r/309614 (owner: 10Andrew Bogott) [18:07:55] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2623826 (10Jgreen) [18:18:19] (03CR) 10Hashar: [C: 031] "It works!" [puppet] - 10https://gerrit.wikimedia.org/r/309337 (owner: 10Alex Monk) [18:19:13] (03CR) 10Hashar: "Ah "Rename role::aptly to role::aptly::server ". Yeah sorry bout that :(" [puppet] - 10https://gerrit.wikimedia.org/r/308152 (owner: 10Hashar) [18:21:21] (03CR) 10Hashar: package_builder: support '-backports' in distribution (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309568 (owner: 10Hashar) [18:21:46] (03PS2) 10Hashar: package_builder: support '-backports' in distribution [puppet] - 10https://gerrit.wikimedia.org/r/309568 [18:23:00] !log demon@tin Synchronized wmf-config/: prune old ext messages files (duration: 00m 52s) [18:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:14] 06Operations, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2623880 (10MaxSem) [18:49:25] (03PS2) 10Rush: labstore: nfs-manage-binds a sync-exports replacments [puppet] - 10https://gerrit.wikimedia.org/r/309387 [18:49:36] ostriches: hey, around? [18:49:44] Yessir, what's up? [18:49:47] hi! [18:49:50] (03PS3) 10Rush: labstore: nfs-manage-binds a sync-exports replacments [puppet] - 10https://gerrit.wikimedia.org/r/309387 [18:50:08] we're getting these daily: [18:50:17] Cron /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/echo.dblist extensions/Echo/maintenance/processEchoEmailBatch.php [18:50:23] Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php [18:50:26] both fail with [18:50:31] Warning: require_once(/etc/mediawiki/WikitechPrivateSettings.php): failed to open stream: No such file or directory in /srv/mediawiki/wmf-config/wikitech.php on line 179 [18:50:35] Fatal error: require_once(): Failed opening required '/etc/mediawiki/WikitechPrivateSettings.php' (include_path='/srv/mediawiki/php-1.28.0-wmf.18:/usr/local/lib/php:/usr/share/php') in [18:50:39] /srv/mediawiki/wmf-config/wikitech.php on line 179 [18:50:39] seems to be happening for a while: https://phabricator.wikimedia.org/T137771#2404084 [18:50:42] Bahhhh [18:50:48] also https://phabricator.wikimedia.org/T140889 I suppose [18:50:49] I was told those were in /etc/mediawiki now [18:50:57] Something change in puppet? [18:51:00] also https://phabricator.wikimedia.org/T136926 I guess [18:51:04] haven't those never been available on terbium? [18:51:07] they are the labwiki config [18:51:08] and https://phabricator.wikimedia.org/T132383 ? [18:52:08] Ok, so wikitech private settings need to be on all apaches not just silver... [18:54:14] (03CR) 10Madhuvishy: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/309387 (owner: 10Rush) [18:54:52] Or at the very least, work hosts. [18:56:27] (03PS2) 10Chad: mwrepl: Use MWScript.php directly [puppet] - 10https://gerrit.wikimedia.org/r/309605 [18:56:29] (03PS1) 10Chad: Include wikitech's private settings on terbium-like hosts [puppet] - 10https://gerrit.wikimedia.org/r/309615 (https://phabricator.wikimedia.org/T137771) [18:56:44] (03PS2) 10Chad: Include wikitech's private settings on terbium-like hosts [puppet] - 10https://gerrit.wikimedia.org/r/309615 (https://phabricator.wikimedia.org/T137771) [18:56:46] (03CR) 10Rush: [C: 032] labstore: nfs-manage-binds a sync-exports replacments [puppet] - 10https://gerrit.wikimedia.org/r/309387 (owner: 10Rush) [18:57:12] * paravoid pukes at modules/role/manifests/mediawiki/maintenance.pp [18:57:20] ...including an openstack:: class [18:57:38] Yeah I'm not thrilled. [18:57:43] But that's how that file is generated right now. [18:58:00] tbh, we should move Wikitech into the normal private settings [18:58:14] (and make the whole thing come from puppet too) [18:58:16] in what universe is a mediawiki config under the openstack class ffs [18:58:37] (I realize this isn't your change) [18:58:45] yeah, that sounds cleaner, let's do that? [19:04:08] (03CR) 10EBernhardson: [C: 04-1] "it doesn't appear that the --debug-extension allows command line arguments, testing this script on terbium gets:" [puppet] - 10https://gerrit.wikimedia.org/r/309605 (owner: 10Chad) [19:04:22] (03CR) 10Nikerabbit: contint: vary ssh from= for prod slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [19:05:17] (03CR) 10Chad: "Yeah, this is gonna need my other change to land first :)" [puppet] - 10https://gerrit.wikimedia.org/r/309605 (owner: 10Chad) [19:08:24] (03CR) 10EBernhardson: "did the integer overflow patch for search-extra make it in here? It was after '" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [19:09:48] (03CR) 10Alex Monk: "Won't this give the restricted group access to the novaadmin pass etc. again?" [puppet] - 10https://gerrit.wikimedia.org/r/309615 (https://phabricator.wikimedia.org/T137771) (owner: 10Chad) [19:12:43] RECOVERY - mediawiki-installation DSH group on mw2076 is OK: OK [19:20:07] (03CR) 10Chad: "It should be on every MW host tbh...you can't assume that your special snowflake is always special :)" [puppet] - 10https://gerrit.wikimedia.org/r/309615 (https://phabricator.wikimedia.org/T137771) (owner: 10Chad) [19:23:56] (03Abandoned) 10Chad: Include wikitech's private settings on terbium-like hosts [puppet] - 10https://gerrit.wikimedia.org/r/309615 (https://phabricator.wikimedia.org/T137771) (owner: 10Chad) [19:26:31] (03PS1) 10Chad: Sometimes we want to pretend to be wikitech even when we aren't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309616 (https://phabricator.wikimedia.org/T137771) [19:27:25] (03CR) 10Chad: [C: 032] Sometimes we want to pretend to be wikitech even when we aren't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309616 (https://phabricator.wikimedia.org/T137771) (owner: 10Chad) [19:27:52] (03Merged) 10jenkins-bot: Sometimes we want to pretend to be wikitech even when we aren't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309616 (https://phabricator.wikimedia.org/T137771) (owner: 10Chad) [19:29:41] !log demon@tin Synchronized wmf-config/wikitech.php: bizarro config loading (duration: 00m 46s) [19:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:42] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2624103 (10demon) [19:32:40] paravoid: That should fix it for now. [19:32:51] I'm still not a huge fan of how this is structured though. [19:33:10] I agree that it shouldn't be in the openstack class, but... [19:33:25] well, it's not an accident that that file is limited to just silver [19:33:42] I guess it doesn't reveal secrets to anyone new to put it everywhere, now that all deployers have access to silver [19:33:55] yes it does [19:34:02] restricted users have access to terbium [19:35:19] sorry, 'restricted users' being more people than just deployers? [19:35:27] 'restricted' is a group [19:35:31] and yes, it's not a subset of deployers [19:35:54] the majority of them are not deployers: https://phabricator.wikimedia.org/T104671 [19:36:17] ah, looks like ostriches patch addressed that without copying the file everywhere... [19:36:19] so maybe we're good [19:37:36] I still think being able to access a wiki half-configured is flaw. [19:45:41] ostriches: thanks, btw :) [19:49:55] yw [19:57:33] (03PS1) 10Yuvipanda: labspuppetbackend: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/309619 [19:58:10] (03PS2) 10Yuvipanda: labspuppetbackend: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/309619 [19:58:18] (03CR) 10Yuvipanda: [C: 032 V: 032] labspuppetbackend: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/309619 (owner: 10Yuvipanda) [20:05:59] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Make elasticsearch actually uses shard allocation awareness - https://phabricator.wikimedia.org/T143571#2624340 (10debt) [20:06:03] 06Operations, 10ops-eqiad, 06Discovery, 10Elasticsearch, and 2 others: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685#2624339 (10debt) 05Open>03Resolved [20:09:59] (03PS1) 10Andrew Bogott: Fix the puppet backend tld for labtest [puppet] - 10https://gerrit.wikimedia.org/r/309621 [20:10:47] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10AlexMonk-WMF) I did some digging and found /shared/pywikipedia/core/pywikibot/comms/rcstream.py in tools defaults to stream.wikimedia.org... [20:11:22] (03CR) 10Andrew Bogott: [C: 032] Fix the puppet backend tld for labtest [puppet] - 10https://gerrit.wikimedia.org/r/309621 (owner: 10Andrew Bogott) [20:12:19] (03PS1) 10Yuvipanda: labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 [20:12:38] andrewbogott: ^ this should work [20:13:23] or not [20:13:45] (03CR) 10jenkins-bot: [V: 04-1] labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 (owner: 10Yuvipanda) [20:15:58] (03PS2) 10Yuvipanda: labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 [20:16:00] (03PS1) 10Yuvipanda: labspuppetbackend: Listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/309625 [20:17:51] (03CR) 10jenkins-bot: [V: 04-1] labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 (owner: 10Yuvipanda) [20:18:14] what the fuck now, jenkins [20:18:52] (03CR) 10jenkins-bot: [V: 04-1] labspuppetbackend: Listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/309625 (owner: 10Yuvipanda) [20:18:57] over time I'm coming around to more of the ipython project's attitude towards style checks in CI [20:19:45] (03PS2) 10Yuvipanda: labspuppetbackend: Listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/309625 [20:19:47] (03PS3) 10Yuvipanda: labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 [20:20:24] (03CR) 10Andrew Bogott: [C: 031] labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 (owner: 10Yuvipanda) [20:22:26] (03CR) 10Yuvipanda: [C: 032] labs: Hit the REST backend for additional roles [puppet] - 10https://gerrit.wikimedia.org/r/309622 (owner: 10Yuvipanda) [20:22:34] (03CR) 10Yuvipanda: [C: 032] labspuppetbackend: Listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/309625 (owner: 10Yuvipanda) [20:31:31] 06Operations, 10Pywikibot-core, 10Traffic, 07HTTPS: rcstream support defaults to stream.wikimedia.org:80 - https://phabricator.wikimedia.org/T145244#2624490 (10AlexMonk-WMF) a:03AlexMonk-WMF [20:41:15] (03PS1) 10Andrew Bogott: horizon puppet gui: Remove the 'labs' role filter [puppet] - 10https://gerrit.wikimedia.org/r/309633 [20:42:20] (03PS1) 10Thcipriani: Bump scap version to 3.2.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/309635 [20:42:39] (03CR) 10Thcipriani: [C: 04-1] "Blocked until new version is on carbon" [puppet] - 10https://gerrit.wikimedia.org/r/309635 (owner: 10Thcipriani) [20:43:20] (03CR) 10Andrew Bogott: [C: 032] horizon puppet gui: Remove the 'labs' role filter [puppet] - 10https://gerrit.wikimedia.org/r/309633 (owner: 10Andrew Bogott) [20:48:02] (03PS2) 10Thcipriani: Bump scap version to 3.2.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/309635 (https://phabricator.wikimedia.org/T127762) [20:49:03] (03CR) 10Thcipriani: [C: 04-1] Bump scap version to 3.2.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/309635 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [20:53:13] (03PS2) 10Rush: labstore: nfs-exportd refactor and updates [puppet] - 10https://gerrit.wikimedia.org/r/309382 [20:54:36] (03CR) 10Madhuvishy: [C: 031] labstore: nfs-exportd refactor and updates [puppet] - 10https://gerrit.wikimedia.org/r/309382 (owner: 10Rush) [20:55:22] (03PS3) 10Rush: labstore: nfs-exportd refactor and updates [puppet] - 10https://gerrit.wikimedia.org/r/309382 [21:01:29] (03CR) 10Rush: [C: 032] labstore: nfs-exportd refactor and updates [puppet] - 10https://gerrit.wikimedia.org/r/309382 (owner: 10Rush) [21:04:29] ostriches: I'm getting a merge failure from Jenkins due to unrelated Wikibase unit tests failing. Any idea what to do about that? Should I just wait and try rechecking it later? https://gerrit.wikimedia.org/r/#/c/260636/ [21:04:44] lez see [21:05:27] ostriches: hmm, actually maybe it is related [21:09:16] ostriches: actually, I think it's related but indirectly. At some point someone fixed a typo on a line in that file, and the rebase picked it up. But the typo fix is triggering a bug in Wikibase. [21:09:30] ostriches: I'll do some more digging [21:09:31] Aww :( [21:15:33] ostriches: I don't get it. It's failing because of a typo fix that was merged 6 months ago: https://gerrit.wikimedia.org/r/#/c/278277/ Clearly I'm missing something here. [21:16:41] https://gerrit.wikimedia.org/r/#/c/278277/2/includes/pager/ReverseChronologicalPager.php specifically [21:18:24] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [21:20:33] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:23] The 3 unit test failures have nothing to do with the code changes in that patch, but they are related to the file that was changed. This is baffling. [21:23:04] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [21:30:37] (03PS1) 10Alex Monk: hiera_lookup util: add support for labtest realm, fix check for labs [puppet] - 10https://gerrit.wikimedia.org/r/309685 [21:43:54] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:52:55] (03PS1) 10Alex Monk: fix labstore cluster: labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/309689 [21:55:50] (03PS1) 10Alex Monk: couple more labs support host hiera cluster key cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309690 [22:09:36] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2624661 (10greg) >>! In T144661#2618330, @jcrespo wrote: >> It's kind of... [22:10:03] jouncebot: now [22:10:03] No deployments scheduled for the next 62 hour(s) and 49 minute(s) [22:10:12] neat, it's already pulled and running [22:11:14] (03PS1) 10Alex Monk: more labs regex hiera cluster key fix for wikimedia.org hosts [puppet] - 10https://gerrit.wikimedia.org/r/309691 [22:11:16] (03PS1) 10Alex Monk: set labtest up in ganglia like labs [puppet] - 10https://gerrit.wikimedia.org/r/309692 [22:11:18] (03PS1) 10Alex Monk: remove extra cluster: labvirt hieradata [puppet] - 10https://gerrit.wikimedia.org/r/309693 [22:14:00] (03Abandoned) 10Alex Monk: remove extra cluster: labvirt hieradata [puppet] - 10https://gerrit.wikimedia.org/r/309693 (owner: 10Alex Monk) [22:15:24] (03Restored) 10Alex Monk: remove extra cluster: labvirt hieradata [puppet] - 10https://gerrit.wikimedia.org/r/309693 (owner: 10Alex Monk) [22:16:05] (03PS2) 10Alex Monk: remove extra cluster: labvirt hieradata [puppet] - 10https://gerrit.wikimedia.org/r/309693 [22:16:07] (03PS2) 10Alex Monk: set labtest up in ganglia like labs [puppet] - 10https://gerrit.wikimedia.org/r/309692 [22:22:45] (03PS1) 10RobH: robh on vacation, remove from paging [puppet] - 10https://gerrit.wikimedia.org/r/309694 [22:32:40] (03PS1) 10Alex Monk: Remove references to the old virt* servers [puppet] - 10https://gerrit.wikimedia.org/r/309695 [22:33:50] (03CR) 10jenkins-bot: [V: 04-1] Remove references to the old virt* servers [puppet] - 10https://gerrit.wikimedia.org/r/309695 (owner: 10Alex Monk) [22:34:36] greg-g: jouncebot's "now" is actually live, just not merged :) [22:34:39] jouncebot: now [22:34:39] No deployments scheduled for the next 62 hour(s) and 25 minute(s) [22:36:01] (03PS2) 10Alex Monk: Remove references to the old virt* servers [puppet] - 10https://gerrit.wikimedia.org/r/309695 [22:36:22] greg-g: I wonder if it would be useful to have jouncebot also manage a section of the channel topic when deploys are active? [22:40:00] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2624715 (10bd808) >>! In T144661#2624661, @greg wrote: > So, here's a pr... [22:42:39] bd808: also a good idea, one I/we've wanted for a while. And one we can do better now with jouncebot than we could with that etsy bot [22:43:04] Yeah. the etsy bot *almost* worked [22:43:12] * greg-g nods [22:43:31] that one probably needs a task so people can fret over /topic format ;) [22:44:11] (03PS9) 10Volans: Automation: automatically reimage host [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) [22:44:26] code talks, bikeshedding walks ;) [22:48:23] (03CR) 10Greg Grossmeier: [C: 032] Add a "now" command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308086 (owner: 10BryanDavis) [22:48:32] :) [22:49:13] (03Merged) 10jenkins-bot: Add a "now" command [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308086 (owner: 10BryanDavis) [22:49:41] greg-g: while your +2 finger is working -- https://gerrit.wikimedia.org/r/#/c/308087/ [22:50:37] (03CR) 10Greg Grossmeier: [C: 032] "God yes." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308087 (owner: 10BryanDavis) [22:51:40] (03Merged) 10jenkins-bot: Use normal messages rather than notices for help [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/308087 (owner: 10BryanDavis) [23:02:52] (03CR) 10Thcipriani: "Inline question/confusion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [23:08:33] (03CR) 10BryanDavis: l10nupdate: aquire scap lock before changing files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) (owner: 10BryanDavis) [23:19:00] (03PS2) 10BryanDavis: l10nupdate: aquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) [23:22:03] (03PS2) 10Ppchelko: Change-Prop: Bump transcludes concurrency once again. [puppet] - 10https://gerrit.wikimedia.org/r/309377