[00:08:49] legoktm: I've got a great idea. Write a script to make 150 mw-config commits, each just changing one require_once() to wfLoadExtension() and not depending on each other. The backlog will shame us into merging as many of them as possible, and the not-yet-ready outstanding ones will shame us into making them ready. :-) [00:09:02] :( [00:09:36] RECOVERY - grafana.wikimedia.org on krypton is OK: HTTP OK: HTTP/1.1 200 OK - 1485 bytes in 0.007 second response time [00:10:17] !log apache restart on krypton [00:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:25] RECOVERY - HTTP on krypton is OK: HTTP OK: HTTP/1.1 200 OK - 1485 bytes in 0.002 second response time [00:11:54] mutante: :) [00:12:48] (03CR) 10Dzahn: "not needed anymore - fixed by https://gerrit.wikimedia.org/r/#/c/230682/" [puppet] - 10https://gerrit.wikimedia.org/r/230664 (owner: 10Dzahn) [00:12:58] (03Abandoned) 10Dzahn: Revert "grafana: add role to krypton (VM)" [puppet] - 10https://gerrit.wikimedia.org/r/230664 (owner: 10Dzahn) [00:13:18] ori: :) thanks [00:13:38] (03CR) 10Faidon Liambotis: [C: 032] Add A/PTR for mr1-codfw and msw1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230696 (owner: 10Faidon Liambotis) [00:16:30] (03CR) 10Dzahn: [C: 031] "@krypton:/etc/apache2/sites-enabled# curl localhost 2>/dev/null | grep body" [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [00:18:54] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1526267 (10Krenair) [00:19:57] (03PS2) 10Dzahn: misc-web: switch grafana to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) [00:21:08] (03PS4) 10Ori.livneh: Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) (owner: 10GWicke) [00:21:17] (03CR) 10Ori.livneh: [C: 032 V: 032] Enforce a hard limit on RestbaseUpdateJobOnDependencyChange retries [puppet] - 10https://gerrit.wikimedia.org/r/226901 (https://phabricator.wikimedia.org/T73853) (owner: 10GWicke) [00:23:07] (03PS1) 10Tim Starling: Enable ParsoidBatchAPI everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230708 [00:24:18] (03CR) 10Tim Starling: [C: 04-2] "Can be deployed once the ParsoidBatchAPI source tree is available in all active deployment branches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230708 (owner: 10Tim Starling) [00:25:40] (03CR) 10Dzahn: "i get:" [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [00:33:05] (03PS1) 10Dzahn: OTRS: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230709 [00:36:36] PROBLEM - Disk space on cp1054 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=88%) [00:38:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: stat1002 access for tgr - https://phabricator.wikimedia.org/T108417#1526317 (10Dzahn) p:5Triage>3Normal [00:39:02] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: stat1002 access for tgr - https://phabricator.wikimedia.org/T108417#1526320 (10Dzahn) a:3ArielGlenn [00:39:02] bblack: cp1054 disk ^ ? [00:39:40] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1526322 (10Dzahn) p:5Triage>3Normal [00:39:51] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1526324 (10Dzahn) a:3ArielGlenn [00:42:29] 10Ops-Access-Reviews: Analytics-users membership for csteipp - https://phabricator.wikimedia.org/T108351#1526326 (10Dzahn) a:3ArielGlenn [00:48:36] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [00:50:41] (03PS1) 10Faidon Liambotis: Add pybal-testsvc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/230713 [00:51:32] 6operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#1526345 (10Dzahn) "comparision: Serving small static files with Nginx, Varnish, G-WAN, Lighthttpd, Apache Traffic Server" https://x443.wordpress.com/2012/07/07/comparision-serving-small-static-files-with-nginx-va... [00:52:36] (03CR) 10Ori.livneh: [C: 031] "weee" [dns] - 10https://gerrit.wikimedia.org/r/230713 (owner: 10Faidon Liambotis) [00:54:52] (03CR) 10Faidon Liambotis: [C: 032] Add pybal-testsvc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/230713 (owner: 10Faidon Liambotis) [01:00:01] (03CR) 10Dzahn: "in hiera it would be public, i would assume it needs to be private in the private puppet repo and then read from there like we do for othe" [puppet] - 10https://gerrit.wikimedia.org/r/230549 (https://phabricator.wikimedia.org/T108610) (owner: 10Yurik) [01:01:05] RECOVERY - Disk space on cp1054 is OK: DISK OK [01:06:44] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1526397 (10Dzahn) that was fixed by https://gerrit.wikimedia.org/r/#/c/230682/ so the role is applied on krypton but "Could not contact Elasticsearch. Please ensure that Elasticsearch is reachable... [01:08:26] (03PS1) 10BryanDavis: logging: Only send info and higher to logstash by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230719 [01:09:56] (03CR) 10BryanDavis: logging: Only send info and higher to logstash by default (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230719 (owner: 10BryanDavis) [01:17:36] (03CR) 10Yurik: "Existing osm_importer password is pulled via" [puppet] - 10https://gerrit.wikimedia.org/r/230549 (https://phabricator.wikimedia.org/T108610) (owner: 10Yurik) [01:20:08] !log ori@tin Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoader.php: I2089b21fc: Revert resourceloader: Add must-revalidate to Cache-Control (duration: 00m 12s) [01:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:05:45] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:07:45] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18880 bytes in 0.044 second response time [02:08:19] flappy flap flap [02:08:25] page =P [02:23:33] !log l10nupdate@tin Synchronized php-1.26wmf17/cache/l10n: l10nupdate for 1.26wmf17 (duration: 06m 48s) [02:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:59] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf17) at 2015-08-11 02:26:58+00:00 [02:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:06] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:56:06] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18879 bytes in 1.054 second response time [03:44:46] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [03:49:34] (03PS1) 10BBlack: strongswan: decrease log spam [puppet] - 10https://gerrit.wikimedia.org/r/230725 [03:50:50] (03CR) 10BBlack: [C: 032] strongswan: decrease log spam [puppet] - 10https://gerrit.wikimedia.org/r/230725 (owner: 10BBlack) [03:52:56] (03PS1) 10BBlack: bugfix for aaa20e4ce [puppet] - 10https://gerrit.wikimedia.org/r/230726 [03:53:13] (03CR) 10BBlack: [C: 032 V: 032] bugfix for aaa20e4ce [puppet] - 10https://gerrit.wikimedia.org/r/230726 (owner: 10BBlack) [03:54:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:56:46] PROBLEM - puppet last run on cp1055 is CRITICAL puppet fail [03:57:47] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [03:58:26] PROBLEM - puppet last run on cp3039 is CRITICAL puppet fail [03:58:27] PROBLEM - puppet last run on cp3012 is CRITICAL puppet fail [03:58:45] those will self-correct [03:58:55] I think! :) [04:00:26] RECOVERY - puppet last run on cp3039 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:02:05] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:02:56] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:04:46] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:36:41] (03PS3) 10BBlack: rename varnish backends more-explicitly [puppet] - 10https://gerrit.wikimedia.org/r/230687 [04:36:52] (03PS4) 10BBlack: rename varnish backends more-explicitly [puppet] - 10https://gerrit.wikimedia.org/r/230687 [04:37:50] (03CR) 10BBlack: [C: 032] rename varnish backends more-explicitly [puppet] - 10https://gerrit.wikimedia.org/r/230687 (owner: 10BBlack) [04:40:28] (03PS1) 10BBlack: Revert "Revert "cache::config: replace lvs IP refs with service hostnames"" [puppet] - 10https://gerrit.wikimedia.org/r/230728 [04:41:21] (03PS2) 10BBlack: Revert "Revert "cache::config: replace lvs IP refs with service hostnames"" [puppet] - 10https://gerrit.wikimedia.org/r/230728 [04:43:26] PROBLEM - puppet last run on cp1069 is CRITICAL Puppet has 1 failures [04:44:39] (03PS1) 10EBernhardson: fix incorrect whitespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230729 [04:44:55] PROBLEM - puppet last run on cp3035 is CRITICAL Puppet has 2 failures [04:45:27] PROBLEM - puppet last run on cp4007 is CRITICAL Puppet has 1 failures [04:46:27] PROBLEM - puppet last run on cp2007 is CRITICAL Puppet has 1 failures [04:46:56] RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:47:26] PROBLEM - puppet last run on cp3034 is CRITICAL Puppet has 2 failures [04:47:26] PROBLEM - puppet last run on cp2021 is CRITICAL Puppet has 1 failures [04:48:26] PROBLEM - puppet last run on cp2023 is CRITICAL Puppet has 1 failures [04:48:26] RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:48:57] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 2 failures [04:49:05] PROBLEM - puppet last run on cp2026 is CRITICAL Puppet has 1 failures [04:49:16] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 2 failures [04:49:26] PROBLEM - puppet last run on cp3047 is CRITICAL Puppet has 2 failures [04:49:35] RECOVERY - puppet last run on cp4007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:49:56] PROBLEM - puppet last run on cp2011 is CRITICAL Puppet has 1 failures [04:50:15] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 2 failures [04:50:16] PROBLEM - puppet last run on cp3018 is CRITICAL Puppet has 2 failures [04:50:26] PROBLEM - puppet last run on cp2003 is CRITICAL Puppet has 1 failures [04:50:30] on the bulk of them it's a race condition that will work itself out... [04:51:15] PROBLEM - puppet last run on cp2016 is CRITICAL Puppet has 1 failures [04:51:27] PROBLEM - puppet last run on cp2014 is CRITICAL Puppet has 1 failures [04:51:36] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [04:51:56] PROBLEM - puppet last run on cp2010 is CRITICAL Puppet has 1 failures [04:52:06] PROBLEM - puppet last run on cp4019 is CRITICAL Puppet has 2 failures [04:52:16] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 2 failures [04:52:16] PROBLEM - puppet last run on cp1058 is CRITICAL Puppet has 1 failures [04:52:45] PROBLEM - puppet last run on cp4015 is CRITICAL Puppet has 1 failures [04:53:16] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 1 failures [04:53:16] PROBLEM - puppet last run on cp3006 is CRITICAL Puppet has 2 failures [04:53:36] PROBLEM - puppet last run on cp4005 is CRITICAL Puppet has 1 failures [04:54:16] PROBLEM - puppet last run on cp3009 is CRITICAL Puppet has 2 failures [04:54:25] PROBLEM - puppet last run on cp2019 is CRITICAL Puppet has 1 failures [04:54:26] PROBLEM - puppet last run on cp2022 is CRITICAL Puppet has 1 failures [04:54:35] PROBLEM - puppet last run on cp2015 is CRITICAL Puppet has 1 failures [04:56:16] PROBLEM - puppet last run on cp3046 is CRITICAL Puppet has 2 failures [04:56:17] PROBLEM - puppet last run on cp3013 is CRITICAL Puppet has 2 failures [04:56:17] PROBLEM - puppet last run on cp3031 is CRITICAL Puppet has 2 failures [04:56:55] PROBLEM - puppet last run on cp3045 is CRITICAL Puppet has 2 failures [04:56:56] PROBLEM - puppet last run on cp2017 is CRITICAL Puppet has 1 failures [04:59:06] PROBLEM - puppet last run on cp3016 is CRITICAL Puppet has 2 failures [04:59:26] PROBLEM - puppet last run on cp3039 is CRITICAL Puppet has 2 failures [04:59:39] PROBLEM - puppet last run on cp3012 is CRITICAL Puppet has 2 failures [04:59:46] PROBLEM - puppet last run on cp4009 is CRITICAL Puppet has 1 failures [05:00:46] PROBLEM - puppet last run on cp2002 is CRITICAL Puppet has 1 failures [05:01:06] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures [05:01:45] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 2 failures [05:01:46] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [05:01:56] PROBLEM - puppet last run on cp1057 is CRITICAL Puppet has 1 failures [05:02:05] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 2 failures [05:02:05] PROBLEM - puppet last run on cp3017 is CRITICAL Puppet has 2 failures [05:02:35] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [05:02:36] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures [05:02:37] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures [05:03:16] PROBLEM - puppet last run on cp3036 is CRITICAL Puppet has 2 failures [05:03:56] PROBLEM - puppet last run on cp3032 is CRITICAL Puppet has 2 failures [05:04:06] PROBLEM - puppet last run on cp3003 is CRITICAL Puppet has 2 failures [05:05:15] PROBLEM - puppet last run on cp4018 is CRITICAL Puppet has 2 failures [05:05:15] PROBLEM - puppet last run on cp4013 is CRITICAL Puppet has 2 failures [05:05:55] PROBLEM - puppet last run on cp2009 is CRITICAL Puppet has 1 failures [05:06:45] PROBLEM - puppet last run on cp3015 is CRITICAL Puppet has 2 failures [05:06:56] PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 1 failures [05:07:16] PROBLEM - puppet last run on cp3049 is CRITICAL Puppet has 2 failures [05:07:56] PROBLEM - puppet last run on cp2024 is CRITICAL Puppet has 1 failures [05:08:17] PROBLEM - puppet last run on cp1070 is CRITICAL Puppet has 1 failures [05:09:16] PROBLEM - puppet last run on cp4006 is CRITICAL Puppet has 1 failures [05:09:36] PROBLEM - puppet last run on cp4020 is CRITICAL Puppet has 2 failures [05:09:46] PROBLEM - puppet last run on cp2005 is CRITICAL Puppet has 1 failures [05:10:05] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 2 failures [05:10:36] PROBLEM - puppet last run on cp2008 is CRITICAL Puppet has 1 failures [05:10:46] PROBLEM - puppet last run on cp3038 is CRITICAL Puppet has 2 failures [05:10:55] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 2 failures [05:10:55] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [05:11:16] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 2 failures [05:11:46] PROBLEM - puppet last run on cp4012 is CRITICAL Puppet has 2 failures [05:11:55] PROBLEM - puppet last run on cp3004 is CRITICAL Puppet has 2 failures [05:11:56] PROBLEM - puppet last run on cp2020 is CRITICAL Puppet has 1 failures [05:12:05] RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:12:06] RECOVERY - puppet last run on cp2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:12:16] PROBLEM - puppet last run on cp3041 is CRITICAL Puppet has 2 failures [05:12:46] PROBLEM - puppet last run on cp3033 is CRITICAL Puppet has 2 failures [05:13:25] PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 2 failures [05:13:47] PROBLEM - puppet last run on cp4011 is CRITICAL Puppet has 2 failures [05:13:55] PROBLEM - puppet last run on cp3043 is CRITICAL Puppet has 2 failures [05:14:16] PROBLEM - puppet last run on cp3044 is CRITICAL Puppet has 2 failures [05:14:56] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:15:05] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:15:21] (03PS1) 10BBlack: cache_misc: convert backends to director-style [puppet] - 10https://gerrit.wikimedia.org/r/230730 [05:15:36] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [05:15:36] RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:15:47] RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:15:56] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:16:06] RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:16:37] RECOVERY - puppet last run on cp2011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:16:46] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:16:55] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:16:55] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:17:06] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [05:17:25] RECOVERY - puppet last run on cp4015 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:17:56] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [05:18:15] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:18:16] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:18:27] (03CR) 10BBlack: [C: 032] cache_misc: convert backends to director-style [puppet] - 10https://gerrit.wikimedia.org/r/230730 (owner: 10BBlack) [05:18:36] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:18:46] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:18:56] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:19:05] RECOVERY - puppet last run on cp2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:19:56] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:20:06] RECOVERY - puppet last run on cp1069 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:20:16] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:20:16] (03PS3) 10BBlack: Revert "Revert "cache::config: replace lvs IP refs with service hostnames"" [puppet] - 10https://gerrit.wikimedia.org/r/230728 [05:20:56] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:20:57] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:20:57] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:21:05] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:21:12] (03CR) 10BBlack: [C: 032] Revert "Revert "cache::config: replace lvs IP refs with service hostnames"" [puppet] - 10https://gerrit.wikimedia.org/r/230728 (owner: 10BBlack) [05:21:16] RECOVERY - puppet last run on cp2015 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [05:21:35] RECOVERY - puppet last run on cp2017 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:22:46] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [05:22:57] RECOVERY - puppet last run on cp3046 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:27:07] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 2 failures [05:27:07] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [05:27:15] PROBLEM - puppet last run on cp2002 is CRITICAL Puppet has 1 failures [05:27:16] PROBLEM - puppet last run on cp2004 is CRITICAL Puppet has 1 failures [05:27:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Aug 11 05:27:33 UTC 2015 (duration 27m 32s) [05:27:35] PROBLEM - puppet last run on cp4018 is CRITICAL Puppet has 2 failures [05:27:36] PROBLEM - puppet last run on cp4006 is CRITICAL Puppet has 1 failures [05:27:36] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 2 failures [05:27:36] PROBLEM - puppet last run on cp4013 is CRITICAL Puppet has 2 failures [05:27:36] PROBLEM - puppet last run on cp3036 is CRITICAL Puppet has 2 failures [05:27:37] PROBLEM - puppet last run on cp3049 is CRITICAL Puppet has 2 failures [05:27:37] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 5 minutes ago with 0 failures [05:27:37] PROBLEM - puppet last run on cp3005 is CRITICAL Puppet has 2 failures [05:27:37] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures [05:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:27:46] PROBLEM - puppet last run on cp4020 is CRITICAL Puppet has 2 failures [05:27:46] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:28:06] PROBLEM - puppet last run on cp2005 is CRITICAL Puppet has 1 failures [05:28:06] PROBLEM - puppet last run on cp4012 is CRITICAL Puppet has 2 failures [05:28:06] PROBLEM - puppet last run on cp4011 is CRITICAL Puppet has 2 failures [05:28:07] PROBLEM - puppet last run on cp3043 is CRITICAL Puppet has 2 failures [05:28:07] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 2 failures [05:28:07] PROBLEM - puppet last run on cp3004 is CRITICAL Puppet has 2 failures [05:28:07] RECOVERY - puppet last run on cp3039 is OK Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:28:15] PROBLEM - puppet last run on cp2020 is CRITICAL Puppet has 1 failures [05:28:15] PROBLEM - puppet last run on cp2024 is CRITICAL Puppet has 1 failures [05:28:15] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [05:28:15] PROBLEM - puppet last run on cp2009 is CRITICAL Puppet has 1 failures [05:28:16] PROBLEM - puppet last run on cp1057 is CRITICAL Puppet has 1 failures [05:28:16] PROBLEM - puppet last run on cp3032 is CRITICAL Puppet has 2 failures [05:28:17] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 2 failures [05:28:17] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:28:26] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:28:26] PROBLEM - puppet last run on cp3044 is CRITICAL Puppet has 2 failures [05:28:27] PROBLEM - puppet last run on cp3041 is CRITICAL Puppet has 2 failures [05:28:27] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 2 failures [05:28:27] PROBLEM - puppet last run on cp3017 is CRITICAL Puppet has 2 failures [05:28:27] PROBLEM - puppet last run on cp3003 is CRITICAL Puppet has 2 failures [05:28:27] PROBLEM - puppet last run on cp1070 is CRITICAL Puppet has 1 failures [05:28:56] PROBLEM - puppet last run on cp2008 is CRITICAL Puppet has 1 failures [05:28:56] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [05:28:56] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures [05:29:06] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures [05:29:06] PROBLEM - puppet last run on cp3038 is CRITICAL Puppet has 2 failures [05:29:06] PROBLEM - puppet last run on cp3033 is CRITICAL Puppet has 2 failures [05:29:06] PROBLEM - puppet last run on cp3015 is CRITICAL Puppet has 2 failures [05:29:07] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:29:19] that the puppet check re-emits old failures when you flip the local disable switch makes it even more spammy :P [05:29:37] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:29:37] RECOVERY - puppet last run on cp4013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:29:39] salting puppets all of the place now just to clean it all faster [05:30:16] RECOVERY - puppet last run on cp2024 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:30:25] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [05:30:26] RECOVERY - puppet last run on cp1057 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:30:26] RECOVERY - puppet last run on cp3032 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [05:30:37] RECOVERY - puppet last run on cp3044 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:30:37] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:17] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:31:17] RECOVERY - puppet last run on cp3033 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:31:26] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:31:36] RECOVERY - puppet last run on cp2002 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:31:36] RECOVERY - puppet last run on cp2004 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:31:55] RECOVERY - puppet last run on cp4006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:55] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:31:56] RECOVERY - puppet last run on cp3005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:32:16] RECOVERY - puppet last run on cp2005 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:32:26] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:32:26] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:32:46] RECOVERY - puppet last run on cp3041 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [05:32:46] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:32:46] RECOVERY - puppet last run on cp1070 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:33:16] RECOVERY - puppet last run on cp2008 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [05:33:25] RECOVERY - puppet last run on cp3015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:33:56] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:33:56] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:06] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:25] RECOVERY - puppet last run on cp4011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:26] RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [05:34:26] RECOVERY - puppet last run on cp3043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:36] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:34:47] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:35:15] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:35:16] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:35:25] RECOVERY - puppet last run on cp3038 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:35:56] RECOVERY - puppet last run on cp3036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:36:26] RECOVERY - puppet last run on cp2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:36:26] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:45:20] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526649 (10awight) > That probably made sense in 2006, when the article that SO post is based on was... [05:56:47] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526651 (10BBlack) Even if browsers allow >2K URLs, they seem like a poor idea in general. Even a 1... [06:00:00] (03PS2) 10BBlack: tlsproxy: multi_accept off [puppet] - 10https://gerrit.wikimedia.org/r/230553 [06:02:55] (03PS3) 10BBlack: tlsproxy: multi_accept off [puppet] - 10https://gerrit.wikimedia.org/r/230553 [06:03:38] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: multi_accept off [puppet] - 10https://gerrit.wikimedia.org/r/230553 (owner: 10BBlack) [06:12:20] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1526653 (10BBlack) [06:21:00] (03CR) 10Matanya: [C: 031] OTRS: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230709 (owner: 10Dzahn) [06:21:21] 6operations, 10MediaWiki-General-or-Unknown: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1526661 (10Joe) [06:30:45] PROBLEM - puppet last run on mc2015 is CRITICAL Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on lvs1003 is CRITICAL Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on pybal-test2002 is CRITICAL puppet fail [06:31:47] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2145 is CRITICAL Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on db1045 is CRITICAL Puppet has 1 failures [06:33:06] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:33:26] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:33:55] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:37:40] puppet o'clock! [06:53:44] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1526679 (10BBlack) Probably the most important question (since I haven't really looked at UrlShortener) is: are there subdomains involved, or just `... [06:55:26] RECOVERY - puppet last run on lvs1003 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:55] RECOVERY - puppet last run on mw2145 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on db1045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:56] RECOVERY - puppet last run on mc2015 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:56] RECOVERY - puppet last run on pybal-test2002 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:06:19] (03PS1) 10BBlack: no varnish::instance uses "backends" directly anymore [puppet] - 10https://gerrit.wikimedia.org/r/230733 [07:10:27] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1526686 (10Arrbee) a:3KartikMistry [07:15:34] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526691 (10Tgr) So why don't we just use POST? `sendBeacon` actually does that, we just abuse it cur... [07:31:52] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1526718 (10BBlack) Probably because beacon is used with the analytics pipeline rather than the appse... [07:51:22] (03PS1) 10ArielGlenn: add iridum to dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/230734 [07:53:56] (03CR) 10ArielGlenn: [C: 032] add iridum to dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/230734 (owner: 10ArielGlenn) [08:30:37] 6operations, 10MediaWiki-General-or-Unknown: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1526743 (10fgiunchedi) a:5fgiunchedi>3Joe @joe has kindly agreed to investigate this, he's been already bouncing ideas with @t... [08:31:16] (03CR) 10Jcrespo: [C: 031] Change test for log_type to a list [software] - 10https://gerrit.wikimedia.org/r/230645 (owner: 10coren) [08:43:59] (03CR) 10Filippo Giunchedi: [C: 031] "thanks Bryan for the explanation!" [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [08:45:01] (03CR) 10Filippo Giunchedi: "also to clarify, since logstash pushes directly to statsd we can avoid per-host stats for now since the host where it comes from isn't int" [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [08:47:53] (03PS2) 10ArielGlenn: Add Chris Steipp to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/230142 (https://phabricator.wikimedia.org/T108227) (owner: 10Andrew Bogott) [08:49:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1526758 (10ArielGlenn) updated patchset. I miscounted days after manager approval so I guess it's tomorrow that this can go out. [08:49:43] Anyone from ops around? [08:50:37] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1526759 (10fgiunchedi) >>! In T103335#1524694, @brion wrote: > Sample command line for VP9->ogv co... [08:51:45] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1526761 (10ArielGlenn) @BBlack, is this something I can hand to you? [08:51:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 linting comments inline, also move this to the role, not in the module. Other services might want to use the same module and configure f" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [08:53:51] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1526764 (10ArielGlenn) @hashar can you clarify? [08:56:50] (03PS1) 10Alexandros Kosiaris: ganeti: move role from manifests/ into the role module [puppet] - 10https://gerrit.wikimedia.org/r/230735 [08:56:56] (03CR) 10Alexandros Kosiaris: [C: 031] etherpad: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230686 (owner: 10Dzahn) [08:57:37] (03CR) 10Alexandros Kosiaris: "Just noting btw, that thanks to mod_access_compat (enabled by default?), etherpad is already on jessie. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/230686 (owner: 10Dzahn) [08:59:00] <_joe_> akosiaris: we should really get to the point where we disable mod_access_compat btw [08:59:03] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me, there is though a syntax error (missing comma)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230601 (owner: 10Chad) [08:59:08] <_joe_> I'm sure it's a perf penalty [08:59:50] (03CR) 10Alexandros Kosiaris: [C: 032] access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 (owner: 10Matanya) [08:59:54] (03PS3) 10Alexandros Kosiaris: access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 (owner: 10Matanya) [09:00:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] access: stat1002 access for tgr [puppet] - 10https://gerrit.wikimedia.org/r/230510 (owner: 10Matanya) [09:00:35] _joe_: yup, obviously [09:00:42] I just never noticed it on etherpad [09:00:52] also etherpad needs mpm_event [09:01:02] needs is an overstatement [09:01:11] but it would be nice to try it [09:01:21] <_joe_> akosiaris: let's do it then [09:01:32] <_joe_> why event and not worker, btw? [09:02:23] I like event more ? [09:02:33] no seriously both are obviously better than prefork [09:02:58] but event is webscale!!! [09:03:56] RECOVERY - puppet last run on ms-be2009 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [09:04:16] seriously, for etherpad there shouldn't be a major difference between the 2 [09:05:04] <_joe_> when you say "$x is webscale" it goes with more exclamation marks and at least one "1" [09:05:09] but the way etherpad clients (the javascript) works event might actually be just slightly better off [09:05:39] occasional requests and the like [09:05:47] _joe_: oh yes, you are right [09:05:54] but event is webscale!!!!!1111 [09:05:57] better ? [09:05:59] <_joe_> yes [09:09:15] !log reboot ms-be2009, cpu soft lockup [09:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:16:40] (03PS1) 10ArielGlenn: dataset: add redirect for fundraising data and link on web page [puppet] - 10https://gerrit.wikimedia.org/r/230738 [09:18:47] 6operations, 10Wikimedia-Fundraising: Add /fundraising to dumps.wikimedia.org - https://phabricator.wikimedia.org/T42847#1526793 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/230738/ for fundraising. not clear what people want to happen with frdata.wm.o [09:21:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: stat1002 access for tgr - https://phabricator.wikimedia.org/T108417#1526795 (10ArielGlenn) 5Open>3Resolved this was merged a little ahead of time but no matter. closing. [09:22:26] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 482593 msg: ocg_render_job_queue 3423 msg (=3000 critical) [09:22:55] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 483239 msg: ocg_render_job_queue 3646 msg (=3000 critical) [09:24:05] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 485316 msg: ocg_render_job_queue 4731 msg (=3000 critical) [09:33:30] (03PS1) 10ArielGlenn: dumps mirrors rsync conf: remove/update useless comments [puppet] - 10https://gerrit.wikimedia.org/r/230739 [09:34:08] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1526805 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi thanks @papaul! I'm assuming this is a new disk, anyways after clearing the raid array and mounting the fs the machine crashed, upon reboot t... [09:34:39] (03CR) 10ArielGlenn: [C: 032] dumps mirrors rsync conf: remove/update useless comments [puppet] - 10https://gerrit.wikimedia.org/r/230739 (owner: 10ArielGlenn) [09:44:21] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1526823 (10fgiunchedi) upgrade plan, starting today: * upgrade row A machines, (restbase100[127]) with `sudo apt-get install cassandra` * check regressions, http://grafana.wikimedia.org/#/... [09:45:26] 6operations, 10Citoid, 6Security, 6Security-Team, and 2 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632#1526828 (10mobrovac) >>! In T108632#1526127, @BBlack wrote: > If you think we can flip the switch now, I'm all for it. I gather from this that if we f... [09:46:15] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1526829 (10fgiunchedi) ``` root@carbon:~# reprepro --noskipold --restrict cassandra update aptmethod 'http' seems to have a obsoleted redirect handling which causes reprepro to request fil... [09:50:43] 6operations, 10Datasets-General-or-Unknown: Find docs on dataset mirrors - https://phabricator.wikimedia.org/T107510#1526843 (10ArielGlenn) 5Open>3Resolved that file is dead and at this point wouldn't have useful information in it. if there are complaints about the mirrors, I know that the administrator of... [09:52:36] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1526847 (10ArielGlenn) where are we on this? [09:56:22] PROBLEM - Host google is DOWN: /bin/ping6 -n -U -w 15 -c 5 google.com [09:56:35] RECOVERY - Host google is UPING OK - Packet loss = 0%, RTA = 8.67 ms [09:56:50] !log switched routing-system autonomous-system to eqiad's subAS on cr1-eqiad/cr2--eqiad [09:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:57] (03PS2) 10Filippo Giunchedi: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/230582 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [10:06:58] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1526856 (10ArielGlenn) @greg so what's the decision; there's also https://phabricator.wikimedia.org/T75919 and https://phabricator.wikimedia.or... [10:08:26] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I've updated the code review not to remove the old version because that's racy with puppet in https://gerrit.wikimedia.org/r/#/c/230" [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/230582 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [10:17:24] 6operations, 6Discovery, 10MediaWiki-Search, 7Monitoring: Search service monitoring should fail if search results only return exact matches and suggestions don't work - https://phabricator.wikimedia.org/T101914#1526896 (10ArielGlenn) it looks like they want to check a request of the form e.g. http://en.wik... [10:24:13] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1526914 (10ArielGlenn) adding @jcrespo to this to... [10:25:31] !log upgrade cassandra on restbase1001 [10:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:30] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1526919 (10ArielGlenn) who might be able to take o... [10:30:37] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1526926 (10ArielGlenn) @yuvipanda: can you describe network expertise you need? [10:31:25] !log upgrade cassandra on restbase1002 [10:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:34:17] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1526928 (10ArielGlenn) do we still want to do this? [10:35:02] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1526931 (10jcrespo) @ArielGlenn I already talked t... [10:35:56] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1526933 (10fgiunchedi) cosmetic issue output contains `%s` spotted while looking at the logs, benign ``` restbase1002:~$ grep %s /var/log/cassandra/system.log INFO [MemtableFlushWriter:... [10:38:15] Forwarding stuff from #wikimedia-tech from the last 45 minutes: [10:38:19] ContentTranslation-servers seems dead, anything known about that? [10:38:21] Hello. Are there any issues? 208.80.152.0/22 seems to have dropped off the routing table. [10:38:37] !log upgrade cassandra on restbase1007 [10:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:01] andre__: the latter issue is related to a recent network maint, should be fully recovered. no idea for the former [10:44:44] (03PS2) 10ArielGlenn: remove now obselete snapshot hosts sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/230524 [10:45:21] !log general maintenance on db1042 (restart, upgrade, db reconstruction) [10:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:45:49] (03CR) 10ArielGlenn: [C: 032] remove now obselete snapshot hosts sudoers file [puppet] - 10https://gerrit.wikimedia.org/r/230524 (owner: 10ArielGlenn) [10:47:56] 'm about performing a bigdelete at enwiki for a page with +10k revids [10:48:00] so you're aware [10:49:59] 7Puppet, 6operations: Clean up files/snapshot/sudoers.snapshot - https://phabricator.wikimedia.org/T107479#1526984 (10ArielGlenn) 5Open>3Resolved it's gone. [10:52:41] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1526992 (10ArielGlenn) yeah I saw and already remo... [11:09:22] godog, ah, thanks. will forward that! [11:24:55] akosiaris, hi, do you know if repl is done? [11:28:39] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1527035 (10mobrovac) >>! In T107900#1508656, @GWicke wrote: > ... and [for good reasons](https://wikitech.wikimedia.org/wiki/Incident_documentation/20140211-Parsoid). I think it would be preferable to do the same for ot... [11:46:25] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 554308 msg: ocg_render_job_queue 479 msg [11:46:50] akosiaris: can you look at apertium-apy service? [11:47:25] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 554359 msg: ocg_render_job_queue 0 msg [11:47:47] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 554393 msg: ocg_render_job_queue 0 msg [11:49:05] kart_: not sure what you mean. look at what ? [11:49:18] yurik: no it's not done yet [11:49:31] akosiaris: anything with it? (ie sca has some issues?) [11:50:33] kart_: not that I know of [11:51:20] akosiaris: okay [11:51:50] akosiaris: It will be great if I can have access to /var/log/apertium till we move to service-runner :) [11:52:36] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 2 others: Apertium leaves a ton of stale processes, consumes all the available - https://phabricator.wikimedia.org/T107270#1527098 (10Unhammer) I've now implemented the above mentioned option -... [11:53:16] kart_: wanna file a task ? [11:58:37] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1527104 (10mark) Looking at the current quotes in the spreadsheet, I think it seems best to move forward with quote 712030866 for 3 instances. We could order with an extra drive carrier (... [12:03:52] akosiaris: sure. [12:05:53] 6operations: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1527114 (10KartikMistry) 3NEW a:3akosiaris [12:06:02] akosiaris: ^ [12:12:02] (03PS2) 10BBlack: no varnish::instance uses "backends" directly anymore [puppet] - 10https://gerrit.wikimedia.org/r/230733 [12:12:57] (03CR) 10BBlack: [C: 032] no varnish::instance uses "backends" directly anymore [puppet] - 10https://gerrit.wikimedia.org/r/230733 (owner: 10BBlack) [12:18:46] (03PS1) 10Faidon Liambotis: Allocate neighbor block for cr2-ulsfo<->cr1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230764 [12:19:18] (03CR) 10Faidon Liambotis: [C: 032] Allocate neighbor block for cr2-ulsfo<->cr1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230764 (owner: 10Faidon Liambotis) [12:19:38] (03PS5) 10Faidon Liambotis: Repurpose s/cr2-eqiad/cr1-eqord/ for link with codfw [dns] - 10https://gerrit.wikimedia.org/r/220811 [12:19:42] (03CR) 10Faidon Liambotis: [C: 032] Repurpose s/cr2-eqiad/cr1-eqord/ for link with codfw [dns] - 10https://gerrit.wikimedia.org/r/220811 (owner: 10Faidon Liambotis) [12:30:51] (03PS8) 10Alexandros Kosiaris: Added tilerator service, granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [12:34:08] (03CR) 10Yurik: Added tilerator service, granted kartotherian OSM DB read access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [12:35:39] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580#1527177 (10BBlack) [12:43:54] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1527183 (10fgiunchedi) >>! In T95253#1524978, @GWicke wrote: > We talked about this at the last hardware planning meeting. The consensus was to keep things simple for... [12:59:19] 6operations: codfw misc cluster ganglia not working - https://phabricator.wikimedia.org/T108680#1527201 (10BBlack) 3NEW [13:04:13] (03CR) 10coren: [C: 032] Change test for log_type to a list [software] - 10https://gerrit.wikimedia.org/r/230645 (owner: 10coren) [13:04:25] (03CR) 10coren: [V: 032] Change test for log_type to a list [software] - 10https://gerrit.wikimedia.org/r/230645 (owner: 10coren) [13:09:41] (03PS1) 10ArielGlenn: dumps: get rid of one more eval.php call, correct usage message [puppet] - 10https://gerrit.wikimedia.org/r/230767 [13:10:44] (03CR) 10ArielGlenn: [C: 032] dumps: get rid of one more eval.php call, correct usage message [puppet] - 10https://gerrit.wikimedia.org/r/230767 (owner: 10ArielGlenn) [13:11:33] 6operations, 10Traffic, 7HTTPS: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#1527216 (10DaBPunkt) >>! In T108579#1524034, @BBlack wrote: > @dabpunkt can you provide details on the client software (browser version, OS version, etc?) and any local softw... [13:20:13] (03PS1) 10Alexandros Kosiaris: ganeti: assign cluster variable [puppet] - 10https://gerrit.wikimedia.org/r/230768 [13:21:56] (03CR) 10Alexandros Kosiaris: Added tilerator service, granted kartotherian OSM DB read access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) (owner: 10Yurik) [13:33:05] 6operations, 10Traffic, 7HTTPS: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#1527225 (10BBlack) Based on the actual error message, I don't think the issue is coming from our servers in any case. There are various FF bug reports linked to this error t... [13:42:02] (03PS2) 10Alexandros Kosiaris: ganeti: move role from manifests/ into the role module [puppet] - 10https://gerrit.wikimedia.org/r/230735 [13:42:08] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: move role from manifests/ into the role module [puppet] - 10https://gerrit.wikimedia.org/r/230735 (owner: 10Alexandros Kosiaris) [13:42:13] (03CR) 10Alexandros Kosiaris: [V: 032] ganeti: move role from manifests/ into the role module [puppet] - 10https://gerrit.wikimedia.org/r/230735 (owner: 10Alexandros Kosiaris) [13:43:45] (03PS2) 10Alexandros Kosiaris: ganeti: assign cluster variable [puppet] - 10https://gerrit.wikimedia.org/r/230768 [13:43:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ganeti: assign cluster variable [puppet] - 10https://gerrit.wikimedia.org/r/230768 (owner: 10Alexandros Kosiaris) [13:45:29] (03PS1) 10Alexandros Kosiaris: Revert "ganeti: move role from manifests/ into the role module" [puppet] - 10https://gerrit.wikimedia.org/r/230770 [13:46:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "ganeti: move role from manifests/ into the role module" [puppet] - 10https://gerrit.wikimedia.org/r/230770 (owner: 10Alexandros Kosiaris) [13:46:52] (03PS1) 10BBlack: define puppet ganglia stuff for caches @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/230771 [13:47:36] PROBLEM - puppet last run on fluorine is CRITICAL puppet fail [13:47:36] PROBLEM - puppet last run on cp2023 is CRITICAL puppet fail [13:47:45] PROBLEM - puppet last run on wtp2020 is CRITICAL puppet fail [13:47:45] PROBLEM - puppet last run on mw2188 is CRITICAL puppet fail [13:47:45] PROBLEM - puppet last run on ganeti1002 is CRITICAL puppet fail [13:47:46] PROBLEM - puppet last run on ms-be1006 is CRITICAL puppet fail [13:47:46] PROBLEM - puppet last run on mw2114 is CRITICAL puppet fail [13:47:46] PROBLEM - puppet last run on copper is CRITICAL puppet fail [13:47:55] PROBLEM - puppet last run on mc2016 is CRITICAL puppet fail [13:47:56] PROBLEM - puppet last run on mw1010 is CRITICAL puppet fail [13:47:56] PROBLEM - puppet last run on mw2075 is CRITICAL puppet fail [13:47:56] PROBLEM - puppet last run on ms-fe2004 is CRITICAL puppet fail [13:47:56] PROBLEM - puppet last run on ms-be2004 is CRITICAL puppet fail [13:47:57] PROBLEM - puppet last run on mw1066 is CRITICAL puppet fail [13:48:05] ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::salt::minions for cp2023.codfw.wmnet on node cp2023.codfw.wmnet [13:48:06] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [13:48:06] PROBLEM - puppet last run on lvs1002 is CRITICAL puppet fail [13:48:06] PROBLEM - puppet last run on lvs1005 is CRITICAL puppet fail [13:48:16] PROBLEM - puppet last run on calcium is CRITICAL puppet fail [13:48:17] PROBLEM - puppet last run on mw2105 is CRITICAL puppet fail [13:48:17] PROBLEM - puppet last run on achernar is CRITICAL puppet fail [13:48:17] PROBLEM - puppet last run on analytics1003 is CRITICAL puppet fail [13:48:17] PROBLEM - puppet last run on pybal-test2001 is CRITICAL puppet fail [13:48:25] PROBLEM - puppet last run on helium is CRITICAL puppet fail [13:48:26] PROBLEM - puppet last run on wtp2001 is CRITICAL puppet fail [13:48:26] PROBLEM - puppet last run on db2039 is CRITICAL puppet fail [13:48:26] PROBLEM - puppet last run on mw2109 is CRITICAL puppet fail [13:48:26] PROBLEM - puppet last run on bast1001 is CRITICAL puppet fail [13:48:26] PROBLEM - puppet last run on mw2015 is CRITICAL puppet fail [13:48:26] PROBLEM - puppet last run on mw2004 is CRITICAL puppet fail [13:48:27] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [13:48:27] PROBLEM - puppet last run on ganeti2004 is CRITICAL puppet fail [13:48:28] PROBLEM - puppet last run on ganeti2001 is CRITICAL puppet fail [13:48:36] PROBLEM - puppet last run on cp3040 is CRITICAL puppet fail [13:48:37] PROBLEM - puppet last run on db1050 is CRITICAL puppet fail [13:48:37] PROBLEM - puppet last run on mw1046 is CRITICAL puppet fail [13:48:45] PROBLEM - puppet last run on cp4002 is CRITICAL puppet fail [13:48:45] PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail [13:48:45] PROBLEM - puppet last run on mw1069 is CRITICAL puppet fail [13:48:46] PROBLEM - puppet last run on db1031 is CRITICAL puppet fail [13:48:46] PROBLEM - puppet last run on db1033 is CRITICAL puppet fail [13:48:46] PROBLEM - puppet last run on iodine is CRITICAL puppet fail [13:48:46] PROBLEM - puppet last run on mw1173 is CRITICAL puppet fail [13:48:47] PROBLEM - puppet last run on elastic1004 is CRITICAL puppet fail [13:48:47] PROBLEM - puppet last run on cp3047 is CRITICAL puppet fail [13:48:55] PROBLEM - puppet last run on einsteinium is CRITICAL puppet fail [13:48:56] PROBLEM - puppet last run on mw1091 is CRITICAL puppet fail [13:48:56] PROBLEM - puppet last run on mw1153 is CRITICAL puppet fail [13:49:05] PROBLEM - puppet last run on lvs2004 is CRITICAL puppet fail [13:49:05] PROBLEM - puppet last run on cp2011 is CRITICAL puppet fail [13:49:05] PROBLEM - puppet last run on mw1068 is CRITICAL puppet fail [13:49:06] PROBLEM - puppet last run on mw1241 is CRITICAL puppet fail [13:49:06] PROBLEM - puppet last run on db2059 is CRITICAL puppet fail [13:49:06] PROBLEM - puppet last run on db1040 is CRITICAL puppet fail [13:49:06] PROBLEM - puppet last run on mw2087 is CRITICAL puppet fail [13:49:07] PROBLEM - puppet last run on mw1027 is CRITICAL puppet fail [13:49:07] PROBLEM - puppet last run on mw1235 is CRITICAL puppet fail [13:49:08] PROBLEM - puppet last run on mw1021 is CRITICAL puppet fail [13:49:08] PROBLEM - puppet last run on mw1205 is CRITICAL puppet fail [13:49:15] PROBLEM - puppet last run on mw2123 is CRITICAL puppet fail [13:49:15] PROBLEM - puppet last run on mw2117 is CRITICAL puppet fail [13:49:15] PROBLEM - puppet last run on mw1143 is CRITICAL puppet fail [13:49:16] PROBLEM - puppet last run on mw1150 is CRITICAL puppet fail [13:49:16] PROBLEM - puppet last run on mw2212 is CRITICAL puppet fail [13:49:16] PROBLEM - puppet last run on db2045 is CRITICAL puppet fail [13:49:16] PROBLEM - puppet last run on elastic1018 is CRITICAL puppet fail [13:49:17] PROBLEM - puppet last run on mw2113 is CRITICAL puppet fail [13:49:17] PROBLEM - puppet last run on mw1025 is CRITICAL puppet fail [13:49:18] PROBLEM - puppet last run on lvs1004 is CRITICAL puppet fail [13:49:18] PROBLEM - puppet last run on ms-be1015 is CRITICAL puppet fail [13:49:19] PROBLEM - puppet last run on mw2019 is CRITICAL puppet fail [13:49:19] PROBLEM - puppet last run on db1022 is CRITICAL puppet fail [13:49:25] PROBLEM - puppet last run on db1066 is CRITICAL puppet fail [13:49:26] PROBLEM - puppet last run on lvs3003 is CRITICAL puppet fail [13:49:26] PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail [13:49:26] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [13:49:26] PROBLEM - puppet last run on labcontrol1001 is CRITICAL puppet fail [13:49:36] PROBLEM - puppet last run on mw1189 is CRITICAL puppet fail [13:49:36] PROBLEM - puppet last run on analytics1035 is CRITICAL puppet fail [13:49:46] PROBLEM - puppet last run on mw2134 is CRITICAL puppet fail [13:49:46] PROBLEM - puppet last run on mw2163 is CRITICAL puppet fail [13:49:46] PROBLEM - puppet last run on mw2176 is CRITICAL puppet fail [13:49:46] PROBLEM - puppet last run on mw2083 is CRITICAL puppet fail [13:49:46] PROBLEM - puppet last run on mw2079 is CRITICAL puppet fail [13:49:46] PROBLEM - puppet last run on mw1092 is CRITICAL puppet fail [13:49:47] PROBLEM - puppet last run on ms-fe1002 is CRITICAL puppet fail [13:49:55] PROBLEM - puppet last run on mw1003 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on db2047 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on mw2070 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on mw1166 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on mw1213 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on mw1107 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on elastic1027 is CRITICAL puppet fail [13:49:57] PROBLEM - puppet last run on mw2184 is CRITICAL puppet fail [13:49:58] PROBLEM - puppet last run on wtp2016 is CRITICAL puppet fail [13:49:58] PROBLEM - puppet last run on es2010 is CRITICAL puppet fail [13:49:59] PROBLEM - puppet last run on mw2039 is CRITICAL puppet fail [13:49:59] PROBLEM - puppet last run on cp2026 is CRITICAL puppet fail [13:50:00] PROBLEM - puppet last run on db2054 is CRITICAL puppet fail [13:50:05] PROBLEM - puppet last run on ms-fe1001 is CRITICAL puppet fail [13:50:06] PROBLEM - puppet last run on elastic1008 is CRITICAL puppet fail [13:50:06] PROBLEM - puppet last run on analytics1030 is CRITICAL puppet fail [13:50:07] PROBLEM - puppet last run on tmh1001 is CRITICAL puppet fail [13:50:18] PROBLEM - puppet last run on cp2016 is CRITICAL puppet fail [13:50:18] PROBLEM - puppet last run on mw2127 is CRITICAL puppet fail [13:50:18] PROBLEM - puppet last run on mw1204 is CRITICAL puppet fail [13:50:19] PROBLEM - puppet last run on mw1118 is CRITICAL puppet fail [13:50:19] PROBLEM - puppet last run on db1034 is CRITICAL puppet fail [13:50:19] PROBLEM - puppet last run on mw1155 is CRITICAL puppet fail [13:50:19] PROBLEM - puppet last run on db1002 is CRITICAL puppet fail [13:50:20] PROBLEM - puppet last run on db1021 is CRITICAL puppet fail [13:50:21] PROBLEM - puppet last run on mw2082 is CRITICAL puppet fail [13:50:21] PROBLEM - puppet last run on db2065 is CRITICAL puppet fail [13:50:25] PROBLEM - puppet last run on cp2014 is CRITICAL puppet fail [13:50:25] PROBLEM - puppet last run on stat1002 is CRITICAL Puppet last ran 6 hours ago [13:50:26] PROBLEM - puppet last run on mw2196 is CRITICAL puppet fail [13:50:26] PROBLEM - puppet last run on mw2143 is CRITICAL puppet fail [13:50:26] PROBLEM - puppet last run on ms-fe2003 is CRITICAL puppet fail [13:50:26] PROBLEM - puppet last run on ruthenium is CRITICAL puppet fail [13:50:26] PROBLEM - puppet last run on mw2084 is CRITICAL puppet fail [13:50:27] PROBLEM - puppet last run on mw2131 is CRITICAL puppet fail [13:50:27] PROBLEM - puppet last run on mw2093 is CRITICAL puppet fail [13:50:45] PROBLEM - puppet last run on db1027 is CRITICAL puppet fail [13:50:46] PROBLEM - puppet last run on mw1137 is CRITICAL puppet fail [13:50:46] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [13:50:46] PROBLEM - puppet last run on mw1154 is CRITICAL puppet fail [13:50:46] PROBLEM - puppet last run on mw1131 is CRITICAL puppet fail [13:50:46] PROBLEM - puppet last run on etherpad1001 is CRITICAL puppet fail [13:50:46] PROBLEM - puppet last run on mw1054 is CRITICAL puppet fail [13:50:56] PROBLEM - puppet last run on mc1012 is CRITICAL puppet fail [13:51:05] PROBLEM - puppet last run on mw1253 is CRITICAL puppet fail [13:51:06] PROBLEM - puppet last run on cp2010 is CRITICAL puppet fail [13:51:06] PROBLEM - puppet last run on mw2142 is CRITICAL puppet fail [13:51:07] PROBLEM - puppet last run on analytics1031 is CRITICAL puppet fail [13:51:15] PROBLEM - puppet last run on labsdb1005 is CRITICAL puppet fail [13:51:15] PROBLEM - puppet last run on mw2182 is CRITICAL puppet fail [13:51:15] PROBLEM - puppet last run on mw2110 is CRITICAL puppet fail [13:51:15] PROBLEM - puppet last run on es1004 is CRITICAL puppet fail [13:51:16] PROBLEM - puppet last run on dbproxy1001 is CRITICAL puppet fail [13:51:16] PROBLEM - puppet last run on mw1179 is CRITICAL puppet fail [13:51:16] PROBLEM - puppet last run on db1051 is CRITICAL puppet fail [13:51:17] PROBLEM - puppet last run on mw1047 is CRITICAL puppet fail [13:51:17] PROBLEM - puppet last run on mw1129 is CRITICAL puppet fail [13:51:18] PROBLEM - puppet last run on logstash1002 is CRITICAL puppet fail [13:51:18] PROBLEM - puppet last run on mw1211 is CRITICAL puppet fail [13:51:19] PROBLEM - puppet last run on mw2055 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on wtp2010 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on mw2090 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on mw2096 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on mw2130 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on mw1194 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on mw2049 is CRITICAL puppet fail [13:51:26] PROBLEM - puppet last run on cp1071 is CRITICAL puppet fail [13:51:27] PROBLEM - puppet last run on mw1020 is CRITICAL puppet fail [13:51:35] PROBLEM - puppet last run on mw1075 is CRITICAL puppet fail [13:51:46] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:51:46] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [13:51:46] PROBLEM - puppet last run on mw1011 is CRITICAL puppet fail [13:51:47] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [13:51:47] PROBLEM - puppet last run on mw1128 is CRITICAL puppet fail [13:51:55] PROBLEM - puppet last run on cp1066 is CRITICAL puppet fail [13:51:55] PROBLEM - puppet last run on db1005 is CRITICAL puppet fail [13:51:56] PROBLEM - puppet last run on db1068 is CRITICAL puppet fail [13:51:57] PROBLEM - puppet last run on mw2056 is CRITICAL puppet fail [13:51:57] PROBLEM - puppet last run on mw2168 is CRITICAL puppet fail [13:52:06] PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail [13:52:06] PROBLEM - puppet last run on mw2047 is CRITICAL puppet fail [13:52:07] PROBLEM - puppet last run on cp4015 is CRITICAL puppet fail [13:52:17] PROBLEM - puppet last run on mw1208 is CRITICAL puppet fail [13:52:22] bblack, so commit related? [13:52:26] PROBLEM - puppet last run on polonium is CRITICAL puppet fail [13:52:35] PROBLEM - puppet last run on db2007 is CRITICAL puppet fail [13:52:36] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [13:52:55] PROBLEM - puppet last run on db1016 is CRITICAL puppet fail [13:53:28] jynus: akosiaris reverted the offending commit (although I still can't understand how it caused this). I've re-run one node manually and it was fixed. [13:53:45] it's just going to take a while for them all to re-run and recover I think [13:54:02] and I have no idea why that commit failed [13:54:06] perfect, I do not care about the error, but do not want to duplicate work [13:54:41] or that it was global [13:54:45] weird... [13:55:00] yeah it's somehow interrelated with hieradata's lookup of the $cluster variable [13:55:03] I think [13:55:13] I hope not [13:55:18] cause if it is ... [13:55:26] that still doesn't completely explain it, but I think it's related [13:55:33] https://gerrit.wikimedia.org/r/#/c/220085/ [13:55:40] this also exhibits the same exact problem [13:55:46] which is why I haven't merged it [13:55:58] probably the role/manifests/etherpad.pp ? [13:56:13] cause the others already there are under various directories [13:56:28] but where should I put the file then ? and how should I name it ? [13:56:35] akosiaris, indeed it is strange [13:56:50] I think that's just a catalyst, not the real problem [13:57:14] possibly the custom role.rb thing is interrelated with this somehow too [13:57:25] and my poor catalog compiler is broken again so I can't test [13:57:32] bblack: hmm could be [13:58:36] but also, ganeti is one of the few that set "cluster:" in hieradata per-DC instead of in common/ [13:58:51] by that I mean: [13:58:51] hieradata/role/common/wdqs.yaml:cluster: wdqs [13:58:51] hieradata/role/eqiad/ganeti.yaml:cluster: ganeti [13:59:00] that could be some factor in this somehow too [14:00:16] bblack: no, that showed up before I merge the cluster change [14:00:20] it's unrelated [14:02:20] (03PS1) 10Alexandros Kosiaris: Revert "Revert "ganeti: move role from manifests/ into the role module"" [puppet] - 10https://gerrit.wikimedia.org/r/230773 [14:07:58] is it possible it's temporary and would've fixed itself later regardless? as in, some kind of FS sync issue where a catalog compilation sees both files missing or both files existing temporarily when the role moves paths, and that confuses/kills puppet role lookup for a short window? [14:08:15] similar to what we see when we move a template or file path, race condition for a bit and then it cleans up [14:10:56] 6operations, 10RESTBase, 10RESTBase-Cassandra: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1527327 (10mobrovac) >>! In T95253#1524978, @GWicke wrote: > The exact startup solution remains tbd, but one option might be to leverage http://0pointer.de/blog/projec... [14:13:02] RECOVERY - puppet last run on db1031 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:13:42] RECOVERY - puppet last run on wtp2020 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:13:52] RECOVERY - puppet last run on calcium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:03] RECOVERY - puppet last run on achernar is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:14:03] (03PS12) 10Giuseppe Lavagetto: puppet-compiler: first commit [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) [14:14:11] RECOVERY - puppet last run on iodine is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:11] RECOVERY - puppet last run on ganeti2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:12] RECOVERY - puppet last run on db1033 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:14:12] RECOVERY - puppet last run on elastic1004 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:14:22] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:14:22] RECOVERY - puppet last run on helium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:31] RECOVERY - puppet last run on db2039 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:14:32] RECOVERY - puppet last run on mw2015 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:14:32] RECOVERY - puppet last run on db1050 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:14:32] RECOVERY - puppet last run on db2059 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:14:42] RECOVERY - puppet last run on ms-be1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:51] RECOVERY - puppet last run on copper is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:14:52] RECOVERY - puppet last run on mw1091 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:15:02] RECOVERY - puppet last run on mc2016 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:15:02] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:15:02] RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:15:02] RECOVERY - puppet last run on ms-fe2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:03] RECOVERY - puppet last run on ganeti1002 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:15:03] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:15:03] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:15:12] RECOVERY - puppet last run on mw1066 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:15:12] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:15:22] RECOVERY - puppet last run on db1040 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:15:30] <_joe_> what happened? [14:15:31] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:15:31] RECOVERY - puppet last run on elastic1018 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:15:31] RECOVERY - puppet last run on elastic1008 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:15:31] RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:15:32] RECOVERY - puppet last run on mw1173 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:15:33] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:15:33] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:15:39] <_joe_> I was not looking at the chat, sorry [14:15:41] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:42] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:42] RECOVERY - puppet last run on lvs1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:51] RECOVERY - puppet last run on db2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:51] RECOVERY - puppet last run on cp4002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:51] RECOVERY - puppet last run on lvs1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:52] RECOVERY - puppet last run on einsteinium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:52] RECOVERY - puppet last run on mw1189 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:53] RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:15:53] RECOVERY - puppet last run on mw1253 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:16:01] RECOVERY - puppet last run on mw2004 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:16:02] RECOVERY - puppet last run on db1002 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:16:02] RECOVERY - puppet last run on mw1155 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:16:02] RECOVERY - puppet last run on analytics1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:03] RECOVERY - puppet last run on ganeti2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:11] RECOVERY - puppet last run on mw2082 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:16:12] RECOVERY - puppet last run on mw2105 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:12] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:16:12] RECOVERY - puppet last run on mw1027 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:16:17] _joe_: strange puppet bug with a commit, reverted [14:16:21] RECOVERY - puppet last run on db1022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:21] RECOVERY - puppet last run on mw2087 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:16:21] RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:21] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:16:21] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:16:21] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:16:22] RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:22] RECOVERY - puppet last run on mw1153 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:23] RECOVERY - puppet last run on mw2117 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:16:31] RECOVERY - puppet last run on cp2011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:31] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:16:31] RECOVERY - puppet last run on db1066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:31] RECOVERY - puppet last run on bast1001 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:16:31] RECOVERY - puppet last run on wtp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:32] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:16:32] RECOVERY - puppet last run on mw2109 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:33] RECOVERY - puppet last run on mw1241 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:16:33] RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:34] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [14:16:34] RECOVERY - puppet last run on es1004 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:16:36] <_joe_> which commit specifically? [14:16:39] <_joe_> I hate icinga [14:16:41] RECOVERY - puppet last run on labsdb1005 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:16:41] RECOVERY - puppet last run on mw1150 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:16:42] RECOVERY - puppet last run on ms-fe1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:52] RECOVERY - puppet last run on ms-be1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:52] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:16:52] RECOVERY - puppet last run on db1068 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [14:16:52] RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:16:53] RECOVERY - puppet last run on mw1069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:53] RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:16:53] RECOVERY - puppet last run on mw1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:54] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:17:01] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:17:01] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:02] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:02] RECOVERY - puppet last run on mw1010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:03] RECOVERY - puppet last run on etherpad1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:17:03] RECOVERY - puppet last run on db2054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:04] RECOVERY - puppet last run on es2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:04] RECOVERY - puppet last run on lvs1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:11] RECOVERY - puppet last run on ms-be2004 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:17:11] RECOVERY - puppet last run on wtp2016 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:17:11] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:17:11] RECOVERY - puppet last run on mw1068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:12] RECOVERY - puppet last run on mw1143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:12] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:17:18] _joe_: https://gerrit.wikimedia.org/r/230735 , which lead to this on all nodes: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::salt::minions for cp2023.codfw.wmnet on node cp2023.codfw.wmnet [14:17:21] RECOVERY - puppet last run on mw2083 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:21] RECOVERY - puppet last run on mw2176 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:17:22] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:31] RECOVERY - puppet last run on db1051 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:32] RECOVERY - puppet last run on wtp1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:32] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:32] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:32] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:42] RECOVERY - puppet last run on mw1104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:42] RECOVERY - puppet last run on mw1154 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:42] RECOVERY - puppet last run on tmh1001 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:17:48] the commit in question moved an unrelated role class from manifests/role/foo to modules/role/manifests/ [14:17:52] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:17:52] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:52] RECOVERY - puppet last run on logstash1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:52] RECOVERY - puppet last run on mw1021 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:17:52] RECOVERY - puppet last run on db1005 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:17:53] RECOVERY - puppet last run on mw1131 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:01] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:18:01] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:02] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:18:02] RECOVERY - puppet last run on mw2163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:02] RECOVERY - puppet last run on mw2049 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:18:02] RECOVERY - puppet last run on mw1204 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:18:02] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:11] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:11] RECOVERY - puppet last run on dbproxy1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:11] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:12] RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:12] RECOVERY - puppet last run on labvirt1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:12] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:12] RECOVERY - puppet last run on polonium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:13] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:21] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:18:21] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:21] RECOVERY - puppet last run on mw2142 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:18:21] RECOVERY - puppet last run on mw1054 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:22] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:18:22] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:18:22] RECOVERY - puppet last run on analytics1031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:23] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:31] RECOVERY - puppet last run on mw1137 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:32] RECOVERY - puppet last run on wtp2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:32] RECOVERY - puppet last run on pybal-test2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:32] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:18:32] RECOVERY - puppet last run on mw2196 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:32] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:32] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:18:33] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:41] RECOVERY - puppet last run on mw1020 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:18:41] RECOVERY - puppet last run on mw2084 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:41] RECOVERY - puppet last run on mw2131 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:18:42] RECOVERY - puppet last run on db2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:42] RECOVERY - puppet last run on mw1194 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:18:42] RECOVERY - puppet last run on mw2067 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:42] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:18:43] RECOVERY - puppet last run on mw1211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:43] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:18:51] RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:51] RECOVERY - puppet last run on mw1011 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:18:52] RECOVERY - puppet last run on db1027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:01] RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:01] RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:11] RECOVERY - puppet last run on mw2070 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:19:12] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:12] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:12] RECOVERY - puppet last run on mw2168 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:19:12] RECOVERY - puppet last run on mw2110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:12] RECOVERY - puppet last run on db1016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:12] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:19:13] RECOVERY - puppet last run on mw2039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:13] RECOVERY - puppet last run on mw2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:21] RECOVERY - puppet last run on mw1179 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:21] RECOVERY - puppet last run on mw2090 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:22] RECOVERY - puppet last run on cp4015 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:19:32] RECOVERY - puppet last run on mw1128 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:32] RECOVERY - puppet last run on mw2130 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:20:02] RECOVERY - puppet last run on mw1208 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:11] <_joe_> bblack: it was the same on all nodes? [14:20:19] <_joe_> because I think that might be a race condition [14:20:32] <_joe_> you have "import(roles/*.pp)" [14:20:39] <_joe_> in site.pp [14:21:07] yeah that was my last thought too [14:21:14] <_joe_> meh [14:21:16] 14:07 < bblack> is it possible it's temporary and would've fixed itself later regardless? as in, some kind of FS sync issue where a catalog compilation sees both files missing or both files existing temporarily when the role moves paths, and that confuses/kills puppet role lookup for a short window? [14:21:21] 14:08 < bblack> similar to what we see when we move a template or file path, race condition for a bit and then it cleans up [14:22:01] <_joe_> yeah seems plausible [14:22:06] <_joe_> lemme check a random host [14:25:04] <_joe_> Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Role class role::mediawiki::appserver not found at /etc/puppet/manifests/site.pp:1925 on node mw2090.codfw.wmnet [14:25:13] <_joe_> so yeah our hypothesis seems almost correct [14:28:57] <_joe_> and I see just a few thread having that problem, it seems [14:30:07] <_joe_> ok, I stand corrected - I fear we need to restart puppet every time we want to move away any of the role files from manifest/roles [14:32:27] <_joe_> akosiaris: ^^ [14:34:30] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1527418 (10Papaul) @ fgiunchedi no it wasn't a new drive. I just pulled the drive out an plug it back in. [14:35:31] argh ? [14:37:16] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1527437 (10Eevans) >>! In T107949#1526823, @fgiunchedi wrote: > upgrade plan, starting today: > * upgrade row A machines, (restbase100[127]) with `nodetool flush && sudo apt-get -o Dpkg::O... [14:37:41] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1527439 (10Eevans) >>! In T107949#1526933, @fgiunchedi wrote: > cosmetic issue output contains `%s` spotted while looking at the logs, benign > > ``` > > restbase1002:~$ grep %s /var/log... [14:38:56] _joe_: really it's just another aspect of the same problem we see with the files/templates, etc. Just worse. there's no transactionality to the git filesystem updates on the master -> active threads running for clients, etc [14:42:03] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1527473 (10Eevans) >>! In T107949#1526823, @fgiunchedi wrote: > upgrade plan, starting today: > * upgrade row A machines, (restbase100[127]) with `nodetool flush && sudo apt-get -o Dpkg::O... [14:43:57] (03PS1) 10Alexandros Kosiaris: Introduce mobileapps.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/230780 (https://phabricator.wikimedia.org/T105538) [14:50:48] (03CR) 10Eevans: [C: 031] "> LGTM, I've updated the code review not to remove the old version because that's racy with puppet" [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/230582 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [14:51:19] Is there a bug open for the non-atomic updates? Should be solvable with some symlink magic [14:55:35] <_joe_> bblack: i think it's actually worse [14:55:49] <_joe_> bblack: imports are set in stone in the ruby process [14:55:57] 6operations, 10ContentTranslation-Deployments, 3LE-CX6-Sprint 2: Access to /var/log/apertium for Kartik - https://phabricator.wikimedia.org/T108678#1527517 (10KartikMistry) [14:55:59] <_joe_> I am pretty sure of that [14:57:33] (03PS1) 10Alexandros Kosiaris: lvs: remove the old unused osm lvs_service [puppet] - 10https://gerrit.wikimedia.org/r/230783 [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150811T1500). Please do the needful. [15:06:20] No SWAT changes this monring? [15:07:24] bd808: don't jinx it [15:07:43] :) I've got something that can go [15:08:13] (03CR) 10BryanDavis: "> (ideally we'd load balance across hosts anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [15:08:32] (03CR) 10Alexandros Kosiaris: [C: 032] lvs: remove the old unused osm lvs_service [puppet] - 10https://gerrit.wikimedia.org/r/230783 (owner: 10Alexandros Kosiaris) [15:09:42] (03PS1) 10Alexandros Kosiaris: postgres: Enable streaming replication [puppet] - 10https://gerrit.wikimedia.org/r/230785 [15:10:16] (03CR) 10BryanDavis: [C: 032] logging: Only send info and higher to logstash by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230719 (owner: 10BryanDavis) [15:10:24] (03Merged) 10jenkins-bot: logging: Only send info and higher to logstash by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230719 (owner: 10BryanDavis) [15:11:53] !log bd808@tin Synchronized wmf-config/logging.php: logging: Only send info and higher to logstash by default (4388a84) 1/2 (duration: 00m 12s) [15:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:27] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: logging: Only send info and higher to logstash by default (4388a84) 2/2 (duration: 00m 12s) [15:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:25] hmmm... do jobrunners not pick up wmf-config changes rapidly? [15:15:29] (03CR) 10Yuvipanda: [C: 031] labstore: add timers for backups [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [15:15:35] Coren: ^ +1'd it [15:16:43] * Coren goes to test now. Yeay. [15:16:55] (03PS2) 10coren: labstore: add timers for backups [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) [15:17:24] (03CR) 10coren: [C: 032] "LGTM. Let's see if it also looks good to systemd." [puppet] - 10https://gerrit.wikimedia.org/r/230569 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [15:17:28] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Touched wmf-config/InitialiseSettings.php (duration: 00m 13s) [15:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:56] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1527574 (10greg) @ArielGlenn: there's an NDA's task at T97593 which Brandon is driving. [15:22:51] (03CR) 10BryanDavis: "I am still seeing debug level logs in Logstash after syncing this and a var_dump() of $wgMWLoggerDefaultSpi is still showing the use of th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230719 (owner: 10BryanDavis) [15:24:17] bd808: wave when you are ready with logstash to merge the relevant puppet changes btw [15:24:26] YuviPanda: The backups are now enabled. [15:24:40] * Coren watches the alerts, which should go critical now. [15:25:19] godog: thanks. give me a few minutes to try an puzzle out why my wmf-config change isn't doing what I'd hoped [15:27:13] Aaaah. No. They /never/ ran so they don't have a last run time. [15:28:09] bd808: np, is it not picking up the change at all? [15:29:03] godog: apparently not [15:29:31] uhhhh... maybe I forgot to rebase [15:29:54] * bd808 facepalms [15:29:56] yup [15:30:28] (03PS2) 10Alexandros Kosiaris: postgres: stream WALs while doing pg_basebackup [puppet] - 10https://gerrit.wikimedia.org/r/230785 [15:30:36] Coren: :) I guess we will have to wait? [15:30:43] !log bd808@tin Synchronized wmf-config/logging.php: logging: Only send info and higher to logstash by default (4388a84) 1/2 (actually rebased this time) (duration: 00m 11s) [15:30:45] (03PS3) 10Alexandros Kosiaris: postgres: stream WALs while doing pg_basebackup [puppet] - 10https://gerrit.wikimedia.org/r/230785 [15:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] postgres: stream WALs while doing pg_basebackup [puppet] - 10https://gerrit.wikimedia.org/r/230785 (owner: 10Alexandros Kosiaris) [15:31:04] (03PS1) 10Alexandros Kosiaris: new_wmf_service: fix bug with wrong function name [puppet] - 10https://gerrit.wikimedia.org/r/230787 [15:31:06] (03PS1) 10Alexandros Kosiaris: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) [15:31:08] (03PS1) 10Alexandros Kosiaris: Assign mobileapps service to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/230789 (https://phabricator.wikimedia.org/T105538) [15:31:09] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: logging: Only send info and higher to logstash by default (4388a84) 2/2 (actually rebased this time) (duration: 00m 11s) [15:31:10] (03PS1) 10Alexandros Kosiaris: Setup LVS for mobileapps service on sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 (https://phabricator.wikimedia.org/T105538) [15:31:17] YuviPanda: I'm starting a manual run of replicate-others to see if that updates the last run - but I think that the timer might only track the last time /it/ started it. We'll soon see. [15:31:24] that's better [15:31:37] Ok [15:31:41] (03CR) 10Alexandros Kosiaris: "probably needs config.yaml.erb updated ?" [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [15:31:48] YuviPanda: Not as useful as using the log like my first draft, but doesn't require elevated privileges so it's a reasonable compromise. [15:32:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "some minor editing needed to assign to scb" [puppet] - 10https://gerrit.wikimedia.org/r/230789 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [15:32:10] (03CR) 10BryanDavis: "*facepalm* I fetched to tin but didn't rebase before running sync-file. Nothing to see here. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230719 (owner: 10BryanDavis) [15:32:37] (03PS1) 10Faidon Liambotis: Allocate neighbor blocks for cr1/2-codfw<->mr1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230791 [15:32:39] (03PS1) 10Faidon Liambotis: Add AAAA/PTR for mr1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230792 [15:32:45] Ok [15:34:47] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [15:35:51] Coren: ^ [15:36:00] We should figure outwhy this is happening [15:36:00] (03PS2) 10Alexandros Kosiaris: new_wmf_service: fix bug with wrong function name [puppet] - 10https://gerrit.wikimedia.org/r/230787 [15:36:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] new_wmf_service: fix bug with wrong function name [puppet] - 10https://gerrit.wikimedia.org/r/230787 (owner: 10Alexandros Kosiaris) [15:36:15] * Coren grumbles. [15:36:22] !log Disabled puppet on logstash100[1-3] in preparation for upgrade to 1.5.3 [15:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:35] YuviPanda: I'll have to dig in the check source. There's obviously a bug in it. [15:36:48] godog: I'm ready for you to merge puppet patches when you can get to it [15:37:10] YuviPanda: Annoyingly though, it's part of the icinga default. [15:37:13] Coren: is there a task for it? If not can you create one? [15:37:22] bd808: yup, I'll merge the three patches from https://phabricator.wikimedia.org/T99735#1525049 [15:37:33] perfect [15:37:36] YuviPanda: Nope. I'll create one now. [15:38:00] Thanks [15:38:39] (03PS18) 10Filippo Giunchedi: Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [15:38:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [15:39:08] (03PS27) 10Filippo Giunchedi: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [15:39:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [15:39:29] (03PS4) 10Filippo Giunchedi: logstash: Enable doc_values in template mapping [puppet] - 10https://gerrit.wikimedia.org/r/230250 (https://phabricator.wikimedia.org/T74930) (owner: 10BryanDavis) [15:39:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] logstash: Enable doc_values in template mapping [puppet] - 10https://gerrit.wikimedia.org/r/230250 (https://phabricator.wikimedia.org/T74930) (owner: 10BryanDavis) [15:40:03] 6operations, 10ops-codfw: es2007 degraded RAID - disk failure - https://phabricator.wikimedia.org/T108592#1527624 (10Papaul) @jcrespo please check and see if you stay have the same error. Thanks. [15:40:50] bd808: kicking puppet on tin [15:42:01] godog: if that works as hoped, you should see it create /srv/deployment/logstash/plugins [15:42:41] 10Ops-Access-Requests, 6operations, 7LDAP: Add WMF engineer VolkerE to ldap/wmf group - https://phabricator.wikimedia.org/T107985#1527630 (10greg) [15:43:28] 10Ops-Access-Requests, 6operations, 7LDAP: Add WMF engineer VolkerE to ldap/wmf group - https://phabricator.wikimedia.org/T107985#1527636 (10greg) Sorry for the late reply. RelEng doesn't manage LDAP. Did something indicate that you should ask us? Old documentation? [15:43:51] greg-g: The process for that has mostly been "ask ostriches" [15:43:57] bd808: ish, puppet needs to run on palladium first [15:44:03] Which is an absolutely terrible process. [15:44:04] :) [15:44:06] (03PS1) 10Alexandros Kosiaris: new_wmf_service.py: fix icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/230795 [15:44:24] ostriches: I was just about to say the same thing on the ticket [15:44:40] godog: oh? for the salt master? [15:44:54] hello, sorry for the intrusion and the silly qs; is this request correct? can someone take a look? https://phabricator.wikimedia.org/T107992 [15:45:03] bd808: yeah, otherwise repo_config isn't updated [15:45:25] supernino: Seems fine, yeah. [15:45:40] bd808: you should be good to go [15:46:33] godog: looks good. thanks [15:46:34] (03CR) 10Alexandros Kosiaris: [C: 032] new_wmf_service.py: fix icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/230795 (owner: 10Alexandros Kosiaris) [15:46:38] (03PS2) 10Alexandros Kosiaris: new_wmf_service.py: fix icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/230795 [15:46:50] (03CR) 10Alexandros Kosiaris: [V: 032] new_wmf_service.py: fix icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/230795 (owner: 10Alexandros Kosiaris) [15:46:54] 7Puppet, 6operations, 6Release-Engineering, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1527648 (10greg) [15:47:12] ok thanks ostriches, who usually take care of these things? [15:47:13] !log Trebuchet deploy of logstash/plugins: Add logstash-filter-prune 0.1.5 (36144b2) [15:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:39] supernino: magical elves! Or Krenair and Reedy and the releng folks [15:47:55] ostriches: for ldap? [15:47:57] * ostriches wonders if he's a magical elf or a releng folk. [15:47:59] greg-g: Yes. [15:48:03] ostriches: :( [15:48:08] lol elves are cool [15:48:35] I don't add users to groups in LDAP. [15:48:50] I probably shouldn't be able to. [15:48:53] Krenair: No, but you would do something like supernino's question, T107992 [15:49:04] Two conversations going on :p [15:49:15] oh, right [15:49:21] !log upgrading logstash on logstash1001 [15:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:29] yes [15:50:12] bd808: np, let me know if sth comes up [15:50:21] 6operations: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1527659 (10coren) 3NEW a:3coren [15:50:35] !log nuking db1002-db1007 on icinga [15:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:55] 6operations: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1527667 (10coren) p:5Triage>3High Set to high priority as this causes inapropriate critical icinga alerts. [15:51:03] everithing seems to be going ok, but just in case [15:52:33] 10Ops-Access-Requests, 6operations, 7LDAP: Add WMF engineer VolkerE to ldap/wmf group - https://phabricator.wikimedia.org/T107985#1527683 (10demon) 5Open>3Resolved a:3demon Done. [15:53:13] 6operations: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1527692 (10fgiunchedi) see also discussion in {T104975}, possibly a filesystem we shouldn't be checking anyway [15:53:14] mutante: shouldn't that ^ (LDAP) be done by someone in ops vs "ask chad"? /me shrugs [15:53:36] mutante: er, mis-timing, that == adding someone to LDAP [15:53:53] It should be done by ldap-admins. [15:54:01] ostriches is one of them, so are Reedy and robla [15:54:56] In practice, anyone in ops can do it too [15:55:17] I'd just love to merge processes and get rid of random "ask $X" processes [15:55:37] ldap is just fubar'd [15:55:54] see also: new employee onboarding and ex-employee off-boarding pain felt by ops/oit [15:56:32] I don't think wikitech ldap is on any on/offboarding documents at all. [15:56:45] It should be. [15:56:49] right. [15:56:49] It's basically "find out you need it and ask until someone points you to the right person" [15:56:54] Which is bad, right [15:56:56] At least offboarding. [15:57:00] aka: 'not a process' ;) [15:57:00] there is a ticket open in phab now to redefine this process I think [15:57:07] godog: found a problem. In https://gerrit.wikimedia.org/r/#/c/227175/27/manifests/role/kibana.pp,unified I messed up the ldap_bindpass [15:57:09] chasemp: yeah, probably at least part of it [15:57:12] I'd prefer to get T62412 fixed before putting it on onboarding docs [15:57:15] godog: I'll make a patch to fix [15:57:19] (03PS5) 10Ori.livneh: logstash: Count MediaWiki log events with statsd [puppet] - 10https://gerrit.wikimedia.org/r/230233 (https://phabricator.wikimedia.org/T100735) (owner: 10BryanDavis) [15:57:19] 6operations, 10ops-eqiad, 7Database, 5Patch-For-Review: Remove db1002-db1007 from production - https://phabricator.wikimedia.org/T105768#1527724 (10jcrespo) * Removed from icinga * Puppet certs revoked * Salt keys revoked [15:57:43] Krenair: heh :) [15:57:49] bd808: oh ok [15:58:37] Krenair: I forget where I made the comment (must have been IRC as it's not on that ticket) but I was very surprised when I found out I had +2 :) [15:59:20] oh right, meeting time [15:59:56] godog: hmmm... I think I may need help to fix correctly [16:00:04] bd808: Respected human, time to deploy Logstash cluster updates (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150811T1600). Please do the needful. [16:00:14] godog: I did this 'role::kibana::ldap_bindpass: "%{scope('passwords::ldap::production::proxypass')} [16:00:14] "' in hieradata/role/common/logstash.yaml [16:00:34] T62412 is a silly meta bug [16:00:45] which apparently doesn't work (probably because the private passwords aren't in scope at the right time) [16:01:03] godog: so maybe the best fix would be to set that in the private hiera to the right value [16:01:15] and remove from the non-private [16:01:16] bd808: failing on logstash1001 ? [16:01:19] yeah [16:01:35] 6operations, 10ops-codfw: es2007 degraded RAID - disk failure - https://phabricator.wikimedia.org/T108592#1527733 (10jcrespo) 5Open>3Resolved a:3jcrespo Error went away. @Papaul Did you changed it or did it went away on its own? ``` Firmware state: Rebuild ``` [16:01:40] It applies as an empty string and that keeps apache2 from starting [16:01:52] Krenair: for the record, I don't think that's a blocker for getting the onboarding process written down more clearly, we can always change after [16:03:13] (03CR) 10BryanDavis: "Problem with hiera settings and apache2 config noted inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [16:03:51] greg-g, okay, but we shouldn't start giving out ldap/wmf automatically to new engineers as part of onboarding until it's fixed [16:04:02] godog: the hack thing I could to would be to pull the param out and set the variable inline with a scoped lookkup [16:04:43] bd808: ah yeah, like graphite does $ldap_bindpass = $passwords::ldap::production::proxypass so it'll be in scope in the template [16:05:07] yeah. that's how it worked before my patch tried to move it all to hiera [16:05:46] (03PS2) 10Alexandros Kosiaris: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) [16:05:48] (03PS2) 10Alexandros Kosiaris: Assign mobileapps service to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/230789 (https://phabricator.wikimedia.org/T105538) [16:05:50] (03PS2) 10Alexandros Kosiaris: Setup LVS for mobileapps service on sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/230790 (https://phabricator.wikimedia.org/T105538) [16:06:17] bd808: nod, I'm not deep enough in the puppet rabbithole to tell for sure why that doesn't work [16:06:39] k. I'll make the quick and dirty fix patch then [16:08:00] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1527748 (10RobH) [16:09:16] (03CR) 10Faidon Liambotis: [C: 032] Allocate neighbor blocks for cr1/2-codfw<->mr1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230791 (owner: 10Faidon Liambotis) [16:09:26] (03CR) 10Faidon Liambotis: [C: 032] Add AAAA/PTR for mr1-codfw [dns] - 10https://gerrit.wikimedia.org/r/230792 (owner: 10Faidon Liambotis) [16:10:40] (03PS1) 10BryanDavis: logstash: fix ldap_bindpass [puppet] - 10https://gerrit.wikimedia.org/r/230798 [16:10:46] godog: ^ [16:11:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] logstash: fix ldap_bindpass [puppet] - 10https://gerrit.wikimedia.org/r/230798 (owner: 10BryanDavis) [16:11:31] bd808: looks good, merged! [16:11:51] * bd808 tries it out [16:12:25] godog: that worked :) [16:12:36] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=1 dev=sdb failed - https://phabricator.wikimedia.org/T108561#1527753 (10Papaul) @Filippo drive replacement complete [16:13:06] Krenair: yuo mean "we should stop" ;) [16:13:28] bd808: \o/ [16:14:04] greg-g, right now they have to know what to get and ask for it, right? [16:15:13] PROBLEM - RAID on db2023 is CRITICAL 1 failed LD(s) (Degraded) [16:16:33] !log logstash upgrade on logstash1001 complete [16:16:34] RECOVERY - RAID on es2007 is OK optimal, 1 logical, 2 physical [16:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:39] ^oh, one goes up, another down [16:18:19] they are team-tagging! [16:20:13] !log logstash upgrade on logstash1002 complete [16:20:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/230582 (https://phabricator.wikimedia.org/T101764) (owner: 10Eevans) [16:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:14] !log logstash upgrade on logstash1003 complete [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:31] (03PS1) 10Alexandros Kosiaris: Get scb up to par with sca [puppet] - 10https://gerrit.wikimedia.org/r/230800 [16:23:53] godog: done with the logstash upgrades. Let's wait 10-15 minutes to make sure nothing melts and then start on the elasticsearch update [16:24:21] godog: you could schedule the icinga watch downtime now from T108040 if you want [16:25:24] (03PS1) 10Faidon Liambotis: Fix mr1-codfw AAAA to match PTR [dns] - 10https://gerrit.wikimedia.org/r/230802 [16:25:46] godog: the check that needs to be silenced is "ElasticSearch health check for shards" for all 6 hosts in the logstash_eqiad group [16:25:59] (03CR) 10Faidon Liambotis: [C: 032] Fix mr1-codfw AAAA to match PTR [dns] - 10https://gerrit.wikimedia.org/r/230802 (owner: 10Faidon Liambotis) [16:27:14] RECOVERY - Disk space on labstore1002 is OK: DISK OK [16:27:15] bd808: yup, that's muted until 19.30 UTC [16:27:48] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1527783 (10Eevans) a:3Eevans [16:27:58] godog: awesome, thanks [16:30:19] greg-g: should be "ask an LDAP admin" which has ops plus a couple others, but via phab would make a difference for process [16:30:23] Krenair: I didn't [16:30:34] ? [16:30:39] explicitly ask for +2 [16:30:40] oh [16:30:56] bd808: np, we'll need to look at es-tool and jessie, I think it might run out of the box [16:30:58] greg-g: but you were hanging out with the cool kids (mw-core) [16:33:04] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1527795 (10Eevans) [16:33:09] (03CR) 10Dzahn: "oh, interesting re: "mod_access_compat", well, it did not seem to be default everywhere, saw at least one host break because of Apache 2.2" [puppet] - 10https://gerrit.wikimedia.org/r/230686 (owner: 10Dzahn) [16:33:26] (03PS1) 10Alex Monk: Set wgNamespaceRobotPolicies on itwiki's NS_USER to noindex,follow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230804 (https://phabricator.wikimedia.org/T107992) [16:34:13] PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL - Last run was over 1:00:00 ago [16:34:18] bd808: indeed :) [16:34:54] godog: things look stable. I'm going to start the elasticsearch 1.7.1 update now [16:36:33] !log upgraded elaasticsearch to 1.7.1 on logstash1001 [16:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:00] the onboarding process shouldn't even be defined as limited to 1 or 2 teams. to actually work it needs to be a WMF-wide thing, from when a contract gets signed until they have all they need to work, so at least HR + Facilities + IT + Ops + TeamTheyWorkIn, theoretically Finances [16:37:21] !log upgraded elaasticsearch to 1.7.1 on logstash1002 [16:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:20] !log upgraded elasticsearch to 1.7.1 on logstash1003 [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:36] (03PS4) 10Dzahn: etherpad: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230686 [16:39:55] 6operations, 10Traffic: Fix Varnish TTLs across the board - https://phabricator.wikimedia.org/T108612#1527843 (10BBlack) Ok I've dug into this some (read varnish source code to confirm behavior there, re-read our VCL, stared at lots of parsed varnish logs, etc) and it's not as bad as I initially thought. Most... [16:40:50] (03CR) 10Dzahn: [C: 032] etherpad: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230686 (owner: 10Dzahn) [16:40:59] (03CR) 10Ori.livneh: "> If that is a high traffic service we might still want to exclude it from connection tracking, though." [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [16:41:23] (03PS1) 10Alex Monk: Allow ptwiki bureaucrats to remove sysop+bureaucrat rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230805 (https://phabricator.wikimedia.org/T107661) [16:41:44] (03PS1) 10BBlack: varnish: director->backends is now always an array [puppet] - 10https://gerrit.wikimedia.org/r/230806 [16:42:13] 6operations, 6Collaboration-Team-Backlog, 10Flow, 10MediaWiki-Redirects, 3Reading-Web: Flow url doesn't redirect to mobile - https://phabricator.wikimedia.org/T107108#1527852 (10Krenair) [16:42:16] !log restarted Apache on Etherpad [16:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:26] (03PS1) 10BBlack: cache_(text|upload): remove runtime_params conditional [puppet] - 10https://gerrit.wikimedia.org/r/230807 [16:42:28] (03PS1) 10BBlack: cache_(text|upload): frontend default_ttl => 30d [puppet] - 10https://gerrit.wikimedia.org/r/230808 (https://phabricator.wikimedia.org/T108612) [16:42:30] (03PS1) 10BBlack: cache_mobile: def_ttl 30d [puppet] - 10https://gerrit.wikimedia.org/r/230809 (https://phabricator.wikimedia.org/T108612) [16:42:38] 6operations, 6Collaboration-Team-Backlog, 10Flow, 10MediaWiki-Redirects, 3Reading-Web: Flow url doesn't redirect to mobile - https://phabricator.wikimedia.org/T107108#1486874 (10Krenair) Moved from #Wikimedia-site-requests to #operations since that's in the puppet repository, not mediawiki-config [16:42:47] !log upgraded elasticsearch to 1.7.1 on logstash1004; logstash-2015.08.11 shard recovering [16:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:29] 6operations, 6Collaboration-Team-Backlog, 10Flow, 10MediaWiki-Redirects, 3Reading-Web: Flow url doesn't redirect to mobile - https://phabricator.wikimedia.org/T107108#1527871 (10Dzahn) Yes, in the puppet repository but in the mediawiki module. Since it's config you might argue for it to be in mediawiki-c... [16:46:55] (03PS1) 10coren: Labs: put the real check interval for backups [puppet] - 10https://gerrit.wikimedia.org/r/230810 [16:47:07] YuviPanda: ^^ extra easy changeset to +1? [16:47:24] (03CR) 10jenkins-bot: [V: 04-1] Labs: put the real check interval for backups [puppet] - 10https://gerrit.wikimedia.org/r/230810 (owner: 10coren) [16:47:31] 6operations, 10ops-codfw: RAID disk failure on db2023 - https://phabricator.wikimedia.org/T108701#1527873 (10jcrespo) 3NEW [16:47:44] Oh d'uh [16:47:55] Coren: line 35 looks wrong [16:47:59] Coren: :) [16:48:08] (03PS2) 10coren: Labs: put the real check interval for backups [puppet] - 10https://gerrit.wikimedia.org/r/230810 [16:48:09] Yeah, copypasta failz [16:48:19] 6operations, 10ops-codfw: RAID disk failure on db2023 - https://phabricator.wikimedia.org/T108701#1527884 (10jcrespo) @Papaul, lets wait for now: it says "Rebuild", it may fix itself as the other one. [16:48:49] 7Blocked-on-Operations, 6Collaboration-Team-Backlog, 10Flow, 3Collaboration-Team-Current, and 2 others: Separate reference tables by wiki - https://phabricator.wikimedia.org/T107204#1527887 (10DannyH) p:5High>3Unbreak! [16:48:54] ACKNOWLEDGEMENT - RAID on db2023 is CRITICAL 1 failed LD(s) (Degraded) Jcrespo T108701 [16:48:55] YuviPanda: You noticed the properly triggering 1h crit? [16:49:23] Coren: nice! [16:49:33] (and I haven't, no0 [16:49:35] but whee [16:51:10] (03PS3) 10Mobrovac: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [16:51:54] (03CR) 10jenkins-bot: [V: 04-1] Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [16:52:54] RECOVERY - puppet last run on ms-be2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:53:20] (03PS4) 10Mobrovac: Introducing mobileapps role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [16:56:30] (03CR) 10Mobrovac: [C: 031] "Looks GTG to me." [puppet] - 10https://gerrit.wikimedia.org/r/230788 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [16:59:30] (03CR) 10Mobrovac: [C: 04-1] "The plan is to have it only on SCB, so I don't understand why would we want to assign it to SCA at all." [puppet] - 10https://gerrit.wikimedia.org/r/230789 (https://phabricator.wikimedia.org/T105538) (owner: 10Alexandros Kosiaris) [17:04:47] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=1 dev=sdb failed - https://phabricator.wikimedia.org/T108561#1527930 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi rebuilding, thanks @papaul ! ``` /dev/sdb1 1.9T 4.2G 1.9T 1% /srv/swift-storage/sdb1 ``` [17:06:14] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=1 dev=sdb failed - https://phabricator.wikimedia.org/T108561#1527938 (10Papaul) You welcome. [17:11:02] (03PS2) 10BBlack: cache_(text|upload): remove runtime_params conditional [puppet] - 10https://gerrit.wikimedia.org/r/230807 [17:11:10] (03CR) 10BBlack: [C: 032 V: 032] cache_(text|upload): remove runtime_params conditional [puppet] - 10https://gerrit.wikimedia.org/r/230807 (owner: 10BBlack) [17:11:20] (03PS2) 10BBlack: cache_(text|upload): frontend default_ttl => 30d [puppet] - 10https://gerrit.wikimedia.org/r/230808 (https://phabricator.wikimedia.org/T108612) [17:13:14] (03PS2) 10Dzahn: OTRS: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230709 [17:13:18] !log logstash cluster recovered after upgrade of elasticsearch on logstash1004 [17:13:22] (03CR) 10BBlack: [C: 032] cache_(text|upload): frontend default_ttl => 30d [puppet] - 10https://gerrit.wikimedia.org/r/230808 (https://phabricator.wikimedia.org/T108612) (owner: 10BBlack) [17:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:04] !log log event volume in logstash dropped dramatically at 16:49; investigating [17:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:45] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1527982 (10brion) Made an attempt to sidestep ffmpeg2theora by using ffmpeg for conversion and ogg... [17:20:51] (03PS2) 10BBlack: cache_mobile: def_ttl 30d [puppet] - 10https://gerrit.wikimedia.org/r/230809 (https://phabricator.wikimedia.org/T108612) [17:21:50] (03CR) 10BBlack: [C: 032] cache_mobile: def_ttl 30d [puppet] - 10https://gerrit.wikimedia.org/r/230809 (https://phabricator.wikimedia.org/T108612) (owner: 10BBlack) [17:22:03] RECOVERY - RAID on db2023 is OK optimal, 1 logical, 2 physical [17:22:34] (03PS2) 10BBlack: define puppet ganglia stuff for caches @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/230771 [17:22:41] (03CR) 10BBlack: [C: 032 V: 032] define puppet ganglia stuff for caches @ codfw [puppet] - 10https://gerrit.wikimedia.org/r/230771 (owner: 10BBlack) [17:25:07] (03PS13) 10Giuseppe Lavagetto: puppet-compiler: first commit [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) [17:27:20] !log logstash event volume recovered after restarting all 3 logstash services [17:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:25] !log upgrading elasticsearch on logstash1005 [17:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "- Diff detection is broken with the latest catalog diff puppet face" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/228849 (https://phabricator.wikimedia.org/T96802) (owner: 10Giuseppe Lavagetto) [17:29:46] !log upgraded elasticsearch to 1.7.1 on logstash1005; logstash-2015.08.11 shard recovering [17:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:19] (03PS3) 10coren: Labs: put the real check interval for backups [puppet] - 10https://gerrit.wikimedia.org/r/230810 [17:35:51] (03CR) 10coren: [C: 032] Labs: put the real check interval for backups [puppet] - 10https://gerrit.wikimedia.org/r/230810 (owner: 10coren) [17:38:54] RECOVERY - Last backup of the others filesystem on labstore1002 is OK - Last run successful [17:39:37] YuviPanda: ^^ [17:43:03] 6operations: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1528103 (10coren) Good catch, @fgiunchedi - clearly the same issue. Merging. [17:43:07] !log log event volume in logstash dropped dramatically again; seems to correlate with final recovery of logstash-2015.08.11 shard [17:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:22] 6operations: Investigate why Icinga's check_disk panics on snatshot mounts - https://phabricator.wikimedia.org/T108694#1528104 (10coren) [17:43:24] 6operations, 5Continuous-Integration-Isolation, 7Icinga, 7Monitoring, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1528106 (10coren) [17:45:38] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1528115 (10Dzahn) 10:08 "Could not contact Elasticsearch. Please ensure that Elasticsearch is reachable from your browser." 10:09 and the way i tested is: 10:09 ssh -D... [17:50:19] 6operations, 5Continuous-Integration-Isolation, 7Icinga, 7Monitoring, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1528152 (10coren) We get the same issue on the labstore* with the required exclusion being `/va... [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150811T1800). Please do the needful. [18:00:48] twentyafterfour: don't trust logstash/kibana to tell you what's broken/working right now. [18:01:34] !log logstash cluster recovered after upgrade of elasticsearch on logstash1005 [18:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:15] !log upgrading elasticsearch on logstash1006 [18:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:44] !log upgraded elasticsearch to 1.7.1 on logstash1006; logstash-2015.08.11 shard recovering [18:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:37] bd808: I don't usually trust it ;) [18:05:01] fine then ;) [18:06:57] !log logstash cluster recovered after upgrade of elasticsearch on logstash1006 [18:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:20] godog: I'm done! Thanks for your help [18:07:50] (03PS1) 1020after4: 1.26wmf18 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230819 [18:07:52] (03PS1) 1020after4: delete 1.26wmf10 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230820 [18:08:04] (03CR) 1020after4: [C: 032] 1.26wmf18 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230819 (owner: 1020after4) [18:08:10] (03Merged) 10jenkins-bot: 1.26wmf18 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230819 (owner: 1020after4) [18:08:20] (03CR) 1020after4: [C: 032] delete 1.26wmf10 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230820 (owner: 1020after4) [18:08:26] (03Merged) 10jenkins-bot: delete 1.26wmf10 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230820 (owner: 1020after4) [18:14:58] (03PS1) 10Dzahn: grafana: needs to load Apache mod_proxy_http too [puppet] - 10https://gerrit.wikimedia.org/r/230824 [18:15:18] !log twentyafterfour@tin Started scap: sync new branch 1.26wmf18 and update testwiki [18:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:38] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1528327 (10Dzahn) >>! In T105008#1528115, @Dzahn wrote: > 10:33 it's related to mod_proxy, figuring it out > 10:34 3945] AH01144: No protocol handler was valid for the URL /grafan... [18:16:52] (03PS1) 10Mforns: Change percentage in EventLogging validation alert [puppet] - 10https://gerrit.wikimedia.org/r/230825 (https://phabricator.wikimedia.org/T108339) [18:18:14] (03PS2) 10Dzahn: grafana: needs to load Apache mod_proxy_http too [puppet] - 10https://gerrit.wikimedia.org/r/230824 [18:19:08] (03PS3) 10Dzahn: grafana: needs to load Apache mod_proxy_http too [puppet] - 10https://gerrit.wikimedia.org/r/230824 [18:19:21] (03PS4) 10Dzahn: grafana: needs to load Apache mod_proxy_http too [puppet] - 10https://gerrit.wikimedia.org/r/230824 (https://phabricator.wikimedia.org/T105008) [18:19:47] (03CR) 10Dzahn: [C: 032] grafana: needs to load Apache mod_proxy_http too [puppet] - 10https://gerrit.wikimedia.org/r/230824 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [18:21:30] !log logstash log event volume back to normal levels following elasticsearch upgrade [18:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:47] (03CR) 10Dzahn: [C: 031] "works now after issues fixed with I0e8f2e7aaf1ae7e178b9" [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [18:23:54] (03PS3) 10Dzahn: misc-web: switch grafana to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) [18:27:28] 6operations, 10RESTBase-Cassandra: upgrade RESTBase cluster to Cassandra 2.1.8 - https://phabricator.wikimedia.org/T107949#1528368 (10Eevans) >>! In T107949#1527439, @Eevans wrote: >>>! In T107949#1526933, @fgiunchedi wrote: >> cosmetic issue output contains `%s` spotted while looking at the logs, benign >> >... [18:27:45] (03CR) 10Dzahn: [C: 032] misc-web: switch grafana to backend krypton [puppet] - 10https://gerrit.wikimedia.org/r/230660 (https://phabricator.wikimedia.org/T105008) (owner: 10Dzahn) [18:29:53] (03PS1) 10Dzahn: grafana: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/230827 (https://phabricator.wikimedia.org/T104946) [18:30:16] (03PS2) 10Dzahn: grafana: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/230827 (https://phabricator.wikimedia.org/T104946) [18:31:57] ah, the switched varnishes [18:32:09] gotta use a bunch of ssh-keygen -f [18:32:23] and 4 boxen now for misc-web [18:34:17] 6operations, 7network: smokeping loss of ping for codfw rows - https://phabricator.wikimedia.org/T108715#1528390 (10fgiunchedi) 3NEW [18:36:36] paravoid: ^ https://phabricator.wikimedia.org/T108715 [18:36:40] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1528411 (10chasemp) @bblack @faidon @mmodell I have talked this over with a few people gracious enough to lend me their ear but t... [18:37:13] !log grafana switched to node krypton (jessie/VM) [18:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:30] bd808: no problem! glad it was smooth [18:38:10] ottomata2, hi, any updates on the hadoop? https://gerrit.wikimedia.org/r/#/c/230535/ [18:39:28] 6operations, 5Patch-For-Review, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1528423 (10Dzahn) [18:39:30] 6operations, 5Patch-For-Review: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1528421 (10Dzahn) 5Open>3Resolved 11:38 < mutante> !log grafana switched to node krypton (jessie/VM) [18:40:14] mutante: that made me realize it'd be nice if we could cc phab tickets with !log [18:40:48] godog: yes, that would be nice [18:41:07] morebots could mail phab i suppose [18:41:08] I am a logbot running on tools-exec-1210. [18:41:08] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:41:08] To log a message, type !log . [18:42:08] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1528439 (10Dzahn) [18:42:51] (03CR) 10Dzahn: [C: 032] grafana: remove role from zirconium [puppet] - 10https://gerrit.wikimedia.org/r/230827 (https://phabricator.wikimedia.org/T104946) (owner: 10Dzahn) [18:44:45] !log twentyafterfour@tin scap failed: OSError [Errno 1] Operation not permitted: '/srv/mediawiki-staging/wikiversions.php' (duration: 29m 27s) [18:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:25] 6operations: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1528451 (10Dzahn) [18:45:53] ori: ^ [18:46:14] twentyafterfour: was it the chmod that failed? [18:47:11] bd808: yes [18:47:15] os.chmod(php_file, 0664) [18:47:34] -rw-rw-r-- 1 ori wikidev 33069 Aug 11 18:44 wikiversions.php [18:47:57] I would have to own it to chmod, right? [18:48:19] we do the same thing for the cdb... [18:48:28] but it is moved into place [18:49:08] created, then moved instead of chmod on the existing file? [18:49:26] yeah see https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/tasks.py#L163-L164 [18:49:58] the problem here is chmod on a file owned by another deployer [18:50:21] you can hot patch it by taking the chmod() on line 178 out entirely [18:50:26] 6operations, 5Patch-For-Review, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1528450 (10Dzahn) 5Open>3Resolved [18:50:57] but it should probably follow the tmp_file, rename, chmod model used for the cdb [18:52:41] yeah [18:52:42] (03PS1) 10Dzahn: zirconium: decom, rm from site.pp,DHCP,netboot [puppet] - 10https://gerrit.wikimedia.org/r/230828 (https://phabricator.wikimedia.org/T105510) [18:53:10] !log twentyafterfour@tin Started scap: again: sync new branch 1.26wmf18 and update testwiki [18:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:37] (03CR) 10Dzahn: [C: 032] zirconium: decom, rm from site.pp,DHCP,netboot [puppet] - 10https://gerrit.wikimedia.org/r/230828 (https://phabricator.wikimedia.org/T105510) (owner: 10Dzahn) [18:57:50] 6operations, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1528492 (10Dzahn) root@palladium:~# puppet cert clean zirconium.wikimedia.org Notice: Revoked certificate with serial 346 Notice: Removing file Puppet::SSL::Certificate zirconium.wikimedia.org at '/var/lib/puppet/ser... [18:58:08] !log twentyafterfour@tin Finished scap: again: sync new branch 1.26wmf18 and update testwiki (duration: 04m 58s) [18:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:39] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1528512 (10brion) Ok I've got a provisional patch for ffmpeg2theora master, which gets a local bui... [18:59:44] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1528513 (10brion) Ok I've got a provisional patch for ffmpeg2theora master, which gets a local build of ffmpeg2theora working in MediaWiki-Vagrant for me. https... [19:01:25] (03PS1) 10Dzahn: decom zirconium [dns] - 10https://gerrit.wikimedia.org/r/230830 (https://phabricator.wikimedia.org/T105510) [19:02:12] (03CR) 10Dzahn: "just wait a little while, make sure it's gone from icinga and shut it down first" [dns] - 10https://gerrit.wikimedia.org/r/230830 (https://phabricator.wikimedia.org/T105510) (owner: 10Dzahn) [19:04:16] (03PS1) 10Alexandros Kosiaris: maps: Include tuning.conf in slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/230832 [19:05:20] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1528555 (10mmodell) >>! In T100519#1528411, @chasemp wrote: > 2) Put iridium in a public VLAN setting up SSH to go through LVS incomi... [19:05:40] (03PS1) 10BryanDavis: Fix wikiversions compilation problem [tools/scap] - 10https://gerrit.wikimedia.org/r/230833 [19:05:46] twentyafterfour: ^ [19:05:56] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Include tuning.conf in slaves as well [puppet] - 10https://gerrit.wikimedia.org/r/230832 (owner: 10Alexandros Kosiaris) [19:06:13] 6operations, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1528570 (10Dzahn) once the name zirconium is gone from DNS, this hardware can still be accessed as wmf3427.mgmt.eqiad.wmnet. [19:07:06] 6operations, 5Patch-For-Review: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1528578 (10Dzahn) a:3Dzahn [19:07:50] akosiaris, the backup process seems to have finished! [19:07:57] (slaves repl) [19:10:32] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1528619 (10mmodell) summary of my goals, roughly in order of importance: # Get phabricator's git repos exposed over ssh, with a stro... [19:11:39] (03PS1) 10Alexandros Kosiaris: maps: Fix typo introduced in 5daab4f [puppet] - 10https://gerrit.wikimedia.org/r/230838 [19:12:41] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1528625 (10BBlack) >>! In T100519#1528411, @chasemp wrote: > The model of having SSH (for git) set up as a service in LVS and termina... [19:13:19] (03CR) 1020after4: [C: 032] Fix wikiversions compilation problem [tools/scap] - 10https://gerrit.wikimedia.org/r/230833 (owner: 10BryanDavis) [19:14:58] (03CR) 10Ori.livneh: "Yikes, thanks." [tools/scap] - 10https://gerrit.wikimedia.org/r/230833 (owner: 10BryanDavis) [19:15:36] (03Merged) 10jenkins-bot: Fix wikiversions compilation problem [tools/scap] - 10https://gerrit.wikimedia.org/r/230833 (owner: 10BryanDavis) [19:15:49] twentyafterfour: What's the status of the wmf18 roll-out? Do I have time to sneak in an unbreak now fix for the wmf18 branch? [19:17:24] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Fix typo introduced in 5daab4f [puppet] - 10https://gerrit.wikimedia.org/r/230838 (owner: 10Alexandros Kosiaris) [19:18:00] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1528677 (10BBlack) >>! In T100519#1528619, @mmodell wrote: > summary of my goals, roughly in order of importance: > > # Get phabrica... [19:19:08] RoanKattouw: it's on testwiki [19:19:37] RoanKattouw: so, sneak it in by all means [19:21:30] yurik: yeah, it's fixed now [19:21:35] 6operations, 6Discovery, 10Maps, 10Traffic, and 2 others: Set up standard HTTPS Termination -> 2layer caching for maps service - https://phabricator.wikimedia.org/T105076#1528698 (10Yurik) [19:21:37] and with that I am going to bed [19:21:43] akosiaris, i can't connect [19:21:59] ? [19:22:09] twentyafterfour: OK, going to do that now, Jenkins volente [19:22:28] akosiaris, never mind, works now! [19:22:30] awesome!!! [19:22:33] thanks!!!!!!!!!!!!!!!! [19:22:41] gnight :) [19:25:43] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1528724 (10GWicke) As discussed on the mail thread, my proposal is to go with 6 nodes with 8 cores, 96G RAM and 4 Samsung SSDs each. This variant gives us the best cost / performance rati... [19:26:39] !log catrope@tin Synchronized php-1.26wmf18/extensions/Flow/modules/editor/editors/visualeditor/mw.flow.ve.Target.js: Fix missing editor switcher (duration: 00m 12s) [19:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:58] twentyafterfour: OK, done. Thanks! [19:28:10] !log ori@tin Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoader.php: I2089b21fc: ResourceLoader: make "cacheReport" option false by default (duration: 00m 11s) [19:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:24] !log ori@tin Synchronized php-1.26wmf18/includes/resourceloader/ResourceLoader.php: I2089b21fc: ResourceLoader: make "cacheReport" option false by default (duration: 00m 13s) [19:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:03] thcipriani: I applaud your vim live it thing [19:34:17] Negative24: heh, thanks. Nice github stalking. Some guy on the internet actually made vinyl stickers of that version of the vim logo and sent me a bunch. Made the whole project worth-while :) [19:35:04] thcipriani: I actually saw a link of it on http://vimcasts.org/blog/2013/02/habit-breaking-habit-making/ then saw it was you :P WMF is everywhere! [19:35:38] but your github is awesome as well [19:36:17] 6operations: codfw misc cluster ganglia not working - https://phabricator.wikimedia.org/T108680#1528766 (10Dzahn) a:3Dzahn [19:36:57] so testwiki is 503, trying to figure out the cause [19:40:12] ah so that's why nothings working. Its always great when its not just me [19:42:27] 2015-08-11 19:42:01 mw1017 testwiki fatal ERROR: [f40e9261] /wiki/Main_Page ErrorException from line 267 of /srv/mediawiki/php-1.26wmf18/includes/exception/MWExceptionHandler.php: Fatal Error: Cannot pass parameter 2 by reference {"exception":"[Exception ErrorException] (/srv/mediawiki/php-1.26wmf18/includes/exception/MWExceptionHandler.php:267) Fatal Error: Cannot pass parameter 2 by reference\n[stacktrace]\n#0 [internal func [19:42:28] tion]: MWExceptionHandler::handleFatalError()\n#1 {main}\n"} [19:43:40] 6operations: codfw misc cluster ganglia not working - https://phabricator.wikimedia.org/T108680#1528775 (10Dzahn) 12:38 < mutante> bblack: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous%2520codfw&tab=m&vn=&hide-hf=false 12:38 < mutante> the ganglia data is coming in no... [19:43:58] 6operations: codfw misc cluster ganglia not working - https://phabricator.wikimedia.org/T108680#1528776 (10Dzahn) 5Open>3Resolved [19:44:29] bd808: ^^ [19:55:21] twentyafterfour: i think it might be I7ea050a2eabba635f2aadb4e33b6f8fbfb1b01a8 [19:57:04] yeah [19:57:08] i fixed it, i'll submit a patch [19:57:29] ori: awesome thanks [19:59:06] AaronSchulz: yt? [19:59:25] hm [19:59:59] AaronSchulz: it's http://stackoverflow.com/a/9716982/582542 [20:00:18] so the question is, is the proper fix to create $dummy = null; and pass that instead of a null literal (that's what I live-hacked on mw1017, and it works) [20:00:21] or to pass $casToken [20:01:20] I suspect you meant to do the latter, no? [20:01:34] * AaronSchulz is trying to find the lines in question [20:01:59] AaronSchulz: includes/objectcache/MultiWriteBagOStuff.php L66 [20:02:21] also did you mean to clobber $flags like that? [20:04:56] * AaronSchulz looks at what $casToken did before [20:06:04] (03PS1) 10Alex Monk: Kill ee-prototype.wikipedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/230854 (https://phabricator.wikimedia.org/T107397) [20:06:32] 6operations, 6Discovery, 10SEO, 3Discovery-Analysis-Sprint: Get Oliver Keyes access to Google Webmaster Tools for all Wikimedia domains - https://phabricator.wikimedia.org/T101157#1528884 (10Deskana) Taking out of the sprint, because I'm clearly not finding time to get around to this. [20:06:37] 6operations, 6Discovery, 10SEO: Get Oliver Keyes access to Google Webmaster Tools for all Wikimedia domains - https://phabricator.wikimedia.org/T101157#1528885 (10Deskana) [20:07:33] yeah the actual cas value doesn't matter for that class [20:07:46] so I can make that <<$value = $cache->get( $key, $casToken, $flags );>> [20:08:11] phpstorm actually sees that error when the $caches var doc is fixed too [20:09:08] AaronSchulz: https://gerrit.wikimedia.org/r/#/c/230853/ [20:10:44] !log ori@tin Synchronized php-1.26wmf18/includes/objectcache/MultiWriteBagOStuff.php: 0acfe6a5bb: Fix argument handling in MultiWriteBagOStuff::get() (duration: 00m 12s) [20:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:09] twentyafterfour: all yours [20:13:49] (03PS1) 1020after4: group0 wikis to 1.26wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230913 [20:14:05] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230913 (owner: 1020after4) [20:14:10] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230913 (owner: 1020after4) [20:18:54] TypeError: not all arguments converted during string formatting [20:19:07] when running sync-wikiversions [20:19:20] TypeError: not all arguments converted during string formatting [20:19:24] er, [20:19:33] err_msg = 'ExtensionMessages not found in {}' % ext_msg [20:19:47] isn't the {} syntax only applicable to string.format() [20:20:37] yes [20:20:41] it should be %s [20:21:03] that's what I thought [20:21:05] ok fixing it [20:21:44] it's your bug :P [20:21:45] I50adefa19c0e3916d25703d78934b743a5f64da7 [20:22:06] * ori only responsible for 1 of 3 deployment bugs \o/ [20:22:18] greg-g: are you not proud? [20:23:32] * twentyafterfour is getting better at python [20:31:16] (03PS1) 1020after4: use %s not {} for string templating [tools/scap] - 10https://gerrit.wikimedia.org/r/230914 [20:32:15] (03CR) 1020after4: [C: 032] use %s not {} for string templating [tools/scap] - 10https://gerrit.wikimedia.org/r/230914 (owner: 1020after4) [20:34:35] (03Merged) 10jenkins-bot: use %s not {} for string templating [tools/scap] - 10https://gerrit.wikimedia.org/r/230914 (owner: 1020after4) [20:42:05] queste diavolerie moderne [20:59:31] (03PS1) 1020after4: fix method naming missmatch [tools/scap] - 10https://gerrit.wikimedia.org/r/230917 [21:00:32] (03CR) 1020after4: [C: 032] fix method naming missmatch [tools/scap] - 10https://gerrit.wikimedia.org/r/230917 (owner: 1020after4) [21:00:53] (03Merged) 10jenkins-bot: fix method naming missmatch [tools/scap] - 10https://gerrit.wikimedia.org/r/230917 (owner: 1020after4) [21:02:30] !log deployed scap fixes for my dumb mistakes [21:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:09] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.26wmf18 [21:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:32] ori: well done, 2 out of 3 "not me!" ain't bad! [21:06:44] (sorry, was afk for longer than I planned) [21:07:10] greg-g: 1.26wmf18 is {{done}} [21:07:18] (and I updated the wiki) [21:07:21] 6operations, 3Discovery-Maps-Sprint: Postgres replication is not working - https://phabricator.wikimedia.org/T108545#1529194 (10Yurik) 5Open>3Resolved a:3Yurik Awesome, works, thanks! [21:09:03] twentyafterfour: word [21:24:14] (03PS1) 10Hoo man: Add ssh key for new notebook [puppet] - 10https://gerrit.wikimedia.org/r/230920 [21:26:31] 7Blocked-on-Operations, 10Beta-Cluster, 6Collaboration-Team-Backlog, 5Patch-For-Review: Decide what to do with ee_prototypewiki in beta - https://phabricator.wikimedia.org/T107397#1529284 (10Krenair) [21:28:35] (03CR) 10Lucie Kaffee: [C: 031] "This is really Marius, I am sitting next to him. Absolutely legit." [puppet] - 10https://gerrit.wikimedia.org/r/230920 (owner: 10Hoo man) [21:32:04] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1529324 (10Legoktm) >>! In T108649#1526679, @BBlack wrote: > Probably the most important question (since I haven't really looked at UrlShortener) is... [21:33:47] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1529366 (10Legoktm) [21:42:09] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1529450 (10ori) >>! In T108649#1526679, @BBlack wrote: > 1) We could add another SAN onto our unified list for w.wiki > 2) We could simply get a new... [21:42:23] PROBLEM - puppet last run on mw1018 is CRITICAL Puppet has 1 failures [21:46:11] (03PS1) 10BryanDavis: logstash: normalize "level" fields across log types [puppet] - 10https://gerrit.wikimedia.org/r/230922 [21:51:26] (03PS2) 10BryanDavis: logstash: normalize "level" fields across log types [puppet] - 10https://gerrit.wikimedia.org/r/230922 [22:04:18] (03CR) 10BryanDavis: [C: 04-1] "Cherry-picked to beta cluster. Case normalization working but gelf level changes not applied." [puppet] - 10https://gerrit.wikimedia.org/r/230922 (owner: 10BryanDavis) [22:05:46] 6operations, 7network: smokeping loss of ping for codfw rows - https://phabricator.wikimedia.org/T108715#1529617 (10faidon) 5Open>3Resolved a:3faidon Yup, I set up an overly aggressive security policy that blocked pings among other things. Thanks for noticing! Should be fixed now. [22:06:33] (03PS1) 10EBernhardson: Introduce new labs role for vagrant+lxc [puppet] - 10https://gerrit.wikimedia.org/r/230928 [22:07:14] RECOVERY - puppet last run on mw1018 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:16:58] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1529683 (10faidon) >>! In T100519#1528677, @BBlack wrote: >>>! In T100519#1528619, @mmodell wrote: >> summary of my goals, roughly in... [22:22:30] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1529705 (10BBlack) >>! In T108649#1529324, @Legoktm wrote: > SAN = https://en.wikipedia.org/wiki/SubjectAltName ? I don't understand the details and... [22:24:32] (03PS1) 10Ori.livneh: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 [22:25:52] paravoid: ^ (when you're done with the Other Thing.) [22:26:34] PROBLEM - mysqld processes on db1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:26:36] (03PS3) 10BryanDavis: logstash: normalize "level" fields across log types [puppet] - 10https://gerrit.wikimedia.org/r/230922 [22:29:14] PROBLEM - DPKG on db1042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:29:46] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1529708 (10BBlack) Basically, yeah. I ran down a similar plan with @Chasemp and I think he's working on some patches for it. Howeve... [22:33:03] (03CR) 10BryanDavis: [C: 04-1] "Updated default mapping in beta cluster and prod. Will retry tomorrow with new mapping in place." [puppet] - 10https://gerrit.wikimedia.org/r/230922 (owner: 10BryanDavis) [22:33:25] RECOVERY - DPKG on db1042 is OK: All packages OK [22:36:47] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1529720 (10BBlack) In the interest of full disclosure of options, there's a middle-ground option where we get a simple separate cert for w.wiki, and... [22:49:13] RECOVERY - mysqld processes on db1042 is OK: PROCS OK: 1 process with command name mysqld [22:57:44] is there a way to see disk io load in ganglia? https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=maps+Cluster+codfw&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=2&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [22:59:16] yurik: ganglia, no, graphite, yes [22:59:17] http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1439333942.041&target=servers.maps-*.iostat.sda.io [22:59:57] there are lots of metrics available, see servers/maps-test2001/iostat hierarchy in graphite.wikimedia.org [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150811T2300). Please do the needful. [23:00:04] Krenair matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:36] Present [23:00:48] ok [23:01:12] ori, thx, is there a view there to combine all maps servers? [23:01:16] (03CR) 10Alex Monk: [C: 032] Set wgNamespaceRobotPolicies on itwiki's NS_USER to noindex,follow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230804 (https://phabricator.wikimedia.org/T107992) (owner: 10Alex Monk) [23:01:32] neither here nor there, but the FWIW the tidy extension we're currently using uses HNI and is pretty clean and small https://github.com/wikimedia/mediawiki-php-tidy/blob/hni/ext_tidy.cpp [23:01:49] mischan [23:01:51] (03Merged) 10jenkins-bot: Set wgNamespaceRobotPolicies on itwiki's NS_USER to noindex,follow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230804 (https://phabricator.wikimedia.org/T107992) (owner: 10Alex Monk) [23:02:02] "Adsum" [23:02:06] (03CR) 10BryanDavis: Introduce new labs role for vagrant+lxc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [23:02:33] Coren, gwicke, twentyafterfour: Anyone deploying? [23:02:56] yurik: if you find a metric that is useful, click on graph data, then edit, and replace e.g. "servers.maps-test2001.iostat.sda.io" with "servers.maps-test*.iostat.sda.io" [23:03:04] graphite supports wildcards that way [23:03:34] ty [23:04:04] yurik: once you identify the metrics you find useful, it's easy to create a persistent view of these metrics in the form of a dashboard in graphite [23:04:11] *a dashboard in grafana [23:05:15] no? [23:05:16] ok [23:05:37] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/230804/ (duration: 00m 12s) [23:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:56] (03CR) 10Alex Monk: [C: 032] Allow ptwiki bureaucrats to remove sysop+bureaucrat rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230805 (https://phabricator.wikimedia.org/T107661) (owner: 10Alex Monk) [23:06:21] (03Merged) 10jenkins-bot: Allow ptwiki bureaucrats to remove sysop+bureaucrat rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230805 (https://phabricator.wikimedia.org/T107661) (owner: 10Alex Monk) [23:08:25] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/230805/ (duration: 00m 12s) [23:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:58] matt_flaschen, want to do your one? [23:10:01] or do you want me to? [23:10:09] Krenair, sure, I can. [23:10:34] You know this code far better than I do :) [23:11:09] Krenair, it's just a submodule bump. I will start the script tonight, but not immediately. [23:12:28] Would someone in ops be able to do https://gerrit.wikimedia.org/r/#/c/230854/ ? [23:12:32] YuviPanda, ^ [23:19:46] !log mattflaschen@tin Synchronized php-1.26wmf18/extensions/Flow/: Sync Flow 1.26wmf18 for memory leaks (duration: 00m 14s) [23:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:34] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 555290 msg: ocg_render_job_queue 3120 msg (=3000 critical) [23:22:34] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 555301 msg: ocg_render_job_queue 3132 msg (=3000 critical) [23:23:44] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 557785 msg: ocg_render_job_queue 4758 msg (=3000 critical) [23:24:41] we ok? ^ [23:25:16] ocg issues = PDF downloads might not work [23:25:30] not the end of the world, unless you're an info-en volunteer [23:25:45] in which case it might be the end of your inbox [23:26:38] well, we have CRITICAL warnings, we either care and respond or we don't and we don't warn [23:26:49] * greg-g is hardline sometimes [23:27:40] https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=rendering&return_to=Ch%C3%A2teau+de+Louveciennes&collection_id=1e53e473afb8b2337e06e50f39ea1603ae0b032b&writer=rdf2latex is stuck for me, so it's probably broken [23:28:01] gwicke, know anything about that? ^ [23:28:07] cscott's not around this week. [23:28:41] or subbu|gardening [23:28:46] arlolra is not in this channel [23:29:06] dammit [23:29:12] and just as I switched to -parsoid to ask he quit [23:29:30] Krenair, SWAT is done, right? [23:29:41] I guess so [23:29:51] I was hoping to get rid of ee-prototype [23:30:03] But that can't really happen properly without ops [23:30:29] ping apergos [23:31:01] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=PDF+servers+eqiad&m=cpu_report&s=descending&mc=2&g=cpu_report [23:31:01] Krenair: IIRC there was some work going oin to decommision some of the OCG servers [23:31:11] because beta sites stuff is all in operations/puppet :( [23:31:30] since there doesn't seem to be anyone around with OCG-specific expertise, here's what I suggest: [23:31:50] - high CPU probably means they're all stuck on some job that is causing pathological performance [23:32:03] - clearing the queue sucks but better than having the service be totally down [23:32:10] - restarting the service plausibly clears the queue [23:32:13] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [23:32:16] .:. we should restart the service [23:32:21] no, the queue is in redis [23:32:27] and hooked up in intricate ways [23:32:35] hrm [23:32:42] (+1 to ".:.") [23:32:44] clearing the redis instances used completely might help [23:32:56] did someone just press something on an ocg box? [23:33:06] looking at the logs could help too [23:33:19] I'd propose doing that fisrt [23:33:44] -rw-r--r-- 1 syslog adm 0 Aug 10 06:25 ocg.log [23:33:44] -rw-r--r-- 1 syslog adm 109444181 Aug 10 06:25 ocg.log-20150810.gz [23:33:53] and, indeed, there are a lot of logs [23:33:54] OCG doesn't like SIGHUP? :) [23:34:18] my example url above just started working [23:34:48] there are "Bundle completed successfully!" messages in the logs [23:35:06] yeah [23:35:10] ('ocg' in logstash) [23:35:16] there is quite a lot of backlog, but it is making progress [23:35:16] sometimes I get "Progress: 0.00% Status: Waiting for job runner to pick up render job" [23:35:17] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=PDF+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=ocg_job_queue&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=descending [23:35:22] a couple of times it has just worked [23:35:24] so i suggest we leave it [23:35:26] and let it recover [23:35:31] http://ganglia.wikimedia.org/latest/stacked.php?m=ocg_job_queue&c=PDF%20servers%20eqiad&r=hour&st=1439336084&host_regex= [23:37:19] I still regret that I didn't manage to convince Matt to go for a simple stateless design [23:39:01] https://wikitech.wikimedia.org/wiki/OCG#Pruning_the_queue [23:39:44] worth trying, imho ^^ [23:40:58] it has nearly recovered [23:41:37] yeah, dropping quickly now [23:42:14] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 565794 msg: ocg_render_job_queue 96 msg [23:42:56] (03PS1) 10Dzahn: admin: mailman-admins on fermium, not just users [puppet] - 10https://gerrit.wikimedia.org/r/230946 (https://phabricator.wikimedia.org/T108349) [23:43:04] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 565972 msg: ocg_render_job_queue 0 msg [23:43:04] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 565972 msg: ocg_render_job_queue 0 msg [23:44:22] greg-g: ^ [23:45:20] ori: yay [23:45:35] (03CR) 10Dzahn: [C: 032] admin: mailman-admins on fermium, not just users [puppet] - 10https://gerrit.wikimedia.org/r/230946 (https://phabricator.wikimedia.org/T108349) (owner: 10Dzahn) [23:47:23] 10Ops-Access-Reviews, 5Patch-For-Review: John Lewis sudo as 'list' on mailman staging VM - https://phabricator.wikimedia.org/T108349#1529960 (10Dzahn) [fermium:/etc/sudoers.d] $ id johnflewis uid=2744(johnflewis) gid=500(wikidev) groups=500(wikidev),756(mailman-users),757(mailman-admins) [fermium:/etc/sudoers... [23:47:57] 10Ops-Access-Reviews, 5Patch-For-Review: John Lewis sudo as 'list' on mailman staging VM - https://phabricator.wikimedia.org/T108349#1529961 (10Dzahn) 5Open>3Resolved [23:48:32] 10Ops-Access-Reviews: John Lewis sudo as 'list' on mailman staging VM - https://phabricator.wikimedia.org/T108349#1518752 (10Dzahn) [23:48:45] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations: John Lewis sudo as 'list' on mailman staging VM - https://phabricator.wikimedia.org/T108349#1529968 (10Dzahn) [23:52:13] (03PS3) 10Dzahn: OTRS: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230709 [23:53:22] (03CR) 10Dzahn: [C: 032] OTRS: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230709 (owner: 10Dzahn) [23:55:44] (03CR) 10Dzahn: "[iodine:/etc/apache2/sites-available] $ sudo apache2ctl configtest" [puppet] - 10https://gerrit.wikimedia.org/r/230709 (owner: 10Dzahn) [23:59:03] (03PS3) 10Dzahn: dbtree: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/230693