[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T0000). Please do the needful. [00:00:42] !log restbase restarting in labs for T158628 [00:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:47] T158628: Create beta hewiktionary for testing InterwikiSorting & Cognate - https://phabricator.wikimedia.org/T158628 [00:01:34] mutante: ^ [00:04:57] I tried to sneak something into swat at the last minute, not sure if its too late [00:07:20] bawolff: jouncebot doesn't catch it if you add it close to deploy time, but I can get your change out for you :) [00:07:35] thanks [00:08:00] I literally got back to my apartment like 5 minutes ago [00:08:31] looks like you missed a newline in your comment on line 3491, not a big deal, but doesn't look intentional, want to amend before I merge? [00:09:41] (03PS2) 10Brian Wolff: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 [00:09:56] That's embarassing. Fixed :) [00:10:05] cool :) [00:10:14] (03PS3) 10Thcipriani: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [00:10:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [00:12:28] (03Merged) 10jenkins-bot: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [00:12:38] (03CR) 10jenkins-bot: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [00:15:31] bawolff: hrm, not sure whether or not this can be tested on mwdebug1002, but if so, I pulled it there just now [00:15:46] thcipriani: it can, just not with the firefox extension [00:17:08] Confirmed works (or at least the header is present, I can only really test with wget) [00:18:04] ok [00:18:09] going live everywhere then [00:18:24] (03CR) 10Chad: [C: 031] "Let's go ahead with this" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [00:18:54] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4197503 keys, up 126 days 15 hours - replication_delay is 0 [00:19:41] Thanks :) [00:19:59] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341331|Add other WMF domains to foundationwiki CSP policy for Special:HideBanners]] (duration: 00m 40s) [00:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:06] ^ bawolff live everywhere now [00:20:14] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4197560 keys, up 126 days 15 hours - replication_delay is 0 [00:22:34] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:23:02] oh meh. The links were for wikidata.org, but its actually www.wikidata.org. Oh well, I'll deal with that at some point later [00:23:59] 06Operations, 10Education-Program-Dashboard, 03Programs-and-Events-Dashboard-Sprint 2, 07Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3078262 (10Dzahn) >>! In T126295#3078016, @Ragesoss wrote: > Some changes:... [00:26:09] (03CR) 10Dzahn: [C: 032] Partman: Add ms-be20[2-3][0-9] [puppet] - 10https://gerrit.wikimedia.org/r/341460 (owner: 10Papaul) [00:27:04] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:31:35] (03PS1) 10Brian Wolff: In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 [00:31:47] 06Operations, 10Education-Program-Dashboard, 03Programs-and-Events-Dashboard-Sprint 2, 07Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3078276 (10Ragesoss) > Since that is a package manager that looks like it mi... [00:32:53] thcipriani: Don't suppose I could also have https://gerrit.wikimedia.org/r/#/c/341472/1 (Fix for me being stupid and missing the www on www.wikidata.org)? [00:33:18] :) [00:33:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 (owner: 10Brian Wolff) [00:33:51] no problem [00:34:39] (03CR) 10Dzahn: [C: 032] Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [00:34:44] (03PS9) 10Dzahn: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox) [00:35:21] (03Merged) 10jenkins-bot: In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 (owner: 10Brian Wolff) [00:36:03] bawolff: pulled on mwdebug1002, everything look right there? [00:36:16] Oh, I thought y'alls were done [00:36:45] yep [00:37:02] RainbowSprinkles: me too, one last one came up. Am I in your way? [00:37:02] (03CR) 10jenkins-bot: In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 (owner: 10Brian Wolff) [00:37:15] thcipriani: No, was just taking advantage of quietness [00:37:18] :) [00:37:31] ok, one quick sync-file and there will be real quietness :) [00:37:36] * RainbowSprinkles nods [00:37:41] (taking gerrit down for a min or two) [00:39:04] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341472|In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org]] (duration: 00m 40s) [00:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:10] ^ bawolff live now [00:40:06] Whee, no more console warnings on https://wikimediafoundation.org/wiki/Thank_You/da :D [00:40:14] :) [00:40:17] Thanks :) [00:40:27] yw, glad all's well [00:41:54] submitting https://gerrit.wikimedia.org/r/#/c/326150/ [00:42:05] (since you don't see that part here) [00:42:10] i wish it would show [00:43:58] !log gerrit: taking offline for a minute or two for case-insensitive login conversion [00:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:24] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [00:47:04] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:48:46] ACKNOWLEDGEMENT - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn maintenance [00:48:54] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [00:49:26] !log gerrit: coming back online now [00:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:40] ACKNOWLEDGEMENT - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused daniel_zahn maintenance [00:50:04] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:50:24] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [00:50:24] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0) [00:50:27] Back up, and it looks like existing logins were preserved (was afraid due to the conversion) [00:50:38] great! [00:50:44] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment] [00:51:04] I was able to login via multiple case scenarios: chad, CHAD, cHAd [00:51:06] All worked [00:51:21] that bromine stuff is because it tried to clone in that moment...no biggie [00:51:33] RainbowSprinkles: nice [00:51:34] Yeah, couple of things just had bad timing, that always happens when we go offline [00:51:35] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [00:51:41] fixes those [00:52:34] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [00:52:43] logged in as dZaHn [00:52:44] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [00:52:54] gets corrected to normal spelling [00:53:29] Yep, that stays the same [00:55:04] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:56:24] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/discovery-stats] [01:01:34] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:11:04] 06Operations, 10Ops-Access-Requests: Requesting access to researchers and statistics-users groups for niharika29 - https://phabricator.wikimedia.org/T159780#3078358 (10Niharika) [01:15:54] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:18:24] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:21:24] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:25:44] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:30:34] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:53:44] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:23:28] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 08m 19s) [02:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Mar 7 02:28:59 UTC 2017 (duration 5m 32s) [02:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:14] PROBLEM - tileratorui on maps1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:04] RECOVERY - tileratorui on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.114 second response time [03:01:14] PROBLEM - tileratorui on maps1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:06] RECOVERY - tileratorui on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.008 second response time [03:10:24] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:44] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:10:54] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [03:28:58] !log foreachwikiindblist closed deleteEqualMessages.php (T45917) - purge upstreamed translations from closed wikis [03:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:04] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [03:38:44] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [04:18:24] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:30:24] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:43:34] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:46:24] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:59:24] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [05:09:32] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3078828 (10Nuria) On our end we have no preference how to fix issue (adding typo back would break other things, but I do not dispute that it might be fine), traffic & reading (... [05:11:34] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:21:35] !log foreachwikiindblist 'all - closed - private' deleteEqualMessages.php (T45917) - purge upstreamed translations from remaining wikis [05:21:35] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [05:22:05] !log foreachwikiindblist 'all - closed - private' deleteEqualMessages.php (T45917) - purge upstreamed translations from remaining wikis [05:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:54] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:24] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:49:14] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [05:52:14] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:54:54] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:02:34] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:08] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3078910 (10Nuria) I was going to summarize but @mforns already did it in a prior ticket. For what is causing this issue see: https://phabricator.wikimedia.org/T148780#2891117... [06:15:24] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:30:34] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:33:14] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:59:24] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:02:14] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:05:46] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079045 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1060.eqiad.wmnet'] ``` The l... [07:08:56] (03PS1) 10Marostegui: db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) [07:13:19] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:14:56] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:16:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2053 - T159414 (duration: 00m 41s) [07:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:23] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [07:16:25] !log Deploy ALTER table on db2053 (s6) for the revision table - T159414 [07:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:09] (03CR) 10jenkins-bot: db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:17:37] (03PS3) 10Marostegui: production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) [07:20:32] (03CR) 10Marostegui: [C: 032] production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [07:27:24] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [07:30:34] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 782506 msg: ocg_render_job_queue 0 msg [07:30:35] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079091 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1060.eqiad.wmnet'] ``` and were **ALL** successful. [07:37:24] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 31 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:39:41] !log Stop MySQL db1067 to clone db1060 from it - T158193 [07:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:48] T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193 [07:42:24] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:55:04] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:56:02] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079117 (10Marostegui) The data transfer between db1067 and db1060 was started around 20 minutes ago. [08:21:39] (03PS1) 10Marostegui: dbstore.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341500 (https://phabricator.wikimedia.org/T149418) [08:24:04] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:24:21] (03PS1) 10Muehlenhoff: Enable base::firewall on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341501 [08:25:18] !log installing systemd bugfix updates from jessie point release [08:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:03] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5668/" [puppet] - 10https://gerrit.wikimedia.org/r/341500 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [08:27:14] (03CR) 10Marostegui: [C: 032] dbstore.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341500 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [08:31:30] (03PS1) 10Ema: cache_text varnishtest: vary cookie [puppet] - 10https://gerrit.wikimedia.org/r/341502 (https://phabricator.wikimedia.org/T155314) [08:33:54] (03PS1) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) [08:34:44] (03CR) 10Marostegui: [C: 031] Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) (owner: 10Jcrespo) [08:35:52] (03CR) 10Ema: [V: 032 C: 032] cache_text varnishtest: vary cookie [puppet] - 10https://gerrit.wikimedia.org/r/341502 (https://phabricator.wikimedia.org/T155314) (owner: 10Ema) [08:37:18] (03PS2) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) [08:37:51] (03CR) 10Jcrespo: "This actually requires a restart to take effect :-/" [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) (owner: 10Jcrespo) [08:38:01] moritzm: good morning. Did you manage to upgrade hhvm on beta cluster ? [08:38:21] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341504 (https://phabricator.wikimedia.org/T159803) [08:39:16] 06Operations, 10MediaWiki-API, 10Traffic, 13Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3079196 (10ema) >>! In T155314#3077838, @Tgr wrote: > Setting (or not) `Vary` seems to be the right way to tell Varnish whether to cache or n... [08:39:24] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/varnish/tests/text/11-vary-cookie.vtc] [08:39:55] looking ^ [08:40:57] hashar: I first need to run some more tests, I'll upgrade the beta cluster when these were successful, probably later the morning [08:41:24] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:41:34] that was a transient puppetfart [08:41:45] moritzm: last time there were some failure with the hhvm extension but no clue how to check that. [08:43:23] it was a problem with the shipped default config, the proper fix is https://phabricator.wikimedia.org/T157306, but it can be addressed by running pupept after the upgrade [08:43:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks mostly good, minor comments inline about the schedule and we also need to add the actual target by updating backup::set to have a jo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad) [08:52:34] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:28] (03PS1) 10Marostegui: tendri.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418) [09:02:06] (03PS2) 10Marostegui: tendril.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418) [09:05:25] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5669/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:06:10] (03CR) 10Marostegui: [C: 032] tendril.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [09:10:19] !log temporary live hacking analytics-flex.cfg partman config on install1002 [09:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:34] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:23:55] !log cache_text, cache_upload: upgrading to varnish 4.1.5 T159424 [09:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:00] T159424: Upgrade text and upload cache clusters to varnish 4.1.5 - https://phabricator.wikimedia.org/T159424 [09:24:05] 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3079330 (10Gilles) [09:25:24] (03PS3) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 [09:35:32] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Esc3300) I think it might be worth attempting to determine the factors that lead to the rapid raise. - The edit rate didn't seem that high and we could easily h... [09:40:22] (03PS6) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [09:41:13] (03CR) 10jerkins-bot: [V: 04-1] Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (owner: 10Jcrespo) [09:41:34] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:42:59] (03PS7) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [09:50:44] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:52:04] (03CR) 10Alexandros Kosiaris: [C: 032] hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi) [09:52:11] (03PS2) 10Alexandros Kosiaris: hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi) [09:52:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi) [10:03:07] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3079430 (10akosiaris) I 've just merged the patch above and now oresrdb2001 is setup and `oresrdb.svc.codfw.wmnet` is working fine. I 'll leave the task o... [10:05:24] (03PS1) 10Marostegui: labs.my.cnf: Add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418) [10:07:47] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3079452 (10MoritzMuehlenhoff) HHVM extensions need to be rebuilt for the new 3.18 ABI. [10:07:56] (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/5671/" [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [10:08:03] (03PS1) 10Alexandros Kosiaris: Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512 [10:08:05] (03PS2) 10Marostegui: labs.my.cnf: Add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418) [10:08:57] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:12:51] (03CR) 10Marostegui: [C: 032] labs.my.cnf: Add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [10:16:36] (03PS8) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 [10:19:47] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:24:06] (03CR) 10Jcrespo: [C: 031] "I am ready to deploy this. This will deploy the wrong socket parameter for most servers, but they will not be affected until a new templat" [puppet] - 10https://gerrit.wikimedia.org/r/340987 (owner: 10Jcrespo) [10:25:21] (03CR) 10Alexandros Kosiaris: [C: 032] Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512 (owner: 10Alexandros Kosiaris) [10:25:32] (03PS2) 10Alexandros Kosiaris: Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512 [10:25:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512 (owner: 10Alexandros Kosiaris) [10:29:47] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:00] (03PS9) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896) [10:33:23] (03PS3) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) [10:34:48] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM, 15User-Joe: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#3079463 (10hashar) [10:34:50] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#3079462 (10hashar) [10:37:57] PROBLEM - DPKG on scb2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:38:24] (03CR) 10Jcrespo: [C: 032] Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) (owner: 10Jcrespo) [10:39:57] RECOVERY - DPKG on scb2006 is OK: All packages OK [10:43:11] 06Operations, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3079486 (10elukey) ``` [cb02e6a5dec913ef6b98416a] [no req] DBConnectionError from line 753 of /srv/mediawiki/php-1.29.0-wmf.14/includes/libs/rdbms/database/Data... [10:44:27] PROBLEM - Disk space on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:47] PROBLEM - salt-minion processes on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:47] PROBLEM - Check systemd state on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:57] PROBLEM - MD RAID on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:58] PROBLEM - dhclient process on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:07] PROBLEM - DPKG on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:07] PROBLEM - configured eth on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:07] PROBLEM - puppet last run on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:27] PROBLEM - DPKG on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:46:57] PROBLEM - MD RAID on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:07] PROBLEM - puppet last run on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:07] PROBLEM - configured eth on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:07] PROBLEM - dhclient process on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:07] PROBLEM - Check systemd state on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:37] PROBLEM - Disk space on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:47] PROBLEM - salt-minion processes on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:10] do we have a scb2005? O.o [10:50:25] these are all up to me [10:50:50] PROBLEM - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.08 seconds [10:50:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Send logs to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/321096 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup) [10:51:29] checkind db1038 [10:52:05] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 370.02 seconds Marostegui checking [10:52:14] what is the issue? [10:52:32] don't know yet [10:53:58] 1000 inserts/s [10:54:10] 300 updates/s [10:54:39] on the vslow host? [10:54:45] yep, the traffic has increased a lot since around 6am or so [10:54:48] yes, it is an vslow host [10:55:20] weird [10:55:34] storage looks fine [10:56:14] (03PS6) 10Gehel: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 (owner: 10Smalyshev) [10:56:23] cache invalidations? [10:56:57] on urwiki [10:56:59] probably [10:57:11] there is a huge increase on tmp tables happening at 10 [10:57:47] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:57:54] the last thing on SAL is ema upgrading to varnish 4.1.5 at 9:23 [10:58:12] UPDATE /* Title::invalidateCache */ `page` SET page_touched = '20170307104739' WHERE page_id = '184305' AND (page_touched < '20170307104739') [10:59:07] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 53 minutes ago with 0 failures [10:59:08] RECOVERY - configured eth on scb2006 is OK: OK - interfaces up [10:59:08] RECOVERY - DPKG on scb2006 is OK: All packages OK [10:59:17] RECOVERY - Disk space on scb2006 is OK: DISK OK [10:59:37] RECOVERY - salt-minion processes on scb2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:59:37] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [10:59:47] RECOVERY - MD RAID on scb2006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:59:48] RECOVERY - dhclient process on scb2006 is OK: PROCS OK: 0 processes with command name dhclient [11:01:19] Something happened at 10:00 and at 10:40, as per the graphs [11:01:47] RECOVERY - MD RAID on scb2005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [11:02:13] RECOVERY - DPKG on scb2005 is OK: All packages OK [11:02:23] RECOVERY - Disk space on scb2005 is OK: DISK OK [11:02:31] marostegui, avalanche of invalidations [11:02:33] RECOVERY - salt-minion processes on scb2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:02:47] RECOVERY - MariaDB Slave Lag: s3 on db1038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:03:06] jynus: how is that triggered? [11:03:08] 5x the regular traffic [11:03:35] and from 100 wps to 2000 wps [11:03:53] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [11:03:53] PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 4 minutes ago with 10 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy],Package[eventstreams/deploy],Package[changeprop/deploy],Package[electron-render/deploy] [11:04:12] servers are ready to slow down, but the slow one is out of the replication check [11:04:23] to allow for it to lag if necessary [11:04:27] RECOVERY - mysqld processes on db1060 is OK: PROCS OK: 1 process with command name mysqld [11:04:27] PROBLEM - mathoid endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=10042): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb0533c7950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:04:27] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=19000): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fecb353a950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:04:43] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=8888): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fec37c8e950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:04:43] PROBLEM - mathoid endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=10042): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1dbfd9c950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:04:53] PROBLEM - ores on scb2005 is CRITICAL: connect to address 10.192.0.34 and port 8081: Connection refused [11:04:53] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=8888): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f469a0db950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:05:03] RECOVERY - configured eth on scb2005 is OK: OK - interfaces up [11:05:04] PROBLEM - ores on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 8081: Connection refused [11:05:04] PROBLEM - ores uWSGI web app on scb2005 is CRITICAL: NRPE: Command check_uwsgi-ores not defined [11:05:18] mobrovac, services are not very happy on codfw- is there an rolling restart going on? [11:05:23] PROBLEM - pdfrender on scb2005 is CRITICAL: connect to address 10.192.0.34 and port 5252: Connection refused [11:05:25] PROBLEM - ores uWSGI web app on scb2006 is CRITICAL: NRPE: Command check_uwsgi-ores not defined [11:05:25] RECOVERY - dhclient process on scb2005 is OK: PROCS OK: 0 processes with command name dhclient [11:05:33] PROBLEM - pdfrender on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 5252: Connection refused [11:05:53] PROBLEM - trendingedits endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=6699): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f61fded6950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:06:03] PROBLEM - trendingedits endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=6699): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f65eea5d950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:06:17] RECOVERY - MariaDB Slave SQL: s2 on db1060 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:06:21] RECOVERY - MariaDB Slave IO: s2 on db1060 is OK: OK slave_io_state Slave_IO_Running: Yes [11:06:33] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy [11:06:33] PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=7272): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8022248950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:06:33] RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.074 second response time [11:06:43] RECOVERY - mathoid endpoints health on scb2006 is OK: All endpoints are healthy [11:06:43] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=1970): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f367f9c4950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:07:03] PROBLEM - changeprop endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=7272): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7ff9967ad950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:07:03] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [11:07:13] PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=8080): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1b86ca5950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:07:43] PROBLEM - eventstreams on scb2005 is CRITICAL: connect to address 10.192.0.34 and port 8092: Connection refused [11:07:43] PROBLEM - cxserver endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=8080): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa923c57950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:07:53] PROBLEM - eventstreams on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 8092: Connection refused [11:07:53] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=19000): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f9189884950: Failed to establish a new connection: [Errno 111] Connection refused,)) [11:08:43] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:09:14] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3079557 (10Emijrp) @Esc3300 About 3 days (72 hours), at a roughly edit rate of 600 epm, I did 2.5 million edits, more or less. Just a note, my bot edits add descriptions in do... [11:10:23] RECOVERY - pdfrender on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.096 second response time [11:10:33] RECOVERY - mathoid endpoints health on scb2005 is OK: All endpoints are healthy [11:10:43] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [11:10:44] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [11:10:53] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy [11:11:37] PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refused [11:12:04] that's me, sorry about that [11:12:53] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:16:33] RECOVERY - changeprop endpoints health on scb2005 is OK: All endpoints are healthy [11:16:33] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:16:43] RECOVERY - eventstreams on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.081 second response time [11:16:53] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [11:16:53] RECOVERY - trendingedits endpoints health on scb2005 is OK: All endpoints are healthy [11:16:53] RECOVERY - ores on scb2005 is OK: HTTP OK: HTTP/1.0 200 OK - 3147 bytes in 0.090 second response time [11:17:04] RECOVERY - ores uWSGI web app on scb2005 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [11:17:13] RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy [11:17:43] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [11:17:53] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:17:54] RECOVERY - eventstreams on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.088 second response time [11:18:03] RECOVERY - changeprop endpoints health on scb2006 is OK: All endpoints are healthy [11:18:03] RECOVERY - trendingedits endpoints health on scb2006 is OK: All endpoints are healthy [11:18:04] RECOVERY - ores on scb2006 is OK: HTTP OK: HTTP/1.0 200 OK - 3147 bytes in 0.083 second response time [11:18:23] RECOVERY - ores uWSGI web app on scb2006 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app [11:18:43] RECOVERY - cxserver endpoints health on scb2006 is OK: All endpoints are healthy [11:18:55] (03PS1) 10Alexandros Kosiaris: Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516 [11:20:07] (03CR) 10Alexandros Kosiaris: [C: 032] Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516 (owner: 10Alexandros Kosiaris) [11:20:13] (03PS2) 10Alexandros Kosiaris: Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516 [11:20:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516 (owner: 10Alexandros Kosiaris) [11:20:28] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079570 (10Marostegui) db1060 has been reimaged and recloned and it is now trying to catch up (GTID is enabled) [11:23:34] (03PS4) 10Elukey: Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) [11:24:47] (03CR) 10Elukey: [C: 032] Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [11:26:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] [11:27:09] !log end of hacking on install1002 (puppet re-enabled) [11:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:35] 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3079624 (10Beetlebeard) >>! In T158638#3076473, @Dzahn wrote: > @Beetlebeard how does that Gerrit link above look to you? > > direct link to diff: https://g... [11:32:41] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mobileapps']) [11:32:42] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [11:32:43] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=graphoid']) [11:32:44] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=citoid']) [11:32:45] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [11:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:46] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=cxserver']) [11:32:47] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=ores']) [11:32:48] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=eventstreams']) [11:32:49] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=pdfrender']) [11:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:51] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits']) [11:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:55] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mobileapps']) [11:32:56] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid']) [11:32:57] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=graphoid']) [11:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=citoid']) [11:33:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [11:33:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=cxserver']) [11:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:02] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=ores']) [11:33:03] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=eventstreams']) [11:33:04] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=pdfrender']) [11:33:05] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits']) [11:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:28] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3079631 (10akosiaris) [11:36:31] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3079629 (10akosiaris) 05Open>03Resolved I 've just applied the puppet configs on the hosts and pooled them for all the services. @mobrovac, I suppose we... [11:37:20] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079632 (10Marostegui) For the record, we are seeing the following disk errors (raid is fine and disks are online though): ``` #1 Media error count: 2... [11:37:26] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3079634 (10akosiaris) With T156631 and T159486 done we now have the required cap... [11:37:35] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3079637 (10akosiaris) 05Open>03Resolved a:03akosiaris [11:37:37] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3079639 (10akosiaris) [11:38:14] (03PS2) 10Gehel: wdqs: cleanup old GC logs [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248) [11:39:03] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3122: Connection refused [11:39:22] (03CR) 10Gehel: [C: 032] wdqs: cleanup old GC logs [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248) (owner: 10Gehel) [11:39:23] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3120: Connection refused [11:39:29] looking, host is depooled ^ [11:40:03] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.168 second response time [11:40:21] (03PS4) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) [11:40:23] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.167 second response time [11:41:00] wow, today is a crazy day [11:41:08] !log cleaning empty log file on elastic2001 (cronspam) [11:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:01] 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#3079660 (10tomasz) Does anything else need to be done here or can we just close this task? [11:48:23] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:49] (03PS1) 10Marostegui: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) [11:54:40] (03PS10) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896) [11:56:36] (03CR) 10Jcrespo: [C: 04-1] "The comment has to go away, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [11:58:33] (03PS2) 10Marostegui: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) [11:58:43] (03CR) 10Marostegui: "Thanks - as soon as I pushed it I noticed and I was amending :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [12:00:58] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [12:02:33] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [12:02:49] (03CR) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [12:03:49] (03CR) 10Gehel: [C: 04-1] deployment-prep: Use apt experimental for elasticsearch servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson) [12:03:59] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2034 - T132416 (duration: 00m 50s) [12:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:05] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [12:07:54] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3079736 (10Esc3300) Many seem to be descriptions for category items (Something that might not be of much use to Wikipedia). On a few items I check, some languages use only th... [12:08:45] gehel: experimental at this point also contains HHVM 3.18, so make sure this is only applied to the elastic nodes, not sure if deployment-prep updates it's packages automatically like the rest of labs [12:09:56] moritzm: yep, I'm looking into that. Since this is a temporary configuration (until ES5 is moved out of experimental) it probably make sense to configure it on each of the elasticsearch node in deployment-prep, and not on something more generic [12:10:41] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 [12:10:46] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 [12:11:27] RECOVERY - MariaDB Slave Lag: s2 on db1060 is OK: OK slave_sql_lag Replication lag: 57.20 seconds [12:11:36] (03CR) 10Gehel: [C: 04-2] "While we do have project specific configurations (%{::labsproject}), we do not have a good way to specific something more specific. Since " [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson) [12:14:25] gehel: ack [12:16:21] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 (owner: 10Marostegui) [12:17:45] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 (owner: 10Marostegui) [12:17:55] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 (owner: 10Marostegui) [12:18:23] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:19:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2053 - T159414 (duration: 00m 43s) [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:18] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [12:24:02] (03PS1) 10Marostegui: db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) [12:26:53] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [12:31:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [12:33:02] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [12:33:14] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [12:34:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 with less weight - T158193 (duration: 00m 40s) [12:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:19] T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193 [12:34:34] (03CR) 10Gehel: [C: 04-2] "In any case, the `apt` class is not applied on labs. I need to dig a bit more to see what the impact is if we add it to the elasticsearch " [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson) [12:39:56] !log Deploy ALTER table on db2028 (codfw s6 master) on the revision table - T159414 [12:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:02] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [12:47:04] 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3079788 (10Gilles) [12:53:00] !log analytics1040 back in service - testing the new Debian configuration [12:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:38] !log Just for the sake of having it logged: gtid_domain_id has been deployed in all the database servers - T149418 [12:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:44] T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418 [12:59:07] (03CR) 10Reedy: [C: 031] Save logs of generate CAPTCHA cron to /var/log/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/341197 (https://phabricator.wikimedia.org/T159610) (owner: 10Florianschmidtwelzow) [13:10:50] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3079893 (10MoritzMuehlenhoff) hhvm-wikidiff2 and hhvm-luasandbox rebuilt without changes against the new 3.18 API. hhvm-tidy needed to be patched. Initially the build failed with ``` /home/jmm/rebuild/hhvm-tidy-0... [13:11:38] (03PS1) 10Reedy: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 [13:11:48] jouncebot: next [13:11:48] In 0 hour(s) and 48 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1400) [13:30:44] 06Operations, 10media-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830#3080003 (10Reedy) [13:37:08] 06Operations, 10media-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830#3080033 (10Reedy) [13:37:44] jouncebot: now [13:37:44] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [13:37:50] jouncebot: next [13:37:51] In 0 hour(s) and 22 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1400) [13:38:07] sorry for the spam :/ i hate doing jouncebot commands [13:49:35] Zppix: you can query it ;) [13:49:56] i know but i have enough bots in my pms [13:50:06] lol [13:50:24] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [13:54:54] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) [13:56:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [13:57:32] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [13:57:40] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui) [13:58:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1060 weight - T158193 (duration: 00m 58s) [13:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:57] T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193 [13:59:46] jouncebot: now [13:59:46] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1400). Please do the needful. [14:00:04] reedy: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:33] It's my patch, so I'll deploy it :P [14:00:48] (03PS2) 10Reedy: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 [14:00:52] (03CR) 10Reedy: [C: 032] Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 (owner: 10Reedy) [14:00:58] but jouncebot told me there was nothing! [14:00:59] :D [14:01:24] hashar: to be fair it didnt lie it said 0 minutes it didnt say seconds [14:01:40] it said no deplokyments [14:01:46] guess It needed a refresh [14:01:56] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [14:02:42] (03Merged) 10jenkins-bot: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 (owner: 10Reedy) [14:02:55] (03CR) 10jenkins-bot: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 (owner: 10Reedy) [14:03:58] !log reedy@tin Synchronized docroot/: Fixup filebackend symlinks (duration: 00m 41s) [14:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:51] (03PS4) 10Jcrespo: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 [14:06:36] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080104 (10TheDJ) As far as i'm aware, this is NOT due to a typo. Safari simply implements a very limited set of referrer policies: 'no-referrer', 'origin', 'no-referrer-when-d... [14:08:09] (03CR) 10Jcrespo: "Question (out of scope)- shouldn't be the whole noc stuff in a different repo?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [14:10:58] (03CR) 10Marostegui: [C: 031] Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo) [14:22:23] PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:25:45] (03PS11) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [14:25:47] (03PS1) 10Filippo Giunchedi: performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532 [14:25:49] (03PS1) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) [14:25:51] (03PS1) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [14:25:53] (03PS1) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [14:27:13] (03CR) 10jerkins-bot: [V: 04-1] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:27:24] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:27:30] (03CR) 10jerkins-bot: [V: 04-1] facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:29:09] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080150 (10TheDJ) The Referrer-Policy HTTP Header is actually quite different from the meta header. I have created a mock request http://www.mocky.io/v2/58bec319260000201bf07c5... [14:30:31] (03CR) 10Jcrespo: [C: 032] Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [14:31:11] (03PS2) 10Filippo Giunchedi: performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532 [14:31:13] (03PS2) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) [14:31:15] (03PS2) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [14:31:17] (03PS12) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [14:31:19] (03PS2) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [14:31:51] (03PS3) 10Filippo Giunchedi: performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532 [14:31:56] (03CR) 10jerkins-bot: [V: 04-1] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:32:01] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532 (owner: 10Filippo Giunchedi) [14:32:25] (03CR) 10jerkins-bot: [V: 04-1] facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:32:31] !log uploaded HHVM 3.18 builds of hhvm-tidy, hhvm-luasandbox and hhvm-wikidiff2 to the experimental section of apt.wikimedia.org (Bug: T158176) [14:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:36] T158176: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176 [14:33:02] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [14:34:14] (03CR) 10Gehel: [C: 04-2] "Actually, experimental seems already available on deployment-prep nodes. Not where this comes from..." [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson) [14:35:03] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 607 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4217999 keys, up 127 days 6 hours - replication_delay is 607 [14:36:10] gehel: ^which host? [14:36:23] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4217967 keys, up 127 days 6 hours - replication_delay is 650 [14:36:45] * akosiaris looking at the redis replication issues [14:37:04] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195777 keys, up 127 days 6 hours - replication_delay is 58 [14:37:13] moritzm: I checked the elasticsearch and deployment-mediawiki05 [14:37:33] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:37:40] akosiaris: Giuseppe and I have a plan to tackle this issue, reducing the replication of the rbd redis instances.. [14:37:49] something more trivial like eqiad -> codfw [14:37:57] 1:1 [14:38:12] yeah it's kind of convoluted right now [14:38:37] and the lua stuff afaik blocks heavily the Redis thread sometimes introducing lags [14:38:59] moritzm: actually, all the deployment-mediawiki* seems to have experimental configured [14:38:59] master_host:10.64.32.18 [14:38:59] master_port:6379 [14:38:59] master_link_status:down [14:39:08] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#3080156 (10ema) [14:39:27] hmm so it thinks rdb1007 is down ? [14:39:50] master_link_down_since_seconds:195 [14:39:53] yeah it's increasing [14:40:12] should alert again in 5 minutes or so [14:40:30] elukey: and you say that behavior is because of lua ? [14:41:49] akosiaris: no no I am reporting something that Giuseppe told me a while ago, that should be the issue.. I followed the replication like you did without finding why the rdb1007 link was down [14:42:18] it goes away after a while [14:42:28] I thought it was network blips [14:42:33] but everything looks good [14:42:51] so the theory that the master lags because of lua could be a good path of investigation [14:43:28] gehel: ah, indeed. according to git log godog enabled it for the git backport [14:44:33] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:44:51] (03Abandoned) 10DCausse: Add a bash script to fetch and update this repo [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340973 (owner: 10DCausse) [14:45:34] elukey: hmmm slowlog get 10 returns 9/10 PSYNC commands on rbd1007 [14:45:41] moritzm: you mean commit 80ba9cc7 ? Isn't that about tin / mira? Not deployment-prep ? [14:46:23] akosiaris: what does it mean? [14:46:25] !log restart labsdb1004 for config and data check [14:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:03] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195777 keys, up 127 days 6 hours - replication_delay is 658 [14:47:17] elukey: that at least 1 slave is trying repeatedly to get the differences [14:47:20] gehel: yeah, you're right, that was only tin/mira [14:47:32] and is (I allege) blocked on something [14:48:24] moritzm: I actually have no idea where this experimental config comes from. I can't see any class on those hosts that would bring it in. But there is probably some labs magic that I don't understand... [14:48:47] ^ hashar: any idea? [14:49:38] akosiaris: ahh ok so the last 10 commands received on rdb1007 are psync [14:50:05] the last 10 slow commands [14:50:15] where slow > X seconds.. lemme find the value of X [14:50:23] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:51:07] config get slowlog-log-slower-than [14:51:07] 1) "slowlog-log-slower-than" [14:51:07] 2) "10000" [14:51:14] that value is in μsecs [14:51:23] so 0.1 secs [14:52:46] this is weird [14:52:58] looks like those psync are not from codfw [14:54:27] pfff... the slave reports it's trying to sync, but the master doesn't know it ? Or the output of the "role" command is utter crap [14:54:36] (03PS2) 10Muehlenhoff: Enable base::firewall on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341501 [14:56:45] lsof does says they got established TCP connections though [14:58:11] but the client believes it's not connected ... [14:58:17] er the slave [14:59:23] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:37] but the server reports it as connected [14:59:49] elukey: the lua theory is becoming more and more credible to my eyes [15:00:53] akosiaris: maybe $sometimes something blocks the only running thread and game over, all blocks [15:01:18] elukey: wouldn't that block however all the clients (the mediawikis) as well ? [15:01:52] ah wait we might be gracefully failing there... [15:02:32] akosiaris: this is a good point, it could be useful to see if there is impact somewhere [15:05:48] (03CR) 10Muehlenhoff: [C: 032] Enable base::firewall on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341501 (owner: 10Muehlenhoff) [15:08:07] elukey: there we go 1932] 07 Mar 15:04:03.086 # Client id=46011339065 addr=10.192.32.133:38222 ... cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. [15:08:08] [1932] 07 Mar 15:04:03.186 # Connection with slave 10.192.32.133:6479 lost. [15:08:33] woa [15:08:41] nice finding!! [15:08:49] where was it?? [15:08:52] logs [15:08:57] on the master [15:09:12] there is a setting according to my google foo [15:09:20] client-output-buffer-limit [15:10:03] so the master drops the connection with the slave since it is filling the output buffers [15:10:07] this might rule out lua [15:10:27] so the slave lags and the master abandons him [15:10:32] ouch [15:10:33] poor redis [15:10:34] so [15:10:40] [1932] 07 Mar 15:04:04.793 * Background saving started by pid 7152 [15:10:40] [7152] 07 Mar 15:04:32.553 * DB saved on disk [15:10:44] [7152] 07 Mar 15:04:32.630 * RDB: 367 MB of memory used by copy-on-write [15:11:04] if I read this correctly the RDB file on this run was 367MB [15:11:25] config get client-output-buffer-limit [15:11:25] 1) "client-output-buffer-limit" [15:11:25] 2) "normal 0 0 0 slave 536870912 209715200 60 pubsub 33554432 8388608 60" [15:11:35] which is less than the 512MB in the config file.. no ? [15:12:22] ah no, it's the 200MB soft limit [15:12:29] lol [15:13:45] the funny thing is previous log lines of today mention different rdb file sizes .. most well below the 200 limit [15:14:09] mmm could it be that the slave tries a full sync when it gets dropped? [15:14:24] going to end up in a mess [15:14:34] until it finally auto-recovers [15:14:45] it supposedly should not [15:14:51] try the full resync that is [15:15:09] If you set up a slave, upon connection it sends a PSYNC command.If this is a reconnection and the master has enough backlog, only the difference (what the slave missed) is sent. Otherwise what is called a full resynchronization is triggered. [15:15:17] well, it's opportunistic of course [15:15:24] so it could be that you are right [15:15:36] I 've increase the buffer limits to see what will happen btw [15:15:53] !log increase client-output-buffer-limit soft-limit to 500MB temporarily on rdb1007 [15:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] akosiaris: thanks! [15:17:19] https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer/ [15:17:22] :D :D :D [15:18:24] akosiaris: so raising the soft-limit means avoiding any throttling? [15:18:41] for the slaves yeah [15:18:56] I only changed the slave softlimit [15:20:12] super, just wanted to get the change [15:20:12] [1932] 07 Mar 15:19:47.398 * Full resync requested by slave. [15:20:18] you were right [15:20:23] it requests a full resync [15:20:30] (03PS1) 10Muehlenhoff: Enable base::firewall in role::test::system by default [puppet] - 10https://gerrit.wikimedia.org/r/341550 [15:20:36] I should have noticed it before [15:22:24] but now we know what is the issue with Redis, really nice finding [15:22:33] !log joal@tin Started deploy [analytics/aqs/deploy@e0da1bd]: (no justification provided) [15:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:40] I am also guilty to have increased the alarm's backoff to see if it was too sensitive [15:22:49] before even checking the logs [15:23:06] (03PS1) 10Jcrespo: Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 [15:23:24] I see now rd2005 doing something with ~50Mbps [15:23:26] ^marostegui [15:23:40] well fetching data at ~50Mbps from rdb1007 [15:23:44] (03CR) 10Marostegui: [C: 031] Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo) [15:24:01] I am going to add moritz so he can give it a second look [15:24:38] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3080282 (10Andrew) [15:24:46] elukey: just failed by again.. exact same error [15:25:17] maybe we should increase the max limit if possible [15:25:54] I am wondering it i's trying to sync to aof or the rdb file [15:26:42] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3080300 (10Papaul) [15:26:45] 06Operations, 10ops-codfw: apply hostname labels and update racktables for scb2005 (WMF6466) and scb2006 (WMF6468) - https://phabricator.wikimedia.org/T159487#3080298 (10Papaul) 05Open>03Resolved complete [15:26:59] (03PS1) 10Ema: cache_upload: test cookie stripping [puppet] - 10https://gerrit.wikimedia.org/r/341552 (https://phabricator.wikimedia.org/T137609) [15:27:12] the documentation does not mention AOF at all [15:27:30] which has me puzzled cause the RDB file sizes are lower than the limits now [15:28:10] er.. no... [15:28:19] -rw-r--r-- 1 redis redis 2.9G Mar 7 15:28 rdb1007-6379.aof [15:28:19] -rw-r--r-- 1 redis redis 1.3G Mar 7 15:25 rdb1007-6379.rdb [15:28:23] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:28:24] lol [15:28:31] ahaah [15:28:38] ok so the logs write something irrelevant to the actual RDB file size [15:28:41] !log joal@tin Finished deploy [analytics/aqs/deploy@e0da1bd]: (no justification provided) (duration: 06m 08s) [15:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:46] [12621] 07 Mar 15:28:38.545 * RDB: 80 MB of memory used by copy-on-write [15:28:59] akosiaris: AOF should be a sort of journal IRIC [15:29:02] *IIRC [15:29:08] ok, my bad for assuming that number was the actual RDB filesize [15:29:26] ok so.. lemme retry that limit [15:29:34] (03CR) 10Ema: [V: 032 C: 032] cache_upload: test cookie stripping [puppet] - 10https://gerrit.wikimedia.org/r/341552 (https://phabricator.wikimedia.org/T137609) (owner: 10Ema) [15:30:14] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=rdb1007&panelId=14&fullscreen shows tons of space [15:30:22] and it is not even cached by the kernel [15:32:07] we should know if my change working in about 3 mins [15:32:17] I 've set the limit to the absurdly high number of 5G [15:32:24] both hard and soft limits [15:33:52] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10Ottomata) [15:34:02] (03PS2) 10Jcrespo: Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 [15:34:16] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Ottomata) [15:34:19] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10Ottomata) [15:34:20] (03CR) 10jerkins-bot: [V: 04-1] Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo) [15:34:44] akosiaris: +1 [15:34:55] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Ottomata) [15:34:57] 06Operations, 10ops-eqiad: check stat1004 (or another identical R430) for PCIe expansion space - https://phabricator.wikimedia.org/T151080#3080346 (10Ottomata) 05Open>03declined We will be getting GPU as the stat1002 replacement, instead of installing one in stat1004. See: T159838 [15:35:32] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3080351 (10Papaul) @fgiunchedi thank you. However, there are some steps that I will not be able to perform such as - disable puppet on host - remove all remaining puppet... [15:35:37] (03PS3) 10Jcrespo: Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 [15:35:43] elukey: ok it actually worked [15:35:56] I expect icinga to notice it soon [15:36:04] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4192429 keys, up 127 days 7 hours - replication_delay is 0 [15:36:09] there we go [15:36:50] niceeeeeeeeeeeeeeeeee [15:36:53] \o/ [15:37:17] (03PS1) 10Elukey: Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530) [15:37:24] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4191759 keys, up 127 days 7 hours - replication_delay is 17 [15:38:38] (03PS2) 10Elukey: Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530) [15:39:39] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080355 (10Ottomata) [15:40:09] (03PS3) 10Elukey: Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530) [15:40:20] 06Operations, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata) [15:41:00] 06Operations, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata) [15:41:10] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10Ottomata) [15:41:38] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata) p:05Triage>03Normal [15:41:46] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080375 (10Ottomata) p:05Triage>03Normal [15:42:09] (03CR) 10Ottomata: [C: 032] Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [15:46:56] 06Operations, 10Ops-Access-Requests: Requesting access to researchers and statistics-users groups for niharika29 - https://phabricator.wikimedia.org/T159780#3080393 (10Ottomata) Hi @Niharika, thanks for the request. You'll need approver from your manager, and @Nuria should approve as well. FYI, I believe that... [15:47:11] 06Operations, 10Ops-Access-Requests: Requesting access to researchers and statistics-users groups for niharika29 - https://phabricator.wikimedia.org/T159780#3080395 (10Ottomata) Please have your manager post their approval here. [15:47:37] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3080399 (10Nuria) Approved [15:47:39] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3080400 (10Ottomata) [15:47:54] 06Operations, 10Traffic, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#3080401 (10ema) 05Open>03Resolved a:03ema [15:48:14] (03CR) 10Marostegui: [C: 031] Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo) [15:49:58] 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3080405 (10Ottomata) Oh prefect! We will soon be replacing stat1002 (T159838) and stat1003 (T159839) with newer hardware. When we do so, we will upgrade these to Jessie. [15:50:11] 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3080410 (10Ottomata) [15:53:37] 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3080432 (10matmarex) This is a misconfiguration of Wikipedia's short URLs. Correctly configured MediaWiki does not exhibit this issue, I can't reproduce locally. Sounds like... [15:56:48] (03PS1) 10Jcrespo: mariadb: Separate sanitarium role && monitore it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896) [15:58:39] (03PS2) 10Jcrespo: mariadb: Separate sanitarium role && monitor it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896) [16:01:03] jouncebot: next [16:01:03] In 0 hour(s) and 58 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1700) [16:05:16] (03PS1) 10Jcrespo: mariadb: separate sanitarium role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 [16:05:34] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:06] (03CR) 10jerkins-bot: [V: 04-1] mariadb: separate sanitarium role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo) [16:06:15] (03PS2) 10Jcrespo: mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 [16:06:39] (03CR) 10Marostegui: [C: 031] "For tracking can you include the bugid: https://phabricator.wikimedia.org/T150850" [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo) [16:07:12] (03CR) 10jerkins-bot: [V: 04-1] mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo) [16:08:17] (03CR) 10Jcrespo: "> For tracking can you include the bugid: https://phabricator.wikimedia.org/T150850" [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo) [16:08:52] (03PS3) 10Jcrespo: mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850) [16:09:22] (03CR) 10Marostegui: [C: 031] "The key word I normally use to look for this ticket is: decouple :-)" [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [16:09:36] (03PS3) 10Jcrespo: mariadb: Separate sanitarium role && monitor it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896) [16:11:22] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3080480 (10matmarex) [16:15:44] 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#3080506 (10ema) 05Open>03Resolved a:03ema Fix confirmed by @mobrovac, closing [16:17:02] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#3080524 (10ema) 05Open>03Resolved a:03ema [16:18:30] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5673/db1069.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [16:18:36] (03PS4) 10Jcrespo: mariadb: Separate sanitarium role && monitor it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896) [16:19:06] (03PS3) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) [16:19:08] (03PS3) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [16:19:10] (03PS13) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [16:19:12] (03PS3) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [16:19:16] (03Draft1) 10Paladox: Fix gerritbot [puppet] - 10https://gerrit.wikimedia.org/r/341559 [16:19:18] (03PS2) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) [16:20:24] (03CR) 10Paladox: "Tested on http://gerrit-new.wmflabs.org/r/#/c/58/ with https://phab-01.wmflabs.org/T20" [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [16:20:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [16:22:13] !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1005.eqiad.wmnet [16:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:20] !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1006.eqiad.wmnet [16:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:26] !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1007.eqiad.wmnet [16:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:33] !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1008.eqiad.wmnet [16:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:03] (03PS1) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [16:23:56] (03CR) 10jerkins-bot: [V: 04-1] authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [16:24:14] (03PS2) 10Filippo Giunchedi: swift: ignore spammy 507s from container-server [puppet] - 10https://gerrit.wikimedia.org/r/340142 (https://phabricator.wikimedia.org/T157237) [16:26:15] paladox: You may want to update your commit message, you put "better" twice ;) [16:26:22] (03CR) 10Filippo Giunchedi: [C: 032] swift: ignore spammy 507s from container-server [puppet] - 10https://gerrit.wikimedia.org/r/340142 (https://phabricator.wikimedia.org/T157237) (owner: 10Filippo Giunchedi) [16:26:38] (03PS3) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) [16:26:48] Oh thanks [16:26:56] elukey jynus merging your patches too [16:27:05] (03PS2) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [16:27:07] thanks [16:27:28] de nada [16:27:30] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080579 (10Nuria) >As far as i'm aware, this is NOT due to a typo. Safari simply implements a very limited set of referrer policies: 'no-referrer', 'origin', 'no-referrer-when-... [16:28:10] (03PS4) 10Jcrespo: mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850) [16:28:13] (03CR) 10jerkins-bot: [V: 04-1] authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [16:28:54] godog: thanks! [16:29:36] de nada [16:29:44] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080581 (10Ottomata) @JKatzWMF, I think if you really really really want this fixed, you'll need to find a MediaWiki dev to revisit https://meta.wikimedia.org/wiki/Research_tal... [16:30:24] (03CR) 10Gehel: [C: 031] "I'd say merge this right now, it has gone through some testing and even if some bug remain, it should not break anything badly (no commit " [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [16:30:29] (03PS3) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [16:31:19] (03CR) 10DCausse: [V: 032 C: 032] automate management of elasticsearch plugin repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [16:31:30] dcausse: thanks! [16:31:33] yw :) [16:32:12] (03Abandoned) 10Filippo Giunchedi: swift: add lvs configuration for esams [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [16:33:04] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195378 keys, up 127 days 8 hours - replication_delay is 630 [16:33:24] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4194556 keys, up 127 days 8 hours - replication_delay is 651 [16:34:34] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:36:51] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#3080610 (10fgiunchedi) [16:36:54] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Experiment with Swift as docker registry backend - https://phabricator.wikimedia.org/T149098#3080608 (10fgiunchedi) 05stalled>03Resolved We're using codfw to test swift as docker registry backend. [16:37:56] (03CR) 10Chad: "Minor inline nit, otherwise lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [16:38:04] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:18] (03CR) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [16:40:22] !log decrease client-output-buffer-limit soft-limit back to normal values [16:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:24] PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:41:25] (03CR) 10Jcrespo: [C: 032] mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [16:43:04] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:43:12] (03CR) 10Chad: Gerrit: Fix bot so it uses the name of user instead of username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [16:43:44] (03PS4) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [16:43:55] akosiaris: should we open a task for the redis lag? [16:44:12] just to track it (done, todo, etc..) [16:45:01] yeah we should [16:45:03] (03CR) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [16:45:22] (03PS4) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) [16:45:31] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3080649 (10RobH) @papaul: You can actually remove the puppet references, but you won't be able to self merge. You up to doing that? If not, I'll handle that step for you... [16:46:09] papaul: just let me know when you are ready to work on the decom for ms-be systems on https://phabricator.wikimedia.org/T159413 and i'll handle the steps you mentioned (or you can prepare the patches and I can merge, whichever you want! [16:46:10] =] [16:46:37] ms-fe [16:47:09] akosiaris: I'll do it later on or tomorrow! [16:47:38] akosiaris: so atm we still have the higher limit but not the soft? [16:47:49] no I 've reverted everything [16:47:53] ah okok [16:48:03] would it be good to leave it there for say a day? [16:48:07] just to see how it goes [16:48:09] but it seems like it broke again [16:48:13] yep.. [16:51:06] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080658 (10dr0ptp4kt) Just a note that I've reached out internally to a contact to see if this is okay and achievable. In searching the internet about imple... [16:52:14] (03PS1) 10Jcrespo: mariadb: Decouple parsercache role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341565 (https://phabricator.wikimedia.org/T150850) [16:54:11] (03Abandoned) 10EBernhardson: deployment-prep: Use apt experimental for elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson) [16:54:13] (03PS1) 10Gehel: osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) [16:54:42] (03PS5) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [16:54:49] (03PS1) 10Reedy: Remove EducationProgram config back compat hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341567 [16:55:15] (03PS3) 10EBernhardson: deployment-prep: Use elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/341372 [16:58:03] !log re-increase temporarily the client-output-buffer-limit for rbd1007, phab task filling to follow [16:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:14] elukey: ^ [16:58:32] akosiaris: do you mind to paste the exact command? I'll add it to the task [16:58:39] (just for the records) [16:58:41] config set client-output-buffer-limit "normal 0 0 0 slave 2536870912 2536870912 60 pubsub 33554432 8388608 60" [16:58:45] super [16:59:14] so that's 2.5GB? [16:59:29] yup.. number straight from an RNG [16:59:35] (03PS1) 10Jcrespo: mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) [16:59:37] only req that is it > 1.3G [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1700). Please do the needful. [17:00:04] Pchelolo and Smalyshev: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:24] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#3080701 (10Dzahn) @tomasz From Operations side it's done, but let's have @Croslof confirm for the legal side. [17:02:04] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4190860 keys, up 127 days 8 hours - replication_delay is 0 [17:02:11] (03PS1) 10Filippo Giunchedi: hieradata: make mwlog1001 primary log host [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) [17:04:04] (03CR) 10Filippo Giunchedi: "Anything significant should change. Statistics jobs will see a bunch of changed files in the past due to overlap while rsync'ing from mwlo" [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [17:04:16] ACKNOWLEDGEMENT - HP RAID on db2048 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:11 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T159849 [17:04:20] 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159849#3080704 (10ops-monitoring-bot) [17:04:24] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4190693 keys, up 127 days 8 hours - replication_delay is 0 [17:05:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [17:05:49] ^papaul, is it you changing the disk or did it finally fail? [17:06:22] (03CR) 10Filippo Giunchedi: "Also I'll cleanup "fluorine" in comments and so on in later reviews." [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [17:06:40] 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3080713 (10jcrespo) [17:06:43] 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159849#3080715 (10jcrespo) [17:07:02] 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10jcrespo) It finally failed, see T159849 summary. [17:07:16] (03CR) 10BBlack: [C: 031] "puppet-line passes, and compiler output on cp1008 + radon looks good for not impacting non-lint usages. I don't think there's a good way " [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [17:07:21] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080720 (10jcrespo) [17:07:55] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10jcrespo) ``` physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Failed) ``` [17:08:01] 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3080727 (10mobrovac) [17:08:05] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3080725 (10mobrovac) 05Resolved>03Open >>! In T159486#3079629, @akosiaris wrote: > I 've just applied the puppet configs on the hosts and pooled them for... [17:08:25] 06Operations, 06Services (doing), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3080729 (10mobrovac) [17:09:29] (03CR) 10Chad: [C: 031] "Lgtm, let's get this live" [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [17:09:53] 06Operations: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3080731 (10elukey) [17:09:58] akosiaris: --^ [17:10:43] ah snap wrong logs [17:10:48] fixing [17:12:45] (03PS2) 10Jcrespo: mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) [17:13:10] 06Operations: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3080776 (10elukey) [17:13:28] better [17:13:59] 06Operations, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3080731 (10elukey) [17:16:52] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080832 (10Papaul) a:05Papaul>03Marostegui disk replacement complete [17:17:36] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3080836 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete [17:20:23] !Log phab task for Redis rdb1007 client-output-buffer-limit temp increase is T159850 [17:20:24] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [17:21:27] (03CR) 10Dzahn: "Error: Could not find template 'phabricator/initscripts/sshd-phab.service.erb' at /mnt/jenkins-workspace/puppet-compiler/5677/change/src/m" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [17:21:58] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3080845 (10Marostegui) Thanks! Disk is rebuilding! ``` root@db2044:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337F5EF0) Port Name: 1I Port Name: 2I... [17:22:49] (03PS1) 10Madhuvishy: nfs: Enable nfs exports in new instance maps-warper2 [puppet] - 10https://gerrit.wikimedia.org/r/341572 [17:22:51] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080848 (10Marostegui) Thanks - raid getting rebuilt ``` root@db2048:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E3350) Gen8 ServBP 12+2 at Port 1I,... [17:25:12] (03CR) 10Madhuvishy: [C: 032] nfs: Enable nfs exports in new instance maps-warper2 [puppet] - 10https://gerrit.wikimedia.org/r/341572 (owner: 10Madhuvishy) [17:25:25] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080881 (10NahidSultan) I'm closing this task as Google+ policy on this matter has changed since we started this discussion. This tick mark beside the websit... [17:26:19] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080883 (10NahidSultan) 05Open>03Invalid [17:26:51] 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080886 (10dr0ptp4kt) @NahidSultan, thanks for the update. [17:28:24] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:32:00] (03PS1) 10BBlack: linting: remove config-geo-test [dns] - 10https://gerrit.wikimedia.org/r/341573 (https://phabricator.wikimedia.org/T156100) [17:32:01] (03PS1) 10BBlack: add first discovery records [dns] - 10https://gerrit.wikimedia.org/r/341574 (https://phabricator.wikimedia.org/T156100) [17:32:08] (03CR) 10jerkins-bot: [V: 04-1] linting: remove config-geo-test [dns] - 10https://gerrit.wikimedia.org/r/341573 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [17:32:12] (03CR) 10jerkins-bot: [V: 04-1] add first discovery records [dns] - 10https://gerrit.wikimedia.org/r/341574 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [17:38:33] (03PS1) 10Ema: cache_maps: do not set cookies [puppet] - 10https://gerrit.wikimedia.org/r/341575 [17:44:20] godog joe Is puppet SWAT happening? [17:44:27] (03PS1) 10Ema: cache_misc: set timeout_idle to 120s [puppet] - 10https://gerrit.wikimedia.org/r/341576 (https://phabricator.wikimedia.org/T159429) [17:46:44] (03PS4) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) [17:46:46] (03PS4) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) [17:46:48] (03PS14) 10Filippo Giunchedi: prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [17:46:50] (03PS4) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [17:46:54] Pchelolo: sigh, I got dragged into other things and forgot, I can do it now though [17:47:08] Thank you :) [17:47:37] that's the patch carried over from last week. Tue is indeed better time for it [17:47:39] (03PS8) 10Filippo Giunchedi: Enable local logging for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) (owner: 10Ppchelko) [17:48:29] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [17:52:29] (03CR) 10Filippo Giunchedi: [C: 032] Enable local logging for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) (owner: 10Ppchelko) [17:53:20] Pchelolo: merged, I'm trying puppet on restbase1007 [17:53:30] thank you godog [17:54:10] It will be a no-op for now, need to restart RB to pick up the new config, but we have a deploy planned for later today that would make RB restart [17:54:33] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3080947 (10Legoktm) >>! In T159618#3079361, @Esc3300 wrote: > I think it might be worth attempting to determine the factors that lead to the rapid raise. > > - The edit ra... [17:54:56] godog: I will deploy RB in 10 mins or so, which will pick up the change [17:55:29] mobrovac: I can deploy too, but a bit later, still need to bike to the office [17:55:40] mobrovac Pchelolo neat, I'll force a puppet run now [17:56:43] godog: force for the whole RB cluster? [17:57:02] mobrovac: staggered but yeah [17:57:07] :) [17:58:58] mobrovac: I guess you have it all under control now, I will take 15 mins afk to bike to the office. Cant resist the temptation to get my coffee [17:59:28] yes please do :) [17:59:30] enjoy it [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1800). [18:00:15] no parsoid deploy today [18:00:22] (03PS2) 10Bartosz Dziewoński: Turn off patrolling for FlaggedRevs in bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341350 (https://phabricator.wikimedia.org/T158662) (owner: 10DatGuy) [18:00:55] No ores today [18:02:36] paladox: i can't really explain this compiler fail: http://puppet-compiler.wmflabs.org/5677/phab2001.codfw.wmnet/ except it's because the change itself is moving the template around and there is some kind of race [18:03:01] paladox: since it only shows up on phab2001, i will merge anyways to see if it happens or not [18:03:53] (03PS7) 10Dzahn: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [18:06:35] (03CR) 10Dzahn: [C: 032] "i can't explain why the compiler fails on phab2001 but i assume it must be some kind of race because the template gets renamed in this cha" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [18:09:13] (03CR) 10Dzahn: "no-op on iridium, but fail on phab2001 is real .. ehmm..." [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [18:09:32] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3080998 (10Betacommand) >>! In T159618#3080947, @Legoktm wrote: >>>! In T159618#3079361, @Esc3300 wrote: >> I think it might be worth attempting to determine the factors that... [18:09:34] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:10:04] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:07] ACKNOWLEDGEMENT - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/#/c/339786/7 [18:12:27] duh, "initscript" vs "initscripts" ... fixing [18:14:27] (03PS1) 10Dzahn: phabricator: fix location of sshd-phab.service template [puppet] - 10https://gerrit.wikimedia.org/r/341579 [18:16:25] 06Operations, 06Discovery, 06Discovery-Search (Current work): remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3081045 (10Gehel) The following should be sufficient: ``` swapoff -a sed -i.bak '/swap/d' fstab ``` This does not recover the 1Go of the swap partition (but we do... [18:16:49] (03CR) 10Dzahn: [C: 032] phabricator: fix location of sshd-phab.service template [puppet] - 10https://gerrit.wikimedia.org/r/341579 (owner: 10Dzahn) [18:17:03] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081047 (10RobH) [18:18:24] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [18:21:06] (03PS1) 10Muehlenhoff: Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581 [18:21:24] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:58] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081119 (10RobH) Switch ports disabled, diff below since the port info will be needed once these systems are unracked. [edit interfaces ge-6/0/0] - enable; + disable; [edit interfaces... [18:24:04] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:24:30] (03CR) 10Dzahn: "< icinga-wm> RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures" [puppet] - 10https://gerrit.wikimedia.org/r/341579 (owner: 10Dzahn) [18:25:06] (03CR) 10Dzahn: "follow-up fix on https://gerrit.wikimedia.org/r/341579, no-op on both now" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [18:25:21] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#2912333 (10EBernhardson) Yes it does, i've declined the 2.x task [18:25:37] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081145 (10RobH) [18:26:48] (03PS5) 10Dzahn: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [18:27:19] (03CR) 10Filippo Giunchedi: [C: 031] Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581 (owner: 10Muehlenhoff) [18:27:41] (03PS2) 10Muehlenhoff: Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581 [18:28:17] (03PS4) 10MarcoAurelio: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 [18:28:24] RECOVERY - swift-account-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:28:24] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f5d1763c950: Failed to establish a new connection: [Errno 111] Connection refused,)) [18:28:34] PROBLEM - Restbase root url on restbase-dev1002 is CRITICAL: connect to address 10.64.32.112 and port 7231: Connection refused [18:30:21] (03CR) 10Muehlenhoff: [C: 032] Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581 (owner: 10Muehlenhoff) [18:31:05] (03CR) 10Dzahn: [C: 032] Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [18:31:10] (03PS6) 10Dzahn: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [18:32:00] 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#3081194 (10CRoslof) 05Open>03Resolved All good with me. [18:33:57] (03PS1) 10RobH: decom of db2001-db2009 [puppet] - 10https://gerrit.wikimedia.org/r/341582 [18:34:19] (03CR) 10RobH: [C: 032] decom of db2001-db2009 [puppet] - 10https://gerrit.wikimedia.org/r/341582 (owner: 10RobH) [18:34:31] (03PS2) 10RobH: decom of db2001-db2009 [puppet] - 10https://gerrit.wikimedia.org/r/341582 [18:35:54] (03PS1) 10Mholloway: [Android] Create symlink to repo licenses dir in the SDK on CI [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099) [18:36:05] mutante: When you merge those template changes for its-* templates, you don't do a full gerrit restart right? It just needs a plugin reload [18:37:18] !log restbase deploy start of cd53670b [18:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:34] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [18:37:34] RECOVERY - Restbase root url on restbase-dev1002 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.017 second response time [18:38:06] (03PS2) 10Mholloway: [Android] Create symlink to repo licenses dir in the SDK on CI [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099) [18:38:34] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:39:10] (03PS1) 10RobH: decom of db2001-db2009 [dns] - 10https://gerrit.wikimedia.org/r/341585 [18:39:44] RainbowSprinkles: yes, it never needed a gerrit restart to change gerrit bot [18:39:58] Ok, just double checking :) [18:40:00] (03PS7) 10Dzahn: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox) [18:40:20] (03CR) 10RobH: [C: 032] decom of db2001-db2009 [dns] - 10https://gerrit.wikimedia.org/r/341585 (owner: 10RobH) [18:44:22] (03PS19) 10Dzahn: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [18:48:14] RECOVERY - HP RAID on db2044 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [18:49:28] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3081277 (10JKatzWMF) @Ottomata @Nuria Just spoke with @Nuria and between the above comments and our conversation I think I have what we need to figure out next steps....which a... [18:50:42] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3081287 (10Ottomata) Great! :) [18:52:50] !log rmmod acpi_pad on baham, was using 100% CPU T137647 [18:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:56] T137647: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647 [18:53:34] (03CR) 10Dzahn: [C: 04-1] "Error: Could not find template 'phabricator/initscripts/ssh-phab.systemd.erb'" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [18:54:54] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:37] (03PS1) 10Chad: Gerrit: Sort config sections alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/341587 [18:55:58] (03CR) 10Chad: "Technically a no-op, although puppet compiler will disagree. Needs visual check" [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad) [18:56:05] (03CR) 10Dzahn: [C: 04-1] "there are 2 separate issues here: a) "sshd-phab" vs. "ssh-phab" b) .conf and .service vs. .systemd and .upstart" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [18:57:27] (03CR) 10MaxSem: [C: 031] cache_maps: do not set cookies [puppet] - 10https://gerrit.wikimedia.org/r/341575 (owner: 10Ema) [18:57:43] 06Operations, 10RESTBase, 10service-runner, 13Patch-For-Review, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3081321 (10mobrovac) Something is wrong there. RB is not even creating the file, despite the fact that the directory permissions are correct:... [18:57:46] 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3081322 (10Beetlebeard) But without the CNAME entry and the verification the e-mails will be redirected to gmail and saved to google servers, but the users a... [19:02:03] 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3081347 (10Dzahn) I noticed that the existing entry for wikimedia.org using Google is just a CNAME for "google.com." but in this case it is supposed to be "... [19:02:58] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3081362 (10Nuria) >you'll need to find a MediaWiki dev to revisit https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_referrer_policy I *think* policy as is works, we will... [19:08:32] (03CR) 10Mholloway: [C: 04-1] "Needs update for non-"periodic" CI machines..." [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099) (owner: 10Mholloway) [19:08:46] !log rebooting baham (ns1) - low cpu frequencies issues like T147905 [19:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:52] T147905: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905 [19:09:04] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081365 (10RobH) [19:09:39] 06Operations, 10ops-codfw, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#1998016 (10RobH) a:05RobH>03Papaul Ok, this is now ready for on-site disk wipes of all the systems. Assigning to @papaul for followup. [19:09:55] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [19:10:44] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [19:11:14] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:54] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [19:15:04] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms [19:15:14] PROBLEM - Check whether ferm is active by checking the default input chain on baham is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [19:15:24] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:34] PROBLEM - Auth DNS on baham is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:15:44] PROBLEM - Check systemd state on baham is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:16:24] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [19:17:19] PROBLEM - Auth DNS on ns1-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:17:39] expected. [19:18:17] because of reboot eh? [19:18:44] yep [19:19:19] PROBLEM - Auth DNS on ns1-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:19:34] PROBLEM - Check size of conntrack table on baham is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:20:30] !log rebooting baham (ns1) AGAIN - low cpu frequencies issues like T147905 - checking bios/idrac stuff [19:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:35] T147905: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905 [19:21:14] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [19:21:44] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:24] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:22:54] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:23:04] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [19:23:39] !log branching 1.29.0-wmf15 refs T158996 [19:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:45] T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996 [19:28:00] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3081445 (10Ottomata) Good news! I think we don't have to rebuild varnishkafka. Quick test on MWV has varnishkafka working... [19:28:09] RECOVERY - Auth DNS on ns1-v4 is OK: DNS OK: 0.046 seconds response time. www.wikipedia.org returns 208.80.154.224 [19:28:14] RECOVERY - Auth DNS on ns1-v6 is OK: DNS OK: 0.065 seconds response time. www.wikipedia.org returns 208.80.154.224 [19:28:15] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [19:28:15] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [19:28:15] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [19:28:34] RECOVERY - Auth DNS on baham is OK: DNS OK: 0.050 seconds response time. www.wikipedia.org returns 208.80.154.224 [19:30:45] (03PS1) 10Dzahn: phabricator: fix file names of systemd/upstart templates [puppet] - 10https://gerrit.wikimedia.org/r/341589 (https://phabricator.wikimedia.org/T137928) [19:34:21] mutante: https://gerrit.wikimedia.org/r/#/c/341587/ will require some rebases for some outstanding patches, but will make our lives easier :) [19:38:46] (03PS20) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [19:38:51] (03PS21) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [19:39:26] (03PS22) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) [19:40:01] (03PS15) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [19:40:38] 06Operations, 10ops-codfw, 10Traffic: baham (ns1) CPU-related issues - https://phabricator.wikimedia.org/T159870#3081551 (10BBlack) [19:42:38] (03CR) 10Gehel: [C: 031] "We don't seem to be using any cookies on maps. This change looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/341575 (owner: 10Ema) [19:52:24] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:52:43] (03PS2) 10Gehel: osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) [19:55:05] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3081625 (10Ottomata) [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T2000). [20:00:43] (03CR) 10Dzahn: [C: 032] "no-op in compiler http://puppet-compiler.wmflabs.org/5680/" [puppet] - 10https://gerrit.wikimedia.org/r/341589 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:01:36] RainbowSprinkles: added to queue, just wanna get the phab-ssh stuff done first [20:02:21] Ok :) [20:03:08] (03CR) 10MaxSem: [C: 031] osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) (owner: 10Gehel) [20:03:59] (03PS1) 10Chad: Gerrit: Remove reviewer counts cron, nobody is using it [puppet] - 10https://gerrit.wikimedia.org/r/341593 [20:04:29] https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.15 <-- no changes? [20:05:13] 06Operations, 06Discovery, 06Discovery-Search (Current work): remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3081652 (10RobH) >>! In T158884#3081045, @Gehel wrote: > The following should be sufficient: > > ``` > swapoff -a > sed -i.bak '/swap/d' fstab > ``` > > This does... [20:06:39] TabbyCat: More likely someone else decided to create the page [20:06:55] With just the boilerplate [20:07:18] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/c/341589/ has been merged, rebasing this and re-compiling it and it should work now ..." [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [20:08:39] (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/c/341589/ has been merged, rebasing this and re-compiling it and it should work now ..." [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:09:02] (03CR) 10Dzahn: "i meant to put this comment on https://gerrit.wikimedia.org/r/#/c/339763/22" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox) [20:09:16] ah [20:09:18] woops [20:09:20] wrong place [20:10:08] paladox: what was the change since PS19 on https://gerrit.wikimedia.org/r/#/c/339763/22 [20:10:09] RainbowSprinkles: I see. Well, we might see some changes when twentyafterfour makes the train depart from the station ;) [20:10:15] i dont think there was a need to change anything there, paladox [20:10:29] what it needed was the separate fix to be merged [20:10:53] oh [20:10:58] those are rebases i think [20:11:23] and "published edit" [20:11:44] rebases only needed if it can also be merged [20:11:51] compiles that again [20:14:24] mutante ok it re adds the files now [20:14:26] but i get [20:14:27] Failed at step EXEC spawning /usr/bin/chown: No such file or directory [20:15:15] oh i see [20:15:22] /usr/bin/chown does not exist now [20:15:23] strange [20:15:43] paladox: /bin/chown ? [20:15:49] but that sounds really broken [20:15:56] thanks [20:16:04] patch in comming to fix that [20:16:15] i don't know the context, but ok [20:16:56] (03PS1) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618) [20:17:01] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5681/" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:17:16] (03Draft1) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 [20:17:19] (03PS2) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 [20:17:21] 06Operations, 10RESTBase, 10service-runner, 13Patch-For-Review, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3081725 (10mobrovac) Ok, it turns out the problem is that firejail doesn't have `/var/log/restbase` whitelisted. RB is actually logging stuff... [20:17:22] mutante ^^ :) [20:17:49] (03CR) 10Dzahn: [C: 032] Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:18:35] ok, added to queue as well, one by one [20:18:39] that fixes it :) [20:18:41] tested [20:19:01] and ok [20:19:51] aaah, yes, i see what you mean. we'll get there in a minute [20:19:54] (03PS3) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 (https://phabricator.wikimedia.org/T158434) [20:20:00] (03PS4) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 (https://phabricator.wikimedia.org/T158434) [20:20:06] Ok thanks :) [20:22:29] (03CR) 10Dzahn: [C: 032] Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 (https://phabricator.wikimedia.org/T158434) (owner: 10Paladox) [20:23:15] thanks [20:23:31] !log iridium - temp disabled puppet - converting phab-ssh service to base::service_unit, systemd on phab2001, upstart on iridium [20:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:39] (03PS16) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [20:24:04] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[ssh-phab] [20:24:19] Hmm, /me wonders why ^^ is failing? [20:24:50] because of the thing you just uploaded the fix for :) [20:25:02] Oh ah [20:25:02] :) [20:25:04] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:25:07] :) [20:26:26] !log installing librdkafka 0.9.4 on cp1045 (cache misc host) via .deb package to try it with varnishkafka in prod (ping bblack, ema, just in case) [20:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:23] !log phab2001 - phab-ssh service converted to base::service_unit and with working systemd unit file. 'systemctl ssh-phab status' is active (running) (T158434) [20:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:29] T158434: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434 [20:27:49] :) [20:29:27] !log iridium - re-enabling puppet, ssh-phab service converted to base::service_unit, upstart template moved but unchanged, service restarted just fine. [20:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:54] mutante i wonder if we should do a phd.conf file since i doint think if iridium was restarted phd will start, see this https://secure.phabricator.com/T4181#133830 script [20:30:31] paladox: we should just focus on getting it work properly with systemd and then reinstall iridium as phab1001 with jessie [20:30:34] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3081776 (10Ottomata) @elukey, I've dpkg -i librdkafka 0.9.4 on cp1045 and restarted varnishkafka. Let's let this ru... [20:30:37] ok :) [20:30:58] rebooting works too [20:31:05] for systemd phd and ssh-phab [20:31:12] so for converting things to base::service_unit purposes, keep the upstart part unchanged /no-op [20:31:34] easier to merge if prod server isn't affected [20:31:36] yep [20:31:39] yep [20:32:02] other fixes could go separate, but we should also just get rid of the trusty install and then we don't care [20:32:17] (03CR) 10Paladox: "Tested and works :)" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:32:23] ok [20:32:38] mutante could you run puppet compiler on ^^ please? [20:33:04] (03PS1) 10Rush: nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601 [20:33:09] yes [20:34:00] thanks [20:34:03] :) [20:34:05] (03CR) 10Andrew Bogott: [C: 031] nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601 (owner: 10Rush) [20:34:13] (03PS2) 10Rush: nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601 [20:34:58] (03CR) 10Rush: [V: 032 C: 032] nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601 (owner: 10Rush) [20:35:07] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3081786 (10Paladox) I believe that we now have full support for debian jessie as far as i can tell.... [20:35:20] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5682/" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:35:36] :) [20:35:43] only phab2001 is affected by the change [20:35:51] iridium has no changes :_ [20:36:05] :) [20:36:11] thanks for doing that :) [20:37:19] !log twentyafterfour@tin Started scap: bump test wikis to 1.29.0-wmf.5 refs T158996 [20:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:24] T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996 [20:37:33] * Reedy hands twentyafterfour an extra 1 [20:39:23] paladox: no change on iridium is good, but i wish we could avoid 2 x "if $::initsystem == 'systemd'". an advantage of base::service_unit as we used it for ssh-phab is that we did not need those anymore [20:40:04] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:14] i assume you tried that but it did not work in this case? [20:40:53] with "systemd => true" and "upstart => true" in base::service_unit instead of "if"s [20:41:55] can we reduce it from 2 x to 1 x if we can't remove both [20:44:14] paladox: ideally i want the change for phd to be a nice one like the one for ssh-phab at https://gerrit.wikimedia.org/r/#/c/339763/22/modules/phabricator/manifests/vcs.pp where all the if/else, file, service is gone and only base::service_unit stays [20:48:03] (03CR) 10Dzahn: "@20after4 do you have an opinion on this? did you need --force before?" [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox) [20:49:07] (03CR) 10Dzahn: "12:41 < mutante> paladox: no change on iridium is good, but i wish we could avoid 2 x "if $::initsystem == 'systemd'". an advantage of " [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:50:45] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3081822 (10Dzahn) >>! In T158434#3081786, @Paladox wrote: > I believe that we now have full support... [20:53:26] (03PS2) 10Dzahn: Gerrit: Sort config sections alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad) [20:53:35] 06Operations, 10RESTBase, 10service-runner, 06Services (doing), 15User-mobrovac: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3081824 (10mobrovac) [20:54:52] mutante i would need to create the .conf file to do that since it will cause alot of problems if i didnt do that if and else [20:55:53] paladox: if it's true that phd service is supposed to be permanently stopped on the server that is not the "hot" one, and i think it is. then we need to get rid of " [20:55:57] PHD should be supervising processes" [20:56:01] icinga check [20:56:15] Yep, i was meaning for iridium [20:56:48] the .conf file is not needed for phab2001 as we will use systemd there. Just systemd is not on iridium [20:56:50] i said that without any relation to the .conf file thing :) [20:57:04] oh [20:57:27] i am not sure yet what you mean will cause a lot of problems, but the best is you just show me with gerrit [20:58:33] ;) [20:58:37] oh, wait [20:58:58] twentyafterfour: is it true that phd service should always be stopped on the server that is not "hot" ? [20:59:04] (03CR) 1020after4: "I've rarely needed to use --force but it might make sense to have it just in case." [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox) [20:59:05] twentyafterfour would base::service_unit work for starting the phd service on iridium [20:59:06] or could it just run on both [20:59:19] i was thinking again/. [20:59:34] mutante: phd needs to run on only the primary server until we have phab configured for proper cluster awareness [21:00:11] twentyafterfour: ok, i thought so and just wanted to confirm. i will do something to ensure Icinga only adds the check for the primary one [21:00:12] mutante i wont be able to test the change that would affect iridium [21:00:17] I believe the ipv6 problems are resolved so we can probably enable clustering now and get phd running on multiple hosts [21:00:19] but it may start it [21:00:32] yep, ipv6 problems resolved :) [21:00:50] (03PS3) 10Paladox: Phabricator: Start and stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 [21:00:52] paladox: I don't know about service_unit on iridium, no idea at all [21:00:56] (03PS4) 10Paladox: Phabricator: Start and stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 [21:01:12] (03PS5) 10Paladox: Phabricator: Start and stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 [21:01:23] twentyafterfour: the good change today: ssh-phab service is converted to base::service_unit and just works on either, upstart or systemd without all the "if-then" [21:01:36] twentyafterfour oh, it's used to managed upstart, sysvinit and systemd scripts. [21:01:40] mutante: awesome [21:01:50] paladox: I see [21:01:51] mutante i could do the change, though how will we test it? [21:02:06] this one https://gerrit.wikimedia.org/r/#/c/340158/16/modules/phabricator/manifests/init.pp [21:02:49] (03PS1) 10Rush: etcd: etcd-backup.py needs a type set for argparse [puppet] - 10https://gerrit.wikimedia.org/r/341607 [21:03:47] (03PS2) 10Rush: etcd: etcd-backup.py needs a type set for argparse 'keep' [puppet] - 10https://gerrit.wikimedia.org/r/341607 [21:04:06] paladox: ? that's the one you already tested on a labs instance. did you mean clustering ? [21:04:14] Yes [21:04:20] wrong link then? [21:04:43] the one you linked i already commented on [21:04:59] mutante: the issue with phd running on multiple servers is this: when phd updates repositories it needs to run the git (push|pull) from the server that owns the repo. With cluster support enabled then phd will know how to schedule the job on the git master for that repo [21:05:30] wihtout clustering it'll assume the repo is local and run the operation on the rsync'd copy of the repo instead of the authoritative master copy [21:06:17] aha, yea, that makes sense. i think you told me about the git pull before. thanks for the details [21:06:46] ok, let's get the "phd" service converted to base::service_unit next [21:06:53] like we did for ssh-phab [21:06:56] there may be other issues but I think that's the only one that we need to deal with to get multiple PHDs working [21:07:05] cool [21:07:09] then let's re-install iridium maybe :) [21:07:12] ;) [21:07:32] mutante: sounds good [21:08:04] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:08:06] i just would like to try and improve this: https://gerrit.wikimedia.org/r/#/c/340158/16/modules/phabricator/manifests/init.pp to [21:09:06] to be more like this: https://gerrit.wikimedia.org/r/#/c/339763/22/modules/phabricator/manifests/vcs.pp [21:09:31] mutante: indeed, that's a lot cleaner [21:09:33] by which i mean "less or no if $::initsystem" [21:10:10] mutante i could do a change [21:10:20] though i would not know it's impact on iridium [21:10:45] well we can test it on iridium, if it breaks we can fix it [21:10:47] paladox: happy to compile one to find that out [21:10:52] ok [21:10:53] not the end of the world if phd goes down for a minute [21:10:53] thanks [21:11:18] (03PS1) 10BBlack: dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 [21:11:26] (03CR) 10jerkins-bot: [V: 04-1] dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 (owner: 10BBlack) [21:11:28] (03PS6) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) [21:11:30] (03PS1) 10BBlack: authdns: add 10/8 to geo map [puppet] - 10https://gerrit.wikimedia.org/r/341616 [21:11:37] (03PS17) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [21:11:39] phd should work fine with service_unit I think [21:11:40] mutante ^^ done [21:11:48] twentyafterfour yeh [21:12:21] systemd works with phd, just because we symblink phd for iridium (trusty only) it would not be easy to implement. Though im hopping it is [21:12:30] twentyafterfour ^^ [21:12:48] service_unit i mean [21:13:13] (03PS2) 10BBlack: dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 [21:13:20] (03CR) 10jerkins-bot: [V: 04-1] dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 (owner: 10BBlack) [21:13:36] (03CR) 1020after4: Phabricator: Migrate to base::service_unit for phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:14:07] paladox: much better but some more things to change there [21:14:12] paladox: what symlink are you talking about? [21:14:15] (03CR) 10Paladox: Phabricator: Migrate to base::service_unit for phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:14:17] paladox: upstart => is set to false, but you want "true" [21:14:23] +1 [21:14:46] twentyafterfour https://github.com/wikimedia/puppet/blob/production/modules/phabricator/manifests/phd.pp#L26 [21:14:51] ok [21:14:57] the "before" line for class ::phabricator::phd ... hmm ...yea [21:15:08] i understand why you had a special case there [21:15:21] but maybe that can be removed ? [21:15:51] mutante if i set it to upstart => true, then it will fail since there will be no template for upstart, see https://github.com/wikimedia/puppet/blob/production/modules/base/manifests/service_unit.pp#L96 [21:16:02] or can you do upstart => true without needing to define a template [21:16:22] (03PS18) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [21:16:44] paladox: why not the same procedure we just did with ssh-phab? first move the templates... then make the switch [21:16:57] paladox: I don't think /etc/init.d/phd symlink is important, though we could skip upstart and configure it as sysv init [21:16:58] the upstart init file is somewhere in the repo, right [21:17:13] mutante: phd IS the initscript [21:17:20] on iridium [21:17:33] /etc/init.d/phd is a symlink to the phab code [21:17:36] Yes [21:17:40] lets configure it as a sysvinit [21:17:40] phd is just an initscript written in php [21:17:41] uhmm.. [21:17:54] upstart handles it because it's in /etc/init.d/ [21:18:02] i understand what you mean now, paladox [21:18:09] yep [21:18:09] *nods* [21:18:18] theres [21:18:18] https://secure.phabricator.com/T4181#133830 [21:18:20] so it needs to have an upstart config written or we need to keep it as sysv init [21:18:21] we could use [21:18:38] yeah something like that [21:18:53] Ok, i will create a seperate patch to introduce that [21:19:05] paladox: that looks like it's close to what we need. [21:19:09] earlier i said we should just focus on systemd and replacing iridium.. and keep iridium to "no-op" while converting this.. but now ... yea... [21:19:10] :) [21:19:30] what you guys said then [21:19:36] :) :) [21:20:24] unless service_unit { sysvinit=>true } would do the trick? [21:21:30] i dunno, or i can live with one "if $:initsystem", but one should be enough, not 2 of them [21:21:45] and once we are on jessie we can remove that again [21:21:58] Yeh [21:22:22] that would be fine, i just assumed first we can avoid all of them with base::service_unit [21:22:29] like it was true for the other service [21:22:44] PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:04] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:14] PROBLEM - Nginx local proxy to apache on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:54] !log mw1177 - service hhvm restart [21:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:34] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 74544 bytes in 0.134 second response time [21:25:54] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.024 second response time [21:26:04] RECOVERY - Nginx local proxy to apache on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.028 second response time [21:26:08] (03Draft1) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 [21:26:12] (03PS2) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 [21:26:15] twentyafterfour mutante ^^ [21:27:20] (03PS1) 1020after4: group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 [21:28:17] (03CR) 10Dzahn: "i downloaded "before" and "after" file and sorted them with "sort". then diff showed they are identical" [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad) [21:28:58] (03CR) 10Dzahn: [C: 032] Gerrit: Sort config sections alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad) [21:30:08] ah [21:30:15] i can test that on phab-01 [21:30:20] i can start the instance now [21:30:36] !log twentyafterfour@tin Finished scap: bump test wikis to 1.29.0-wmf.5 refs T158996 (duration: 53m 17s) [21:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:42] T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996 [21:32:45] (03PS3) 1020after4: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [21:33:12] jouncebot: now [21:33:12] For the next 0 hour(s) and 26 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T2000) [21:33:37] (03CR) 1020after4: [C: 031] "this should work" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [21:34:34] (03CR) 1020after4: [C: 032] group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 (owner: 1020after4) [21:36:51] (03Merged) 10jenkins-bot: group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 (owner: 1020after4) [21:37:31] (03PS6) 10Paladox: Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 [21:37:35] (03CR) 10Dzahn: "maybe we have reasons to do it for "stop" but for "start" i think we should rather know if there are errors rather than forcing it" [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox) [21:37:47] (03CR) 10jenkins-bot: group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 (owner: 1020after4) [21:38:12] (03PS7) 10Paladox: Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 [21:38:43] (03PS8) 10Paladox: Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 [21:40:23] (03PS4) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 [21:40:37] (03PS5) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 [21:40:45] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.15 refs T158996 [21:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:51] T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996 [21:42:27] (03CR) 10Dzahn: "i think "start on started mysql" will be an issue since mysql isn't running on same machine." [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [21:49:15] (03CR) 10Dzahn: [C: 032] Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox) [21:50:48] (03CR) 10Paladox: "> i think "start on started mysql" will be an issue since mysql isn't" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [21:54:14] !log mobrovac@tin Started deploy [trending-edits/deploy@f855460]: (no justification provided) [21:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:02] !log mobrovac@tin Finished deploy [trending-edits/deploy@f855460]: (no justification provided) (duration: 04m 48s) [21:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:34] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [22:00:48] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3082031 (10Paladox) Bump, any update on this please? [22:01:58] !log mobrovac@tin Started deploy [zotero/translators@35da336]: Update transators for T158675 [22:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:04] !log mobrovac@tin Finished deploy [zotero/translators@35da336]: Update transators for T158675 (duration: 00m 06s) [22:02:04] T158675: Update zotero translators on gerrit from the zotero repository on github - https://phabricator.wikimedia.org/T158675 [22:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:44] (03PS19) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [22:07:48] (03PS20) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [22:08:23] (03PS21) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) [22:08:47] !log mobrovac@tin Started deploy [citoid/deploy@5a7e053]: Deploy for T158675 T103478 T159486 [22:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:55] T103478: Citoid service should validate ISSN in mediawiki format - https://phabricator.wikimedia.org/T103478 [22:08:56] T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486 [22:08:56] T158675: Update zotero translators on gerrit from the zotero repository on github - https://phabricator.wikimedia.org/T158675 [22:11:24] !log mobrovac@tin Finished deploy [citoid/deploy@5a7e053]: Deploy for T158675 T103478 T159486 (duration: 02m 36s) [22:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:01] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3082200 (10Papaul) [22:29:41] (03PS1) 10Hashar: zuul: the deb packages creates /etc/zuul [puppet] - 10https://gerrit.wikimedia.org/r/341700 [22:30:06] 06Operations, 10ops-eqiad: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082218 (10Jgreen) [22:32:29] (03CR) 10Hashar: [V: 031 C: 031] "Puppet compiler https://puppet-compiler.wmflabs.org/5683/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar) [22:33:34] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3082237 (10Jgreen) [22:33:54] 06Operations, 10ops-eqiad: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082254 (10Jgreen) a:05Jgreen>03None [22:34:30] 06Operations, 10ops-eqiad: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082218 (10Jgreen) [22:35:48] (03PS1) 10Chad: Gerrit: lower heap to 20g [puppet] - 10https://gerrit.wikimedia.org/r/341701 [22:38:17] (03CR) 10Pnorman: [C: 031] osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) (owner: 10Gehel) [22:45:42] !log ms-be2028-ms-be2039 - signing puppet certs, salt-key, initial run [22:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:44] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [23:02:50] (03PS2) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) [23:03:19] (03PS3) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) [23:10:28] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn) [23:11:43] (03CR) 10Dzahn: [C: 032] "tested on mwdebug1001 with apache-fast-test from tin" [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn) [23:14:53] (03PS4) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) [23:16:02] (03PS5) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) [23:16:15] (03CR) 10Dzahn: "PS4: insert literal tab chars that we use here unlike almost everywhere else now" [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn) [23:20:34] (03CR) 10Dzahn: [C: 032] redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn) [23:20:42] (03PS6) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) [23:25:21] (03PS2) 10Dzahn: zuul: the deb packages creates /etc/zuul [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar) [23:26:42] (03CR) 10Dzahn: [C: 032] zuul: the deb packages creates /etc/zuul [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar) [23:30:33] (03CR) 10Dzahn: "confirmed no change on contint1001/2001 as they are already running" [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar) [23:32:41] jouncebot: next [23:32:42] In 0 hour(s) and 27 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T0000) [23:34:19] (03PS2) 10Dzahn: Gerrit: lower heap to 20g [puppet] - 10https://gerrit.wikimedia.org/r/341701 (owner: 10Chad) [23:35:33] (03CR) 10Dzahn: [C: 032] Gerrit: lower heap to 20g [puppet] - 10https://gerrit.wikimedia.org/r/341701 (owner: 10Chad) [23:38:03] Gerrit just went down :( [23:38:07] (03PS1) 10Krinkle: Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597) [23:38:08] 20 minutes before the SWTA [23:38:15] Oh, I see mutante is probably just restarting it [23:38:23] !log gerrit restarting for config changes 341701, 341587 [23:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:46] James_F: ^ config patch can go out ahead at any time - no-op since the var doesn't exist (it'll just create an unused var). [23:39:57] * James_F nods. Let's SWAT it now so that it doesn't disrupt Beta Cluster QAers. [23:41:19] RoanKattouw: yes, with RainbowSprinkles [23:41:27] It's back now [23:41:28] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [23:41:42] Sorry for freaking out [23:41:57] one thing is odd, i still see your merged changed in "incoming reviews" [23:42:59] mutante: Lemme force a reindex of that change. I've noticed that sometimes happens with things merged right before a shutdown [23:43:17] Index not flushed before shutdown, probably [23:43:25] ok,cool. i tested logging out and in again but it stayed [23:43:42] Fixed [23:43:56] indeed, thanks :) [23:45:07] Can anyone else load https://gerrit.wikimedia.org/r/#/c/341708/ ? [23:45:17] RainbowSprinkles that is meant to be fixed [23:45:36] that sounds like the bug wasen't fixed [23:45:50] https://gerrit-review.googlesource.com/#/c/93479/ [23:45:54] It's not a big deal [23:45:59] ok [23:46:24] James_F: No, looking [23:46:29] Ta. [23:47:34] Weirdly indexed as well [23:47:49] * RainbowSprinkles grumbles something about proper use of lucene [23:47:57] James_F: Fixed'd [23:48:04] Ta. [23:48:05] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3082407 (10Papaul) [23:50:25] I went ahead and reindexed about 10 more changes on either side of the restart out of paranoia [23:51:35] (03CR) 10Jforrester: [C: 031] Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597) (owner: 10Krinkle) [23:53:28] (03CR) 10Thcipriani: [C: 031] Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad) [23:53:58] PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:54:36] (03PS6) 10Dzahn: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox) [23:55:32] (03PS22) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)