[00:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T0000). Please do the needful.
[00:00:42] <mobrovac>	 !log restbase restarting in labs for T158628
[00:00:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:00:47] <stashbot>	 T158628: Create beta hewiktionary for testing InterwikiSorting & Cognate - https://phabricator.wikimedia.org/T158628
[00:01:34] <mobrovac>	 mutante: ^
[00:04:57] <bawolff>	 I tried to sneak something into swat at the last minute, not sure if its too late
[00:07:20] <thcipriani>	 bawolff: jouncebot doesn't catch it if you add it close to deploy time, but I can get your change out for you :)
[00:07:35] <bawolff>	 thanks
[00:08:00] <bawolff>	 I literally got back to my apartment like 5 minutes ago
[00:08:31] <thcipriani>	 looks like you missed a newline in your comment on line 3491, not a big deal, but doesn't look intentional, want to amend before I merge?
[00:09:41] <wikibugs_>	 (03PS2) 10Brian Wolff: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331
[00:09:56] <bawolff>	 That's embarassing. Fixed :)
[00:10:05] <thcipriani>	 cool :)
[00:10:14] <wikibugs_>	 (03PS3) 10Thcipriani: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff)
[00:10:23] <wikibugs_>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff)
[00:12:28] <wikibugs_>	 (03Merged) 10jenkins-bot: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff)
[00:12:38] <wikibugs_>	 (03CR) 10jenkins-bot: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff)
[00:15:31] <thcipriani>	 bawolff: hrm, not sure whether or not this can be tested on mwdebug1002, but if so, I pulled it there just now
[00:15:46] <bawolff>	 thcipriani: it can, just not with the firefox extension
[00:17:08] <bawolff>	 Confirmed works (or at least the header is present, I can only really test with wget)
[00:18:04] <thcipriani>	 ok
[00:18:09] <thcipriani>	 going live everywhere then
[00:18:24] <wikibugs_>	 (03CR) 10Chad: [C: 031] "Let's go ahead with this" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[00:18:54] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4197503 keys, up 126 days 15 hours - replication_delay is 0
[00:19:41] <bawolff>	 Thanks :)
[00:19:59] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341331|Add other WMF domains to foundationwiki CSP policy for Special:HideBanners]] (duration: 00m 40s)
[00:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:06] <thcipriani>	 ^ bawolff live everywhere now
[00:20:14] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4197560 keys, up 126 days 15 hours - replication_delay is 0
[00:22:34] <icinga-wm>	 RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[00:23:02] <bawolff>	 oh meh. The links were for wikidata.org, but its actually www.wikidata.org. Oh well, I'll deal with that at some point later
[00:23:59] <wikibugs_>	 06Operations, 10Education-Program-Dashboard, 03Programs-and-Events-Dashboard-Sprint 2, 07Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3078262 (10Dzahn) >>! In T126295#3078016, @Ragesoss wrote:  > Some changes:...
[00:26:09] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Partman: Add ms-be20[2-3][0-9] [puppet] - 10https://gerrit.wikimedia.org/r/341460 (owner: 10Papaul)
[00:27:04] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:31:35] <wikibugs_>	 (03PS1) 10Brian Wolff: In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472
[00:31:47] <wikibugs_>	 06Operations, 10Education-Program-Dashboard, 03Programs-and-Events-Dashboard-Sprint 2, 07Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3078276 (10Ragesoss)  > Since that is a package manager that looks like it mi...
[00:32:53] <bawolff>	 thcipriani: Don't suppose I could also have https://gerrit.wikimedia.org/r/#/c/341472/1 (Fix for me being stupid and missing the www on www.wikidata.org)?
[00:33:18] <thcipriani>	 :)
[00:33:47] <wikibugs_>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 (owner: 10Brian Wolff)
[00:33:51] <thcipriani>	 no problem
[00:34:39] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[00:34:44] <wikibugs_>	 (03PS9) 10Dzahn: Gerrit: Enable config localUsernameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[00:35:21] <wikibugs_>	 (03Merged) 10jenkins-bot: In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 (owner: 10Brian Wolff)
[00:36:03] <thcipriani>	 bawolff: pulled on mwdebug1002, everything look right there?
[00:36:16] <RainbowSprinkles>	 Oh, I thought y'alls were done
[00:36:45] <bawolff>	 yep
[00:37:02] <thcipriani>	 RainbowSprinkles: me too, one last one came up. Am I in your way?
[00:37:02] <wikibugs_>	 (03CR) 10jenkins-bot: In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341472 (owner: 10Brian Wolff)
[00:37:15] <RainbowSprinkles>	 thcipriani: No, was just taking advantage of quietness
[00:37:18] <thcipriani>	 :)
[00:37:31] <thcipriani>	 ok, one quick sync-file and there will be real quietness :)
[00:37:36] * RainbowSprinkles nods
[00:37:41] <RainbowSprinkles>	 (taking gerrit down for a min or two)
[00:39:04] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341472|In CSP policy for foundationwiki, wikidata.org -> www.wikidata.org]] (duration: 00m 40s)
[00:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:10] <thcipriani>	 ^ bawolff live now
[00:40:06] <bawolff>	 Whee, no more console warnings on https://wikimediafoundation.org/wiki/Thank_You/da :D
[00:40:14] <thcipriani>	 :)
[00:40:17] <bawolff>	 Thanks :)
[00:40:27] <thcipriani>	 yw, glad all's well
[00:41:54] <mutante>	 submitting https://gerrit.wikimedia.org/r/#/c/326150/
[00:42:05] <mutante>	 (since you don't see that part here)
[00:42:10] <mutante>	 i wish it would show 
[00:43:58] <RainbowSprinkles>	 !log gerrit: taking offline for a minute or two for case-insensitive login conversion
[00:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:24] <icinga-wm>	 PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused
[00:47:04] <icinga-wm>	 PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[00:48:46] <icinga-wm>	 ACKNOWLEDGEMENT - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn maintenance
[00:48:54] <icinga-wm>	 PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[00:49:26] <RainbowSprinkles>	 !log gerrit: coming back online now
[00:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:40] <icinga-wm>	 ACKNOWLEDGEMENT - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused daniel_zahn maintenance
[00:50:04] <icinga-wm>	 RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[00:50:24] <icinga-wm>	 PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[00:50:24] <icinga-wm>	 RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0)
[00:50:27] <RainbowSprinkles>	 Back up, and it looks like existing logins were preserved (was afraid due to the conversion)
[00:50:38] <mutante>	 great!
[00:50:44] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment]
[00:51:04] <RainbowSprinkles>	 I was able to login via multiple case scenarios: chad, CHAD, cHAd
[00:51:06] <RainbowSprinkles>	 All worked
[00:51:21] <mutante>	 that bromine stuff is because it tried to clone in that moment...no biggie
[00:51:33] <mutante>	 RainbowSprinkles: nice
[00:51:34] <RainbowSprinkles>	 Yeah, couple of things just had bad timing, that always happens when we go offline
[00:51:35] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui]
[00:51:41] <mutante>	 fixes those
[00:52:34] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[00:52:43] <mutante>	 logged in as dZaHn
[00:52:44] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[00:52:54] <mutante>	 gets corrected to normal spelling
[00:53:29] <RainbowSprinkles>	 Yep, that stays the same
[00:55:04] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[00:56:24] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/discovery-stats]
[01:01:34] <icinga-wm>	 PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:11:04] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers and statistics-users groups for niharika29 - https://phabricator.wikimedia.org/T159780#3078358 (10Niharika)
[01:15:54] <icinga-wm>	 RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[01:18:24] <icinga-wm>	 RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[01:21:24] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[01:25:44] <icinga-wm>	 PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:30:34] <icinga-wm>	 RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[01:53:44] <icinga-wm>	 RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[02:23:28] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 08m 19s)
[02:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:59] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Mar  7 02:28:59 UTC 2017 (duration 5m 32s)
[02:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:55:14] <icinga-wm>	 PROBLEM - tileratorui on maps1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:56:04] <icinga-wm>	 RECOVERY - tileratorui on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.114 second response time
[03:01:14] <icinga-wm>	 PROBLEM - tileratorui on maps1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:02:06] <icinga-wm>	 RECOVERY - tileratorui on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.008 second response time
[03:10:24] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[03:10:44] <icinga-wm>	 PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:10:54] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[03:28:58] <Krinkle>	 !log foreachwikiindblist closed deleteEqualMessages.php (T45917) - purge upstreamed translations from closed wikis
[03:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:29:04] <stashbot>	 T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917
[03:38:44] <icinga-wm>	 RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[04:18:24] <icinga-wm>	 PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:30:24] <icinga-wm>	 PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:43:34] <icinga-wm>	 PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:46:24] <icinga-wm>	 RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[04:59:24] <icinga-wm>	 RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[05:09:32] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3078828 (10Nuria) On our end we have no preference how to fix issue (adding typo back would break other things, but I do not dispute that it might be fine), traffic & reading (...
[05:11:34] <icinga-wm>	 RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[05:21:35] <Krinkle>	  !log foreachwikiindblist 'all - closed - private' deleteEqualMessages.php (T45917) - purge upstreamed translations from remaining wikis
[05:21:35] <stashbot>	 T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917
[05:22:05] <Krinkle>	 !log foreachwikiindblist 'all - closed - private' deleteEqualMessages.php (T45917) - purge upstreamed translations from remaining wikis
[05:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:54] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:47:24] <icinga-wm>	 PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:49:14] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[05:52:14] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:54:54] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:02:34] <icinga-wm>	 PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:07:08] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3078910 (10Nuria) I was going to summarize but @mforns already did it in a prior ticket. For what is causing this issue see: https://phabricator.wikimedia.org/T148780#2891117...
[06:15:24] <icinga-wm>	 RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:30:34] <icinga-wm>	 RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[06:33:14] <icinga-wm>	 PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:59:24] <icinga-wm>	 PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:02:14] <icinga-wm>	 RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[07:05:46] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079045 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1060.eqiad.wmnet'] ``` The l...
[07:08:56] <wikibugs_>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414)
[07:13:19] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[07:14:56] <wikibugs_>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[07:16:15] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2053 - T159414 (duration: 00m 41s)
[07:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:23] <stashbot>	 T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414
[07:16:25] <marostegui>	 !log  Deploy ALTER table on db2053 (s6) for the revision table - T159414
[07:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:09] <wikibugs_>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341492 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[07:17:37] <wikibugs_>	 (03PS3) 10Marostegui: production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418)
[07:20:32] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[07:27:24] <icinga-wm>	 RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[07:30:34] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 782506 msg: ocg_render_job_queue 0 msg
[07:30:35] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079091 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1060.eqiad.wmnet'] ```  and were **ALL** successful.
[07:37:24] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 31 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[07:39:41] <marostegui>	 !log Stop MySQL db1067 to clone db1060 from it - T158193
[07:39:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:48] <stashbot>	 T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193
[07:42:24] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[07:55:04] <icinga-wm>	 PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:56:02] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079117 (10Marostegui) The data transfer between db1067 and db1060 was started around 20 minutes ago.
[08:21:39] <wikibugs_>	 (03PS1) 10Marostegui: dbstore.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341500 (https://phabricator.wikimedia.org/T149418)
[08:24:04] <icinga-wm>	 RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[08:24:21] <wikibugs_>	 (03PS1) 10Muehlenhoff: Enable base::firewall on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341501
[08:25:18] <moritzm>	 !log installing systemd bugfix updates from jessie point release
[08:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:03] <wikibugs_>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5668/" [puppet] - 10https://gerrit.wikimedia.org/r/341500 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[08:27:14] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] dbstore.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341500 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[08:31:30] <wikibugs_>	 (03PS1) 10Ema: cache_text varnishtest: vary cookie [puppet] - 10https://gerrit.wikimedia.org/r/341502 (https://phabricator.wikimedia.org/T155314)
[08:33:54] <wikibugs_>	 (03PS1) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572)
[08:34:44] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) (owner: 10Jcrespo)
[08:35:52] <wikibugs_>	 (03CR) 10Ema: [V: 032 C: 032] cache_text varnishtest: vary cookie [puppet] - 10https://gerrit.wikimedia.org/r/341502 (https://phabricator.wikimedia.org/T155314) (owner: 10Ema)
[08:37:18] <wikibugs_>	 (03PS2) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572)
[08:37:51] <wikibugs_>	 (03CR) 10Jcrespo: "This actually requires a restart to take effect :-/" [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) (owner: 10Jcrespo)
[08:38:01] <hashar>	 moritzm: good morning. Did you manage to upgrade hhvm on beta cluster ?  
[08:38:21] <wikibugs_>	 (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341504 (https://phabricator.wikimedia.org/T159803)
[08:39:16] <wikibugs_>	 06Operations, 10MediaWiki-API, 10Traffic, 13Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3079196 (10ema) >>! In T155314#3077838, @Tgr wrote:  > Setting (or not) `Vary` seems to be the right way to tell Varnish whether to cache or n...
[08:39:24] <icinga-wm>	 PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/varnish/tests/text/11-vary-cookie.vtc]
[08:39:55] <ema>	 looking ^
[08:40:57] <moritzm>	 hashar: I first need to run some more tests, I'll upgrade the beta cluster when these were successful, probably later the morning
[08:41:24] <icinga-wm>	 RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[08:41:34] <ema>	 that was a transient puppetfart
[08:41:45] <hashar>	 moritzm: last time there were some failure with the hhvm extension but no clue how to check that.
[08:43:23] <moritzm>	 it was a problem with the shipped default config, the proper fix is https://phabricator.wikimedia.org/T157306, but it can be addressed by running pupept after the upgrade
[08:43:32] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks mostly good, minor comments inline about the schedule and we also need to add the actual target by updating backup::set to have a jo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341371 (owner: 10Chad)
[08:52:34] <icinga-wm>	 PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:01:28] <wikibugs_>	 (03PS1) 10Marostegui: tendri.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418)
[09:02:06] <wikibugs_>	 (03PS2) 10Marostegui: tendril.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418)
[09:05:25] <wikibugs_>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5669/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[09:06:10] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] tendril.my.cnf: Enable gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341505 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[09:10:19] <elukey>	 !log temporary live hacking analytics-flex.cfg partman config on install1002
[09:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:34] <icinga-wm>	 RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[09:23:55] <ema>	 !log cache_text, cache_upload: upgrading to varnish 4.1.5 T159424
[09:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:00] <stashbot>	 T159424: Upgrade text and upload cache clusters to varnish 4.1.5 - https://phabricator.wikimedia.org/T159424
[09:24:05] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3079330 (10Gilles)
[09:25:24] <wikibugs_>	 (03PS3) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310
[09:35:32] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Esc3300) I think it might be worth attempting to determine the factors that lead to the rapid raise.     - The edit rate didn't seem that high and we could easily h...
[09:40:22] <wikibugs_>	 (03PS6) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987
[09:41:13] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (owner: 10Jcrespo)
[09:41:34] <icinga-wm>	 PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:42:59] <wikibugs_>	 (03PS7) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987
[09:50:44] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:52:04] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 032] hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi)
[09:52:11] <wikibugs_>	 (03PS2) 10Alexandros Kosiaris: hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi)
[09:52:14] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] hieradata: add oresrdb in codfw [puppet] - 10https://gerrit.wikimedia.org/r/340485 (https://phabricator.wikimedia.org/T139372) (owner: 10Filippo Giunchedi)
[10:03:07] <wikibugs_>	 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3079430 (10akosiaris) I 've just merged the patch above and now oresrdb2001 is setup and `oresrdb.svc.codfw.wmnet` is working fine.  I 'll leave the task o...
[10:05:24] <wikibugs_>	 (03PS1) 10Marostegui: labs.my.cnf: Add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418)
[10:07:47] <wikibugs_>	 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3079452 (10MoritzMuehlenhoff) HHVM extensions need to be rebuilt for the new 3.18 ABI.
[10:07:56] <wikibugs_>	 (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/5671/" [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[10:08:03] <wikibugs_>	 (03PS1) 10Alexandros Kosiaris: Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512
[10:08:05] <wikibugs_>	 (03PS2) 10Marostegui: labs.my.cnf: Add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418)
[10:08:57] <icinga-wm>	 RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[10:12:51] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] labs.my.cnf: Add gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/341511 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[10:16:36] <wikibugs_>	 (03PS8) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987
[10:19:47] <icinga-wm>	 RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[10:24:06] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] "I am ready to deploy this. This will deploy the wrong socket parameter for most servers, but they will not be affected until a new templat" [puppet] - 10https://gerrit.wikimedia.org/r/340987 (owner: 10Jcrespo)
[10:25:21] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 032] Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512 (owner: 10Alexandros Kosiaris)
[10:25:32] <wikibugs_>	 (03PS2) 10Alexandros Kosiaris: Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512
[10:25:35] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign scb2005, scb2006 their scb role [puppet] - 10https://gerrit.wikimedia.org/r/341512 (owner: 10Alexandros Kosiaris)
[10:29:47] <icinga-wm>	 PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:33:00] <wikibugs_>	 (03PS9) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896)
[10:33:23] <wikibugs_>	 (03PS3) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572)
[10:34:48] <wikibugs_>	 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM, 15User-Joe: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#3079463 (10hashar)
[10:34:50] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found  when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#3079462 (10hashar)
[10:37:57] <icinga-wm>	 PROBLEM - DPKG on scb2006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[10:38:24] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572) (owner: 10Jcrespo)
[10:39:57] <icinga-wm>	 RECOVERY - DPKG on scb2006 is OK: All packages OK
[10:43:11] <wikibugs_>	 06Operations, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3079486 (10elukey) ``` [cb02e6a5dec913ef6b98416a] [no req]   DBConnectionError from line 753 of /srv/mediawiki/php-1.29.0-wmf.14/includes/libs/rdbms/database/Data...
[10:44:27] <icinga-wm>	 PROBLEM - Disk space on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:44:47] <icinga-wm>	 PROBLEM - salt-minion processes on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:44:47] <icinga-wm>	 PROBLEM - Check systemd state on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:44:57] <icinga-wm>	 PROBLEM - MD RAID on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:44:58] <icinga-wm>	 PROBLEM - dhclient process on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:45:07] <icinga-wm>	 PROBLEM - DPKG on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:45:07] <icinga-wm>	 PROBLEM - configured eth on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:45:07] <icinga-wm>	 PROBLEM - puppet last run on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:45:27] <icinga-wm>	 PROBLEM - DPKG on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:46:57] <icinga-wm>	 PROBLEM - MD RAID on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:47:07] <icinga-wm>	 PROBLEM - puppet last run on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:47:07] <icinga-wm>	 PROBLEM - configured eth on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:47:07] <icinga-wm>	 PROBLEM - dhclient process on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:47:07] <icinga-wm>	 PROBLEM - Check systemd state on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:47:37] <icinga-wm>	 PROBLEM - Disk space on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:47:47] <icinga-wm>	 PROBLEM - salt-minion processes on scb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:49:10] <Amir1>	 do we have a scb2005? O.o
[10:50:25] <jynus>	 these are all up to me
[10:50:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.08 seconds
[10:50:54] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Send logs to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/321096 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup)
[10:51:29] <marostegui>	 checkind db1038
[10:52:05] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 370.02 seconds Marostegui checking
[10:52:14] <jynus>	 what is the issue?
[10:52:32] <marostegui>	 don't know yet
[10:53:58] <jynus>	 1000 inserts/s
[10:54:10] <jynus>	 300 updates/s
[10:54:39] <apergos>	 on the vslow host? 
[10:54:45] <marostegui>	 yep, the traffic has increased a lot since around 6am or so
[10:54:48] <marostegui>	 yes, it is an vslow host
[10:55:20] <apergos>	 weird
[10:55:34] <marostegui>	 storage looks fine
[10:56:14] <wikibugs_>	 (03PS6) 10Gehel: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 (owner: 10Smalyshev)
[10:56:23] <jynus>	 cache invalidations?
[10:56:57] <jynus>	 on urwiki
[10:56:59] <jynus>	 probably
[10:57:11] <marostegui>	 there is a huge increase on tmp tables happening at 10
[10:57:47] <icinga-wm>	 RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[10:57:54] <marostegui>	 the last thing on SAL is ema upgrading to varnish 4.1.5 at 9:23
[10:58:12] <jynus>	 UPDATE /* Title::invalidateCache  */ `page` SET page_touched = '20170307104739' WHERE page_id = '184305' AND (page_touched < '20170307104739')
[10:59:07] <icinga-wm>	 RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 53 minutes ago with 0 failures
[10:59:08] <icinga-wm>	 RECOVERY - configured eth on scb2006 is OK: OK - interfaces up
[10:59:08] <icinga-wm>	 RECOVERY - DPKG on scb2006 is OK: All packages OK
[10:59:17] <icinga-wm>	 RECOVERY - Disk space on scb2006 is OK: DISK OK
[10:59:37] <icinga-wm>	 RECOVERY - salt-minion processes on scb2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:59:37] <icinga-wm>	 RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational
[10:59:47] <icinga-wm>	 RECOVERY - MD RAID on scb2006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[10:59:48] <icinga-wm>	 RECOVERY - dhclient process on scb2006 is OK: PROCS OK: 0 processes with command name dhclient
[11:01:19] <marostegui>	 Something happened at 10:00 and at 10:40, as per the graphs
[11:01:47] <icinga-wm>	 RECOVERY - MD RAID on scb2005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[11:02:13] <icinga-wm>	 RECOVERY - DPKG on scb2005 is OK: All packages OK
[11:02:23] <icinga-wm>	 RECOVERY - Disk space on scb2005 is OK: DISK OK
[11:02:31] <jynus>	 marostegui, avalanche of invalidations
[11:02:33] <icinga-wm>	 RECOVERY - salt-minion processes on scb2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[11:02:47] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1038 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[11:03:06] <marostegui>	 jynus: how is that triggered?
[11:03:08] <jynus>	 5x the regular traffic
[11:03:35] <jynus>	 and from 100 wps to 2000 wps
[11:03:53] <icinga-wm>	 RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational
[11:03:53] <icinga-wm>	 PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 4 minutes ago with 10 failures. Failed resources (up to 3 shown): Package[trending-edits/deploy],Package[eventstreams/deploy],Package[changeprop/deploy],Package[electron-render/deploy]
[11:04:12] <jynus>	 servers are ready to slow down, but the slow one is out of the replication check
[11:04:23] <jynus>	 to allow for it to lag if necessary
[11:04:27] <icinga-wm>	 RECOVERY - mysqld processes on db1060 is OK: PROCS OK: 1 process with command name mysqld
[11:04:27] <icinga-wm>	 PROBLEM - mathoid endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=10042): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb0533c7950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:04:27] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=19000): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fecb353a950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:04:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=8888): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fec37c8e950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:04:43] <icinga-wm>	 PROBLEM - mathoid endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=10042): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1dbfd9c950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:04:53] <icinga-wm>	 PROBLEM - ores on scb2005 is CRITICAL: connect to address 10.192.0.34 and port 8081: Connection refused
[11:04:53] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=8888): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f469a0db950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:05:03] <icinga-wm>	 RECOVERY - configured eth on scb2005 is OK: OK - interfaces up
[11:05:04] <icinga-wm>	 PROBLEM - ores on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 8081: Connection refused
[11:05:04] <icinga-wm>	 PROBLEM - ores uWSGI web app on scb2005 is CRITICAL: NRPE: Command check_uwsgi-ores not defined
[11:05:18] <jynus>	 mobrovac, services are not very happy on codfw- is there an rolling restart going on?
[11:05:23] <icinga-wm>	 PROBLEM - pdfrender on scb2005 is CRITICAL: connect to address 10.192.0.34 and port 5252: Connection refused
[11:05:25] <icinga-wm>	 PROBLEM - ores uWSGI web app on scb2006 is CRITICAL: NRPE: Command check_uwsgi-ores not defined
[11:05:25] <icinga-wm>	 RECOVERY - dhclient process on scb2005 is OK: PROCS OK: 0 processes with command name dhclient
[11:05:33] <icinga-wm>	 PROBLEM - pdfrender on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 5252: Connection refused
[11:05:53] <icinga-wm>	 PROBLEM - trendingedits endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=6699): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f61fded6950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:06:03] <icinga-wm>	 PROBLEM - trendingedits endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=6699): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f65eea5d950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:06:17] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on db1060 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[11:06:21] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on db1060 is OK: OK slave_io_state Slave_IO_Running: Yes
[11:06:33] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy
[11:06:33] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=7272): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f8022248950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:06:33] <icinga-wm>	 RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.074 second response time
[11:06:43] <icinga-wm>	 RECOVERY - mathoid endpoints health on scb2006 is OK: All endpoints are healthy
[11:06:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=1970): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f367f9c4950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:07:03] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=7272): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7ff9967ad950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:07:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[11:07:13] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=8080): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1b86ca5950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:07:43] <icinga-wm>	 PROBLEM - eventstreams on scb2005 is CRITICAL: connect to address 10.192.0.34 and port 8092: Connection refused
[11:07:43] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb2006 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.20, port=8080): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa923c57950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:07:53] <icinga-wm>	 PROBLEM - eventstreams on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 8092: Connection refused
[11:07:53] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.34, port=19000): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f9189884950: Failed to establish a new connection: [Errno 111] Connection refused,))
[11:08:43] <icinga-wm>	 PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:09:14] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3079557 (10Emijrp) @Esc3300 About 3 days (72 hours), at a roughly edit rate of 600 epm, I did 2.5 million edits, more or less. Just a note, my bot edits add descriptions in do...
[11:10:23] <icinga-wm>	 RECOVERY - pdfrender on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.096 second response time
[11:10:33] <icinga-wm>	 RECOVERY - mathoid endpoints health on scb2005 is OK: All endpoints are healthy
[11:10:43] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[11:10:44] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[11:10:53] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy
[11:11:37] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refused
[11:12:04] <godog>	 that's me, sorry about that
[11:12:53] <icinga-wm>	 PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:16:33] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2005 is OK: All endpoints are healthy
[11:16:33] <icinga-wm>	 RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[11:16:43] <icinga-wm>	 RECOVERY - eventstreams on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.081 second response time
[11:16:53] <icinga-wm>	 RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational
[11:16:53] <icinga-wm>	 RECOVERY - trendingedits endpoints health on scb2005 is OK: All endpoints are healthy
[11:16:53] <icinga-wm>	 RECOVERY - ores on scb2005 is OK: HTTP OK: HTTP/1.0 200 OK - 3147 bytes in 0.090 second response time
[11:17:04] <icinga-wm>	 RECOVERY - ores uWSGI web app on scb2005 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app
[11:17:13] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb2005 is OK: All endpoints are healthy
[11:17:43] <icinga-wm>	 RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational
[11:17:53] <icinga-wm>	 RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[11:17:54] <icinga-wm>	 RECOVERY - eventstreams on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.088 second response time
[11:18:03] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2006 is OK: All endpoints are healthy
[11:18:03] <icinga-wm>	 RECOVERY - trendingedits endpoints health on scb2006 is OK: All endpoints are healthy
[11:18:04] <icinga-wm>	 RECOVERY - ores on scb2006 is OK: HTTP OK: HTTP/1.0 200 OK - 3147 bytes in 0.083 second response time
[11:18:23] <icinga-wm>	 RECOVERY - ores uWSGI web app on scb2006 is OK: ● uwsgi-ores.service - uwsgi-ores uwsgi app
[11:18:43] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb2006 is OK: All endpoints are healthy
[11:18:55] <wikibugs_>	 (03PS1) 10Alexandros Kosiaris: Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516
[11:20:07] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [C: 032] Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516 (owner: 10Alexandros Kosiaris)
[11:20:13] <wikibugs_>	 (03PS2) 10Alexandros Kosiaris: Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516
[11:20:17] <wikibugs_>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add scb2005, scb2006 in ORES redis firewalling [puppet] - 10https://gerrit.wikimedia.org/r/341516 (owner: 10Alexandros Kosiaris)
[11:20:28] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079570 (10Marostegui) db1060 has been reimaged and recloned and it is now trying to catch up (GTID is enabled)
[11:23:34] <wikibugs_>	 (03PS4) 10Elukey: Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530)
[11:24:47] <wikibugs_>	 (03CR) 10Elukey: [C: 032] Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey)
[11:26:53] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0]
[11:27:09] <elukey>	 !log end of hacking on install1002 (puppet re-enabled)
[11:27:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:35] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3079624 (10Beetlebeard) >>! In T158638#3076473, @Dzahn wrote: > @Beetlebeard how does that Gerrit link above look to you? >  > direct link to diff: https://g...
[11:32:41] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mobileapps'])
[11:32:42] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid'])
[11:32:43] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=graphoid'])
[11:32:44] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=citoid'])
[11:32:45] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium'])
[11:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:46] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=cxserver'])
[11:32:47] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=ores'])
[11:32:48] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=eventstreams'])
[11:32:49] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=pdfrender'])
[11:32:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:51] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits'])
[11:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:55] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mobileapps'])
[11:32:56] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=mathoid'])
[11:32:57] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=graphoid'])
[11:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:59] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=citoid'])
[11:33:00] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium'])
[11:33:01] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=cxserver'])
[11:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:02] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=ores'])
[11:33:03] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=eventstreams'])
[11:33:04] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=pdfrender'])
[11:33:05] <logmsgbot>	 !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=trendingedits'])
[11:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:28] <wikibugs_>	 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3079631 (10akosiaris)
[11:36:31] <wikibugs_>	 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3079629 (10akosiaris) 05Open>03Resolved I 've just applied the puppet configs on the hosts and pooled them for all the services.  @mobrovac, I suppose we...
[11:37:20] <wikibugs_>	 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3079632 (10Marostegui) For the record, we are seeing the following disk errors (raid is fine and disks are online though):  ``` #1 Media error count: 2...
[11:37:26] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW  preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3079634 (10akosiaris) With T156631 and T159486 done we now have the required cap...
[11:37:35] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac, 07Wikimedia-Multiple-active-datacenters: Assess SCB@CODFW  preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3079637 (10akosiaris) 05Open>03Resolved a:03akosiaris
[11:37:37] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3079639 (10akosiaris)
[11:38:14] <wikibugs_>	 (03PS2) 10Gehel: wdqs: cleanup old GC logs [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248)
[11:39:03] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3122: Connection refused
[11:39:22] <wikibugs_>	 (03CR) 10Gehel: [C: 032] wdqs: cleanup old GC logs [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248) (owner: 10Gehel)
[11:39:23] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3120: Connection refused
[11:39:29] <ema>	 looking, host is depooled ^
[11:40:03] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.168 second response time
[11:40:21] <wikibugs_>	 (03PS4) 10Jcrespo: Move tmpdir to /srv/labsdb/tmp to avoid filling up / partition [puppet] - 10https://gerrit.wikimedia.org/r/341503 (https://phabricator.wikimedia.org/T159572)
[11:40:23] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.167 second response time
[11:41:00] <jynus>	 wow, today is a crazy day
[11:41:08] <gehel>	 !log cleaning empty log file on elastic2001 (cronspam)
[11:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:01] <wikibugs_>	 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#3079660 (10tomasz) Does anything else need to be done here or can we just close this task?
[11:48:23] <icinga-wm>	 PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:53:49] <wikibugs_>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416)
[11:54:40] <wikibugs_>	 (03PS10) 10Jcrespo: Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896)
[11:56:36] <wikibugs_>	 (03CR) 10Jcrespo: [C: 04-1] "The comment has to go away, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[11:58:33] <wikibugs_>	 (03PS2) 10Marostegui: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416)
[11:58:43] <wikibugs_>	 (03CR) 10Marostegui: "Thanks - as soon as I pushed it I noticed and I was amending :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[12:00:58] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[12:02:33] <wikibugs_>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[12:02:49] <wikibugs_>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341517 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui)
[12:03:49] <wikibugs_>	 (03CR) 10Gehel: [C: 04-1] deployment-prep: Use apt experimental for elasticsearch servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson)
[12:03:59] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2034 - T132416 (duration: 00m 50s)
[12:04:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:05] <stashbot>	 T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416
[12:07:54] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3079736 (10Esc3300) Many seem to be descriptions for category items (Something that might not be of much use to Wikipedia).  On a few items I check, some languages use only th...
[12:08:45] <moritzm>	 gehel: experimental at this point also contains HHVM 3.18, so make sure this is only applied to the elastic nodes, not sure if deployment-prep updates it's packages automatically like the rest of labs
[12:09:56] <gehel>	 moritzm: yep, I'm looking into that. Since this is a temporary configuration (until ES5 is moved out of experimental) it probably make sense to configure it on each of the elasticsearch node in deployment-prep, and not on something more generic
[12:10:41] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519
[12:10:46] <wikibugs_>	 (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519
[12:11:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1060 is OK: OK slave_sql_lag Replication lag: 57.20 seconds
[12:11:36] <wikibugs_>	 (03CR) 10Gehel: [C: 04-2] "While we do have project specific configurations (%{::labsproject}), we do not have a good way to specific something more specific. Since " [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson)
[12:14:25] <moritzm>	 gehel: ack
[12:16:21] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 (owner: 10Marostegui)
[12:17:45] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 (owner: 10Marostegui)
[12:17:55] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341519 (owner: 10Marostegui)
[12:18:23] <icinga-wm>	 RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[12:19:12] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2053 - T159414 (duration: 00m 43s)
[12:19:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:18] <stashbot>	 T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414
[12:24:02] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193)
[12:26:53] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[12:31:56] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui)
[12:33:02] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui)
[12:33:14] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1060 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341520 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui)
[12:34:13] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 with less weight - T158193 (duration: 00m 40s)
[12:34:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:19] <stashbot>	 T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193
[12:34:34] <wikibugs_>	 (03CR) 10Gehel: [C: 04-2] "In any case, the `apt` class is not applied on labs. I need to dig a bit more to see what the impact is if we add it to the elasticsearch " [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson)
[12:39:56] <marostegui>	 !log Deploy ALTER table on db2028 (codfw s6 master) on the revision table - T159414
[12:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:02] <stashbot>	 T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414
[12:47:04] <wikibugs_>	 06Operations, 06Performance-Team, 10Thumbor: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065#3079788 (10Gilles)
[12:53:00] <elukey>	 !log analytics1040 back in service - testing the new Debian configuration
[12:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:38] <marostegui>	 !log Just for the sake of having it logged: gtid_domain_id has been deployed in all the database servers - T149418
[12:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:44] <stashbot>	 T149418: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418
[12:59:07] <wikibugs_>	 (03CR) 10Reedy: [C: 031] Save logs of generate CAPTCHA cron to /var/log/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/341197 (https://phabricator.wikimedia.org/T159610) (owner: 10Florianschmidtwelzow)
[13:10:50] <wikibugs_>	 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3079893 (10MoritzMuehlenhoff) hhvm-wikidiff2 and hhvm-luasandbox rebuilt without changes against the new 3.18 API. hhvm-tidy needed to be patched. Initially the build failed with     ``` /home/jmm/rebuild/hhvm-tidy-0...
[13:11:38] <wikibugs_>	 (03PS1) 10Reedy: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524
[13:11:48] <Reedy>	 jouncebot: next
[13:11:48] <jouncebot>	 In 0 hour(s) and 48 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1400)
[13:30:44] <wikibugs_>	 06Operations, 10media-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830#3080003 (10Reedy)
[13:37:08] <wikibugs_>	 06Operations, 10media-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830#3080033 (10Reedy)
[13:37:44] <Zppix>	 jouncebot:  now
[13:37:44] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 22 minute(s)
[13:37:50] <Zppix>	 jouncebot:  next
[13:37:51] <jouncebot>	 In 0 hour(s) and 22 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1400)
[13:38:07] <Zppix>	 sorry for the spam :/ i hate doing jouncebot commands
[13:49:35] <volans>	 Zppix: you can query it ;)
[13:49:56] <Zppix>	 i know but i have enough bots in my pms
[13:50:06] <volans>	 lol
[13:50:24] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo)
[13:54:54] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193)
[13:56:08] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui)
[13:57:32] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui)
[13:57:40] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase weight db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341530 (https://phabricator.wikimedia.org/T158193) (owner: 10Marostegui)
[13:58:50] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1060 weight - T158193 (duration: 00m 58s)
[13:58:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:57] <stashbot>	 T158193: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193
[13:59:46] <hashar>	 jouncebot: now
[13:59:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 0 minute(s)
[14:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1400). Please do the needful.
[14:00:04] <jouncebot>	 reedy: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[14:00:33] <Reedy>	 It's my patch, so I'll deploy it :P
[14:00:48] <wikibugs_>	 (03PS2) 10Reedy: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524
[14:00:52] <wikibugs_>	 (03CR) 10Reedy: [C: 032] Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 (owner: 10Reedy)
[14:00:58] <hashar>	 but jouncebot told me there was nothing!
[14:00:59] <hashar>	 :D
[14:01:24] <Zppix>	 hashar:  to be fair it didnt lie it said 0 minutes it didnt say seconds
[14:01:40] <hashar>	 it said no deplokyments
[14:01:46] <hashar>	 guess It needed a refresh
[14:01:56] <Zppix>	 <jouncebot> No deployments scheduled for the next 0 hour(s) and 0 minute(s) 
[14:02:42] <wikibugs_>	 (03Merged) 10jenkins-bot: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 (owner: 10Reedy)
[14:02:55] <wikibugs_>	 (03CR) 10jenkins-bot: Fixup filebackend.php symlinks for noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341524 (owner: 10Reedy)
[14:03:58] <logmsgbot>	 !log reedy@tin Synchronized docroot/: Fixup filebackend symlinks (duration: 00m 41s)
[14:04:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:51] <wikibugs_>	 (03PS4) 10Jcrespo: Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487
[14:06:36] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080104 (10TheDJ) As far as i'm aware, this is NOT due to a typo. Safari simply implements a very limited set of referrer policies: 'no-referrer', 'origin', 'no-referrer-when-d...
[14:08:09] <wikibugs_>	 (03CR) 10Jcrespo: "Question (out of scope)- shouldn't be the whole noc stuff in a different repo?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo)
[14:10:58] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] Add db-codfw.php to noc.wikimedia.org visible config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340487 (owner: 10Jcrespo)
[14:22:23] <icinga-wm>	 PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:25:45] <wikibugs_>	 (03PS11) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541)
[14:25:47] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532
[14:25:49] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541)
[14:25:51] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541)
[14:25:53] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541)
[14:27:13] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:27:24] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:27:30] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:29:09] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080150 (10TheDJ) The Referrer-Policy HTTP Header is actually quite different from the meta header. I have created a mock request http://www.mocky.io/v2/58bec319260000201bf07c5...
[14:30:31] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] Start refactoring of mariadb config template system [puppet] - 10https://gerrit.wikimedia.org/r/340987 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo)
[14:31:11] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532
[14:31:13] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541)
[14:31:15] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541)
[14:31:17] <wikibugs_>	 (03PS12) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541)
[14:31:19] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541)
[14:31:51] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532
[14:31:56] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:32:01] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] performance: create '/srv/org' directory [puppet] - 10https://gerrit.wikimedia.org/r/341532 (owner: 10Filippo Giunchedi)
[14:32:25] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:32:31] <moritzm>	 !log uploaded HHVM 3.18 builds of hhvm-tidy, hhvm-luasandbox and hhvm-wikidiff2 to the experimental section of apt.wikimedia.org (Bug: T158176)
[14:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:36] <stashbot>	 T158176: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176
[14:33:02] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[14:34:14] <wikibugs_>	 (03CR) 10Gehel: [C: 04-2] "Actually, experimental seems already available on deployment-prep nodes. Not where this comes from..." [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson)
[14:35:03] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 607 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4217999 keys, up 127 days 6 hours - replication_delay is 607
[14:36:10] <moritzm>	 gehel: ^which host?
[14:36:23] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4217967 keys, up 127 days 6 hours - replication_delay is 650
[14:36:45] * akosiaris looking at the redis replication issues
[14:37:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195777 keys, up 127 days 6 hours - replication_delay is 58
[14:37:13] <gehel>	 moritzm: I checked the elasticsearch and deployment-mediawiki05
[14:37:33] <icinga-wm>	 PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:37:40] <elukey>	 akosiaris: Giuseppe and I have a plan to tackle this issue, reducing the replication of the rbd redis instances..
[14:37:49] <elukey>	 something more trivial like eqiad -> codfw
[14:37:57] <elukey>	 1:1
[14:38:12] <akosiaris>	 yeah it's kind of convoluted right now
[14:38:37] <elukey>	 and the lua stuff afaik blocks heavily the Redis thread sometimes introducing lags
[14:38:59] <gehel>	 moritzm: actually, all the deployment-mediawiki* seems to have experimental configured
[14:38:59] <akosiaris>	 master_host:10.64.32.18
[14:38:59] <akosiaris>	 master_port:6379
[14:38:59] <akosiaris>	 master_link_status:down
[14:39:08] <wikibugs_>	 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, and 2 others: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#3080156 (10ema)
[14:39:27] <akosiaris>	 hmm so it thinks rdb1007 is down ?
[14:39:50] <akosiaris>	 master_link_down_since_seconds:195
[14:39:53] <akosiaris>	 yeah it's increasing
[14:40:12] <akosiaris>	 should alert again in 5 minutes or so
[14:40:30] <akosiaris>	 elukey: and you say that behavior is because of lua ?
[14:41:49] <elukey>	 akosiaris: no no I am reporting something that Giuseppe told me a while ago, that should be the issue.. I followed the replication like you did without finding why the rdb1007 link was down
[14:42:18] <elukey>	 it goes away after a while
[14:42:28] <elukey>	 I thought it was network blips 
[14:42:33] <elukey>	 but everything looks good
[14:42:51] <elukey>	 so the theory that the master lags because of lua could be a good path of investigation
[14:43:28] <moritzm>	 gehel: ah, indeed. according to git log godog enabled it for the git backport
[14:44:33] <icinga-wm>	 RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[14:44:51] <wikibugs_>	 (03Abandoned) 10DCausse: Add a bash script to fetch and update this repo [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340973 (owner: 10DCausse)
[14:45:34] <akosiaris>	 elukey: hmmm slowlog get 10 returns 9/10 PSYNC commands on rbd1007
[14:45:41] <gehel>	 moritzm: you mean commit 80ba9cc7 ? Isn't that about tin / mira? Not deployment-prep ?
[14:46:23] <elukey>	 akosiaris: what does it mean? 
[14:46:25] <jynus>	 !log restart labsdb1004 for config and data check
[14:46:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:03] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 658 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195777 keys, up 127 days 6 hours - replication_delay is 658
[14:47:17] <akosiaris>	 elukey: that at least 1 slave is trying repeatedly to get the differences
[14:47:20] <moritzm>	 gehel: yeah, you're right, that was only tin/mira
[14:47:32] <akosiaris>	 and is (I allege) blocked on something
[14:48:24] <gehel>	 moritzm: I actually have no idea where this experimental config comes from. I can't see any class on those hosts that would bring it in. But there is probably some labs magic that I don't understand...
[14:48:47] <moritzm>	 ^ hashar: any idea?
[14:49:38] <elukey>	 akosiaris: ahh ok so the last 10 commands received on rdb1007 are psync
[14:50:05] <akosiaris>	 the last 10 slow commands
[14:50:15] <akosiaris>	 where slow > X seconds.. lemme find the value of X
[14:50:23] <icinga-wm>	 RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[14:51:07] <akosiaris>	 config get slowlog-log-slower-than
[14:51:07] <akosiaris>	 1) "slowlog-log-slower-than"
[14:51:07] <akosiaris>	 2) "10000"
[14:51:14] <akosiaris>	 that value is in μsecs
[14:51:23] <akosiaris>	 so 0.1 secs
[14:52:46] <akosiaris>	 this is weird
[14:52:58] <akosiaris>	 looks like those psync are not from codfw
[14:54:27] <akosiaris>	 pfff... the slave reports it's trying to sync, but the master doesn't know it ? Or the output of the "role" command is utter crap
[14:54:36] <wikibugs_>	 (03PS2) 10Muehlenhoff: Enable base::firewall on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341501
[14:56:45] <akosiaris>	 lsof does says they got established TCP connections though
[14:58:11] <akosiaris>	 but the client believes it's not connected ... 
[14:58:17] <akosiaris>	 er the slave
[14:59:23] <icinga-wm>	 PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:59:37] <akosiaris>	 but the server reports it as connected
[14:59:49] <akosiaris>	 elukey: the lua theory is becoming more and more credible to my eyes
[15:00:53] <elukey>	 akosiaris: maybe $sometimes something blocks the only running thread and game over, all blocks
[15:01:18] <akosiaris>	 elukey: wouldn't that block however all the clients (the mediawikis) as well ?
[15:01:52] <akosiaris>	 ah wait we might be gracefully failing there...
[15:02:32] <elukey>	 akosiaris: this is a good point, it could be useful to see if there is impact somewhere
[15:05:48] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::firewall on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341501 (owner: 10Muehlenhoff)
[15:08:07] <akosiaris>	 elukey: there we go 1932] 07 Mar 15:04:03.086 # Client id=46011339065 addr=10.192.32.133:38222 ... cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
[15:08:08] <akosiaris>	 [1932] 07 Mar 15:04:03.186 # Connection with slave 10.192.32.133:6479 lost.
[15:08:33] <elukey>	 woa
[15:08:41] <elukey>	 nice finding!!
[15:08:49] <elukey>	 where was it??
[15:08:52] <akosiaris>	 logs
[15:08:57] <akosiaris>	 on the master
[15:09:12] <akosiaris>	 there is a setting according to my google foo
[15:09:20] <akosiaris>	 client-output-buffer-limit
[15:10:03] <elukey>	 so the master drops the connection with the slave since it is filling the output buffers
[15:10:07] <elukey>	 this might rule out lua
[15:10:27] <elukey>	 so the slave lags and the master abandons him
[15:10:32] <akosiaris>	 ouch
[15:10:33] <elukey>	 poor redis
[15:10:34] <akosiaris>	 so 
[15:10:40] <akosiaris>	 [1932] 07 Mar 15:04:04.793 * Background saving started by pid 7152
[15:10:40] <akosiaris>	 [7152] 07 Mar 15:04:32.553 * DB saved on disk
[15:10:44] <akosiaris>	 [7152] 07 Mar 15:04:32.630 * RDB: 367 MB of memory used by copy-on-write
[15:11:04] <akosiaris>	 if I read this correctly the RDB file on this run was 367MB
[15:11:25] <akosiaris>	 config get client-output-buffer-limit
[15:11:25] <akosiaris>	 1) "client-output-buffer-limit"
[15:11:25] <akosiaris>	 2) "normal 0 0 0 slave 536870912 209715200 60 pubsub 33554432 8388608 60"
[15:11:35] <akosiaris>	 which is less than the 512MB in the config file.. no ?
[15:12:22] <akosiaris>	 ah no, it's the 200MB soft limit 
[15:12:29] <akosiaris>	 lol
[15:13:45] <akosiaris>	 the funny thing is previous log lines of today mention different rdb file sizes .. most well below the 200 limit
[15:14:09] <elukey>	 mmm could it be that the slave tries a full sync when it gets dropped?
[15:14:24] <elukey>	 going to end up in a mess 
[15:14:34] <elukey>	 until it finally auto-recovers
[15:14:45] <akosiaris>	 it supposedly should not
[15:14:51] <akosiaris>	 try the full resync that is
[15:15:09] <akosiaris>	 If you set up a slave, upon connection it sends a PSYNC command.If this is a reconnection and the master has enough backlog, only the difference (what the slave missed) is sent. Otherwise what is called a full resynchronization is triggered.
[15:15:17] <akosiaris>	 well, it's opportunistic of course
[15:15:24] <akosiaris>	 so it could be that you are right
[15:15:36] <akosiaris>	 I 've increase the buffer limits to see what will happen btw
[15:15:53] <akosiaris>	 !log increase client-output-buffer-limit soft-limit to 500MB temporarily on rdb1007
[15:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:30] <elukey>	 akosiaris: thanks!
[15:17:19] <elukey>	 https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer/
[15:17:22] <elukey>	 :D :D :D 
[15:18:24] <elukey>	 akosiaris: so raising the soft-limit means avoiding any throttling?
[15:18:41] <akosiaris>	 for the slaves yeah
[15:18:56] <akosiaris>	 I only changed the slave softlimit
[15:20:12] <elukey>	 super, just wanted to get the change
[15:20:12] <akosiaris>	 [1932] 07 Mar 15:19:47.398 * Full resync requested by slave.
[15:20:18] <akosiaris>	 you were right
[15:20:23] <akosiaris>	 it requests a full resync
[15:20:30] <wikibugs_>	 (03PS1) 10Muehlenhoff: Enable base::firewall in role::test::system by default [puppet] - 10https://gerrit.wikimedia.org/r/341550
[15:20:36] <akosiaris>	 I should have noticed it before
[15:22:24] <elukey>	 but now we know what is the issue with Redis, really nice finding
[15:22:33] <logmsgbot>	 !log joal@tin Started deploy [analytics/aqs/deploy@e0da1bd]: (no justification provided)
[15:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:40] <elukey>	 I am also guilty to have increased the alarm's backoff to see if it was too sensitive
[15:22:49] <elukey>	 before even checking the logs
[15:23:06] <wikibugs_>	 (03PS1) 10Jcrespo: Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551
[15:23:24] <akosiaris>	 I see now rd2005 doing something with ~50Mbps
[15:23:26] <jynus>	 ^marostegui
[15:23:40] <akosiaris>	 well fetching data at ~50Mbps from rdb1007
[15:23:44] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo)
[15:24:01] <jynus>	 I am going to add moritz so he can give it a second look
[15:24:38] <wikibugs_>	 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3080282 (10Andrew)
[15:24:46] <akosiaris>	 elukey: just failed by again.. exact same error
[15:25:17] <elukey>	 maybe we should increase the max limit if possible
[15:25:54] <akosiaris>	 I am wondering it i's trying to sync to aof or the rdb file
[15:26:42] <wikibugs_>	 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3080300 (10Papaul)
[15:26:45] <wikibugs_>	 06Operations, 10ops-codfw: apply hostname labels and update racktables for scb2005 (WMF6466) and scb2006 (WMF6468) - https://phabricator.wikimedia.org/T159487#3080298 (10Papaul) 05Open>03Resolved complete
[15:26:59] <wikibugs_>	 (03PS1) 10Ema: cache_upload: test cookie stripping [puppet] - 10https://gerrit.wikimedia.org/r/341552 (https://phabricator.wikimedia.org/T137609)
[15:27:12] <akosiaris>	 the documentation does not mention AOF at all
[15:27:30] <akosiaris>	 which has me puzzled cause the RDB file sizes are lower than the limits now
[15:28:10] <akosiaris>	 er.. no...
[15:28:19] <akosiaris>	 -rw-r--r-- 1 redis redis 2.9G Mar  7 15:28 rdb1007-6379.aof
[15:28:19] <akosiaris>	 -rw-r--r-- 1 redis redis 1.3G Mar  7 15:25 rdb1007-6379.rdb
[15:28:23] <icinga-wm>	 RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[15:28:24] <akosiaris>	 lol
[15:28:31] <elukey>	 ahaah
[15:28:38] <akosiaris>	 ok so the logs write something irrelevant to the actual RDB file size
[15:28:41] <logmsgbot>	 !log joal@tin Finished deploy [analytics/aqs/deploy@e0da1bd]: (no justification provided) (duration: 06m 08s)
[15:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:46] <akosiaris>	 [12621] 07 Mar 15:28:38.545 * RDB: 80 MB of memory used by copy-on-write
[15:28:59] <elukey>	 akosiaris: AOF should be a sort of journal IRIC
[15:29:02] <elukey>	 *IIRC
[15:29:08] <akosiaris>	 ok, my bad for assuming that number was the actual RDB filesize
[15:29:26] <akosiaris>	 ok so.. lemme retry that limit
[15:29:34] <wikibugs_>	 (03CR) 10Ema: [V: 032 C: 032] cache_upload: test cookie stripping [puppet] - 10https://gerrit.wikimedia.org/r/341552 (https://phabricator.wikimedia.org/T137609) (owner: 10Ema)
[15:30:14] <elukey>	 https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=rdb1007&panelId=14&fullscreen shows tons of space 
[15:30:22] <elukey>	 and it is not even cached by the kernel
[15:32:07] <akosiaris>	 we should know if my change working in about 3 mins
[15:32:17] <akosiaris>	 I 've set the limit to the absurdly high number of 5G 
[15:32:24] <akosiaris>	 both hard and soft limits 
[15:33:52] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10Ottomata)
[15:34:02] <wikibugs_>	 (03PS2) 10Jcrespo: Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551
[15:34:16] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Ottomata)
[15:34:19] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10Ottomata)
[15:34:20] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo)
[15:34:44] <elukey>	 akosiaris: +1
[15:34:55] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Ottomata)
[15:34:57] <wikibugs_>	 06Operations, 10ops-eqiad: check stat1004 (or another identical R430) for PCIe expansion space - https://phabricator.wikimedia.org/T151080#3080346 (10Ottomata) 05Open>03declined We will be getting GPU as the stat1002 replacement, instead of installing one in stat1004.   See: T159838
[15:35:32] <wikibugs_>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3080351 (10Papaul) @fgiunchedi thank you. However, there are some steps that I will not be able to perform such as  - disable puppet on host - remove all remaining puppet...
[15:35:37] <wikibugs_>	 (03PS3) 10Jcrespo: Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551
[15:35:43] <akosiaris>	 elukey: ok it actually worked
[15:35:56] <akosiaris>	 I expect icinga to notice it soon
[15:36:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4192429 keys, up 127 days 7 hours - replication_delay is 0
[15:36:09] <akosiaris>	 there we go
[15:36:50] <elukey>	 niceeeeeeeeeeeeeeeeee
[15:36:53] <elukey>	 \o/
[15:37:17] <wikibugs_>	 (03PS1) 10Elukey: Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530)
[15:37:24] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4191759 keys, up 127 days 7 hours - replication_delay is 17
[15:38:38] <wikibugs_>	 (03PS2) 10Elukey: Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530)
[15:39:39] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080355 (10Ottomata)
[15:40:09] <wikibugs_>	 (03PS3) 10Elukey: Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530)
[15:40:20] <wikibugs_>	 06Operations, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata)
[15:41:00] <wikibugs_>	 06Operations, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata)
[15:41:10] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080331 (10Ottomata)
[15:41:38] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3080357 (10Ottomata) p:05Triage>03Normal
[15:41:46] <wikibugs_>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3080375 (10Ottomata) p:05Triage>03Normal
[15:42:09] <wikibugs_>	 (03CR) 10Ottomata: [C: 032] Replace the journal volume name with unused in analytics-flex.cfg [puppet] - 10https://gerrit.wikimedia.org/r/341553 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey)
[15:46:56] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers and statistics-users groups for niharika29 - https://phabricator.wikimedia.org/T159780#3080393 (10Ottomata) Hi @Niharika, thanks for the request. You'll need approver from your manager, and @Nuria should approve as well.  FYI, I believe that...
[15:47:11] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers and statistics-users groups for niharika29 - https://phabricator.wikimedia.org/T159780#3080395 (10Ottomata) Please have your manager post their approval here.
[15:47:37] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3080399 (10Nuria) Approved
[15:47:39] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29  - https://phabricator.wikimedia.org/T159780#3080400 (10Ottomata)
[15:47:54] <wikibugs_>	 06Operations, 10Traffic, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#3080401 (10ema) 05Open>03Resolved a:03ema
[15:48:14] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] Fix permissions for /var/run/mysqld dir so that the server can write [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/341551 (owner: 10Jcrespo)
[15:49:58] <wikibugs_>	 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3080405 (10Ottomata) Oh prefect!  We will soon be replacing stat1002 (T159838) and stat1003 (T159839) with newer hardware.  When we do so, we will upgrade these to Jessie.
[15:50:11] <wikibugs_>	 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3080410 (10Ottomata)
[15:53:37] <wikibugs_>	 06Operations, 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3080432 (10matmarex) This is a misconfiguration of Wikipedia's short URLs. Correctly configured MediaWiki does not exhibit this issue, I can't reproduce locally. Sounds like...
[15:56:48] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Separate sanitarium role && monitore it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896)
[15:58:39] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Separate sanitarium role && monitor it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896)
[16:01:03] <godog>	 jouncebot: next
[16:01:03] <jouncebot>	 In 0 hour(s) and 58 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1700)
[16:05:16] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: separate sanitarium role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558
[16:05:34] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:06:06] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: separate sanitarium role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo)
[16:06:15] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558
[16:06:39] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] "For tracking can you include the bugid: https://phabricator.wikimedia.org/T150850" [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo)
[16:07:12] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo)
[16:08:17] <wikibugs_>	 (03CR) 10Jcrespo: "> For tracking can you include the bugid: https://phabricator.wikimedia.org/T150850" [puppet] - 10https://gerrit.wikimedia.org/r/341558 (owner: 10Jcrespo)
[16:08:52] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850)
[16:09:22] <wikibugs_>	 (03CR) 10Marostegui: [C: 031] "The key word I normally use to look for this ticket is: decouple :-)" [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[16:09:36] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Separate sanitarium role && monitor it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896)
[16:11:22] <wikibugs_>	 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3080480 (10matmarex)
[16:15:44] <wikibugs_>	 06Operations, 10Traffic, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#3080506 (10ema) 05Open>03Resolved a:03ema Fix confirmed by @mobrovac, closing
[16:17:02] <wikibugs_>	 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#3080524 (10ema) 05Open>03Resolved a:03ema
[16:18:30] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5673/db1069.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo)
[16:18:36] <wikibugs_>	 (03PS4) 10Jcrespo: mariadb: Separate sanitarium role && monitor it on prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341557 (https://phabricator.wikimedia.org/T143896)
[16:19:06] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541)
[16:19:08] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541)
[16:19:10] <wikibugs_>	 (03PS13) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541)
[16:19:12] <wikibugs_>	 (03PS3) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541)
[16:19:16] <wikibugs_>	 (03Draft1) 10Paladox: Fix gerritbot [puppet] - 10https://gerrit.wikimedia.org/r/341559
[16:19:18] <wikibugs_>	 (03PS2) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689)
[16:20:24] <wikibugs_>	 (03CR) 10Paladox: "Tested on http://gerrit-new.wmflabs.org/r/#/c/58/ with https://phab-01.wmflabs.org/T20" [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[16:20:46] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[16:22:13] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1005.eqiad.wmnet
[16:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:20] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1006.eqiad.wmnet
[16:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:26] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1007.eqiad.wmnet
[16:22:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:33] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/weight=40; selector: name=ms-fe1008.eqiad.wmnet
[16:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:03] <wikibugs_>	 (03PS1) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100)
[16:23:56] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[16:24:14] <wikibugs_>	 (03PS2) 10Filippo Giunchedi: swift: ignore spammy 507s from container-server [puppet] - 10https://gerrit.wikimedia.org/r/340142 (https://phabricator.wikimedia.org/T157237)
[16:26:15] <eddiegp>	 paladox: You may want to update your commit message, you put "better" twice ;)
[16:26:22] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] swift: ignore spammy 507s from container-server [puppet] - 10https://gerrit.wikimedia.org/r/340142 (https://phabricator.wikimedia.org/T157237) (owner: 10Filippo Giunchedi)
[16:26:38] <wikibugs_>	 (03PS3) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689)
[16:26:48] <paladox>	 Oh thanks
[16:26:56] <godog>	 elukey jynus merging your patches too
[16:27:05] <wikibugs_>	 (03PS2) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100)
[16:27:07] <jynus>	 thanks
[16:27:28] <godog>	 de nada
[16:27:30] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080579 (10Nuria) >As far as i'm aware, this is NOT due to a typo. Safari simply implements a very limited set of referrer policies: 'no-referrer', 'origin', 'no-referrer-when-...
[16:28:10] <wikibugs_>	 (03PS4) 10Jcrespo: mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850)
[16:28:13] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[16:28:54] <elukey>	 godog: thanks!
[16:29:36] <godog>	 de nada
[16:29:44] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3080581 (10Ottomata) @JKatzWMF, I think if you really really really want this fixed, you'll need to find a MediaWiki dev to revisit https://meta.wikimedia.org/wiki/Research_tal...
[16:30:24] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "I'd say merge this right now, it has gone through some testing and even if some bug remain, it should not break anything badly (no commit " [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel)
[16:30:29] <wikibugs_>	 (03PS3) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100)
[16:31:19] <wikibugs_>	 (03CR) 10DCausse: [V: 032 C: 032] automate management of elasticsearch plugin repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel)
[16:31:30] <gehel>	 dcausse: thanks!
[16:31:33] <dcausse>	 yw :)
[16:32:12] <wikibugs_>	 (03Abandoned) 10Filippo Giunchedi: swift: add lvs configuration for esams [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi)
[16:33:04] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195378 keys, up 127 days 8 hours - replication_delay is 630
[16:33:24] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4194556 keys, up 127 days 8 hours - replication_delay is 651
[16:34:34] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[16:36:51] <wikibugs_>	 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#3080610 (10fgiunchedi)
[16:36:54] <wikibugs_>	 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Experiment with Swift as docker registry backend - https://phabricator.wikimedia.org/T149098#3080608 (10fgiunchedi) 05stalled>03Resolved We're using codfw to test swift as docker registry backend.
[16:37:56] <wikibugs_>	 (03CR) 10Chad: "Minor inline nit, otherwise lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[16:38:04] <icinga-wm>	 PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:39:18] <wikibugs_>	 (03CR) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[16:40:22] <akosiaris>	 !log decrease client-output-buffer-limit soft-limit back to normal values
[16:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:24] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[16:41:25] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: separate sanitarium2 role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341558 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[16:43:04] <icinga-wm>	 RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[16:43:12] <wikibugs_>	 (03CR) 10Chad: Gerrit: Fix bot so it uses the name of user instead of username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[16:43:44] <wikibugs_>	 (03PS4) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100)
[16:43:55] <elukey>	 akosiaris: should we open a task for the redis lag?
[16:44:12] <elukey>	 just to track it (done, todo, etc..)
[16:45:01] <akosiaris>	 yeah we should
[16:45:03] <wikibugs_>	 (03CR) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[16:45:22] <wikibugs_>	 (03PS4) 10Paladox: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689)
[16:45:31] <wikibugs_>	 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3080649 (10RobH) @papaul: You can actually remove the puppet references, but you won't be able to self merge.  You up to doing that?  If not, I'll handle that step for you...
[16:46:09] <robh>	 papaul: just let me know when you are ready to work on the decom for ms-be systems on https://phabricator.wikimedia.org/T159413 and i'll handle the steps you mentioned (or you can prepare the patches and I can merge, whichever you want!
[16:46:10] <robh>	 =]
[16:46:37] <papaul>	 ms-fe
[16:47:09] <elukey>	 akosiaris: I'll do it later on or tomorrow! 
[16:47:38] <elukey>	 akosiaris: so atm we still have the higher limit but not the soft?
[16:47:49] <akosiaris>	 no I 've reverted everything
[16:47:53] <elukey>	 ah okok
[16:48:03] <elukey>	 would it be good to leave it there for say a day?
[16:48:07] <elukey>	 just to see how it goes
[16:48:09] <akosiaris>	 but it seems like it broke again
[16:48:13] <elukey>	 yep..
[16:51:06] <wikibugs_>	 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080658 (10dr0ptp4kt) Just a note that I've reached out internally to a contact to see if this is okay and achievable.  In searching the internet about imple...
[16:52:14] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Decouple parsercache role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341565 (https://phabricator.wikimedia.org/T150850)
[16:54:11] <wikibugs_>	 (03Abandoned) 10EBernhardson: deployment-prep: Use apt experimental for elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson)
[16:54:13] <wikibugs_>	 (03PS1) 10Gehel: osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631)
[16:54:42] <wikibugs_>	 (03PS5) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100)
[16:54:49] <wikibugs_>	 (03PS1) 10Reedy: Remove EducationProgram config back compat hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341567
[16:55:15] <wikibugs_>	 (03PS3) 10EBernhardson: deployment-prep: Use elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/341372
[16:58:03] <akosiaris>	 !log re-increase temporarily the client-output-buffer-limit for rbd1007, phab task filling to follow
[16:58:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:14] <akosiaris>	 elukey: ^
[16:58:32] <elukey>	 akosiaris: do you mind to paste the exact command? I'll add it to the task
[16:58:39] <elukey>	 (just for the records)
[16:58:41] <akosiaris>	 config set client-output-buffer-limit "normal 0 0 0 slave 2536870912 2536870912 60 pubsub 33554432 8388608 60"
[16:58:45] <elukey>	 super
[16:59:14] <elukey>	 so that's 2.5GB?
[16:59:29] <akosiaris>	 yup.. number straight from an RNG
[16:59:35] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850)
[16:59:37] <akosiaris>	 only req that is it > 1.3G
[17:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1700). Please do the needful.
[17:00:04] <jouncebot>	 Pchelolo and Smalyshev: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[17:00:24] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:01:42] <wikibugs_>	 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#3080701 (10Dzahn) @tomasz From Operations side it's done, but let's have @Croslof confirm for the legal side.
[17:02:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4190860 keys, up 127 days 8 hours - replication_delay is 0
[17:02:11] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: hieradata: make mwlog1001 primary log host [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728)
[17:04:04] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "Anything significant should change. Statistics jobs will see a bunch of changed files in the past due to overlap while rsync'ing from mwlo" [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi)
[17:04:16] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2048 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:11 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T159849
[17:04:20] <wikibugs_>	 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159849#3080704 (10ops-monitoring-bot)
[17:04:24] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4190693 keys, up 127 days 8 hours - replication_delay is 0
[17:05:01] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[17:05:49] <jynus>	 ^papaul, is it you changing the disk or did it finally fail? 
[17:06:22] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "Also I'll cleanup "fluorine" in comments and so on in later reviews." [puppet] - 10https://gerrit.wikimedia.org/r/341570 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi)
[17:06:40] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3080713 (10jcrespo)
[17:06:43] <wikibugs_>	 06Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159849#3080715 (10jcrespo)
[17:07:02] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10jcrespo) It finally failed, see T159849 summary.
[17:07:16] <wikibugs_>	 (03CR) 10BBlack: [C: 031] "puppet-line passes, and compiler output on cp1008 + radon looks good for not impacting non-lint usages.  I don't think there's a good way " [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[17:07:21] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080720 (10jcrespo)
[17:07:55] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10jcrespo) ``` physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Failed) ```
[17:08:01] <wikibugs_>	 06Operations, 10hardware-requests, 06Services (watching), 15User-mobrovac: Site: 2 hardware access request for SCB@CODFW - https://phabricator.wikimedia.org/T156631#3080727 (10mobrovac)
[17:08:05] <wikibugs_>	 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3080725 (10mobrovac) 05Resolved>03Open >>! In T159486#3079629, @akosiaris wrote: > I 've just applied the puppet configs on the hosts and pooled them for...
[17:08:25] <wikibugs_>	 06Operations, 06Services (doing), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3080729 (10mobrovac)
[17:09:29] <wikibugs_>	 (03CR) 10Chad: [C: 031] "Lgtm, let's get this live" [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[17:09:53] <wikibugs_>	 06Operations: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3080731 (10elukey)
[17:09:58] <elukey>	 akosiaris: --^
[17:10:43] <elukey>	 ah snap wrong logs
[17:10:48] <elukey>	 fixing
[17:12:45] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Decouple beta role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/341569 (https://phabricator.wikimedia.org/T150850)
[17:13:10] <wikibugs_>	 06Operations: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3080776 (10elukey)
[17:13:28] <elukey>	 better
[17:13:59] <wikibugs_>	 06Operations, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3080731 (10elukey)
[17:16:52] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080832 (10Papaul) a:05Papaul>03Marostegui disk replacement complete
[17:17:36] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3080836 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete
[17:20:23] <elukey>	 !Log phab task for Redis rdb1007 client-output-buffer-limit temp increase is T159850
[17:20:24] <stashbot>	 T159850: JobQueue Redis codfw replicas periodically lagging  - https://phabricator.wikimedia.org/T159850
[17:21:27] <wikibugs_>	 (03CR) 10Dzahn: "Error: Could not find template 'phabricator/initscripts/sshd-phab.service.erb' at /mnt/jenkins-workspace/puppet-compiler/5677/change/src/m" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[17:21:58] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3080845 (10Marostegui) Thanks! Disk is rebuilding! ``` root@db2044:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337F5EF0)      Port Name: 1I     Port Name: 2I...
[17:22:49] <wikibugs_>	 (03PS1) 10Madhuvishy: nfs: Enable nfs exports in new instance maps-warper2 [puppet] - 10https://gerrit.wikimedia.org/r/341572
[17:22:51] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T159666#3080848 (10Marostegui) Thanks - raid getting rebuilt ``` root@db2048:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337E3350)      Gen8 ServBP 12+2 at Port 1I,...
[17:25:12] <wikibugs_>	 (03CR) 10Madhuvishy: [C: 032] nfs: Enable nfs exports in new instance maps-warper2 [puppet] - 10https://gerrit.wikimedia.org/r/341572 (owner: 10Madhuvishy)
[17:25:25] <wikibugs_>	 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080881 (10NahidSultan) I'm closing this task as Google+ policy on this matter has changed since we started this discussion. This tick mark beside the websit...
[17:26:19] <wikibugs_>	 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080883 (10NahidSultan) 05Open>03Invalid
[17:26:51] <wikibugs_>	 06Operations: Verify bn.wikipedia.org via Webmaster Tools to allow linking a bn.wikipedia.org button to G+ page - https://phabricator.wikimedia.org/T109810#3080886 (10dr0ptp4kt) @NahidSultan, thanks for the update.
[17:28:24] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[17:32:00] <wikibugs_>	 (03PS1) 10BBlack: linting: remove config-geo-test [dns] - 10https://gerrit.wikimedia.org/r/341573 (https://phabricator.wikimedia.org/T156100)
[17:32:01] <wikibugs_>	 (03PS1) 10BBlack: add first discovery records [dns] - 10https://gerrit.wikimedia.org/r/341574 (https://phabricator.wikimedia.org/T156100)
[17:32:08] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] linting: remove config-geo-test [dns] - 10https://gerrit.wikimedia.org/r/341573 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[17:32:12] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] add first discovery records [dns] - 10https://gerrit.wikimedia.org/r/341574 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[17:38:33] <wikibugs_>	 (03PS1) 10Ema: cache_maps: do not set cookies [puppet] - 10https://gerrit.wikimedia.org/r/341575
[17:44:20] <Pchelolo>	 godog joe Is puppet SWAT happening?
[17:44:27] <wikibugs_>	 (03PS1) 10Ema: cache_misc: set timeout_idle to 120s [puppet] - 10https://gerrit.wikimedia.org/r/341576 (https://phabricator.wikimedia.org/T159429)
[17:46:44] <wikibugs_>	 (03PS4) 10Filippo Giunchedi: facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541)
[17:46:46] <wikibugs_>	 (03PS4) 10Filippo Giunchedi: facilities: add codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541)
[17:46:48] <wikibugs_>	 (03PS14) 10Filippo Giunchedi: prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541)
[17:46:50] <wikibugs_>	 (03PS4) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541)
[17:46:54] <godog>	 Pchelolo: sigh, I got dragged into other things and forgot, I can do it now though
[17:47:08] <Pchelolo>	 Thank you :)
[17:47:37] <Pchelolo>	 that's the patch carried over from last week. Tue is indeed better time for it
[17:47:39] <wikibugs_>	 (03PS8) 10Filippo Giunchedi: Enable local logging for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) (owner: 10Ppchelko)
[17:48:29] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[17:52:29] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 032] Enable local logging for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) (owner: 10Ppchelko)
[17:53:20] <godog>	 Pchelolo: merged, I'm trying puppet on restbase1007
[17:53:30] <Pchelolo>	 thank you godog
[17:54:10] <Pchelolo>	 It will be a no-op for now, need to restart RB to pick up the new config, but we have a deploy planned for later today that would make RB restart
[17:54:33] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3080947 (10Legoktm) >>! In T159618#3079361, @Esc3300 wrote: > I think it might be worth attempting to determine the factors that lead to the rapid raise.  >  >   - The edit ra...
[17:54:56] <mobrovac>	 godog: I will deploy RB in 10 mins or so, which will pick up the change
[17:55:29] <Pchelolo>	 mobrovac: I can deploy too, but a bit later, still need to bike to the office
[17:55:40] <godog>	 mobrovac Pchelolo neat, I'll force a puppet run now
[17:56:43] <mobrovac>	 godog: force for the whole RB cluster?
[17:57:02] <godog>	 mobrovac: staggered but yeah
[17:57:07] <mobrovac>	 :)
[17:58:58] <Pchelolo>	 mobrovac: I guess you have it all under control now, I will take 15 mins afk to bike to the office. Cant resist the temptation to get my coffee
[17:59:28] <mobrovac>	 yes please do :)
[17:59:30] <mobrovac>	 enjoy it
[18:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T1800).
[18:00:15] <subbu>	 no parsoid deploy today
[18:00:22] <wikibugs_>	 (03PS2) 10Bartosz Dziewoński: Turn off patrolling for FlaggedRevs in bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341350 (https://phabricator.wikimedia.org/T158662) (owner: 10DatGuy)
[18:00:55] <halfak>	 No ores today
[18:02:36] <mutante>	 paladox: i can't really explain this compiler fail: http://puppet-compiler.wmflabs.org/5677/phab2001.codfw.wmnet/   except it's because the change itself is moving the template around and there is some kind of race
[18:03:01] <mutante>	 paladox: since it only shows up on phab2001, i will merge anyways to see if it happens or not
[18:03:53] <wikibugs_>	 (03PS7) 10Dzahn: Phabricator: Move sshd-phab.conf.erb and sshd-phab.service.erb into initscripts [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[18:06:35] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "i can't explain why the compiler fails on phab2001 but i assume it must be some kind of race because the template gets renamed in this cha" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[18:09:13] <wikibugs_>	 (03CR) 10Dzahn: "no-op on iridium, but fail on phab2001 is real .. ehmm..." [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[18:09:32] <wikibugs_>	 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3080998 (10Betacommand) >>! In T159618#3080947, @Legoktm wrote: >>>! In T159618#3079361, @Esc3300 wrote: >> I think it might be worth attempting to determine the factors that...
[18:09:34] <icinga-wm>	 PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:10:04] <icinga-wm>	 PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:11:07] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/#/c/339786/7
[18:12:27] <mutante>	 duh, "initscript" vs "initscripts"  ... fixing
[18:14:27] <wikibugs_>	 (03PS1) 10Dzahn: phabricator: fix location of sshd-phab.service template [puppet] - 10https://gerrit.wikimedia.org/r/341579
[18:16:25] <wikibugs_>	 06Operations, 06Discovery, 06Discovery-Search (Current work): remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3081045 (10Gehel) The following should be sufficient:  ``` swapoff -a sed -i.bak '/swap/d' fstab ```  This does not recover the 1Go of the swap partition (but we do...
[18:16:49] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] phabricator: fix location of sshd-phab.service template [puppet] - 10https://gerrit.wikimedia.org/r/341579 (owner: 10Dzahn)
[18:17:03] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081047 (10RobH)
[18:18:24] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[18:21:06] <wikibugs_>	 (03PS1) 10Muehlenhoff: Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581
[18:21:24] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:23:58] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081119 (10RobH) Switch ports disabled, diff below since the port info will be needed once these systems are unracked.  [edit interfaces ge-6/0/0] -   enable; +   disable; [edit interfaces...
[18:24:04] <icinga-wm>	 RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[18:24:30] <wikibugs_>	 (03CR) 10Dzahn: "< icinga-wm> RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures" [puppet] - 10https://gerrit.wikimedia.org/r/341579 (owner: 10Dzahn)
[18:25:06] <wikibugs_>	 (03CR) 10Dzahn: "follow-up fix on https://gerrit.wikimedia.org/r/341579, no-op on both now" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[18:25:21] <wikibugs_>	 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#2912333 (10EBernhardson) Yes it does, i've declined the 2.x task
[18:25:37] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081145 (10RobH)
[18:26:48] <wikibugs_>	 (03PS5) 10Dzahn: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[18:27:19] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581 (owner: 10Muehlenhoff)
[18:27:41] <wikibugs_>	 (03PS2) 10Muehlenhoff: Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581
[18:28:17] <wikibugs_>	 (03PS4) 10MarcoAurelio: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382
[18:28:24] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[18:28:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f5d1763c950: Failed to establish a new connection: [Errno 111] Connection refused,))
[18:28:34] <icinga-wm>	 PROBLEM - Restbase root url on restbase-dev1002 is CRITICAL: connect to address 10.64.32.112 and port 7231: Connection refused
[18:30:21] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Blacklist n_hdlc kernel module [puppet] - 10https://gerrit.wikimedia.org/r/341581 (owner: 10Muehlenhoff)
[18:31:05] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[18:31:10] <wikibugs_>	 (03PS6) 10Dzahn: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[18:32:00] <wikibugs_>	 06Operations, 10DNS, 10Domains, 10Traffic, 13Patch-For-Review: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#3081194 (10CRoslof) 05Open>03Resolved All good with me.
[18:33:57] <wikibugs_>	 (03PS1) 10RobH: decom of db2001-db2009 [puppet] - 10https://gerrit.wikimedia.org/r/341582
[18:34:19] <wikibugs_>	 (03CR) 10RobH: [C: 032] decom of db2001-db2009 [puppet] - 10https://gerrit.wikimedia.org/r/341582 (owner: 10RobH)
[18:34:31] <wikibugs_>	 (03PS2) 10RobH: decom of db2001-db2009 [puppet] - 10https://gerrit.wikimedia.org/r/341582
[18:35:54] <wikibugs_>	 (03PS1) 10Mholloway: [Android] Create symlink to repo licenses dir in the SDK on CI [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099)
[18:36:05] <RainbowSprinkles>	 mutante: When you merge those template changes for its-* templates, you don't do a full gerrit restart right? It just needs a plugin reload
[18:37:18] <mobrovac>	 !log restbase deploy start of cd53670b
[18:37:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy
[18:37:34] <icinga-wm>	 RECOVERY - Restbase root url on restbase-dev1002 is OK: HTTP OK: HTTP/1.1 200 - 15500 bytes in 0.017 second response time
[18:38:06] <wikibugs_>	 (03PS2) 10Mholloway: [Android] Create symlink to repo licenses dir in the SDK on CI [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099)
[18:38:34] <icinga-wm>	 RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[18:39:10] <wikibugs_>	 (03PS1) 10RobH: decom of db2001-db2009 [dns] - 10https://gerrit.wikimedia.org/r/341585
[18:39:44] <mutante>	 RainbowSprinkles: yes, it never needed a gerrit restart to change gerrit bot
[18:39:58] <RainbowSprinkles>	 Ok, just double checking :)
[18:40:00] <wikibugs_>	 (03PS7) 10Dzahn: Gerrit: Fix bot so it uses the name of user instead of username [puppet] - 10https://gerrit.wikimedia.org/r/341559 (https://phabricator.wikimedia.org/T159689) (owner: 10Paladox)
[18:40:20] <wikibugs_>	 (03CR) 10RobH: [C: 032] decom of db2001-db2009 [dns] - 10https://gerrit.wikimedia.org/r/341585 (owner: 10RobH)
[18:44:22] <wikibugs_>	 (03PS19) 10Dzahn: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[18:48:14] <icinga-wm>	 RECOVERY - HP RAID on db2044 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor
[18:49:28] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3081277 (10JKatzWMF) @Ottomata @Nuria Just spoke with @Nuria and between the above comments and our conversation I think I have what we need to figure out next steps....which a...
[18:50:42] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3081287 (10Ottomata) Great! :)
[18:52:50] <volans>	 !log rmmod acpi_pad on baham, was using 100% CPU T137647
[18:52:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:52:56] <stashbot>	 T137647: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647
[18:53:34] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "Error: Could not find template 'phabricator/initscripts/ssh-phab.systemd.erb'" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[18:54:54] <icinga-wm>	 PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:55:37] <wikibugs_>	 (03PS1) 10Chad: Gerrit: Sort config sections alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/341587
[18:55:58] <wikibugs_>	 (03CR) 10Chad: "Technically a no-op, although puppet compiler will disagree. Needs visual check" [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad)
[18:56:05] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "there are 2 separate issues here: a) "sshd-phab" vs. "ssh-phab"   b) .conf and .service  vs. .systemd and .upstart" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[18:57:27] <wikibugs_>	 (03CR) 10MaxSem: [C: 031] cache_maps: do not set cookies [puppet] - 10https://gerrit.wikimedia.org/r/341575 (owner: 10Ema)
[18:57:43] <wikibugs_>	 06Operations, 10RESTBase, 10service-runner, 13Patch-For-Review, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3081321 (10mobrovac) Something is wrong there. RB is not even creating the file, despite the fact that the directory permissions are correct:...
[18:57:46] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3081322 (10Beetlebeard) But without the CNAME entry and the verification the e-mails will be redirected to gmail and saved to google servers, but the users a...
[19:02:03] <wikibugs_>	 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3081347 (10Dzahn) I noticed that the existing entry for wikimedia.org using Google is just a CNAME for "google.com." but in this case it is supposed to be "...
[19:02:58] <wikibugs_>	 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3081362 (10Nuria) >you'll need to find a MediaWiki dev to revisit https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_referrer_policy I *think* policy as is works, we will...
[19:08:32] <wikibugs_>	 (03CR) 10Mholloway: [C: 04-1] "Needs update for non-"periodic" CI machines..." [puppet] - 10https://gerrit.wikimedia.org/r/341583 (https://phabricator.wikimedia.org/T147099) (owner: 10Mholloway)
[19:08:46] <bblack>	 !log rebooting baham (ns1) - low cpu frequencies issues like T147905
[19:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:52] <stashbot>	 T147905: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905
[19:09:04] <wikibugs_>	 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3081365 (10RobH)
[19:09:39] <wikibugs_>	 06Operations, 10ops-codfw, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#1998016 (10RobH) a:05RobH>03Papaul Ok, this is now ready for on-site disk wipes of all the systems.  Assigning to @papaul for followup.
[19:09:55] <icinga-wm>	 PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100%
[19:10:44] <icinga-wm>	 PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100%
[19:11:14] <icinga-wm>	 PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[19:12:54] <icinga-wm>	 RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms
[19:15:04] <icinga-wm>	 RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms
[19:15:14] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on baham is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[19:15:24] <icinga-wm>	 PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:15:34] <icinga-wm>	 PROBLEM - Auth DNS on baham is CRITICAL: CRITICAL - Plugin timed out while executing system call
[19:15:44] <icinga-wm>	 PROBLEM - Check systemd state on baham is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:16:24] <icinga-wm>	 RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms
[19:17:19] <icinga-wm>	 PROBLEM - Auth DNS on ns1-v6 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[19:17:39] <robh>	 expected.
[19:18:17] <apergos>	 because of reboot eh?
[19:18:44] <robh>	 yep
[19:19:19] <icinga-wm>	 PROBLEM - Auth DNS on ns1-v4 is CRITICAL: CRITICAL - Plugin timed out while executing system call
[19:19:34] <icinga-wm>	 PROBLEM - Check size of conntrack table on baham is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[19:20:30] <bblack>	 !log rebooting baham (ns1) AGAIN - low cpu frequencies issues like T147905 - checking bios/idrac stuff
[19:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:35] <stashbot>	 T147905: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905
[19:21:14] <icinga-wm>	 PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:44] <icinga-wm>	 PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100%
[19:22:24] <icinga-wm>	 PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:22:54] <icinga-wm>	 RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[19:23:04] <icinga-wm>	 PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[19:23:39] <twentyafterfour>	 !log branching 1.29.0-wmf15 refs T158996
[19:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:45] <stashbot>	 T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996
[19:28:00] <wikibugs_>	 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3081445 (10Ottomata) Good news! I think we don't have to rebuild varnishkafka.  Quick test on MWV has varnishkafka working...
[19:28:09] <icinga-wm>	 RECOVERY - Auth DNS on ns1-v4 is OK: DNS OK: 0.046 seconds response time. www.wikipedia.org returns 208.80.154.224
[19:28:14] <icinga-wm>	 RECOVERY - Auth DNS on ns1-v6 is OK: DNS OK: 0.065 seconds response time. www.wikipedia.org returns 208.80.154.224
[19:28:15] <icinga-wm>	 RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[19:28:15] <icinga-wm>	 RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms
[19:28:15] <icinga-wm>	 RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[19:28:34] <icinga-wm>	 RECOVERY - Auth DNS on baham is OK: DNS OK: 0.050 seconds response time. www.wikipedia.org returns 208.80.154.224
[19:30:45] <wikibugs_>	 (03PS1) 10Dzahn: phabricator: fix file names of systemd/upstart templates [puppet] - 10https://gerrit.wikimedia.org/r/341589 (https://phabricator.wikimedia.org/T137928)
[19:34:21] <RainbowSprinkles>	 mutante: https://gerrit.wikimedia.org/r/#/c/341587/ will require some rebases for some outstanding patches, but will make our lives easier :)
[19:38:46] <wikibugs_>	 (03PS20) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928)
[19:38:51] <wikibugs_>	 (03PS21) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928)
[19:39:26] <wikibugs_>	 (03PS22) 10Paladox: Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928)
[19:40:01] <wikibugs_>	 (03PS15) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[19:40:38] <wikibugs_>	 06Operations, 10ops-codfw, 10Traffic: baham (ns1) CPU-related issues - https://phabricator.wikimedia.org/T159870#3081551 (10BBlack)
[19:42:38] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "We don't seem to be using any cookies on maps. This change looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/341575 (owner: 10Ema)
[19:52:24] <icinga-wm>	 RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[19:52:43] <wikibugs_>	 (03PS2) 10Gehel: osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631)
[19:55:05] <wikibugs_>	 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3081625 (10Ottomata)
[20:00:04] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T2000).
[20:00:43] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "no-op in compiler http://puppet-compiler.wmflabs.org/5680/" [puppet] - 10https://gerrit.wikimedia.org/r/341589 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn)
[20:01:36] <mutante>	 RainbowSprinkles: added to queue, just wanna get the phab-ssh stuff done first
[20:02:21] <RainbowSprinkles>	 Ok :)
[20:03:08] <wikibugs_>	 (03CR) 10MaxSem: [C: 031] osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) (owner: 10Gehel)
[20:03:59] <wikibugs_>	 (03PS1) 10Chad: Gerrit: Remove reviewer counts cron, nobody is using it [puppet] - 10https://gerrit.wikimedia.org/r/341593
[20:04:29] <TabbyCat>	 https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.15 <-- no changes?
[20:05:13] <wikibugs_>	 06Operations, 06Discovery, 06Discovery-Search (Current work): remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3081652 (10RobH) >>! In T158884#3081045, @Gehel wrote: > The following should be sufficient: >  > ``` > swapoff -a > sed -i.bak '/swap/d' fstab > ``` >  > This does...
[20:06:39] <RainbowSprinkles>	 TabbyCat: More likely someone else decided to create the page
[20:06:55] <RainbowSprinkles>	 With just the boilerplate
[20:07:18] <wikibugs_>	 (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/c/341589/  has been merged, rebasing this and re-compiling it and it should work now ..." [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[20:08:39] <wikibugs_>	 (03CR) 10Dzahn: "after https://gerrit.wikimedia.org/r/#/c/341589/ has been merged, rebasing this and re-compiling it and it should work now ..." [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[20:09:02] <wikibugs_>	 (03CR) 10Dzahn: "i meant to put this comment on https://gerrit.wikimedia.org/r/#/c/339763/22" [puppet] - 10https://gerrit.wikimedia.org/r/339786 (owner: 10Paladox)
[20:09:16] <paladox>	 ah
[20:09:18] <paladox>	 woops
[20:09:20] <paladox>	 wrong place
[20:10:08] <mutante>	 paladox: what was the change since PS19 on https://gerrit.wikimedia.org/r/#/c/339763/22
[20:10:09] <TabbyCat>	 RainbowSprinkles: I see. Well, we might see some changes when twentyafterfour makes the train depart from the station ;)
[20:10:15] <mutante>	 i dont think there was a need to change anything there, paladox
[20:10:29] <mutante>	 what it needed was the separate fix to be merged
[20:10:53] <paladox>	 oh
[20:10:58] <paladox>	 those are rebases i think
[20:11:23] <mutante>	 and "published edit"
[20:11:44] <mutante>	 rebases only needed if it can also be merged
[20:11:51] <mutante>	 compiles that again
[20:14:24] <paladox>	 mutante ok it re adds the files now
[20:14:26] <paladox>	 but i get
[20:14:27] <paladox>	 Failed at step EXEC spawning /usr/bin/chown: No such file or directory
[20:15:15] <paladox>	 oh i see
[20:15:22] <paladox>	  /usr/bin/chown does not exist now
[20:15:23] <paladox>	 strange
[20:15:43] <mutante>	 paladox: /bin/chown ?
[20:15:49] <mutante>	 but that sounds really broken
[20:15:56] <paladox>	 thanks
[20:16:04] <paladox>	 patch in comming to fix that
[20:16:15] <mutante>	 i don't know the context, but ok
[20:16:56] <wikibugs_>	 (03PS1) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341599 (https://phabricator.wikimedia.org/T150618)
[20:17:01] <wikibugs_>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5681/" [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[20:17:16] <wikibugs_>	 (03Draft1) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598
[20:17:19] <wikibugs_>	 (03PS2) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598
[20:17:21] <wikibugs_>	 06Operations, 10RESTBase, 10service-runner, 13Patch-For-Review, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3081725 (10mobrovac) Ok, it turns out the problem is that firejail doesn't have `/var/log/restbase` whitelisted. RB is actually logging stuff...
[20:17:22] <paladox>	 mutante ^^ :)
[20:17:49] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Phabricator: Migrate to base::service_unit for ssh-phab [puppet] - 10https://gerrit.wikimedia.org/r/339763 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[20:18:35] <mutante>	 ok, added to queue as well, one by one 
[20:18:39] <paladox>	 that fixes it :)
[20:18:41] <paladox>	 tested
[20:19:01] <paladox>	 and ok
[20:19:51] <mutante>	 aaah, yes, i see what you mean. we'll get there in a minute
[20:19:54] <wikibugs_>	 (03PS3) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 (https://phabricator.wikimedia.org/T158434)
[20:20:00] <wikibugs_>	 (03PS4) 10Paladox: Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 (https://phabricator.wikimedia.org/T158434)
[20:20:06] <paladox>	 Ok thanks :)
[20:22:29] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Phabricator: Fix incorrect path to chown [puppet] - 10https://gerrit.wikimedia.org/r/341598 (https://phabricator.wikimedia.org/T158434) (owner: 10Paladox)
[20:23:15] <paladox>	 thanks
[20:23:31] <mutante>	 !log iridium - temp disabled puppet - converting phab-ssh service to base::service_unit, systemd on phab2001, upstart on iridium
[20:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:39] <wikibugs_>	 (03PS16) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[20:24:04] <icinga-wm>	 PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[ssh-phab]
[20:24:19] <paladox>	 Hmm, /me wonders why ^^ is failing?
[20:24:50] <mutante>	 because of the thing you just uploaded the fix for :)
[20:25:02] <paladox>	 Oh ah
[20:25:02] <paladox>	 :)
[20:25:04] <icinga-wm>	 RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[20:25:07] <paladox>	 :)
[20:26:26] <ottomata>	 !log installing librdkafka 0.9.4 on cp1045 (cache misc host) via .deb package to try it with varnishkafka in prod (ping bblack, ema, just in case)
[20:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:23] <mutante>	 !log phab2001 - phab-ssh service converted to base::service_unit and with working systemd unit file. 'systemctl ssh-phab status' is active (running) (T158434)
[20:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:29] <stashbot>	 T158434: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434
[20:27:49] <paladox>	 :)
[20:29:27] <mutante>	 !log iridium - re-enabling puppet, ssh-phab service converted to base::service_unit, upstart template moved but unchanged, service restarted just fine. 
[20:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:54] <paladox>	 mutante i wonder if we should do a phd.conf file since i doint think if iridium was restarted phd will start, see this https://secure.phabricator.com/T4181#133830 script
[20:30:31] <mutante>	 paladox: we should just focus on getting it work properly with systemd and then reinstall iridium as phab1001 with jessie
[20:30:34] <wikibugs_>	 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3081776 (10Ottomata) @elukey, I've dpkg -i librdkafka 0.9.4 on cp1045 and restarted varnishkafka.  Let's let this ru...
[20:30:37] <paladox>	 ok :)
[20:30:58] <paladox>	 rebooting works too
[20:31:05] <paladox>	 for systemd phd and ssh-phab
[20:31:12] <mutante>	 so for converting things to base::service_unit purposes, keep the upstart part unchanged /no-op
[20:31:34] <mutante>	 easier to merge if prod server isn't affected
[20:31:36] <paladox>	 yep
[20:31:39] <paladox>	 yep
[20:32:02] <mutante>	 other fixes could go separate, but we should also just get rid of the trusty install and then we don't care
[20:32:17] <wikibugs_>	 (03CR) 10Paladox: "Tested and works :)" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[20:32:23] <paladox>	 ok
[20:32:38] <paladox>	 mutante could you run puppet compiler on ^^ please?
[20:33:04] <wikibugs_>	 (03PS1) 10Rush: nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601
[20:33:09] <mutante>	 yes
[20:34:00] <paladox>	 thanks
[20:34:03] <paladox>	 :)
[20:34:05] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 031] nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601 (owner: 10Rush)
[20:34:13] <wikibugs_>	 (03PS2) 10Rush: nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601
[20:34:58] <wikibugs_>	 (03CR) 10Rush: [V: 032 C: 032] nova: up fullstack allowed pool to 7 [puppet] - 10https://gerrit.wikimedia.org/r/341601 (owner: 10Rush)
[20:35:07] <wikibugs_>	 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3081786 (10Paladox) I believe that we now have full support for debian jessie as far as i can tell....
[20:35:20] <wikibugs_>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5682/" [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[20:35:36] <paladox>	 :)
[20:35:43] <paladox>	 only phab2001 is affected by the change
[20:35:51] <paladox>	 iridium has no changes :_
[20:36:05] <paladox>	 :)
[20:36:11] <paladox>	 thanks for doing that :)
[20:37:19] <logmsgbot>	 !log twentyafterfour@tin Started scap: bump test wikis to 1.29.0-wmf.5 refs T158996
[20:37:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:24] <stashbot>	 T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996
[20:37:33] * Reedy hands twentyafterfour an extra 1
[20:39:23] <mutante>	 paladox: no change on iridium is good, but  i wish we could avoid  2 x "if $::initsystem == 'systemd'".  an advantage of base::service_unit as we used it for ssh-phab is that we did not need those anymore
[20:40:04] <icinga-wm>	 PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:40:14] <mutante>	 i assume you tried that but it did not work in this case?
[20:40:53] <mutante>	 with "systemd => true" and "upstart => true" in base::service_unit instead of "if"s
[20:41:55] <mutante>	 can we reduce it  from   2 x  to  1 x  if we can't remove both
[20:44:14] <mutante>	 paladox: ideally i want the change for phd to be a nice one like the one for ssh-phab at  https://gerrit.wikimedia.org/r/#/c/339763/22/modules/phabricator/manifests/vcs.pp  where all the  if/else, file, service is gone and only base::service_unit  stays
[20:48:03] <wikibugs_>	 (03CR) 10Dzahn: "@20after4 do you have an opinion on this? did you need --force before?" [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox)
[20:49:07] <wikibugs_>	 (03CR) 10Dzahn: "12:41 < mutante> paladox: no change on iridium is good, but  i wish we could avoid  2 x "if $::initsystem == 'systemd'".  an advantage of " [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[20:50:45] <wikibugs_>	 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434#3081822 (10Dzahn) >>! In T158434#3081786, @Paladox wrote: > I believe that we now have full support...
[20:53:26] <wikibugs_>	 (03PS2) 10Dzahn: Gerrit: Sort config sections alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad)
[20:53:35] <wikibugs_>	 06Operations, 10RESTBase, 10service-runner, 06Services (doing), 15User-mobrovac: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3081824 (10mobrovac)
[20:54:52] <paladox>	 mutante i would need to create the .conf file to do that since it will cause alot of problems if i didnt do that if and else
[20:55:53] <mutante>	 paladox: if it's true that phd service is supposed to be permanently stopped on the server that is not the "hot" one, and i think it is. then we need to get rid of " 	
[20:55:57] <mutante>	 PHD should be supervising processes"
[20:56:01] <mutante>	 icinga check
[20:56:15] <paladox>	 Yep, i was meaning for iridium
[20:56:48] <paladox>	 the .conf file is not needed for phab2001 as we will use systemd there. Just systemd is not on iridium
[20:56:50] <mutante>	 i said that without any relation to the .conf file thing :)
[20:57:04] <paladox>	 oh
[20:57:27] <mutante>	 i am not sure yet what you mean will cause a lot of problems, but the best is you just show me with gerrit
[20:58:33] <twentyafterfour>	 ;)
[20:58:37] <paladox>	 oh, wait
[20:58:58] <mutante>	 twentyafterfour: is it true that phd service should always be stopped on the server that is not "hot" ? 
[20:59:04] <wikibugs_>	 (03CR) 1020after4: "I've rarely needed to use --force but it might make sense to have it just in case." [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox)
[20:59:05] <paladox>	 twentyafterfour would base::service_unit work for starting the phd service on iridium
[20:59:06] <mutante>	 or could it just run on both
[20:59:19] <paladox>	 i was thinking again/.
[20:59:34] <twentyafterfour>	 mutante: phd needs to run on only the primary server until we have phab configured for proper cluster awareness
[21:00:11] <mutante>	 twentyafterfour: ok, i thought so and just wanted to confirm. i will do something to ensure Icinga only adds the check for the primary one
[21:00:12] <paladox>	 mutante i wont be able to test the change that would affect iridium
[21:00:17] <twentyafterfour>	 I believe the ipv6 problems are resolved so we can probably enable clustering now and get phd running on multiple hosts
[21:00:19] <paladox>	 but it may start it
[21:00:32] <paladox>	 yep, ipv6 problems resolved :)
[21:00:50] <wikibugs_>	 (03PS3) 10Paladox: Phabricator: Start and stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424
[21:00:52] <twentyafterfour>	 paladox: I don't know about service_unit on iridium, no idea at all
[21:00:56] <wikibugs_>	 (03PS4) 10Paladox: Phabricator: Start and stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424
[21:01:12] <wikibugs_>	 (03PS5) 10Paladox: Phabricator: Start and stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424
[21:01:23] <mutante>	 twentyafterfour: the good change today: ssh-phab service is converted to base::service_unit and just works on either, upstart or systemd without all the "if-then"
[21:01:36] <paladox>	 twentyafterfour oh, it's used to managed upstart, sysvinit and systemd scripts.
[21:01:40] <twentyafterfour>	 mutante: awesome
[21:01:50] <twentyafterfour>	 paladox: I see
[21:01:51] <paladox>	 mutante i could do the change, though how will we test it?
[21:02:06] <paladox>	 this one https://gerrit.wikimedia.org/r/#/c/340158/16/modules/phabricator/manifests/init.pp
[21:02:49] <wikibugs_>	 (03PS1) 10Rush: etcd: etcd-backup.py needs a type set for argparse [puppet] - 10https://gerrit.wikimedia.org/r/341607
[21:03:47] <wikibugs_>	 (03PS2) 10Rush: etcd: etcd-backup.py needs a type set for argparse 'keep' [puppet] - 10https://gerrit.wikimedia.org/r/341607
[21:04:06] <mutante>	 paladox: ?  that's the one you already tested on a labs instance. did you mean clustering ?
[21:04:14] <paladox>	 Yes
[21:04:20] <mutante>	 wrong link then?
[21:04:43] <mutante>	 the one you linked i already commented on
[21:04:59] <twentyafterfour>	 mutante: the issue with phd running on multiple servers is this:  when phd updates repositories it needs to run the git (push|pull) from the server that owns the repo. With cluster support enabled then phd will know how to schedule the job on the git master for that repo
[21:05:30] <twentyafterfour>	 wihtout clustering it'll assume the repo is local and run the operation on the rsync'd copy of the repo instead of the authoritative master copy
[21:06:17] <mutante>	 aha, yea, that makes sense. i think you told me about the git pull before. thanks for the details
[21:06:46] <mutante>	 ok, let's get the "phd" service converted to base::service_unit  next
[21:06:53] <mutante>	 like we did for ssh-phab 
[21:06:56] <twentyafterfour>	 there may be other issues but I think that's the only one that we need to deal with to get multiple PHDs working
[21:07:05] <twentyafterfour>	 cool
[21:07:09] <mutante>	 then let's re-install iridium maybe :)
[21:07:12] <twentyafterfour>	 ;)
[21:07:32] <twentyafterfour>	 mutante: sounds good
[21:08:04] <icinga-wm>	 RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[21:08:06] <mutante>	 i just would like to try and improve this:  https://gerrit.wikimedia.org/r/#/c/340158/16/modules/phabricator/manifests/init.pp   to 
[21:09:06] <mutante>	 to be more like this:  https://gerrit.wikimedia.org/r/#/c/339763/22/modules/phabricator/manifests/vcs.pp
[21:09:31] <twentyafterfour>	 mutante: indeed, that's a lot cleaner
[21:09:33] <mutante>	 by which i mean "less or no  if $::initsystem"
[21:10:10] <paladox>	 mutante i could do a change
[21:10:20] <paladox>	 though i would not know it's impact on iridium
[21:10:45] <twentyafterfour>	 well we can test it on iridium, if it breaks we can fix it
[21:10:47] <mutante>	 paladox: happy to compile one to find that out
[21:10:52] <paladox>	 ok
[21:10:53] <twentyafterfour>	 not the end of the world if phd goes down for a minute
[21:10:53] <paladox>	 thanks
[21:11:18] <wikibugs_>	 (03PS1) 10BBlack: dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615
[21:11:26] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 (owner: 10BBlack)
[21:11:28] <wikibugs_>	 (03PS6) 10BBlack: authdns lint support for full puppetized config [puppet] - 10https://gerrit.wikimedia.org/r/341564 (https://phabricator.wikimedia.org/T156100)
[21:11:30] <wikibugs_>	 (03PS1) 10BBlack: authdns: add 10/8 to geo map [puppet] - 10https://gerrit.wikimedia.org/r/341616
[21:11:37] <wikibugs_>	 (03PS17) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[21:11:39] <twentyafterfour>	 phd should work fine with service_unit I think
[21:11:40] <paladox>	 mutante ^^ done
[21:11:48] <paladox>	 twentyafterfour yeh
[21:12:21] <paladox>	 systemd works with phd, just because we symblink phd for iridium (trusty only) it would not be easy to implement. Though im hopping it is
[21:12:30] <paladox>	 twentyafterfour ^^
[21:12:48] <paladox>	 service_unit i mean
[21:13:13] <wikibugs_>	 (03PS2) 10BBlack: dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615
[21:13:20] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] dns: add 10/8 to geo map [dns] - 10https://gerrit.wikimedia.org/r/341615 (owner: 10BBlack)
[21:13:36] <wikibugs_>	 (03CR) 1020after4: Phabricator: Migrate to base::service_unit for phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[21:14:07] <mutante>	 paladox: much better but some more things to change there
[21:14:12] <twentyafterfour>	 paladox: what symlink are you talking about?
[21:14:15] <wikibugs_>	 (03CR) 10Paladox: Phabricator: Migrate to base::service_unit for phd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox)
[21:14:17] <mutante>	 paladox: upstart =>  is set to false, but you want "true"
[21:14:23] <twentyafterfour>	 +1
[21:14:46] <paladox>	 twentyafterfour https://github.com/wikimedia/puppet/blob/production/modules/phabricator/manifests/phd.pp#L26
[21:14:51] <paladox>	 ok
[21:14:57] <mutante>	 the "before" line for class ::phabricator::phd ... hmm ...yea
[21:15:08] <mutante>	 i understand why you had a special case there
[21:15:21] <mutante>	 but maybe that can be removed ?
[21:15:51] <paladox>	 mutante if i set it to upstart => true, then it will fail since there will be no template for upstart, see https://github.com/wikimedia/puppet/blob/production/modules/base/manifests/service_unit.pp#L96
[21:16:02] <paladox>	 or can you do upstart => true without needing to define a template
[21:16:22] <wikibugs_>	 (03PS18) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[21:16:44] <mutante>	 paladox: why not the same procedure we just did with ssh-phab?   first move the templates... then make the switch
[21:16:57] <twentyafterfour>	 paladox: I don't think /etc/init.d/phd symlink is important, though we could skip upstart and configure it as sysv init
[21:16:58] <mutante>	 the upstart init file is somewhere in the repo, right
[21:17:13] <twentyafterfour>	 mutante: phd IS the initscript
[21:17:20] <twentyafterfour>	 on iridium
[21:17:33] <twentyafterfour>	 /etc/init.d/phd is a symlink to the phab code
[21:17:36] <paladox>	 Yes
[21:17:40] <paladox>	 lets configure it as a sysvinit
[21:17:40] <twentyafterfour>	 phd is just an initscript written in php
[21:17:41] <mutante>	 uhmm..
[21:17:54] <twentyafterfour>	 upstart handles it because it's in /etc/init.d/
[21:18:02] <mutante>	 i understand what you mean now, paladox
[21:18:09] <paladox>	 yep
[21:18:09] <mutante>	 *nods*
[21:18:18] <paladox>	 theres
[21:18:18] <paladox>	 https://secure.phabricator.com/T4181#133830
[21:18:20] <twentyafterfour>	 so it needs to have an upstart config written or we need to keep it as sysv init
[21:18:21] <paladox>	 we could use 
[21:18:38] <twentyafterfour>	 yeah something like that
[21:18:53] <paladox>	 Ok, i will create a seperate patch to introduce that
[21:19:05] <twentyafterfour>	 paladox: that looks like it's close to what we need.
[21:19:09] <mutante>	 earlier i said we should just focus on systemd and replacing iridium.. and keep iridium to "no-op" while converting this.. but now ... yea...
[21:19:10] <paladox>	 :)
[21:19:30] <mutante>	 what you guys said then
[21:19:36] <paladox>	 :) :)
[21:20:24] <twentyafterfour>	 unless service_unit { sysvinit=>true }  would do the trick?
[21:21:30] <mutante>	 i dunno, or i can live with one "if $:initsystem", but one should be enough, not 2 of them
[21:21:45] <mutante>	 and once we are on jessie we can remove that again
[21:21:58] <paladox>	 Yeh
[21:22:22] <mutante>	 that would be fine, i just assumed first we can avoid all of them with base::service_unit
[21:22:29] <mutante>	 like it was true for the other service
[21:22:44] <icinga-wm>	 PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:23:04] <icinga-wm>	 PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:23:14] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:23:54] <mutante>	 !log mw1177 - service hhvm restart
[21:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:34] <icinga-wm>	 RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 74544 bytes in 0.134 second response time
[21:25:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.024 second response time
[21:26:04] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.028 second response time
[21:26:08] <wikibugs_>	 (03Draft1) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630
[21:26:12] <wikibugs_>	 (03PS2) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630
[21:26:15] <paladox>	 twentyafterfour mutante ^^
[21:27:20] <wikibugs_>	 (03PS1) 1020after4: group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632
[21:28:17] <wikibugs_>	 (03CR) 10Dzahn: "i downloaded "before" and "after" file and sorted them with "sort". then diff showed they are identical" [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad)
[21:28:58] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Gerrit: Sort config sections alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/341587 (owner: 10Chad)
[21:30:08] <paladox>	 ah
[21:30:15] <paladox>	 i can test that on phab-01
[21:30:20] <paladox>	 i can start the instance now
[21:30:36] <logmsgbot>	 !log twentyafterfour@tin Finished scap: bump test wikis to 1.29.0-wmf.5 refs T158996 (duration: 53m 17s)
[21:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:42] <stashbot>	 T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996
[21:32:45] <wikibugs_>	 (03PS3) 1020after4: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox)
[21:33:12] <Zppix>	 jouncebot: now
[21:33:12] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170307T2000)
[21:33:37] <wikibugs_>	 (03CR) 1020after4: [C: 031] "this should work" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox)
[21:34:34] <wikibugs_>	 (03CR) 1020after4: [C: 032] group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 (owner: 1020after4)
[21:36:51] <wikibugs_>	 (03Merged) 10jenkins-bot: group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 (owner: 1020after4)
[21:37:31] <wikibugs_>	 (03PS6) 10Paladox: Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424
[21:37:35] <wikibugs_>	 (03CR) 10Dzahn: "maybe we have reasons to do it for "stop" but for "start" i think we should rather know if there are errors rather than forcing it" [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox)
[21:37:47] <wikibugs_>	 (03CR) 10jenkins-bot: group0 to 1.29.0-wmf.15 refs T158996 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341632 (owner: 1020after4)
[21:38:12] <wikibugs_>	 (03PS7) 10Paladox: Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424
[21:38:43] <wikibugs_>	 (03PS8) 10Paladox: Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424
[21:40:23] <wikibugs_>	 (03PS4) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630
[21:40:37] <wikibugs_>	 (03PS5) 10Paladox: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630
[21:40:45] <logmsgbot>	 !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.15 refs T158996
[21:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:51] <stashbot>	 T158996: MW-1.29.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T158996
[21:42:27] <wikibugs_>	 (03CR) 10Dzahn: "i think "start on started mysql" will be an issue since mysql isn't running on same machine." [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox)
[21:49:15] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Phabricator: stop phd by force [puppet] - 10https://gerrit.wikimedia.org/r/340424 (owner: 10Paladox)
[21:50:48] <wikibugs_>	 (03CR) 10Paladox: "> i think "start on started mysql" will be an issue since mysql isn't" [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox)
[21:54:14] <logmsgbot>	 !log mobrovac@tin Started deploy [trending-edits/deploy@f855460]: (no justification provided)
[21:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:02] <logmsgbot>	 !log mobrovac@tin Finished deploy [trending-edits/deploy@f855460]: (no justification provided) (duration: 04m 48s)
[21:59:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:34] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[22:00:48] <wikibugs_>	 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3082031 (10Paladox) Bump, any update on this please?
[22:01:58] <logmsgbot>	 !log mobrovac@tin Started deploy [zotero/translators@35da336]: Update transators for T158675
[22:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:04] <logmsgbot>	 !log mobrovac@tin Finished deploy [zotero/translators@35da336]: Update transators for T158675 (duration: 00m 06s)
[22:02:04] <stashbot>	 T158675: Update zotero translators on gerrit from the zotero repository on github - https://phabricator.wikimedia.org/T158675
[22:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:07:44] <wikibugs_>	 (03PS19) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[22:07:48] <wikibugs_>	 (03PS20) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[22:08:23] <wikibugs_>	 (03PS21) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)
[22:08:47] <logmsgbot>	 !log mobrovac@tin Started deploy [citoid/deploy@5a7e053]: Deploy for T158675 T103478 T159486
[22:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:55] <stashbot>	 T103478: Citoid service should validate ISSN in mediawiki format - https://phabricator.wikimedia.org/T103478
[22:08:56] <stashbot>	 T159486: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486
[22:08:56] <stashbot>	 T158675: Update zotero translators on gerrit from the zotero repository on github - https://phabricator.wikimedia.org/T158675
[22:11:24] <logmsgbot>	 !log mobrovac@tin Finished deploy [citoid/deploy@5a7e053]: Deploy for T158675 T103478 T159486 (duration: 02m 36s)
[22:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:01] <wikibugs_>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3082200 (10Papaul)
[22:29:41] <wikibugs_>	 (03PS1) 10Hashar: zuul: the deb packages creates /etc/zuul [puppet] - 10https://gerrit.wikimedia.org/r/341700
[22:30:06] <wikibugs_>	 06Operations, 10ops-eqiad: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082218 (10Jgreen)
[22:32:29] <wikibugs_>	 (03CR) 10Hashar: [V: 031 C: 031] "Puppet compiler https://puppet-compiler.wmflabs.org/5683/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar)
[22:33:34] <wikibugs_>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3082237 (10Jgreen)
[22:33:54] <wikibugs_>	 06Operations, 10ops-eqiad: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082254 (10Jgreen) a:05Jgreen>03None
[22:34:30] <wikibugs_>	 06Operations, 10ops-eqiad: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082218 (10Jgreen)
[22:35:48] <wikibugs_>	 (03PS1) 10Chad: Gerrit: lower heap to 20g [puppet] - 10https://gerrit.wikimedia.org/r/341701
[22:38:17] <wikibugs_>	 (03CR) 10Pnorman: [C: 031] osm - waterline import script fix and adding logging [puppet] - 10https://gerrit.wikimedia.org/r/341566 (https://phabricator.wikimedia.org/T159631) (owner: 10Gehel)
[22:45:42] <papaul>	 !log ms-be2028-ms-be2039 - signing puppet certs, salt-key, initial run
[22:45:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:57:44] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[23:02:50] <wikibugs_>	 (03PS2) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981)
[23:03:19] <wikibugs_>	 (03PS3) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981)
[23:10:28] <wikibugs_>	 (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn)
[23:11:43] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] "tested on mwdebug1001 with apache-fast-test from tin" [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn)
[23:14:53] <wikibugs_>	 (03PS4) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981)
[23:16:02] <wikibugs_>	 (03PS5) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981)
[23:16:15] <wikibugs_>	 (03CR) 10Dzahn: "PS4: insert literal tab chars that we use here unlike almost everywhere else now" [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn)
[23:20:34] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) (owner: 10Dzahn)
[23:20:42] <wikibugs_>	 (03PS6) 10Dzahn: redirect 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981)
[23:25:21] <wikibugs_>	 (03PS2) 10Dzahn: zuul: the deb packages creates /etc/zuul [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar)
[23:26:42] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] zuul: the deb packages creates /etc/zuul [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar)
[23:30:33] <wikibugs_>	 (03CR) 10Dzahn: "confirmed no change on contint1001/2001 as they are already running" [puppet] - 10https://gerrit.wikimedia.org/r/341700 (owner: 10Hashar)
[23:32:41] <RainbowSprinkles>	 jouncebot: next
[23:32:42] <jouncebot>	 In 0 hour(s) and 27 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170308T0000)
[23:34:19] <wikibugs_>	 (03PS2) 10Dzahn: Gerrit: lower heap to 20g [puppet] - 10https://gerrit.wikimedia.org/r/341701 (owner: 10Chad)
[23:35:33] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] Gerrit: lower heap to 20g [puppet] - 10https://gerrit.wikimedia.org/r/341701 (owner: 10Chad)
[23:38:03] <RoanKattouw>	 Gerrit just went down :(
[23:38:07] <wikibugs_>	 (03PS1) 10Krinkle: Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597)
[23:38:08] <RoanKattouw>	 20 minutes before the SWTA
[23:38:15] <RoanKattouw>	 Oh, I see mutante is probably just restarting it
[23:38:23] <mutante>	 !log gerrit restarting for config changes 341701, 341587 
[23:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:38:46] <Krinkle>	 James_F: ^ config patch can go out ahead at any time  - no-op since the var doesn't exist (it'll just create an unused var).
[23:39:57] * James_F nods. Let's SWAT it now so that it doesn't disrupt Beta Cluster QAers.
[23:41:19] <mutante>	 RoanKattouw: yes, with RainbowSprinkles 
[23:41:27] <RoanKattouw>	 It's back now
[23:41:28] <icinga-wm>	 PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[23:41:42] <RoanKattouw>	 Sorry for freaking out
[23:41:57] <mutante>	 one thing is odd, i still see your merged changed in "incoming reviews" 
[23:42:59] <RainbowSprinkles>	 mutante: Lemme force a reindex of that change. I've noticed that sometimes happens with things merged right before a shutdown
[23:43:17] <RainbowSprinkles>	 Index not flushed before shutdown, probably
[23:43:25] <mutante>	 ok,cool. i tested logging out and in again but it stayed
[23:43:42] <RainbowSprinkles>	 Fixed
[23:43:56] <mutante>	 indeed, thanks :)
[23:45:07] <James_F>	 Can anyone else load https://gerrit.wikimedia.org/r/#/c/341708/ ?
[23:45:17] <paladox>	 RainbowSprinkles that is meant to be fixed
[23:45:36] <paladox>	 that sounds like the bug wasen't fixed
[23:45:50] <paladox>	 https://gerrit-review.googlesource.com/#/c/93479/
[23:45:54] <RainbowSprinkles>	 It's not a big deal
[23:45:59] <paladox>	 ok
[23:46:24] <RainbowSprinkles>	 James_F: No, looking
[23:46:29] <James_F>	 Ta.
[23:47:34] <RainbowSprinkles>	 Weirdly indexed as well
[23:47:49] * RainbowSprinkles grumbles something about proper use of lucene
[23:47:57] <RainbowSprinkles>	 James_F: Fixed'd
[23:48:04] <James_F>	 Ta.
[23:48:05] <wikibugs_>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3082407 (10Papaul)
[23:50:25] <RainbowSprinkles>	 I went ahead and reindexed about 10 more changes on either side of the restart out of paranoia
[23:51:35] <wikibugs_>	 (03CR) 10Jforrester: [C: 031] Disable wgCiteResponsiveReferences by default for back-compat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341708 (https://phabricator.wikimedia.org/T33597) (owner: 10Krinkle)
[23:53:28] <wikibugs_>	 (03CR) 10Thcipriani: [C: 031] Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 (owner: 10Chad)
[23:53:58] <icinga-wm>	 PROBLEM - puppet last run on aqs1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:54:36] <wikibugs_>	 (03PS6) 10Dzahn: Phabricator: Add a upstart init phd script [puppet] - 10https://gerrit.wikimedia.org/r/341630 (owner: 10Paladox)
[23:55:32] <wikibugs_>	 (03PS22) 10Paladox: Phabricator: Migrate to base::service_unit for phd [puppet] - 10https://gerrit.wikimedia.org/r/340158 (https://phabricator.wikimedia.org/T137928)