[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T0000). Please do the needful. [00:00:04] Krinkle: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:09] o/ [00:02:29] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_geowiki-data-private] [00:02:50] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:09:41] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2836836 (10fgiunchedi) p:05Triage>03Normal [00:09:47] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2836837 (10fgiunchedi) p:05Triage>03Normal [00:09:57] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2836838 (10fgiunchedi) p:05Triage>03Normal [00:13:26] 06Operations: set git author and committer name when running as root - https://phabricator.wikimedia.org/T86146#2836840 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This was done in Iae3a3824299d [00:16:03] 06Operations, 10DBA: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2836846 (10fgiunchedi) I don't think this applies anymore but moving on to #DBA's radar for confimation [00:19:58] 06Operations, 10RESTBase, 10RESTBase-API, 10Traffic, and 2 others: Expose the PDF rendering service via RESTBase - https://phabricator.wikimedia.org/T143132#2836858 (10Addshore) [00:20:01] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#2836857 (10Addshore) [00:24:11] 06Operations: have the ip ranges from modules/ntp/templates/ntp.conf.erb pull from network.pp - https://phabricator.wikimedia.org/T82962#2836861 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This has happened with the `ntp` module refactoring [00:24:50] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:26:05] (03CR) 10BryanDavis: "What part of the configuration ensures that the multi-packet UDP transport for GELF delivers all of the packets from a given origin host t" [puppet] - 10https://gerrit.wikimedia.org/r/324371 (https://phabricator.wikimedia.org/T151971) (owner: 10Filippo Giunchedi) [00:26:43] 06Operations: OSM Server Deployments - Master Tracking Ticket - https://phabricator.wikimedia.org/T82154#2836865 (10fgiunchedi) [00:27:10] 06Operations, 06Discovery, 06Maps: OSM Server Deployments - Master Tracking Ticket - https://phabricator.wikimedia.org/T82154#2836873 (10fgiunchedi) +Maps, is this still relevant? [00:28:34] (03CR) 10BryanDavis: "> What part of the configuration ensures that the multi-packet UDP" [puppet] - 10https://gerrit.wikimedia.org/r/324371 (https://phabricator.wikimedia.org/T151971) (owner: 10Filippo Giunchedi) [00:28:56] bd808: anything to do on https://phabricator.wikimedia.org/T1229 ? [00:29:43] 06Operations, 06Discovery, 06Maps: OSM Server Deployments - Master Tracking Ticket - https://phabricator.wikimedia.org/T82154#2836874 (10MaxSem) Nope, this is closed. [00:30:01] rather "We can just close it" [00:31:10] 06Operations, 06Discovery, 06Maps: OSM Server Deployments - Master Tracking Ticket - https://phabricator.wikimedia.org/T82154#2836875 (10MaxSem) 05stalled>03declined [00:32:01] godog: wow. I have no memory of that at all [00:32:16] seems like it can be closed :) [00:32:42] 06Operations, 06Zero: mdot/zerodot webroot Accept-Language redirects for zero-rated access - https://phabricator.wikimedia.org/T1229#2836881 (10fgiunchedi) 05Open>03Invalid Resolving, doesn't apply anymore. [00:33:09] hehe indeed, I'm scanning through #operations oldest tasks, feels like cleaning the attic [00:33:47] 06Operations: ganglia redis plugin reports negatives values for redis total_connections - https://phabricator.wikimedia.org/T94678#2836883 (10fgiunchedi) 05Open>03declined Ganglia is on its way out. [00:34:36] 06Operations, 07HHVM: Enable the usage of `hhvm -m debug --debug-host ::1` from mw1017 so developers can step through code (think gdb) in production to see what is going wrong. - https://phabricator.wikimedia.org/T94951#2836886 (10fgiunchedi) [00:35:27] 07Puppet: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059#2836887 (10scfc) [00:37:18] 06Operations, 10Traffic: Migrate host lists out of cache.pp to reference values in Hiera - https://phabricator.wikimedia.org/T92601#2836900 (10fgiunchedi) 05Open>03Invalid I think this was eventually resolved during various refactoring, adding #traffic just in case. [00:41:07] 07Puppet: Inconsistent groups for Git repositories with role::puppetmaster::standalone - https://phabricator.wikimedia.org/T152060#2836908 (10scfc) [00:45:03] 06Operations, 07HHVM, 07Wikimedia-log-errors: Unexpected N4HPHP13DataBlockFullE - https://phabricator.wikimedia.org/T89958#2836934 (10fgiunchedi) 05Open>03Invalid I don't think we've seen this reoccurring since then, tentatively resolving. [00:51:23] 06Operations: Add updating labs/private with $::puppetmaster_autoupdate feature flag - https://phabricator.wikimedia.org/T75904#2836959 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This was done in Id99a5996e08 (cfr {T92756} too) [00:51:25] (03PS22) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [00:51:33] (03PS23) 10Paladox: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 [00:53:31] (03PS4) 10Paladox: Phabricator: Allow us to change the default web domain in apache [puppet] - 10https://gerrit.wikimedia.org/r/324551 [00:53:43] (03PS5) 10Paladox: Phabricator: Allow us to change the default web domain in apache [puppet] - 10https://gerrit.wikimedia.org/r/324551 [00:54:29] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [00:59:02] 06Operations, 10Traffic: Support ESI for ResourceLoader - https://phabricator.wikimedia.org/T78963#2836973 (10fgiunchedi) Copying Traffic since it'd affect Varnish if we choose to do it [00:59:05] 06Operations, 10Traffic: Support ESI for ResourceLoader - https://phabricator.wikimedia.org/T78963#2836975 (10fgiunchedi) [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T0100). Please do the needful. [01:00:45] Phupdate? [01:01:07] please no [01:01:09] bd808: would you be interested in having stashbot send NOTICE instead of PRIVMSG? if it does already nevermind [01:01:35] I was looking at T101575, hence the question [01:01:35] T101575: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575 [01:02:12] so... yeah. My feeling is the exact opposite [01:02:24] my client pings on every notice in every channel [01:02:47] so it would turn the noise for me up to 11 [01:03:22] ah ok, what client ? [01:03:40] jouncebot originally used notice and we changed it to privmsg [01:03:47] I'm using textual [01:03:48] I've 'patched' the issue on my side with a list of nicks that will appear as notice [01:04:14] greg-g: Dephloyment? [01:04:28] I guess it'll stay like that privmsg vs notice [01:04:45] James_F: let's not make this uncomfortable for everyone now [01:04:53] * James_F laughs. [01:05:53] 06Operations: irc bots should send NOTICE not PRIVMSG - https://phabricator.wikimedia.org/T101575#2836981 (10fgiunchedi) 05Open>03declined I don't think this has chances of actually getting implemented, people seem to be used to not having bots issue NOTICEs. [01:06:25] godog: *nod* I could do the same in reverse, but folks on irccloud probably can't so easily [01:06:32] Also, did no-one do the SWAT deployment? [01:06:37] :-( [01:07:22] James_F: Krinkle could have done it himself. Evening swatters have been in short supply recently [01:07:59] No problem.I'll roll it out later. [01:08:10] Or now I suppose since there's no phabup [01:08:36] Argh, what have I wrought? [01:08:57] Krinkle: Ideally someone from FR-tech should be involved in the CentralNotice deploy. [01:09:13] 06Operations, 10Monitoring: icinga log rotation wipes out portions of history - https://phabricator.wikimedia.org/T102397#2836984 (10fgiunchedi) 05Open>03Invalid I think this got fixed with the icinga upgrade/refactor ``` einsteinium:/var/log/icinga$ head -3 icinga.log [1480487103] EXTERNAL COMMAND: PROC... [01:09:28] Krinkle: Given the main FR of the year is in full swing and even a few seconds downtime has significant financial penalties. [01:11:06] * Krinkle grabs deployment hammer [01:11:13] James_F: Already taken care of. [01:11:20] I had them test it on mwdebug1001 earlier today [01:11:23] (45 min ago) [01:11:29] Kk. [01:12:42] yep, looked totally benign! [01:12:52] Oh, it is. :-) [01:14:11] Krenair: ping [01:14:12] 06Operations: Migrate parsercache away from being a full RDBMS - https://phabricator.wikimedia.org/T84187#2836993 (10fgiunchedi) 05Open>03declined Tentatively declining this since I don't think it is on anyone's radar (or has been lately). Also AFAIK PC hasn't caused issues lately to the site's availability [01:14:12] !log krinkle@tin Synchronized php-1.29.0-wmf.4/extensions/Citoid/extension.json: I022428 (duration: 00m 46s) [01:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:50] ejegg: here goes. [01:15:01] ejegg: note that it's not going to wikipedias yet since those are still on wmf.3 [01:15:14] it'll roll out there tomorrow [01:15:18] !log krinkle@tin Synchronized php-1.29.0-wmf.4/extensions/CentralNotice/extension.json: I0224288 (duration: 00m 45s) [01:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:34] * MaxSem waits for a mushroom cloud [01:17:14] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#2837007 (10fgiunchedi) I believe the master/data nodes split has been done eventually, @Gehel might know for sure [01:17:51] huh, meta's Special:CentralNotice is still giving me the deprecation warning [01:18:04] logged in, so it shouldn't be cached. let's see [01:18:21] 06Operations, 06Discovery, 06Maps, 10Maps-data: Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2837010 (10Pnorman) For import I generally recommend osm2pgsql uses num CPU threads on machines with up to 8 threads, unless there's something else runni... [01:19:06] ejegg: RL is still cached for logged in [01:19:28] ah [01:20:03] ejegg: Krinkle: I guess u'll have no tgrouble triggering the campaign ;p [01:20:12] !log Ran "CREATE TABLE wbc_entity_usage LIKE dewikivoyage.wbc_entity_usage;" for fiwikivoyage on db1075 (s3 master) (Related: T151570) [01:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:28] T151570: Create Wikivoyage Finnish - https://phabricator.wikimedia.org/T151570 [01:22:19] AndyRussG: Yeah, has a short 5min cache for the startup module shared between all users. [01:22:28] startup is where the dependencies are [01:22:34] 06Operations, 07Wikimedia-log-errors: Memcached TIMEOUT error spam from memcached log for global:slave_lag keys - https://phabricator.wikimedia.org/T108982#2837014 (10fgiunchedi) 05Open>03Invalid The timeout levels are normal now, except for a spike on 2016-11-08. The keys involved are multiple and not jus... [01:22:43] shows as fixed for me now though [01:22:58] but you can bypass the cache by using x-wikimedia-debug with any of the listed servers [01:31:38] 06Operations: rsyncd restart unreliable after configuration changes - https://phabricator.wikimedia.org/T112240#2837034 (10fgiunchedi) 05Open>03Invalid AFAIR we haven't ran into issues with rsync not reloading on mw when e.g. refactoring the network constants in puppet, which would have triggered a restart.... [01:32:37] ah cool, looks good to me now too. [01:32:43] thanks, Krinkle! [01:33:07] * Krinkle drops the hammer [01:40:20] (03PS2) 10Legoktm: Set $wgUserEmailUseReplyTo = true; on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323672 (https://phabricator.wikimedia.org/T66795) [01:40:37] (03CR) 10Legoktm: [C: 032] Set $wgUserEmailUseReplyTo = true; on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323672 (https://phabricator.wikimedia.org/T66795) (owner: 10Legoktm) [01:41:21] (03Merged) 10jenkins-bot: Set $wgUserEmailUseReplyTo = true; on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323672 (https://phabricator.wikimedia.org/T66795) (owner: 10Legoktm) [01:44:32] (03PS1) 10Filippo Giunchedi: nutcracker: listen on localhost for stats [puppet] - 10https://gerrit.wikimedia.org/r/324642 (https://phabricator.wikimedia.org/T111934) [01:46:26] 06Operations, 07Puppet: Move role::otrs into a module - https://phabricator.wikimedia.org/T107670#2837063 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I believe this was done, `./modules/role/manifests/otrs/webserver.pp` [01:47:46] 06Operations, 07Puppet: Move misc::udp2log into a module - https://phabricator.wikimedia.org/T107671#2837067 (10fgiunchedi) 05Open>03Invalid This was done, `misc::udp2log` is no more. [01:48:00] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Set $wgUserEmailUseReplyTo = true; on group0 wikis - T66795 (duration: 00m 46s) [01:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:14] 06Operations, 07Puppet: Move misc::maintenance into a module - https://phabricator.wikimedia.org/T107672#2837072 (10fgiunchedi) 05Open>03Invalid This was done, `misc::maintenance` is no more. [01:48:14] T66795: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795 [01:51:33] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837092 (10Legoktm) The proposed fix is now live on test wikis and mediawiki.org. Here's wha... [01:53:07] 06Operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#2837093 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I'm tentatively resolving this because nowadays `/etc/dsh/group` is generated by merging static lists with `conftool` (by `scap::dsh`) on the deployment server... [01:54:59] 06Operations: recommended ssh ciphers/kexalgorithms combination doesn't work for ilo - https://phabricator.wikimedia.org/T111698#2837103 (10fgiunchedi) [01:57:12] 06Operations, 07discovery-system: Make puppet ca certificate world readable - https://phabricator.wikimedia.org/T110020#2837105 (10fgiunchedi) 05Open>03declined I'm fairly sure we add the puppet CA to `/etc/ssl/certs` so it should be available already system-wide. [01:59:44] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186#2837108 (10fgiunchedi) Adding #traffic for visibility [02:01:59] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837110 (10Huji) Should `From` be `Wikipedia ` or something like either... [02:03:38] 06Operations: Add monitoring for nutcracker - https://phabricator.wikimedia.org/T95231#2837111 (10fgiunchedi) [02:04:21] 06Operations: Add monitoring for nutcracker - https://phabricator.wikimedia.org/T95231#1183701 (10fgiunchedi) I think part of the underlying issue is that there's very little nutcracker monitoring so far stats-wise, I've updated the task title to reflect that. [02:32:15] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837114 (10Xaosflux) I'd think the more standard noreply@wikimedia.org type address would be... [02:34:55] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837115 (10Xaosflux) But if we do want to have it perhaps something like username-projectnam... [02:38:32] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#685163 (10MaxSem) The problem here is that usernames are case-sensitive and allow more chara... [02:42:36] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837120 (10Legoktm) >>! In T66795#2837115, @Xaosflux wrote: > Perhaps something like usernam... [02:55:09] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:09] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [03:28:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 743.08 seconds [03:40:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 168.07 seconds [03:42:59] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:44:23] (03PS5) 10Aude: Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) [03:44:32] (03PS6) 10Aude: Move interwiki sorting orders to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323556 (https://phabricator.wikimedia.org/T111023) [03:52:29] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [04:00:29] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:10:59] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:13:49] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:17:49] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:18:19] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:21:19] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:22:39] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [04:27:05] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837245 (10Xaosflux) Maybe the footer needs work too - that doesn't tell you if it came from... [04:28:29] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:36:47] (03CR) 10Krinkle: [C: 031] Bump $wgJobBackoffThrottling for cache purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324388 (owner: 10Aaron Schulz) [04:40:57] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837268 (10Legoktm) >>! In T66795#2837245, @Xaosflux wrote: > Maybe the footer needs work to... [04:41:49] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [04:52:52] 06Operations, 06Analytics-Kanban, 06Discovery, 06Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2837271 (10Ottomata) YESSHHHHH I think I did it. @mpopov try: ``` CXX=g++-4.8 CXX1X=g++-4.8... [05:27:03] (03CR) 10Yurik: [C: 031] "Looks good. I suspect that deploying it now would allow a much easier deployment build even for the node4." [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) (owner: 10Gehel) [05:27:39] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:37:49] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:54:39] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:06:49] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:11:47] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 2 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2837316 (10native-api) The last few suggestions are only tangentially relevant. Let's not st... [06:18:32] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 526 MB (5% inode=53%) [06:27:31] RECOVERY - MariaDB disk space on silver is OK: DISK OK [06:28:49] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1076.00 Read Requests/Sec=314.40 Write Requests/Sec=7.00 KBytes Read/Sec=39475.20 KBytes_Written/Sec=164.00 [06:30:29] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-mwclient] [06:31:13] 06Operations, 10DBA: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1096483 (10Marostegui) This is quite old indeed and we do not start MySQL everywhere (apart from labs) on purpose. We do not really want Puppet to handle the MySQL servi... [06:35:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324658 (https://phabricator.wikimedia.org/T148967) [06:37:49] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=93.20 Read Requests/Sec=0.80 Write Requests/Sec=4.60 KBytes Read/Sec=3.20 KBytes_Written/Sec=132.40 [06:42:17] !log performed apt-get clean and minor log file cleanup on silver [06:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:29] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:49:35] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 3 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#2837385 (10Joe) [06:50:08] !log Deploy alter table wikidatawiki.revision in codfw hosts only - T150644 [06:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:21] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [06:56:41] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 4 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2837406 (10Joe) [06:57:40] (03CR) 10Jcrespo: "Reverse the load." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324658 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [06:59:34] (03PS2) 10Marostegui: db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324658 (https://phabricator.wikimedia.org/T148967) [07:00:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:00:19] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:01:31] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324658 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:01:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324658 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:02:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324658 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:05:52] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db1070 - T148967 (duration: 02m 31s) [07:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:05] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:06:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 - T148967 (duration: 00m 48s) [07:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:39] (03PS1) 10Jcrespo: mariadb: Depool es1016 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324665 (https://phabricator.wikimedia.org/T151995) [07:09:36] (03CR) 10Marostegui: [C: 031] mariadb: Depool es1016 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324665 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [07:10:56] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1016 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324665 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [07:12:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1016 (duration: 00m 45s) [07:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:19] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:16:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:16:19] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:22:00] !log mysql upgrade and restart for es1016 T151995 [07:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:15] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [07:24:36] !log Stop replication db1095 (sanitarium2) on s3 instance - T150802 [07:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:43] T150802: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802 [07:28:01] (03PS1) 10Jcrespo: mariadb: Depool es1011 for mysql restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324667 (https://phabricator.wikimedia.org/T151995) [07:32:42] !log Deploy alter table db1070 - dewiki.revision - T148967 [07:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:53] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:33:42] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1011 for mysql restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324667 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [07:34:00] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1016 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324668 [07:35:26] (03CR) 10Jcrespo: [C: 04-2] "Wait until buffer pool warmup https://grafana.wikimedia.org/dashboard/db/mysql?panelId=1&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324668 (owner: 10Jcrespo) [07:37:17] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1011 (duration: 00m 48s) [07:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:39] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2837458 (10Shoichi) My plan is to add English code comments ,also inlcude comments translating those names with Han characters of functions and variables. I sent the pro... [07:38:04] !log mysql restart for es1011 T151995 [07:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:14] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [07:40:43] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1011 for mysql restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324669 [07:40:57] (03CR) 10Jcrespo: [C: 04-2] "Wait for buffer pool warmup https://grafana.wikimedia.org/dashboard/db/mysql?panelId=1&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-serv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324669 (owner: 10Jcrespo) [07:44:19] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:46:51] (03PS1) 10Jcrespo: mariadb: Depool es1014 for mysql restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324670 (https://phabricator.wikimedia.org/T151995) [07:47:58] (03CR) 10Marostegui: [C: 031] mariadb: Depool es1014 for mysql restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324670 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [07:48:27] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1014 for mysql restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324670 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [07:50:27] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1014 (duration: 00m 44s) [07:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:35] !log mysql restart for es1014 T151995 [07:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:46] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [07:52:04] 06Operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#2837495 (10demon) >>! In T80395#2837093, @fgiunchedi wrote: > I'm tentatively resolving this because nowadays `/etc/dsh/group` is generated by merging static lists with `conftool` (by `scap::dsh`) on the deployment server... [07:53:15] (03PS1) 10Kaldari: Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) [07:54:27] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1014 for mysql restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324673 [07:54:53] (03CR) 10Jcrespo: [C: 04-2] "Wait until buffer pool warmup https://grafana.wikimedia.org/dashboard/db/mysql?panelId=1&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324673 (owner: 10Jcrespo) [07:56:21] (03CR) 10Marostegui: "this looks fine: https://puppet-compiler.wmflabs.org/4730/" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [07:56:43] (03PS4) 10Marostegui: mariadb: Split backup and otrsbackups classes into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) [07:57:29] !log elukey@tin Starting deploy [analytics/pivot/deploy@0513a6e]: (no message) [07:57:31] !log elukey@tin Finished deploy [analytics/pivot/deploy@0513a6e]: (no message) (duration: 00m 02s) [07:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:57] (deployed pivot on thorium) [07:58:56] (03CR) 10Jcrespo: "Can you check dbstore1001 and 2, those are the important ones this change affect." [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:00:05] (03CR) 10Jcrespo: "actually, dbstore1001 and es2001 should be the servers that could get affected by this change." [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:02:07] (03CR) 10Marostegui: "NODES='dbstore1001.eqiad.wmnet,dbstore1002.eqiad.wmnet, es2001.codfw.wmnet'" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:03:00] (03CR) 10Marostegui: "NODES='dbstore1001.eqiad.wmnet,dbstore1002.eqiad.wmnet, es2001.codfw.wmnet': https://puppet-compiler.wmflabs.org/4732/ they look good" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:03:45] (03CR) 10Jcrespo: [C: 031] mariadb: Split backup and otrsbackups classes into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:04:25] (03CR) 10Jcrespo: "Will need follow up: an enable parameter and the dbstore1001 cron puppetization." [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:05:41] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1016 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324668 (owner: 10Jcrespo) [08:05:52] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1016 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324668 [08:08:48] (03PS1) 10Jcrespo: mariadb: Depool es1018 for mysql upgrade and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324674 (https://phabricator.wikimedia.org/T151995) [08:10:23] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1016 (duration: 00m 45s) [08:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:03] (03CR) 10Marostegui: "What do you mean? The class has not been touched really" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:14:57] (03CR) 10Jcrespo: "Yes, I voted +1, which means this can go. I am noting some pre-existent problems that we will need to solve before we failover dbstore1001" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:15:33] (03CR) 10Marostegui: "Ah - ok ok. Thanks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:15:38] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 07kubernetes: Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#2837522 (10Joe) [08:21:26] (03CR) 10Marostegui: [C: 032] mariadb: Split backup and otrsbackups classes into a different file [puppet] - 10https://gerrit.wikimedia.org/r/320989 (https://phabricator.wikimedia.org/T150851) (owner: 10Marostegui) [08:32:21] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1011 for mysql restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324669 (owner: 10Jcrespo) [08:32:26] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1011 for mysql restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324669 [08:41:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1011 (duration: 00m 48s) [08:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:41] (03PS1) 10ArielGlenn: make sure db_user and db_password config attributes are initialized [dumps] - 10https://gerrit.wikimedia.org/r/324678 [08:48:59] (03PS2) 10ArielGlenn: make sure db_user and db_password config attributes are initialized [dumps] - 10https://gerrit.wikimedia.org/r/324678 [08:49:36] (03CR) 10ArielGlenn: [C: 032] make sure db_user and db_password config attributes are initialized [dumps] - 10https://gerrit.wikimedia.org/r/324678 (owner: 10ArielGlenn) [08:50:52] !log ariel@tin Starting deploy [dumps/dumps@2b35e77]: less logging, fix regression for db_user/password retrieval [08:50:55] !log ariel@tin Finished deploy [dumps/dumps@2b35e77]: less logging, fix regression for db_user/password retrieval (duration: 00m 03s) [08:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:10] remembered to use the message, w00t [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:29] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:54:57] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1014 for mysql restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324673 (owner: 10Jcrespo) [08:55:01] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1014 for mysql restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324673 [08:57:24] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1014 (duration: 00m 45s) [08:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 07kubernetes: Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#2837643 (10Joe) p:05Triage>03Normal [09:03:42] (03PS2) 10Jcrespo: mariadb: Depool es1018 for mysql upgrade and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324674 (https://phabricator.wikimedia.org/T151995) [09:04:49] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [09:04:55] <_joe_> uhm [09:05:26] <_joe_> Guest59376: I preferred your old nickname :P [09:05:28] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1018 for mysql upgrade and restart [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324674 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [09:06:45] <_joe_> YuviPanda: I would like your input on https://phabricator.wikimedia.org/T152078 [09:06:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1018 (duration: 00m 48s) [09:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:29] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:08:23] (03PS1) 10Elukey: Add openjdk-8-jdk to the list of statistics packages [puppet] - 10https://gerrit.wikimedia.org/r/324679 (https://phabricator.wikimedia.org/T151896) [09:08:37] hey joe [09:08:46] I'll look in a bit! [09:08:54] <_joe_> no rush :) [09:09:05] <_joe_> just notifying you better than just subscribing you :) [09:12:14] joe: +1, too many subscriptions :) [09:12:24] <_joe_> heh, same problem here [09:13:18] <_joe_> I try to keep up with at least part of those, but it takes a growing amount of time to do so [09:14:19] !log mysql restart and upgrade for es1018 T151995 [09:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:29] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:14:30] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [09:18:08] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1018 for mysql upgrade and restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324681 [09:18:30] (03CR) 10Jcrespo: [C: 04-2] "Wait until buffer pool warmup https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2Fops&var-server=es1018&from=1480" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324681 (owner: 10Jcrespo) [09:24:44] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#2837716 (10Gehel) Actually this has not been done yet. I'm waiting for the new elasticsearch servers to try to dedicate some of... [09:27:07] (03PS1) 10Jcrespo: mariadb: Master switchover on es2 shard (eqiad) es1015 -> es1011 [puppet] - 10https://gerrit.wikimedia.org/r/324683 (https://phabricator.wikimedia.org/T151995) [09:28:44] (03PS1) 10Jcrespo: Master switchover on es3 shard (eqiad) es1019 -> es1014 [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) [09:29:01] (03PS7) 10MarcoAurelio: Re-enable 'centralauth-rename' rights for when maintenance is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) [09:31:31] !log chaning es2 eqiad replication topology in preparation for master switchover [09:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:29] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:41:53] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 07kubernetes: Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#2837740 (10Joe) Actually there is an haproxy-based implementation of LoadBalancer [[https://github.com/kub... [09:43:29] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:45:40] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 07kubernetes: Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#2837743 (10yuvipanda) I think (5) is a great start, given the fact that our list of services is going to b... [09:50:19] (03CR) 10Marostegui: Master switchover on es3 shard (eqiad) es1019 -> es1014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [09:53:22] (03PS1) 10Jcrespo: mariadb: ignore '' in private data check, print results as we get them [puppet] - 10https://gerrit.wikimedia.org/r/324685 (https://phabricator.wikimedia.org/T150802) [09:54:09] PROBLEM - puppet last run on mw1270 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:54:36] !log added --debug to the puppet compiler options in Jenkins [09:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] (temporarily, investigating a diff issue) [09:56:02] (03CR) 10Marostegui: [C: 031] mariadb: ignore '' in private data check, print results as we get them [puppet] - 10https://gerrit.wikimedia.org/r/324685 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:58:37] (03CR) 10Jcrespo: [C: 032] mariadb: ignore '' in private data check, print results as we get them [puppet] - 10https://gerrit.wikimedia.org/r/324685 (https://phabricator.wikimedia.org/T150802) (owner: 10Jcrespo) [09:59:42] (03PS2) 10Jcrespo: mariadb: Master switchover on es2 shard (eqiad) es1015 -> es1011 [puppet] - 10https://gerrit.wikimedia.org/r/324683 (https://phabricator.wikimedia.org/T151995) [10:01:29] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 57 probes of 414 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [10:04:28] (03PS1) 10Jcrespo: mariadb: swithover es2 master (eqiad) es1015 -> es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324688 (https://phabricator.wikimedia.org/T151995) [10:04:39] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 84, down: 0, shutdown: 0 [10:07:13] (03CR) 10Marostegui: [C: 031] mariadb: swithover es2 master (eqiad) es1015 -> es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324688 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [10:10:12] (03CR) 10Jcrespo: [C: 032] mariadb: swithover es2 master (eqiad) es1015 -> es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324688 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [10:10:48] (03Merged) 10jenkins-bot: mariadb: swithover es2 master (eqiad) es1015 -> es1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324688 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [10:11:29] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 414 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [10:15:49] (03PS1) 10Addshore: Update docblock [software/grafana/simple-json-datasource-implementation] - 10https://gerrit.wikimedia.org/r/324689 [10:15:59] (03CR) 10Addshore: [C: 032 V: 032] Update docblock [software/grafana/simple-json-datasource-implementation] - 10https://gerrit.wikimedia.org/r/324689 (owner: 10Addshore) [10:16:23] (03Abandoned) 10Gehel: Add 'discovery-stats' technical user to the 'stats' group. [puppet] - 10https://gerrit.wikimedia.org/r/323399 (https://phabricator.wikimedia.org/T149722) (owner: 10Gehel) [10:17:33] (03CR) 10Jcrespo: [C: 032] mariadb: Master switchover on es2 shard (eqiad) es1015 -> es1011 [puppet] - 10https://gerrit.wikimedia.org/r/324683 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [10:19:14] (03PS2) 10Gehel: maps / kartotherian: libmapnik3.0 is required for the upgrade to nodejs 6 [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) [10:19:22] (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) (owner: 10Gehel) [10:19:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: switchover es2 master (eqiad) es1015 -> es1011 (duration: 00m 45s) [10:19:52] (03PS3) 10Gehel: maps / kartotherian: libmapnik3.0 is required for the upgrade to nodejs 6 [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) [10:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:08] (03CR) 10Gehel: [C: 032] maps / kartotherian: libmapnik3.0 is required for the upgrade to nodejs 6 [puppet] - 10https://gerrit.wikimedia.org/r/322278 (https://phabricator.wikimedia.org/T150722) (owner: 10Gehel) [10:22:24] jynus: did you just merge my puppet change as well as yours? [10:22:28] yes [10:22:37] if I didn't merge mine, we would have an outage [10:22:45] jynus: Ok, thanks! I was wondering ... [10:22:55] jynus: no problem, just making sure [10:23:09] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:24:19] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:24:28] (03PS3) 10Addshore: WIP Add grafana_json_datasource [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) [10:25:03] !log removed --debug flag to the puppet compiler output [10:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:39] (03PS1) 10Gehel: kartotherian: fix package as array [puppet] - 10https://gerrit.wikimedia.org/r/324691 (https://phabricator.wikimedia.org/T150722) [10:25:39] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:25:49] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:12] (03CR) 10Addshore: "Is there an example elsewhere in puppet that I can follow for switching this from a ThirdLD to something under g.wm.o?" [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [10:26:16] puppet issues above are mine, fix coming up [10:26:51] (03CR) 10Gehel: [C: 032] kartotherian: fix package as array [puppet] - 10https://gerrit.wikimedia.org/r/324691 (https://phabricator.wikimedia.org/T150722) (owner: 10Gehel) [10:28:09] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:28:39] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:28:48] (03PS1) 10ArielGlenn: really fix db_user and db_password issue [dumps] - 10https://gerrit.wikimedia.org/r/324694 [10:28:55] (03PS1) 10Marostegui: wmnet: Change es2 master [dns] - 10https://gerrit.wikimedia.org/r/324695 (https://phabricator.wikimedia.org/T151995) [10:29:09] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:29:32] PROBLEM - MariaDB Slave SQL: es2 on es1015 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.blobs_cluster24: Duplicate entry 26290612 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log es1011-bin.001013, end_log_pos 235090366 [10:29:52] PROBLEM - MariaDB Slave SQL: es2 on es1011 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table dewiki.blobs_cluster24: Duplicate entry 26290612 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log es1015-bin.000998, end_log_pos 728604542 [10:29:53] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:29:57] worry or not worry? [10:30:07] not sure [10:30:08] I guess is you guys changing es2 master [10:30:10] I see the dns change patchset [10:30:18] but it could be a real problem [10:30:19] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:30:23] or a false alarm [10:30:35] it is indeed a real problem [10:30:53] dup key [10:31:21] let's depool es1015 [10:31:30] it should be depooled automatically [10:32:07] the new master es1011 has it too [10:32:35] :-/ [10:34:48] (03CR) 10ArielGlenn: [C: 032] really fix db_user and db_password issue [dumps] - 10https://gerrit.wikimedia.org/r/324694 (owner: 10ArielGlenn) [10:34:52] RECOVERY - MariaDB Slave SQL: es2 on es1011 is OK: OK slave_sql_state not a slave [10:37:12] PROBLEM - MariaDB Slave Lag: es2 on es1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.01 seconds [10:37:35] yes, yes, I am on it [10:37:39] going to ack [10:37:52] no it will only make the alerts worse [10:37:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [10:38:01] ok [10:39:26] 06Operations, 10puppet-compiler: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#2752413 (10elukey) Had the same problem today with: https://puppet-compiler.wmflabs.org/4736/stat1002.eqiad.wmnet/prod.stat1002.eqiad.wmnet.pson https://puppet-c... [10:39:40] !log ariel@tin Starting deploy [dumps/dumps@a3801fa]: second try on db_user fixup [10:39:42] !log ariel@tin Finished deploy [dumps/dumps@a3801fa]: second try on db_user fixup (duration: 00m 01s) [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:12] RECOVERY - MariaDB Slave Lag: es2 on es1015 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:40:30] so we may have now 66 orphan rows [10:40:32] RECOVERY - MariaDB Slave SQL: es2 on es1015 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:41:45] did you skip? [10:41:50] nope [10:41:55] SIGYN IS ON THE OTHER SIDE OF A NETSPLIT [10:42:01] THE ANTI SPAM KLINE BOT IS DEAF [10:42:03] *DEAD [10:42:11] LETS SPAM ALL THE CHANS! [10:42:14] only deleted some dewiki rows [10:42:44] ah [10:42:54] did you log them somewhere? [10:43:04] I backed up them [10:43:21] but I will recover them better from the binary log [10:43:37] I will wait first that it doesn't create more issues [10:44:02] sure [10:44:49] dammit sigyn is back [10:44:54] !ops pls ban sigyn [10:45:00] KILL SIGYN [10:45:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [10:47:44] <_joe_> thanks AlexZ [10:47:56] <_joe_> saved me son work :) [10:50:04] lol np _joe_ [10:50:33] eh? [10:50:43] why does AlexZ get all the hugz [10:50:56] I am NOT jelous! >:( [10:51:09] :P [11:09:13] * hashar hugs ToAruShiroiNeko [11:10:30] * elukey hugs hashar [11:11:57] ◔_◔ [11:12:12] * apergos hugs elukey [11:12:15] 🤗 🤗 🤗 [11:12:26] no return hugs needed reallythat'sfine [11:12:30] yes there is a hugging face inUnicode ( http://emojipedia.org/hugging-face/ ) [11:13:01] I believe that people needs hugs for their work sometimes :D [11:13:06] * mafk feels a bit jealous [11:13:09] :P [11:13:20] jouncebot: hug | mafk [11:13:25] stupid bots [11:13:28] ahahah [11:13:29] !hug [11:13:43] !hug is hugs $1 [11:13:44] Key was added [11:13:46] !hug mafk [11:13:46] hugs mafk [11:13:48] ;d [11:13:52] hashar: the only bot that hugs people is StewardBot :P [11:15:20] how do we list all the keys it knows? [11:15:42] !help [11:15:42] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [11:16:04] !wm-bot [11:16:04] http://meta.wikimedia.org/wiki/WM-Bot [11:16:33] apergos: http://bots.wmflabs.org/~wm-bot/db/%23wikimedia-operations.htm [11:16:37] Results (Found 77): puppet, instance, morebots, git, bang, nagios, bot, labs-home-wm, labs-nagios-wm, labs-morebots, gerrit-wm, wiki, labs, bastion, extension, wm-bot, projects, putty, gerrit, wikitech, revision, monitor, alert, password, unicorn, help, bz, os-change, instancelist, instance-json, leslie's-reset, damianz's-reset, amend, credentials, queue, socks-proxy, info, security, logging, ask, sudo, access, $realm, keys, $site, bug, pageant, blueprint-dns, stucked, pxe, ghsh, group, pathconflict, terminology, rt, erb, regsubst, bots, wt, gerrit-search, change, dn, opshelp, testwiki, sal, task, thx, depool, cluster, hiera, infobot, zuul, jouncebot, jenkins, mira, selfie, hug, [11:16:37] @regsearch .* [11:17:05] hm 77 and it's truncated [11:17:08] that's irc fail [11:17:34] thanks for that link, hashar [11:19:05] !keys [11:19:06] http://bots.wmflabs.org/~petrb/db/ list of infobot keys [11:19:17] ic [11:21:12] that maybe should be updated or removed [11:34:25] (03PS1) 10Yuvipanda: toollabs: Move exec_environ package list to hiera [puppet] - 10https://gerrit.wikimedia.org/r/324699 (https://phabricator.wikimedia.org/T152089) [11:38:57] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 07kubernetes: Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#2837922 (10Joe) @yuvipanda I see some definite advantages and some big limitations in how Ingress works.... [11:45:05] <_joe_> !unicorn [11:45:05] http://www.ascii-art.de/ascii/uvw/unicorn.txt [11:45:10] <_joe_> lol [11:45:43] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active [11:47:24] (03CR) 10Tim Landscheidt: "Interesting concept :-). It would be nice to add "labs/%{::labsproject}/os/%{::lsbdistcodename}" (or similar) to modules/puppetmaster/fil" [puppet] - 10https://gerrit.wikimedia.org/r/324699 (https://phabricator.wikimedia.org/T152089) (owner: 10Yuvipanda) [11:47:42] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 84, down: 0, shutdown: 0 [11:51:22] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-2/0/0: down - Tilaa OOB swap [1Gbps DF]BR [11:53:42] (03PS1) 10MarcoAurelio: Enable $wgAbuseFilterProfile for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324701 (https://phabricator.wikimedia.org/T152087) [11:54:02] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [11:54:40] (03CR) 10Yuvipanda: "ah, yes :) I'm mostly doing this to make it easy to keep 'list of packages' in sync between containers and puppet. I'm loathe to add more " [puppet] - 10https://gerrit.wikimedia.org/r/324699 (https://phabricator.wikimedia.org/T152089) (owner: 10Yuvipanda) [11:55:13] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [11:59:12] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 78.78 ms [12:08:15] (03PS1) 10ArielGlenn: allow dumps of private tables to be skipped via config setting [dumps] - 10https://gerrit.wikimedia.org/r/324702 (https://phabricator.wikimedia.org/T152021) [12:26:12] PROBLEM - puppet last run on puppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:37:54] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 07kubernetes: Load balancing "external" traffic to the Kubernetes cluster in production - https://phabricator.wikimedia.org/T152078#2837964 (10akosiaris) `"2. Can scale to large amounts of outgoing traffic"` Don't you mean incoming (as i... [12:40:53] (03CR) 10Ema: [C: 031] varnish: make PURGE more efficient [puppet] - 10https://gerrit.wikimedia.org/r/324270 (owner: 10BBlack) [12:43:37] (03PS1) 10Gehel: logstash - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) [12:44:30] (03CR) 10jenkins-bot: [V: 04-1] logstash - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [12:45:01] (03PS2) 10Gehel: logstash - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) [12:45:30] (03CR) 10Elukey: "Update after a chat with Giuseppe:" [puppet] - 10https://gerrit.wikimedia.org/r/323807 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [12:45:45] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2837983 (10Gilles) [12:45:47] 06Operations, 06Performance-Team, 10Thumbor: Investigate source of thumbnail 302 redirects - https://phabricator.wikimedia.org/T148410#2837981 (10Gilles) 05Open>03Resolved @fgiunchedi gave me 2 hours of logs (about 1200 hits) privately and what I've found is: - The overwhelming majority is low-quality bl... [12:49:47] !log reedy@tin Synchronized php-1.29.0-wmf.3/api.php: Remove oris bandaid T151702 (duration: 00m 46s) [12:49:55] _joe_: ^ gone [12:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:00] T151702: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702 [12:52:17] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2837991 (10Gilles) [12:53:41] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2838005 (10Gilles) [12:54:12] RECOVERY - puppet last run on puppetmaster1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:58:03] (03PS1) 10Elukey: Set daily logrotation for stats JSON files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324706 [12:58:27] (03CR) 10jenkins-bot: [V: 04-1] Set daily logrotation for stats JSON files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324706 (owner: 10Elukey) [12:58:42] gneee [12:59:17] mmm ERROR: InvocationError: '/home/jenkins/workspace/tox-jessie/.tox/flake8/bin/flake8' [12:59:26] (03CR) 10Elukey: "recheck" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324706 (owner: 10Elukey) [12:59:47] 06Operations, 10Traffic: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093#2838033 (10ema) [13:00:00] 06Operations, 10Traffic: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093#2838046 (10ema) p:05Triage>03High [13:00:40] (03PS2) 10Giuseppe Lavagetto: docker: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/323815 [13:03:49] (03CR) 10Ema: [C: 031] Set daily logrotation for stats JSON files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324706 (owner: 10Elukey) [13:08:58] (03PS1) 10Elukey: Remove Ganglia monitoring for Varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/324708 (https://phabricator.wikimedia.org/T152093) [13:10:16] (03PS1) 10Yuvipanda: [WIP] Add a dump of all packages from exec_environ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/324709 [13:10:40] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add a dump of all packages from exec_environ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/324709 (owner: 10Yuvipanda) [13:16:37] (03CR) 10Ema: [C: 031] Remove Ganglia monitoring for Varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/324708 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [13:19:25] !log cr1-eqiad: setting ae4 and its members (links to asw2-d-eqiad) to disable [13:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:35] !log Upgrading asw2-d-eqiad to JunOS 15.1R5 (T133387) [13:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:43] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [13:21:59] (03PS2) 10Yuvipanda: [WIP] Add a dump of all packages from exec_environ [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/324709 [13:22:01] (03PS1) 10Yuvipanda: Play with more newline things to satisfy pep8 master [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/324712 [13:22:07] expect alerts for mc1033-mc1036 and lvs1007-lvs1012, these are totally expected and nothing to worry about [13:22:15] elukey: ^ [13:23:14] super [13:28:04] (03CR) 10Elukey: [C: 032 V: 032] "tox-jessie complains about:" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324706 (owner: 10Elukey) [13:33:02] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [13:36:30] (03PS2) 10Elukey: Set daily logrotation for stats JSON files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324706 [13:36:32] (03PS1) 10Elukey: Fix varnishkafka_ganglia.py unit test [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324713 [13:37:31] (03CR) 10Elukey: [C: 032] Fix varnishkafka_ganglia.py unit test [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324713 (owner: 10Elukey) [13:38:30] jouncebot: next [13:38:30] In 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1400) [13:38:52] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:40:29] (03CR) 10Hashar: "If I understand it properly, that is blocked on some maintenance script to complete. Namely T148242. Looks like it hasn't completed yet :" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) (owner: 10MarcoAurelio) [13:40:44] (03PS1) 10Elukey: Update varnishkafka submodule SHA [puppet] - 10https://gerrit.wikimedia.org/r/324716 [13:42:27] 07Puppet: On standalone puppetmasters labstore files in /usr/local/sbin get group 998 (gitpuppet) - https://phabricator.wikimedia.org/T152095#2838134 (10scfc) [13:43:44] (03PS2) 10Marostegui: wmnet: Change es2 and es3 master [dns] - 10https://gerrit.wikimedia.org/r/324695 (https://phabricator.wikimedia.org/T151995) [13:43:46] (03CR) 10Elukey: [C: 032 V: 032] "PCC: https://puppet-compiler.wmflabs.org/4741/cp4004.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/324716 (owner: 10Elukey) [13:44:55] (03CR) 10Hashar: [C: 04-1] "I have read T148242. The maintenance script choked on the labswiki database, so it is not fully complete :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322667 (https://phabricator.wikimedia.org/T148242) (owner: 10MarcoAurelio) [13:45:26] (03PS2) 10Hashar: Allow contentadmin and sysop to add/remove autopatrolled users on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324401 (owner: 10MarcoAurelio) [13:47:48] !log Nodepool is out of instances due to OpenStack API spurting a nova.exception.ImageNotAuthorized HTTP 500 [13:47:51] (03CR) 10DCausse: [C: 031] logstash - upgrade to Java 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [13:47:54] CI is stall as a result [13:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:45] T152096 [13:50:45] T152096: OpenStack API refuses to launch new instances || Nodepool is out of instance / CI stalled - https://phabricator.wikimedia.org/T152096 [13:52:18] !log changing es3 eqiad replication topology in preparation for master switchover [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:49] !next [13:53:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324717 [13:53:49] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2838170 (10Gilles) [13:53:51] I do not have enough time to do it in 7 minutes [13:54:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324717 [13:55:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324717 (owner: 10Marostegui) [13:56:00] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1070 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324717 (owner: 10Marostegui) [13:57:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 - T148967 (duration: 00m 45s) [13:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:21] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1400). Please do the needful. [14:00:04] mafk: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:02:02] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:02:04] well I am postponing SWAT [14:02:16] one of the change apparently depend on a maintenance script to complete [14:02:24] the other is trivial (changes a few rights) [14:04:23] hashar, if you are postponing [14:04:30] I will take the time [14:04:50] hashar: oops, I have forgot about swat 😳 [14:04:59] but looks like you are taking care of it [14:05:33] * elukey is checking stat1002 disk space alarm [14:06:52] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [14:09:11] who's SWATing? [14:09:35] jouncebot: now [14:09:35] For the next 0 hour(s) and 50 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1400) [14:10:20] (03PS1) 10Mobrovac: PDF Render: Increase the concurrency to 8 [puppet] - 10https://gerrit.wikimedia.org/r/324720 [14:10:44] Niharika: rename is being enabled now, is that okay? [14:10:55] mafk: Yes. [14:11:07] Niharika: okay [14:11:13] (03CR) 10DCausse: [C: 031] "we would need this soon to activate features that are language dependent. Is there any strong objections with the proposed solution?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319253 (https://phabricator.wikimedia.org/T149755) (owner: 10EBernhardson) [14:11:16] Thanks for taking care of this! [14:11:20] np [14:12:17] hashar: no need to pospone then [14:12:29] maintenance script is no longer running [14:14:16] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2838192 (10Trizek-WMF) [14:14:53] !log rebooting asw2-d-eqiad [14:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:13] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2838193 (10BBlack) I've made this argument before. I'm not fond of Commons/upload images/thumbs being hotlinkable. In my mind, Commons exists to serve the multimedia needs of the encyclopedic content, and hotlinking from it... [14:15:45] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like this in general, but there seems to be a missing file, and some other minor comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324210 (owner: 10Alexandros Kosiaris) [14:16:04] zeljkof: can SWAT be done in light of that script not running anymore? [14:16:50] mafk: I guess it's a question for hashar, he is doing this swat [14:17:34] (03PS1) 10Jcrespo: mariadb: swithover es3 master (eqiad) es1019 -> es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324723 (https://phabricator.wikimedia.org/T151995) [14:17:35] I'm not sure what is going on, but I don't see why the swat should not continue if the script is finished [14:17:46] !log restarting kafka on kafka200[12] for openjdk upgrades [14:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:53] PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:53] PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:53] PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:02] PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:11] these one are expected --^ [14:18:13] not live [14:18:42] PROBLEM - Host asw2-d-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:19:58] !log restarting kafka also on kafka2003 [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:52] zeljkof: yep, I feel the same. Script suspension is linked in the deployments page, maybe hashar didn't read it [14:21:18] mafk: zeljkof: I am busy with another outage [14:21:32] but looks like from the task that the maintenance script has failed overnight on "labswiki" database [14:21:35] and hence it is not completed [14:21:52] Niharika say it's okay, she's the one running the script [14:21:57] so it is no longer running because it failed at some point [14:22:08] hashar: I left a comment. [14:22:21] hashar: https://phabricator.wikimedia.org/E381#3954 [14:22:38] hello :] [14:22:48] hashar: It'll take a few days to get it back in shape, I'm thinking of rescheduling running it next week or so. [14:22:50] (03CR) 10Marostegui: [C: 031] Master switchover on es3 shard (eqiad) es1019 -> es1014 [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [14:22:52] How does that sound? [14:23:08] (03CR) 10Marostegui: [C: 031] mariadb: swithover es3 master (eqiad) es1019 -> es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324723 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [14:23:09] hashar: Hello. :) [14:23:12] sounds like it is not complete and we cannot reenable global renames ? :] [14:23:26] I am in the middle of an outage right now so cant really look deeper in it [14:23:40] hashar: should I take over the swat? [14:23:47] feel free to pair the completion with zeljkof [14:23:52] RECOVERY - Host asw2-d-eqiad.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [14:23:56] okay [14:23:58] but looks like we first need to finish the script [14:24:02] and then can reenable [14:24:13] I'm okay with that [14:24:21] but the other change can be merged [14:24:27] the other trivial patch can be landed though. I will do it later unless someone does the deploy [14:24:34] yeah [14:24:41] Okay. I don't mind it either way. Thanks. [14:24:49] Allow sysop and contentadmin users to add and remove autopatrolled users on Wikitech [14:24:52] eg https://gerrit.wikimedia.org/r/#/c/324401/ [14:24:54] that can be pushed [14:24:55] :] [14:24:58] I'm not sure I'll be avalaible later so if it can be done now I'd love to have it deployed [14:25:38] hashar: just to make sure I understood, 322667 is blocked, but I should deploy 324401? [14:25:38] as for the rename, if it can't be done, I don't mind [14:25:52] please confirm and I will start the swat [14:26:06] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2838225 (10Gilles) Except I've looked at the data for 302s and there was no legitimate use. Even on the rare instances of a blog that seemed to have an educative purpose, there was no attribution and the text was probably cop... [14:26:26] zeljkof: userrights changes in wikitech can go [14:26:39] the other one is blocked for now [14:28:16] [15:24:49] Allow sysop and contentadmin users to add and remove autopatrolled users on Wikitech [14:28:16] [15:24:52] eg https://gerrit.wikimedia.org/r/#/c/324401/ [14:28:16] [15:24:54] that can be pushed [14:28:16] [15:24:55] :] [14:28:18] zeljkof: ^ [14:28:27] so yeah push 324401 : ] [14:28:39] hashar: ok, taking over the swat then [14:28:43] 322667 depends on the completion of a maintenance script or some other thing [14:28:44] ;D [14:28:45] sorry [14:28:46] ! [14:29:23] no problem, I see you are busy with the outage :) [14:31:35] mafk: can you test 324401 at mwdebug1002? (once it is there? [14:31:51] zeljkof: wikitech is not on silver, can't be tested there [14:32:07] mafk: ok, deploying then directly to the cluster [14:32:19] I'll test there once deployed [14:32:35] (03PS3) 10Zfilipin: Allow contentadmin and sysop to add/remove autopatrolled users on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324401 (owner: 10MarcoAurelio) [14:32:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [14:32:50] you meant wikitech is in silver/is not on the main cluster [14:33:02] PROBLEM - Check systemd state on kafka2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:33:08] yes [14:33:19] checking kafka [14:34:23] problem with mirror maker, restarted [14:34:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324401 (owner: 10MarcoAurelio) [14:34:42] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [14:35:02] RECOVERY - Check systemd state on kafka2001 is OK: OK - running: The system is fully operational [14:35:12] (03Merged) 10jenkins-bot: Allow contentadmin and sysop to add/remove autopatrolled users on Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324401 (owner: 10MarcoAurelio) [14:35:30] !log restbase deployed 91551bf [14:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Kube-proxy: Amend to support more than labs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324211 (owner: 10Alexandros Kosiaris) [14:37:56] (03PS2) 10Gilles: Nginx timeout should be higher than thumbor subprocess timeout [puppet] - 10https://gerrit.wikimedia.org/r/323403 (https://phabricator.wikimedia.org/T151459) [14:38:08] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:324401|Allow contentadmin and sysop to add/remove autopatrolled users on Wikitech]] (duration: 00m 50s) [14:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:26] mafk: 324401 is live, please test [14:38:51] zeljkof: listgrouprights is fine [14:38:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "change the call to the class in role::pdfrender instead." [puppet] - 10https://gerrit.wikimedia.org/r/324720 (owner: 10Mobrovac) [14:39:08] mafk: great, in that case ending the swat [14:39:14] !log EU SWAT finished [14:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:27] and I can add/remove it through userrights [14:39:35] so great && many thanks [14:39:43] !hug zeljkof [14:39:43] hugs zeljkof [14:40:26] mafk: thanks you for deploying with us, hope to see you soon ;) [14:40:44] zeljkof: sure, I'm always here ;) [14:41:13] (03PS2) 10Mobrovac: PDF Render: Increase the concurrency to 8 [puppet] - 10https://gerrit.wikimedia.org/r/324720 [14:42:02] zeljkof, mafk taking the deployment window [14:42:07] to do some server maintenance [14:42:17] jynus: I am done with swat [14:42:23] jynus: fine, I don't have anything more to deploy [14:42:24] thank you [14:43:36] (03PS2) 10Jcrespo: Master switchover on es3 shard (eqiad) es1019 -> es1014 [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) [14:43:47] (03PS2) 10Jcrespo: mariadb: swithover es3 master (eqiad) es1019 -> es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324723 (https://phabricator.wikimedia.org/T151995) [14:43:50] (03PS3) 10Gehel: logstash - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) [14:43:54] (03CR) 10Gehel: logstash - upgrade to Java 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [14:44:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] PDF Render: Increase the concurrency to 8 [puppet] - 10https://gerrit.wikimedia.org/r/324720 (owner: 10Mobrovac) [14:44:50] (03CR) 10Jcrespo: [C: 032] mariadb: swithover es3 master (eqiad) es1019 -> es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324723 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [14:46:37] !log restarting kafka on kafka100[123] (EventBus) for openjdk upgrades [14:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: switchover es3 master (eqiad) es1019 -> es1014 (duration: 00m 44s) [14:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:04] (03CR) 10Jcrespo: [C: 032 V: 032] Master switchover on es3 shard (eqiad) es1019 -> es1014 [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [14:49:12] (03PS3) 10Jcrespo: Master switchover on es3 shard (eqiad) es1019 -> es1014 [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) [14:49:15] (03CR) 10Jcrespo: [V: 032] Master switchover on es3 shard (eqiad) es1019 -> es1014 [puppet] - 10https://gerrit.wikimedia.org/r/324684 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [14:51:12] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [14:51:24] <_joe_> uhm [14:51:29] <_joe_> that doesn't sound right [14:51:42] PROBLEM - pdfrender on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 5252: Connection refused [14:51:43] could it be us? [14:51:49] <_joe_> yes [14:51:50] or unrelated? [14:51:58] <_joe_> sorry, it's me [14:52:09] !log Nodepool / CI are processing again [14:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:05] us== this dangerous mediwiki deploys we are doing [14:56:48] (03PS1) 10Tim Landscheidt: puppetmaster: Clone repositories in Labs as root [puppet] - 10https://gerrit.wikimedia.org/r/324727 (https://phabricator.wikimedia.org/T152059) [14:57:17] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2838311 (10BBlack) Keep in mind I fundamentally agree with you from personal POV, but I feel the need to play devil's advocate for the existing stance today here: >>! In T152091#2838225, @Gilles wrote: > Except I've looked a... [14:58:34] 07Puppet, 13Patch-For-Review: role::puppetmaster::standalone clones Git repositories as gitpuppet, git-sync-upstream overwrites them as root - https://phabricator.wikimedia.org/T152059#2838317 (10scfc) a:03scfc [14:59:12] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.004 second response time [14:59:34] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2838318 (10Gilles) The 302s are links to thumbnails that have moved because the original was moved. Mediawiki honors those redirects on misses, figuring out what the new thumbnail location is. [15:00:02] (03CR) 10Jcrespo: [C: 031] wmnet: Change es2 and es3 master [dns] - 10https://gerrit.wikimedia.org/r/324695 (https://phabricator.wikimedia.org/T151995) (owner: 10Marostegui) [15:03:41] (03CR) 10Marostegui: [C: 032] wmnet: Change es2 and es3 master [dns] - 10https://gerrit.wikimedia.org/r/324695 (https://phabricator.wikimedia.org/T151995) (owner: 10Marostegui) [15:03:55] (03PS4) 10Gehel: logstash - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) [15:05:29] (03CR) 10Gehel: [C: 032] logstash - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324704 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [15:06:15] !log upgrading logstash to Java 8, including rolling restart - T151325 [15:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:26] T151325: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325 [15:08:34] !log DNS change for es2 and es3 after the master switchovers - T151995 [15:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:41] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [15:09:08] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2838342 (10Gilles) The examples you'e provided for legitimate use cases aren't compelling examples of us providing a free CDN being a necessity. The examples I've seen on blogspot could host the images there, and it would be... [15:09:47] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2838343 (10Aklapper) [15:10:49] (03PS1) 10Tim Landscheidt: labstore: Use explicit groups for file resources [puppet] - 10https://gerrit.wikimedia.org/r/324729 (https://phabricator.wikimedia.org/T152095) [15:12:12] 07Puppet, 13Patch-For-Review: On standalone puppetmasters labstore files in /usr/local/sbin get group 998 (gitpuppet) - https://phabricator.wikimedia.org/T152095#2838350 (10scfc) a:03scfc [15:13:03] (03PS2) 10Tim Landscheidt: labstore: Use explicit groups for file resources [puppet] - 10https://gerrit.wikimedia.org/r/324729 (https://phabricator.wikimedia.org/T152095) [15:13:05] (03CR) 10Ottomata: [C: 031] "+1, but only if we are sure that this won't mess with the default java 7 alternative." [puppet] - 10https://gerrit.wikimedia.org/r/324679 (https://phabricator.wikimedia.org/T151896) (owner: 10Elukey) [15:15:36] (03CR) 10Yuvipanda: [C: 032] Play with more newline things to satisfy pep8 master [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/324712 (owner: 10Yuvipanda) [15:15:40] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2838367 (10ArielGlenn) I've added Chad as a sometime maintainer of these lists on the MW side (please remove self/add other as appropriate). Initial attempt was here: htt... [15:16:36] (03PS2) 10Yuvipanda: toollabs: remove host aliases for tools-exec-12[01-11] [puppet] - 10https://gerrit.wikimedia.org/r/324623 (https://phabricator.wikimedia.org/T151980) (owner: 10BryanDavis) [15:16:59] (03CR) 10Yuvipanda: [C: 032 V: 032] "Death to aliases files!" [puppet] - 10https://gerrit.wikimedia.org/r/324623 (https://phabricator.wikimedia.org/T151980) (owner: 10BryanDavis) [15:17:27] (03CR) 10Yuvipanda: "Actually I should apply this when I'm a bit more awake, we've never removed these before and I'm not sure how the grid master takes it." [puppet] - 10https://gerrit.wikimedia.org/r/324623 (https://phabricator.wikimedia.org/T151980) (owner: 10BryanDavis) [15:17:58] (03CR) 10Ottomata: [C: 031] Remove Ganglia monitoring for Varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/324708 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [15:18:49] (03PS2) 10Yuvipanda: Add ruby images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/312033 (https://phabricator.wikimedia.org/T141388) (owner: 10BryanDavis) [15:18:51] ema: --^ let's nuke ganglia! :P [15:19:48] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2838378 (10jcrespo) I am starting to like option #2 more because we could do more than just this, more things like "do not alert if X is depooled" until etcd is introduced,... [15:23:13] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2837991 (10faidon) It seems that you're objecting to this feature on two different grounds: one is the legality of how it's being used by users (copyvios, mainly missing attribution when the content's license requests it) and... [15:25:03] (03CR) 10Yuvipanda: [C: 032 V: 032] Add ruby images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/312033 (https://phabricator.wikimedia.org/T141388) (owner: 10BryanDavis) [15:30:47] (03PS1) 10Yuvipanda: Add ruby webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/324730 [15:31:49] (03PS2) 10Elukey: Remove Ganglia monitoring for Varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/324708 (https://phabricator.wikimedia.org/T152093) [15:38:34] !log Stopping mysql and shutting down db2048 for maintenance - T149553 [15:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:44] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [15:40:35] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 3 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#2838435 (10GWicke) See also: - {T97204} - https://www.mediawiki.org/wiki/Rules_of_thumb_for_robust_service_infrastr... [15:44:41] 06Operations, 06Discovery, 10Kartotherian, 06Maps, and 2 others: Deploy libmapnik3.0 deb package to all maps servers - https://phabricator.wikimedia.org/T150722#2838441 (10Gehel) 05Open>03Resolved Package is deployed to all maps servers [15:51:09] Deploy alter table wikidatawiki.revision in dbstore2002 -T150644 [15:51:10] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [15:51:17] !log Deploy alter table wikidatawiki.revision in dbstore2002 -T150644 [15:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:55] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1018 for mysql upgrade and restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324681 (owner: 10Jcrespo) [15:54:58] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1018 for mysql upgrade and restart" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324681 [15:56:14] !log rebooting asw2-d-eqiad again [15:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:33] 07Puppet: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#2838473 (10scfc) [15:59:42] PROBLEM - Host asw2-d-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [16:00:05] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2838320 (10Volans) As discussed on IRC another option could be to run an exec on Puppet code that generates a `.cnf` file with only that list and have all the existing opti... [16:02:48] (03PS1) 10DCausse: [cirrus] enable BM25 on all but wikis with spaecless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324738 (https://phabricator.wikimedia.org/T152092) [16:03:59] (03CR) 10Alexandros Kosiaris: [C: 031] docker: add package provider [puppet] - 10https://gerrit.wikimedia.org/r/323815 (owner: 10Giuseppe Lavagetto) [16:05:02] RECOVERY - Host asw2-d-eqiad.mgmt.eqiad.wmnet is UP: PING WARNING - Packet loss = 58%, RTA = 2.19 ms [16:05:52] RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:05:52] RECOVERY - Host mc1035 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:05:52] RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [16:06:02] RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:07:52] PROBLEM - Juniper alarms on asw2-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms [16:07:57] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2838502 (10jcrespo) +1 to that, yes. [16:08:02] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:42] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:42] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[tcpdump],Package[gdb],Package[lldpd],Package[tshark] [16:08:42] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:09:37] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2599610 (10Nuria) Shouldn't this awesome explanation be on wikitech for future reference? cc @Eevans [16:10:01] (03PS1) 10Jcrespo: Depool es1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324740 (https://phabricator.wikimedia.org/T151995) [16:10:50] (03CR) 10Jcrespo: [C: 032] Depool es1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324740 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [16:11:44] ACKNOWLEDGEMENT - HP RAID on db2041 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor - Failed: 1I:1:10 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152105 [16:11:46] 06Operations, 10ops-codfw: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838512 (10ops-monitoring-bot) [16:12:04] (03PS1) 10Paladox: udp2log: Replace undefined variable with $ensure_monitor_processes [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) [16:12:25] (03PS2) 10Paladox: udp2log: Replace undefined variable with $ensure_monitor_processes [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) [16:12:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1018; depool es1015 (duration: 01m 00s) [16:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:04] (03CR) 10Tim Landscheidt: [C: 04-1] "The patch fixes the puppet-lint errors, so the comments for those should be removed." [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) (owner: 10Paladox) [16:19:02] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:19:13] !log mysql restart for es1015 T151995 [16:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:25] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [16:19:42] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:20:27] funny, I don't see the RAID alarm for db2041 here in the channel but is actually real and the task was opened correctly [16:20:48] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838536 (10Volans) [16:21:00] wasn't that old? [16:21:06] matbe it is in rebuild state? [16:21:10] 8 minutes according to icinga [16:21:23] that sound like a duplicate [16:21:28] https://phabricator.wikimedia.org/T151203 [16:21:33] let me check phab then [16:21:33] that was in predictive failure [16:21:42] was a warning [16:21:59] yeah, it is a duplicate [16:22:00] (03PS3) 10Paladox: udp2log: Replace undefined variable with $ensure_monitor_processes [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) [16:22:15] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838538 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. [16:22:19] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838540 (10Marostegui) The disk failed in the end: https://phabricator.wikimedia.org/T152105 [16:22:32] in fact yes, it is a rebuilding [16:22:35] ^ [16:22:46] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838512 (10Volans) [16:22:49] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838545 (10Volans) [16:22:58] I've merged the two tasks [16:23:01] no no [16:23:06] mark the other one as duplicate [16:23:23] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838546 (10Marostegui) The disk is now rebuilding: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [16:23:32] the old one has proper info [16:23:58] right, phab merge into not merge to... I always mix them up, sorry [16:24:11] I have updated both, just in case :) [16:24:11] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838547 (10Marostegui) The disk is now rebuilding: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 3% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive... [16:24:14] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2838548 (10Volans) 05duplicate>03Open [16:24:44] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2810386 (10Volans) [16:24:45] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2041 - https://phabricator.wikimedia.org/T152105#2838553 (10Volans) [16:24:53] thank you, volans [16:25:11] hoping that phab now doesn't go crazy with a loop :D [16:25:20] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2838555 (10Papaul) p:05High>03Triage [16:25:25] well, we may tune the workflow with time [16:25:28] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/4744/" [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [16:25:32] but that is already useful [16:25:54] (03PS24) 10Dzahn: Phabricator: Allow us to change the default web domain [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [16:26:21] so it was papaul, that was changing the disk [16:26:29] taking it out make it fail of course [16:27:04] if it was put in scheduled downtime would not have happened [16:29:46] !log labsdb: maintain-views --databases fiwikivoyage --debug [16:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:26] (03PS6) 10Dzahn: Phabricator: Allow us to the phab server name in hiera [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [16:31:42] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:32:54] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/324408 (owner: 10Paladox) [16:33:23] (03PS7) 10Dzahn: Phabricator: Allow overriding phab server name in hiera [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [16:33:29] (03PS1) 10Ottomata: Manually specify Kafka api_version in kafka_clusters config [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) [16:34:05] (03CR) 10Paladox: "@Dzahn this may not be needed actually as the setting here actually gets its value from https://gerrit.wikimedia.org/r/#/c/324408/24/modul" [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [16:34:14] (03PS1) 10Jcrespo: Revert "Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324746 [16:34:17] (03CR) 10Paladox: "So we can see weather https://gerrit.wikimedia.org/r/#/c/324408/24/modules/role/manifests/phabricator/main.pp does it for us." [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [16:34:22] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [16:35:10] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4745/" [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [16:35:21] (03PS8) 10Dzahn: Phabricator: Allow overriding phab server name in hiera [puppet] - 10https://gerrit.wikimedia.org/r/324551 (owner: 10Paladox) [16:35:39] (03CR) 10Jcrespo: [C: 04-2] "Wait for buffer pool warmup https://grafana.wikimedia.org/dashboard/db/mysql?panelId=1&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-serv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324746 (owner: 10Jcrespo) [16:35:42] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:35:47] 06Operations, 10Traffic: more robust certificate chain creation in puppet - https://phabricator.wikimedia.org/T84543#2838608 (10Dzahn) [16:37:14] 06Operations, 10Traffic: more robust certificate chain creation in puppet - https://phabricator.wikimedia.org/T84543#928592 (10Dzahn) I think this old ticket imported from RT times can be resolved. But i would say the authority on this should be @BBlack [16:40:10] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2838614 (10TheDJ) I just realized something... As emails will now appear to be sent by Wikim... [16:42:47] Hey folks. Did bast1001 change its host key recently? [16:43:37] Probably not: https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/bast1001.wikimedia.org&action=history [16:43:48] Thanks. Was looking for that page. [16:44:07] * halfak checks the fingerprint he's getting again [16:47:51] It looks like I'm getting a weird IP address that doesn't make sense. Anyone interested in looking at my verbose output with me? [16:47:51] http://pastebin.ca/3743483 [16:48:43] (03PS1) 10GWicke: Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 [16:49:31] Here's a tracepath. http://pastebin.ca/3743484 [16:49:32] (03PS1) 10Jcrespo: mariadb: depool db1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324748 [16:49:37] (03CR) 10jenkins-bot: [V: 04-1] Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [16:50:06] halfak: line 56 seems the correct fingerprint to me [16:50:37] volans, it looks like the IP address is weird/wrong though [16:50:42] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:01] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2838658 (10Trizek-WMF) >>! In T66795#2838614, @TheDJ wrote: > I just realized something... A... [16:51:05] (03CR) 10Jcrespo: [C: 032] mariadb: depool db1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324748 (owner: 10Jcrespo) [16:51:05] When I ping bast1001.wikimedia.org, I get 208.80.154.149 for an IP [16:51:09] your connecting on ipv6 as far as I can tell [16:51:12] halfak: the fingerprint looks correct [16:51:36] Could it be that SSHing from the university will send me through IPv6 and that is what caused the mismatch? [16:51:41] yes [16:51:43] I don't often connect through the university [16:51:47] it will IPv6 first if it can [16:51:51] OK cool. Then I'll just update the record. [16:51:56] Thansk for looking at it with me [16:51:59] and the offending key is for the IPv6 IP [16:52:04] yes could be that line 80 in your known host is the old one [16:52:14] (03CR) 10Alexandros Kosiaris: kubelet: Amend to support more than labs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324210 (owner: 10Alexandros Kosiaris) [16:52:15] if you didn't delete both of them when it was reimaged back in april [16:52:29] check if was the old one in the wiki page maybe [16:52:34] if you want to be sure 100% [16:53:07] halfak: ^^^ [16:53:11] Looks like my known_hosts is encrypted. Is that a thing now? [16:54:11] (03PS2) 10DCausse: Add a wiki configuration tag for configured language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319253 (https://phabricator.wikimedia.org/T149755) (owner: 10EBernhardson) [16:54:13] (03PS2) 10DCausse: [cirrus] enable BM25 on all but wikis with spaceless languages [step 1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324738 (https://phabricator.wikimedia.org/T152092) [16:54:15] (03PS1) 10DCausse: [cirrus] enable BM25 on all but wikis with spaceless languages [step 2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324752 (https://phabricator.wikimedia.org/T152092) [16:54:15] latest versions hash by default [16:54:17] (03PS1) 10DCausse: [cirrus] enable BM25 on all but wikis with spaceless languages [step 3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324753 (https://phabricator.wikimedia.org/T152092) [16:54:18] Looks like they are one-way hashed [16:54:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1019 (duration: 00m 45s) [16:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:35] I disable that because automcompletion > privacy for me [16:55:10] * halfak looks to disable that [16:55:38] (03PS2) 10Alexandros Kosiaris: kubelet: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324210 [16:55:39] halfak: try ssh-keygen -H -F $hostname to see it [16:55:40] (03PS2) 10Alexandros Kosiaris: Kube-proxy: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/324211 [16:55:40] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2838671 (10Krenair) The point is that they'll be disclosing to an address other than the sen... [16:55:42] (03PS2) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [16:55:44] (03PS2) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [16:55:56] and the config is in ssh_config, check HashKnownHosts [16:56:30] Interesting. So I have the right key for the host, but a key I totally don't recognize for that IPv6 [16:56:37] O.o [16:56:48] is not here? https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fbast1001.wikimedia.org&type=revision&diff=435718&oldid=435295 [16:57:12] or in an even previous version [16:57:15] They key is the second field, right? [16:58:21] (03CR) 10Elukey: [C: 032] Remove Ganglia monitoring for Varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/324708 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [16:58:26] (03PS3) 10Elukey: Remove Ganglia monitoring for Varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/324708 (https://phabricator.wikimedia.org/T152093) [16:58:31] !log mysql restart for es1019 T151995 [16:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:42] T151995: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995 [16:59:25] (03CR) 10Mobrovac: [C: 04-1] Add fontconfig file for the pdf render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1700). [17:00:36] halfak: ssh-keygen -l -f ~/.ssh/known_hosts -F bast1001.wikimedia.org [17:01:08] (03PS2) 10GWicke: Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 [17:01:11] (03CR) 10GWicke: Add fontconfig file for the pdf render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [17:01:25] volans, yeah so that's right, but the IP has it's own line with a different key [17:01:36] (03PS1) 10Jcrespo: Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324754 (https://phabricator.wikimedia.org/T151995) [17:01:41] !log otto@tin Starting deploy [eventlogging/analytics@948765d]: (no message) [17:01:44] !log otto@tin Finished deploy [eventlogging/analytics@948765d]: (no message) (duration: 00m 03s) [17:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:02] if you use the IP in the -F halfak ? [17:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:10] it should show you the related fingerprint [17:02:32] Right. And yeah. I don't recognize it from any historical revision of that page. [17:02:50] Also can't find it with search on wikitech [17:02:50] (03CR) 10Jcrespo: [C: 04-2] "Wait for buffer pool warmup https://grafana.wikimedia.org/dashboard/db/mysql?panelId=1&fullscreen&var-dc=eqiad%20prometheus%2Fops&var-serv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324754 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [17:02:51] hmm [17:02:55] (03PS2) 10Ottomata: Manually specify Kafka api_version in kafka_clusters config [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) [17:02:58] (03CR) 10Mobrovac: [C: 031] Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [17:03:22] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:04:02] (03PS3) 10Ottomata: Manually specify Kafka api_version in kafka_clusters config [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) [17:04:17] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2838705 (10faidon) So first of all, JTAC said there is no ETA for this fix getting into 14.1 and we should really go with 15.1. So, I tried upgrading to 15.1R... [17:05:09] !log cr1-eqiad: re-enabling ae4 and its members (links to asw2-d-eqiad) [17:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:27] (03PS1) 10Faidon Liambotis: puppetmaster: remove hiera for the labtest realm [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) [17:10:10] 06Operations, 10DBA, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2838717 (10jcrespo) Waiting for es2019 and es2015 to warmup their buffer pools to repool them and I could close this. [17:10:49] 06Operations, 10DBA, 13Patch-For-Review: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029#2838718 (10jcrespo) a:03jcrespo [17:11:12] 06Operations, 10ops-codfw, 10DBA: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2803218 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. You... [17:11:30] 06Operations, 10ops-codfw, 10DBA: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2838723 (10Papaul) a:03Papaul [17:13:02] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2838729 (10akosiaris) Unfortunately 15.1R4.6 has not solved the problem. Just managed to reproduce it with the exact same procedure and results. That is enabli... [17:15:04] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2838737 (10faidon) I responded to Juniper with the results of the above test, it's back with them now… [17:15:11] !log mysql restart and general upgrade for pc2004 T152029 [17:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:24] T152029: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029 [17:18:42] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:22:21] !log mobrovac@tin Starting deploy [electron-render/deploy@d6f7044]: (no message) [17:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:42] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.080 second response time [17:22:52] ostriches: this new scap feature is so cool ^^^ :) [17:23:01] new but old [17:23:02] :) [17:23:23] !log mobrovac@tin Finished deploy [electron-render/deploy@d6f7044]: (no message) (duration: 01m 02s) [17:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:44] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/4749/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [17:26:03] mobrovac: Messages too! :) [17:26:11] :) [17:26:34] i have to get used to writing messages :P [17:27:30] 06Operations, 10Traffic, 13Patch-For-Review: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093#2838770 (10elukey) Next step is to check if we can use `logster` for `statsv` metrics (and then probably ask to the Performance team). Going to work on it tomorrow! [17:28:23] ostriches: for something like https://gerrit.wikimedia.org/r/#/c/322086/ that only touches beta am I free to just merge it whenever? (ie not in a swat / deployment slot)? [17:28:40] jouncebot: next [17:28:40] In 0 hour(s) and 31 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1800) [17:28:44] addshore: go for it [17:28:56] (03CR) 10Addshore: [C: 032] Enable ElectronPdfService extension on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) (owner: 10Addshore) [17:29:02] (03PS4) 10Addshore: Enable ElectronPdfService extension on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) [17:29:24] (03CR) 10Addshore: Enable ElectronPdfService extension on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) (owner: 10Addshore) [17:29:27] (03CR) 10Addshore: [C: 032] Enable ElectronPdfService extension on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) (owner: 10Addshore) [17:29:59] (03Merged) 10jenkins-bot: Enable ElectronPdfService extension on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) (owner: 10Addshore) [17:33:52] (03PS6) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [17:34:21] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: remove hiera for the labtest realm [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [17:35:02] (03CR) 10Dzahn: phab: add cron to clean up old tmp files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [17:36:44] (03CR) 10Dzahn: ""$ensure_monitor_processes is not defined in Puppet or hieradata/; it seems to have been introduced by @Ottomata in 90fc2e11b3618545523d02" [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) (owner: 10Paladox) [17:43:17] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4750/" [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) (owner: 10Paladox) [17:46:14] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2838814 (10Papaul) a:03Papaul [17:47:52] 06Operations, 10ops-codfw, 06DC-Ops: ms-be2025 controller failure - https://phabricator.wikimedia.org/T151201#2838820 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Yo... [17:48:16] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: remove hiera for the labtest realm [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [17:52:46] (03PS4) 10Dzahn: udp2log: Replace undefined variable with $ensure_monitor_processes [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) (owner: 10Paladox) [17:55:37] 06Operations, 10DBA: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2838847 (10fgiunchedi) 05Open>03Invalid Thanks @Marostegui, tentatively resolving [17:56:16] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2068 - https://phabricator.wikimedia.org/T151763#2838851 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [17:58:55] 06Operations, 10DBA: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#2838853 (10jcrespo) 05Invalid>03Resolved In fact Mariadb can start automatically for non production hosts (right now beta, dns-labs, and analytics-labs), so this is... [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1800). [18:01:22] no parsoid deploy today [18:02:25] !log mysql restart and general upgrade for pc2005 T152029 [18:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:39] T152029: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029 [18:02:42] RECOVERY - Disk space on stat1002 is OK: DISK OK [18:04:52] (03CR) 10Filippo Giunchedi: [C: 031] "yeah it should DTRT, at least on jessie java-7 is preferred over 8 when in "auto mode"" [puppet] - 10https://gerrit.wikimedia.org/r/324679 (https://phabricator.wikimedia.org/T151896) (owner: 10Elukey) [18:05:02] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:05:48] (03PS3) 10Filippo Giunchedi: Nginx timeout should be higher than thumbor subprocess timeout [puppet] - 10https://gerrit.wikimedia.org/r/323403 (https://phabricator.wikimedia.org/T151459) (owner: 10Gilles) [18:06:02] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:11:03] (03CR) 10Filippo Giunchedi: [C: 032] Nginx timeout should be higher than thumbor subprocess timeout [puppet] - 10https://gerrit.wikimedia.org/r/323403 (https://phabricator.wikimedia.org/T151459) (owner: 10Gilles) [18:15:13] !log mysql restart and general upgrade for pc2006 T152029 [18:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:24] T152029: Rolling restart of parsercache servers for TLS certificate update - https://phabricator.wikimedia.org/T152029 [18:16:52] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:16:53] (03CR) 10Jcrespo: [C: 032] Revert "Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324746 (owner: 10Jcrespo) [18:16:56] (03PS2) 10Jcrespo: Revert "Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324746 [18:17:58] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/324741 (https://phabricator.wikimedia.org/T152104) (owner: 10Paladox) [18:18:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 647 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4780858 keys, up 31 days 9 hours - replication_delay is 647 [18:19:19] addshore, it is ok to rebase and not deploy "Enable ElectronPdfService extension on beta sites" on production? [18:19:37] I would assume yes, but have to ask [18:20:39] (03CR) 10Filippo Giunchedi: [C: 04-1] Add fontconfig file for the pdf render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [18:20:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1015 (duration: 00m 45s) [18:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:13] (03PS1) 10Gehel: tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) [18:23:07] (03CR) 10jenkins-bot: [V: 04-1] tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [18:25:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4760616 keys, up 31 days 10 hours - replication_delay is 7 [18:25:20] (03PS2) 10Jcrespo: Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324754 (https://phabricator.wikimedia.org/T151995) [18:26:11] 06Operations, 10DBA: mariadb puppet module doesn't start mysql service in labs (possibly anywhere) - https://phabricator.wikimedia.org/T91797#1096483 (10Krenair) >>! In T91797#2838853, @jcrespo wrote: > In fact Mariadb can start automatically for non production hosts (right now beta, dns-labs, and analytics-la... [18:27:05] (03PS1) 10Gehel: wdqs - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324763 [18:29:37] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2838950 (10Elitre) Notified en:WP/VPT as that's the place where such issues were reported in... [18:30:07] 06Operations, 10media-storage, 05Goal: expand swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2838955 (10fgiunchedi) [18:30:10] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2838952 (10fgiunchedi) 05Open>03Resolved This is complete! [18:30:59] 06Operations, 10media-storage, 05Goal: expand swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2122471 (10fgiunchedi) 05Open>03Resolved Complete, new hw in place [18:32:20] (03PS1) 10Catrope: Re-enable the Flow beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324764 (https://phabricator.wikimedia.org/T138310) [18:33:08] (03CR) 10Catrope: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324764 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [18:33:21] (03PS11) 10Dzahn: base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) [18:34:12] (03CR) 10jenkins-bot: [V: 04-1] base/ipmi: install freeipmi globally, move to ipmi module [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [18:35:58] (03PS1) 10Alex Monk: Revert "Revert "RESTBase configuration for fi.wikivoyage.org"" [puppet] - 10https://gerrit.wikimedia.org/r/324766 [18:36:37] (03PS2) 10Alex Monk: Revert "Revert "RESTBase configuration for fi.wikivoyage.org"" [puppet] - 10https://gerrit.wikimedia.org/r/324766 (https://phabricator.wikimedia.org/T151570) [18:37:59] (03CR) 10Filippo Giunchedi: "> Is there an example elsewhere in puppet that I can follow for" [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [18:41:13] (03CR) 10Dzahn: "@Volans thanks for compiling and confirming it looks good! i just added a "if $::is_virtual == 'false'" around the include so that VMs do" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [18:44:52] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:45:34] (03Abandoned) 10BBlack: rcstream: internal service IP/hostname [dns] - 10https://gerrit.wikimedia.org/r/315536 (https://phabricator.wikimedia.org/T147845) (owner: 10BBlack) [18:45:53] (03Abandoned) 10BBlack: rcstream: internal LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315537 (https://phabricator.wikimedia.org/T147845) (owner: 10BBlack) [18:46:09] (03Abandoned) 10BBlack: cache_misc: use stream LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315538 (https://phabricator.wikimedia.org/T147845) (owner: 10BBlack) [18:47:38] (03PS2) 10BBlack: rcstream: single-backend with manual failover [puppet] - 10https://gerrit.wikimedia.org/r/317132 (https://phabricator.wikimedia.org/T147845) [18:48:12] (03CR) 10Jcrespo: [C: 032] Repool es2019 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324754 (https://phabricator.wikimedia.org/T151995) (owner: 10Jcrespo) [18:48:22] (03CR) 10Smalyshev: [C: 031] wdqs - upgrade to Java 8 [puppet] - 10https://gerrit.wikimedia.org/r/324763 (owner: 10Gehel) [18:48:41] (03CR) 10Filippo Giunchedi: [C: 031] "Not very pretty but it'll do as bandaid!" [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [18:49:36] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1019 (duration: 00m 45s) [18:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:45] 06Operations: Implement a configuration discovery system - https://phabricator.wikimedia.org/T95662#1197052 (10fgiunchedi) Looks like this might be done, anything else to do? [18:53:25] 06Operations, 07Documentation: update ServerLifecycle page - https://phabricator.wikimedia.org/T87782#2839171 (10fgiunchedi) a:03RobH [18:56:01] (03PS1) 10Jcrespo: mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) [18:56:39] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [18:56:49] (03PS2) 10Jcrespo: mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) [18:57:24] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [18:58:18] (03PS3) 10Jcrespo: mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) [18:58:52] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2839200 (10fgiunchedi) p:05Triage>03Normal [18:59:39] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2839201 (10fgiunchedi) p:05Triage>03Normal [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T1900). [19:01:22] Nothing to deploy listed. [19:02:41] (03PS4) 10Jcrespo: mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) [19:04:29] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839211 (10jcrespo) [19:04:32] 06Operations, 10DBA, 13Patch-For-Review: Rolling restart of external storage servers for TLS certificate update - https://phabricator.wikimedia.org/T151995#2839210 (10jcrespo) 05Open>03Resolved [19:05:06] papaul: re T151427 did the host have two disks before or only one? I see one now [19:05:07] T151427: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427 [19:05:25] (03CR) 10Jcrespo: [C: 032] mariadb: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324770 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:08:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1004 (duration: 00m 45s) [19:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:45] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2837991 (10valhallasw) >>! In T152091#2838342, @Gilles wrote: > The examples you'e provided for legitimate use cases aren't compelling examples of us providing a free CDN being a necessity. The examples I've seen on blogspot... [19:10:46] (03CR) 10Andrew Bogott: [C: 04-2] "When Alex refactored all of the labtest realm-specific hiera stuff into host-specific things instead, he moved a bunch of labtest settings" [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [19:14:13] (03PS1) 10Jcrespo: Revert "mariadb: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324771 [19:15:44] 06Operations, 10Monitoring: improve redis master/slave monitoring - https://phabricator.wikimedia.org/T101584#2839267 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving, we can reopen if the current checks need improvement [19:15:50] (03PS1) 10Jcrespo: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324772 (https://phabricator.wikimedia.org/T152029) [19:16:05] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool pc1004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324771 (owner: 10Jcrespo) [19:17:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1004 (duration: 00m 44s) [19:17:37] bblack: ema: Hi! Quick question... Did we experience any site issues today around 8:07 UTC? We have a suspicious dip in FR banner impressions starting at that time.... [19:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:45] (03CR) 10Rush: "@andrew that's my understanding as well. The line:" [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [19:20:57] (03Abandoned) 10Jcrespo: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324772 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:22:06] mutante we can setup a domain for phab2001 so that we can test that it works and then we know switching from phabricator.wikimedia.org at a moments notice will work [19:22:07] gwicke, mobrovac: so, how do I get https://gerrit.wikimedia.org/r/#/c/324766/ applied? [19:23:15] (03PS1) 10Jcrespo: Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324776 (https://phabricator.wikimedia.org/T152029) [19:24:30] (03PS1) 10Andrew Bogott: Clarify a comment re: labtest private hiera lookups. [puppet] - 10https://gerrit.wikimedia.org/r/324778 [19:24:34] paladox: yea, agreed, that's true [19:24:56] (03CR) 10Andrew Bogott: "Proposed trivial alternative:" [puppet] - 10https://gerrit.wikimedia.org/r/324755 (https://phabricator.wikimedia.org/T148717) (owner: 10Faidon Liambotis) [19:24:58] (03CR) 10Jcrespo: [C: 032] Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324776 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:25:09] Yep :) [19:25:33] mutante twentyafterfour and some of mine changes in gerrit actually now make this possible [19:25:42] to allow different domain [19:25:58] paladox: the thing that i like right now is how we have a working puppet role, on a jessie host in labs [19:26:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1005 (duration: 00m 45s) [19:26:18] Yep :) [19:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:26] let's close that ticket now, then phab2001 is next [19:26:30] Ok [19:27:28] mutante done [19:28:25] http://phabricator-01.wmflabs.org/ works for me now [19:28:46] :) [19:30:50] (03PS5) 10Rush: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [19:31:58] (03PS1) 10Jcrespo: Revert "Depool pc1005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324780 [19:33:17] (03CR) 10Jcrespo: [C: 032] Revert "Depool pc1005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324780 (owner: 10Jcrespo) [19:33:45] (03CR) 10Rush: [C: 031] "small ask but cool" [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [19:34:15] paladox: thanks for your work on this [19:34:27] Your welcome :) [19:34:58] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1005 (duration: 00m 46s) [19:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:01] 06Operations, 10Labs-project-Phabricator, 10Phabricator: have a phabricator test instance in labs that uses a working puppet role - https://phabricator.wikimedia.org/T139475#2839337 (10Dzahn) [19:37:45] paladox: i just saw there is another ticket "Phabricator installation guide on labs", i think you can close that too if you add your paste there [19:37:52] Ok [19:38:09] no wait, i'm stupid, that is the same paste [19:38:33] Oh [19:38:34] LOL [19:38:39] saw status: "Active" [19:39:00] Yep :) [19:40:44] (03PS1) 10Jcrespo: Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324781 (https://phabricator.wikimedia.org/T152029) [19:40:51] (03PS7) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [19:41:29] mutante: I will take a look at the ipmi stuff later or tomorrow, but jenkins is not happy there ;) [19:41:37] thanks for the fix [19:42:06] volans: ok, i will compile too later [19:42:40] (03CR) 10Jcrespo: [C: 032] Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324781 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:42:47] (03CR) 10Ottomata: [C: 032] Manually specify Kafka api_version in kafka_clusters config [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [19:42:51] (03PS4) 10Ottomata: Manually specify Kafka api_version in kafka_clusters config [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) [19:42:52] well, maybe it is because the compiler itself "is_virtual" [19:42:55] (03CR) 10Jcrespo: [C: 04-1] Depool pc1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324781 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:43:01] (03CR) 10Ottomata: [V: 032] Manually specify Kafka api_version in kafka_clusters config [puppet] - 10https://gerrit.wikimedia.org/r/324745 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [19:43:30] (03PS2) 10Jcrespo: Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324781 (https://phabricator.wikimedia.org/T152029) [19:43:50] (03CR) 10Jcrespo: [C: 032] Depool pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324781 (https://phabricator.wikimedia.org/T152029) (owner: 10Jcrespo) [19:44:25] (03PS1) 10Jcrespo: Revert "Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324783 [19:44:49] (03PS8) 10Dzahn: phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) [19:45:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool pc1006 (duration: 00m 44s) [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:26] (03CR) 10Dzahn: [C: 032] phab: add cron to clean up old tmp files [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [19:47:27] 06Operations, 10Deployment-Systems, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2839375 (10fgiunchedi) p:05High>03Normal ATM there's 5 mediawiki versions on `/var/lib/l10nupdate/caches` so I suspect something/someone is cleaning up, not sure what though [19:47:41] godog: 1 disk [19:47:55] papaul: ok thanks! [19:48:24] (03CR) 10Jcrespo: [C: 032] Revert "Depool pc1006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324783 (owner: 10Jcrespo) [19:48:38] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2092.codfw.wmnet [19:48:38] godog: yw [19:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:04] 06Operations, 10ops-codfw: mw2092 - disk issue - https://phabricator.wikimedia.org/T151427#2839392 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Back in service, resolving [19:51:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1006 (duration: 00m 46s) [19:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:54] (03CR) 10Dzahn: "cron has been created on iridium and _also on the labs instance called "phabricator" which uses the same role. (this part is new)" [puppet] - 10https://gerrit.wikimedia.org/r/324601 (https://phabricator.wikimedia.org/T150396) (owner: 10Dzahn) [19:53:26] 06Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2839398 (10fgiunchedi) I don't think we have encryption for kafka yet ? I'd love to be wrong though cc @Ottomata @elukey Once we have kafka encryption we could use that fo... [19:55:02] 06Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2839408 (10Ottomata) Timeline for this just discussed and noted in T152015 [19:55:49] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603#2839411 (10fgiunchedi) p:05High>03Normal Back to normal since mw is now to jessie with exception of videoscaler [19:56:52] 06Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2839420 (10Ottomata) Uh, apparently that dashboard doesn't work anymore...(been a long time since I looked at it). Sorry! But it has been done before! :) [19:58:05] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2839428 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving this since other crash-specific tasks have been opened for hhvm [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161201T2000). Please do the needful. [20:00:37] (03PS1) 10Chad: group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324785 [20:01:05] 06Operations, 10Phabricator, 06Release-Engineering-Team: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#2839436 (10Dzahn) [20:01:44] 06Operations, 10Phabricator, 06Release-Engineering-Team: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#2839452 (10Dzahn) p:05Triage>03Normal [20:02:27] (03CR) 10Chad: [C: 032] group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324785 (owner: 10Chad) [20:03:11] (03Merged) 10jenkins-bot: group2 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324785 (owner: 10Chad) [20:03:45] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.4 [20:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:00] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839471 (10jcrespo) Out of 157 active hosts responding to salt, 15 host with no TLS deployed, 42 with the old certificate, 100 with the puppet one: ``` $ sudo salt -C 'G@cluster:mysql' cmd... [20:06:12] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:13] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:13] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:22] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:22] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:22] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:22] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:22] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:32] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:06:52] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get [20:10:39] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to beta cluster - https://phabricator.wikimedia.org/T150945#2802120 (10Addshore) 05stalled>03Resolved [20:10:45] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2839495 (10Addshore) [20:10:58] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2839497 (10Addshore) [20:11:01] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2802107 (10Addshore) 05stalled>03Open [20:11:04] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2839499 (10Addshore) [20:11:07] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2802094 (10Addshore) 05stalled>03Open [20:11:09] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2802081 (10Addshore) 05stalled>03Open [20:11:11] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, 07User-notice: Deploy ElectronPdfService Extension to production - https://phabricator.wikimedia.org/T150185#2839501 (10Addshore) [20:12:15] (03PS1) 10Dzahn: phabricator, RT: repeat hostname for each record [dns] - 10https://gerrit.wikimedia.org/r/324788 [20:12:46] 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#2839504 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I'm not seeing the related test failing frequently, h... [20:13:04] mhh mobileapps heh [20:14:27] paravoid, hi, i just stumbled upon an issue with the mapnik - it seems the "geojson" plugin was not compiled in... could it be that it is missing in the native mapnik build? [20:14:34] cc: mobrovac Pchelolo ^ [20:14:48] this would affect our node6 deployment for kartotherian [20:15:02] paravoid: "Could not create datasource for type: 'geojson' (searched for datasource plugins in '/usr/lib/mapnik/3.0/input')" [20:15:26] godog: i told -mobile [20:15:32] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2839512 (10jcrespo) List of eqiad hosts with the old cert: ``` db1015.eqiad.wmnet db1021.eqiad.wmnet db1022.eqiad.wmnet db1036.eqiad.wmnet db1054.eqiad.wmnet db1060.eqiad.wmnet db1063.eqi... [20:16:07] (03PS1) 10Ottomata: Set api_version for eventlogging kafka consumers as well as producers [puppet] - 10https://gerrit.wikimedia.org/r/324790 (https://phabricator.wikimedia.org/T142430) [20:16:41] mutante: thanks! [20:16:45] mutante: thanks [20:16:47] (03CR) 10Milimetric: [C: 031] "Ottomata: this is ready to merge now." [puppet] - 10https://gerrit.wikimedia.org/r/322969 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:17:04] (03CR) 10jenkins-bot: [V: 04-1] Set api_version for eventlogging kafka consumers as well as producers [puppet] - 10https://gerrit.wikimedia.org/r/324790 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [20:17:06] mdholloway and I are looking at it [20:18:09] (03PS2) 10Ottomata: Set api_version for eventlogging kafka consumers as well as producers [puppet] - 10https://gerrit.wikimedia.org/r/324790 (https://phabricator.wikimedia.org/T142430) [20:18:17] 06Operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#2839517 (10fgiunchedi) p:05High>03Normal Lowering to 'normal', looks like we aren't under high pressure now for ES compression [20:19:42] (03PS2) 10Dzahn: phabricator, RT: repeat hostname for each record [dns] - 10https://gerrit.wikimedia.org/r/324788 [20:20:28] 06Operations, 10Phabricator, 13Patch-For-Review: Phabricator leaving old files in /tmp - https://phabricator.wikimedia.org/T150396#2839533 (10fgiunchedi) [20:21:16] 06Operations, 10Phabricator, 13Patch-For-Review: Phabricator leaving old files in /tmp - https://phabricator.wikimedia.org/T150396#2784519 (10fgiunchedi) Renaming since the issue has been bandaided, what's left to do is investigate and fix the root cause of why phabricator/apache leaves files behind [20:21:42] 06Operations, 10Labs-project-Phabricator, 06Release-Engineering-Team: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2839539 (10Paladox) [20:22:09] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 10 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2839551 (10Yurik) [20:22:40] (03PS6) 10Filippo Giunchedi: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) [20:22:47] (03CR) 10Filippo Giunchedi: "> small ask but cool" [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [20:22:57] 06Operations, 10Phabricator: Phabricator leaving old files in /tmp - https://phabricator.wikimedia.org/T150396#2839554 (10Dzahn) [20:23:00] (03CR) 10Ottomata: [C: 032] Set api_version for eventlogging kafka consumers as well as producers [puppet] - 10https://gerrit.wikimedia.org/r/324790 (https://phabricator.wikimedia.org/T142430) (owner: 10Ottomata) [20:24:04] 06Operations, 10Labs-project-Phabricator, 06Release-Engineering-Team: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2839556 (10Dzahn) agreed. we did it in a similar way for gerrit with "gerrit-new". but gerrit wasn't behind varnish, unlike phab. [20:24:18] 06Operations, 10Labs-project-Phabricator, 06Release-Engineering-Team: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2839572 (10Paladox) [20:24:20] 06Operations, 10Labs-project-Phabricator, 06Release-Engineering-Team: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2839570 (10Dzahn) please link this to the other phab2001 ticket(s) in some way [20:25:00] (03PS1) 10Catrope: Add b/c for the $wgEchoConfig -> $wgEchoEventLoggingSchema rename in I2f9d5d111f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324792 [20:26:33] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2839607 (10fgiunchedi) [20:26:38] (03CR) 10Rush: [C: 031] graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [20:26:41] (03PS7) 10Rush: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [20:27:00] (03CR) 10Legoktm: [C: 031] Add b/c for the $wgEchoConfig -> $wgEchoEventLoggingSchema rename in I2f9d5d111f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324792 (owner: 10Catrope) [20:27:13] !log bouncing kafka broker on kafka1018 to test config changes to eventlogging analytics kafka clients [20:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:33] mutante: godog: mdholloway: filed T152135 for the mobileapps alerts [20:27:33] T152135: Remove obsolete mobile-summary endpoint in MCS - https://phabricator.wikimedia.org/T152135 [20:28:36] (03PS1) 10Dzahn: add phabricator-new for phab2001 [dns] - 10https://gerrit.wikimedia.org/r/324794 (https://phabricator.wikimedia.org/T152132) [20:29:03] bearND: awesome, thanks! do you know why that has started alerting just now? [20:30:00] ACKNOWLEDGEMENT - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:00] ACKNOWLEDGEMENT - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:01] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:01] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:01] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:01] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:01] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:01] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:02] ACKNOWLEDGEMENT - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page responds with malformed body: NoneType object has no attribute get daniel_zahn https://phabricator.wikimedia.org/T152135 [20:30:07] godog: no. But I think it was not the first time. I vaguely remember seeing somthing like this before. [20:30:18] bearND: ^ thanks :) [20:30:49] (03CR) 10Paladox: [C: 031] "yay thanks." [dns] - 10https://gerrit.wikimedia.org/r/324794 (https://phabricator.wikimedia.org/T152132) (owner: 10Dzahn) [20:31:44] godog: mutante : I think it's a fairly simple change to fix this. Should we try to deploy that today or can it wait until Monday? [20:32:04] a bit troubling that it would start failing for no apparent reason, I would have guessed a deploy heh [20:32:40] bearND: the problem with leaving it with alerts is that it can mask other real problems, I'd say if the fix is simple enough to deploy it today [20:32:49] I'd like to understand why it started failing now though [20:33:14] (03PS1) 10Dzahn: gerrit/status/wt-static: repeat hostname for multi records [dns] - 10https://gerrit.wikimedia.org/r/324795 [20:33:51] last time sth similar happened it was because of edits to the page itself [20:34:13] i would guess that too, an edit to the page "Dog" ? [20:34:25] or some template [20:34:26] (03CR) 10Paladox: [C: 031] gerrit/status/wt-static: repeat hostname for multi records [dns] - 10https://gerrit.wikimedia.org/r/324795 (owner: 10Dzahn) [20:35:00] (03PS2) 10Ottomata: Add discovery reports [puppet] - 10https://gerrit.wikimedia.org/r/322969 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:35:02] could be yeah [20:35:31] godog: mutante : yeah, that could be it. We'll try to fix this and deploy today. //cc:mdholloway [20:35:39] :) [20:35:55] (03CR) 10jenkins-bot: [V: 04-1] Add discovery reports [puppet] - 10https://gerrit.wikimedia.org/r/322969 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:36:32] jouncebot: next [20:36:32] In 3 hour(s) and 23 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161202T0000) [20:36:59] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2839631 (10Gehel) I guess that by frozen indices, we refer to [[ https://wikitech.wikimedia.org/wiki/Search#Pausing_Ind... [20:39:19] (03PS3) 10Ottomata: Add discovery reports [puppet] - 10https://gerrit.wikimedia.org/r/322969 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:41:34] (03CR) 10Ottomata: [C: 032] Add discovery reports [puppet] - 10https://gerrit.wikimedia.org/r/322969 (https://phabricator.wikimedia.org/T147034) (owner: 10MaxSem) [20:42:03] (03CR) 10Rush: "Small thing, but it gave me a jolt at first. Can you retitle from 'Keystone: open up firewall for public keystone API' to sometihng like " [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [20:49:30] (03PS1) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [20:49:44] mutante ^^ :) [20:49:54] (03PS1) 10Dzahn: varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) [20:49:56] (03PS2) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [20:50:32] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2839687 (10fgiunchedi) There was a reoccurence of this today with mobileapps ``` scb1001:~$ /usr/bin/service-checker-sw... [20:50:58] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:51:08] (03CR) 10Paladox: [C: 031] varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:53:45] !log otto@tin Starting deploy [eventlogging/eventbus@948765d]: (no message) [20:53:54] !log otto@tin Finished deploy [eventlogging/eventbus@948765d]: (no message) (duration: 00m 08s) [20:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:11] (03PS8) 10Andrew Bogott: Keystone: open up firewall to allow labs access to keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [20:55:13] (03PS2) 10Andrew Bogott: Labs: Add observerenv.sh, helper script for read-only creds [puppet] - 10https://gerrit.wikimedia.org/r/320830 (https://phabricator.wikimedia.org/T150092) [20:55:42] !log otto@tin Starting deploy [eventlogging/eventbus@948765d]: accept api_version parameter [20:55:52] !log otto@tin Finished deploy [eventlogging/eventbus@948765d]: accept api_version parameter (duration: 00m 09s) [20:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:21] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789590 (10bearND) BTW, we're going to remove that problematic endpoint today since we don't need it anymore. {T152135} [20:56:28] Hey, anyone from engineering around for a quick private chat? James seems to be out-of-pocket atm. [20:56:54] (03PS3) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [20:57:59] Revent, hmm? [20:58:19] PM [20:58:22] Revent: there's like a hundred people in engineering, with different skills and responsibilities. it depends on what you want to talk about… [20:59:43] hmm hmm hmm [21:00:01] I tried testing the new Content Translation version by publishing several articles, [21:00:02] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:00:11] and I get "docserver-http: HTTP 400" [21:00:22] might be a parsoid service problem, but maybe not [21:01:02] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:01:04] (03CR) 10Dzahn: [C: 032] phabricator, RT: repeat hostname for each record [dns] - 10https://gerrit.wikimedia.org/r/324788 (owner: 10Dzahn) [21:01:34] (03PS9) 10Andrew Bogott: Keystone: open up firewall to allow labs access to keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [21:01:37] !log bouncing kafka broker on kafka1002 to troubleshoot production only missing messages [21:01:37] (03PS3) 10Andrew Bogott: Labs: Add observerenv.sh, helper script for read-only creds [puppet] - 10https://gerrit.wikimedia.org/r/320830 (https://phabricator.wikimedia.org/T150092) [21:01:39] (03PS1) 10Andrew Bogott: keystone: clarifying comment [puppet] - 10https://gerrit.wikimedia.org/r/324799 [21:01:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:04] (03CR) 10Rush: [C: 031] keystone: clarifying comment [puppet] - 10https://gerrit.wikimedia.org/r/324799 (owner: 10Andrew Bogott) [21:02:33] (03PS2) 10Dzahn: add phabricator-new for phab2001 [dns] - 10https://gerrit.wikimedia.org/r/324794 (https://phabricator.wikimedia.org/T152132) [21:03:02] (03CR) 10Rush: [C: 031] "Nice. Note for posterity: Previously implemented controls still restrict authentication from the Labs instance range to the novaobserver " [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:03:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:05:57] (03CR) 10Rush: [C: 031] Labs: Add observerenv.sh, helper script for read-only creds [puppet] - 10https://gerrit.wikimedia.org/r/320830 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:06:02] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [21:06:10] (03PS1) 10Ottomata: Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 [21:06:40] (03CR) 10Alex Monk: [C: 031] keystone: clarifying comment [puppet] - 10https://gerrit.wikimedia.org/r/324799 (owner: 10Andrew Bogott) [21:06:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [21:07:06] (03CR) 10jenkins-bot: [V: 04-1] Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 (owner: 10Ottomata) [21:07:12] (03PS2) 10Dzahn: gerrit/status/wt-static: repeat hostname for multi records [dns] - 10https://gerrit.wikimedia.org/r/324795 [21:08:35] ostriches ^^ it seems to say failings 30% [21:08:40] (03CR) 10Dzahn: [C: 032] gerrit/status/wt-static: repeat hostname for multi records [dns] - 10https://gerrit.wikimedia.org/r/324795 (owner: 10Dzahn) [21:10:02] (03PS2) 10Ottomata: Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 [21:10:44] (03PS3) 10Ottomata: Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 [21:10:51] (03CR) 10Dzahn: [C: 032] add phabricator-new for phab2001 [dns] - 10https://gerrit.wikimedia.org/r/324794 (https://phabricator.wikimedia.org/T152132) (owner: 10Dzahn) [21:11:02] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:11:43] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:12:14] (03CR) 10Andrew Bogott: [C: 032] keystone: clarifying comment [puppet] - 10https://gerrit.wikimedia.org/r/324799 (owner: 10Andrew Bogott) [21:12:54] (03CR) 10Ottomata: [C: 032] Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 (owner: 10Ottomata) [21:12:59] (03PS4) 10Ottomata: Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 [21:13:01] (03CR) 10Ottomata: [V: 032] Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer [puppet] - 10https://gerrit.wikimedia.org/r/324801 (owner: 10Ottomata) [21:15:20] (03CR) 10Dzahn: [C: 04-1] "the role keyword can only exist once per node, you can't repeat it" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:15:32] !log rolling bounce of main kafka brokers and then eventbus service to pick up api_version change, and to apply min.insync.replicas=1 to kafka [21:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:46] (03PS4) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:17:37] (03CR) 10Dzahn: Phabricator: rsync /srv/repos from iridium to phab2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:18:07] (03PS5) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:18:12] (03PS6) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:18:26] (03CR) 10Yurik: tilerator: deploy config with scap3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [21:18:29] (03CR) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:19:08] (03CR) 10Dzahn: [C: 04-1] "hmm, i dont really like that this is adding it on both servers, when it's only needed on one. but that would mean breaking up the node reg" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:19:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [21:20:02] PROBLEM - Check systemd state on kafka2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:20:36] (03CR) 10Paladox: "Wont we need to break it up so we can get phabricator up an running on the test domain otherwise the things we set for iridium will be inh" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:20:42] PROBLEM - Check systemd state on kafka2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:21:02] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [21:21:02] degraded? oh mirro maker [21:21:05] on it [21:21:22] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:21:54] 07Puppet: realm.pp: "Data retrieved from Toolsbeta is String not Hash" if not defined in Hiera - https://phabricator.wikimedia.org/T152142#2839838 (10valhallasw) [21:22:42] RECOVERY - Check systemd state on kafka2002 is OK: OK - running: The system is fully operational [21:23:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [21:23:22] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:34:23] (03CR) 10Paladox: "20after4 says we can use repository clustering for this https://secure.phabricator.com/book/phabricator/article/cluster_repositories/" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:35:10] (03CR) 1020after4: [C: 04-1] "yeah no need for rsync to mirror the repos. phabricator takes care of it" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:35:31] (03Abandoned) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:45:42] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [21:46:02] RECOVERY - Check systemd state on kafka2001 is OK: OK - running: The system is fully operational [21:49:52] (03PS1) 1020after4: WIP: phabricator refactor init.pp [puppet] - 10https://gerrit.wikimedia.org/r/324808 [21:54:44] twentyafterfour: A follow-up to that ^ (after it's landed) would be to fix the php.ini setup imho [21:55:05] Rather than having a full copied version from who-knows-when, just use system defaults and write our own ini overrides where it matters [21:55:16] ostriches: good idea [21:58:30] Ping? Looks like we’re shortly going to be replacing an image with about a million cross-wiki transclusions (in templates)…. is this going to make the world explode? [21:59:03] Revent: hi [21:59:11] Revent: "replacing"? you mean like uploading a new version? [21:59:17] Yeah. [21:59:32] https://commons.wikimedia.org/wiki/File:Disambig_gray.svg <- to match the new UI color pallete. [22:00:40] just do it then [22:00:43] 01:47, 9 August 2006 [22:00:48] Talk about an old image :) [22:01:14] legoktm: Okies, just didn’t want to unintentionally hammer the servers with re-rendering a million pages. :P [22:01:36] (03CR) 1020after4: [C: 031] varnish misc: add phab2001 as a backend for phab-new [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [22:01:57] Revent: Well, it'll just slowly work its way through the jobqueue unless people start trying to purge things manually and hammer it :) [22:03:01] I rather suspected that, just wanted to verify. [22:04:14] (03PS1) 10Paladox: Phabricator: Remove option metamta.maniphest.public-create-email [puppet] - 10https://gerrit.wikimedia.org/r/324812 [22:04:22] twentyafterfour ^^ :) [22:04:43] (03PS2) 10Paladox: Phabricator: Remove option metamta.maniphest.public-create-email [puppet] - 10https://gerrit.wikimedia.org/r/324812 [22:07:39] !log Dropped ar_usertext_timestamp indexes from the archive tables of all 4 wikis created since September (olowiki, ecwikimedia, projectcomwiki, fiwikivoyage) - replaced with usertext_timestamp index to match all older wikis. MW fix to follow. see -tech [22:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:57] (03CR) 1020after4: [C: 031] Phabricator: Remove option metamta.maniphest.public-create-email [puppet] - 10https://gerrit.wikimedia.org/r/324812 (owner: 10Paladox) [22:08:02] Reedy, ^ [22:08:06] RoanKattouw: ^ [22:12:50] !log upgrade scap to 3.4.1-1 on tin and mira [22:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:52] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:19:40] (03CR) 10Paladox: [C: 031] WIP: phabricator refactor init.pp [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [22:21:42] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [22:23:32] ah dammit that's not going to work, need a puppet patch for scap [22:25:37] (03PS1) 10Filippo Giunchedi: scap: upgrade to 3.4.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/324818 [22:26:51] (03CR) 10Filippo Giunchedi: [C: 032] scap: upgrade to 3.4.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/324818 (owner: 10Filippo Giunchedi) [22:26:56] (03PS2) 10Filippo Giunchedi: scap: upgrade to 3.4.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/324818 [22:29:01] 06Operations, 10Phabricator, 06Release-Engineering-Team, 13Patch-For-Review: Setup test domain for phab2001 - https://phabricator.wikimedia.org/T152132#2840082 (10Paladox) [22:30:05] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2840097 (10fgiunchedi) I took another look at this and it seems related to an edit since no deploys happened. Despite th... [22:35:17] (03CR) 10Tim Landscheidt: "We have done that in the past (cf. If0072fdfcd759deec4f51306690ee6a54f1fb813, I9d5c89136df5a48986571e302b97555c92aa1175, I1fd4c0b81e156e83" [puppet] - 10https://gerrit.wikimedia.org/r/324623 (https://phabricator.wikimedia.org/T151980) (owner: 10BryanDavis) [22:35:36] (03PS8) 10Filippo Giunchedi: graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) [22:37:22] (03CR) 10Filippo Giunchedi: [C: 032] graphite: cleanup labs instances metrics [puppet] - 10https://gerrit.wikimedia.org/r/323339 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [22:46:52] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:49:24] (03PS1) 10Filippo Giunchedi: graphite: switch labs instances cleanup to cron [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) [22:49:41] chasemp: can you take a look ^ ? [22:49:42] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:50:14] (03CR) 10Rush: [C: 031] "agreed" [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [22:50:24] godog: [22:50:37] (03CR) 10jenkins-bot: [V: 04-1] graphite: switch labs instances cleanup to cron [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [22:51:55] chasemp: 👍 [22:52:05] show off :) [22:52:21] hahah I'm on osx, gotta use this fancy terminal [22:52:55] no way that'd work with terminus on linux, or maybe urxvt knows how to do font substituion [22:54:32] (03PS2) 10Filippo Giunchedi: graphite: switch labs instances cleanup to cron [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) [22:56:26] (03CR) 10Filippo Giunchedi: [C: 032] graphite: switch labs instances cleanup to cron [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [22:56:30] (03PS3) 10Filippo Giunchedi: graphite: switch labs instances cleanup to cron [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) [22:56:32] (03CR) 10Filippo Giunchedi: [V: 032] graphite: switch labs instances cleanup to cron [puppet] - 10https://gerrit.wikimedia.org/r/324820 (https://phabricator.wikimedia.org/T143405) (owner: 10Filippo Giunchedi) [23:00:27] 👏 [23:01:21] (03PS3) 10Dzahn: Phabricator: Remove option metamta.maniphest.public-create-email [puppet] - 10https://gerrit.wikimedia.org/r/324812 (owner: 10Paladox) [23:01:42] 🙌 [23:02:43] I should start adding more of that to !log [23:02:59] heh [23:04:44] 06Operations, 10Graphite, 06Labs: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#2840199 (10fgiunchedi) [23:06:26] (03CR) 10Dzahn: "recompiled your result after the last amend:" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:09:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [23:09:12] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [23:09:12] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [23:09:22] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [23:09:22] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [23:09:23] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:09:23] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [23:09:23] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [23:09:25] :) [23:09:32] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [23:09:52] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [23:25:45] (03CR) 10Dzahn: "now bohrium (as a VM) does not get the freeipmi package anymore (as requested). the "check_ipmi_sensor" is a nagios plugin from base monit" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:25:51] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2840250 (10bearND) The issue is not directly in Swagger. It's just that a Swagger spec is used to run a test which exerc... [23:30:23] (03CR) 10Dzahn: [C: 031] "also double checked what you said about db2035, einsteinium and neodymium above and confirm all three, this is as intended" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [23:30:55] (03Draft1) 10Paladox: Phabricator: Set domain for phab2001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) [23:31:01] (03Draft2) 10Paladox: Phabricator: Set domain for phab2001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) [23:31:02] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:09] twentyafterfour mutante ^^ :) [23:33:36] (03Draft1) 10Paladox: Phabricator: Set phabricator active server for iridium and phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324833 (https://phabricator.wikimedia.org/T137928) [23:33:38] (03Draft2) 10Paladox: Phabricator: Set phabricator active server for iridium and phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324833 (https://phabricator.wikimedia.org/T137928) [23:33:45] twentyafterfour mutante ^^ :) [23:34:04] hey all, what's the procedure to run a maintenance script on production servers? [23:34:17] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2840271 (10debt) Is there a chance that this will actually be implemented any time soon? [23:35:24] bmansurov: something that will permanently run by itself ? or a one-time thing? [23:35:32] mutante: one time thing [23:36:45] bmansurov: users with mw deploy rights can also ssh to the maintenance servers, it's terbium for eqiad and wasat for codfw [23:36:52] (03PS3) 10Paladox: Phabricator: Set phabricator active server for iridium and phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324833 (https://phabricator.wikimedia.org/T137928) [23:36:57] is it an existing maintenance script? [23:37:05] mutante: yes [23:37:18] mutante: it's in the page images extension [23:37:28] https://phabricator.wikimedia.org/T152155 mutante fyi [23:37:34] 06Operations, 06Services (watching): reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2840283 (10fgiunchedi) [23:37:36] 06Operations, 06Services (watching): make ocg role work on labs instances (install deployment-pdf instance with jessie) - https://phabricator.wikimedia.org/T135034#2840281 (10fgiunchedi) 05Open>03declined OCG is on its way out [23:39:00] bmansurov, you convince one of us (deployment or restricted) that it's safe to run [23:39:32] in this extension's case your best bet is Max Semenik [23:39:50] bmansurov: ooh, so that comes with mediawiki in ./maintenance/ but i was talking about maintenance servers that run other scripts across the cluster [23:40:01] so what i said about terbium/wasat does not apply [23:40:11] we tend to run these on terbium anyway mutante [23:40:20] ah, ok [23:40:23] Krenair: so if I convince Max, who will run the script? [23:40:30] I'm not sure if he has the rights [23:40:30] Max [23:40:37] ok [23:40:42] he's not even on IRC [23:40:47] mutante: ok [23:40:49] Max has deployment rights [23:40:52] hm [23:41:16] bah [23:41:20] he logged like three hours ago [23:43:26] (03PS3) 10Paladox: Phabricator: Set domain for phab2001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) [23:43:27] 06Operations, 10DBA, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2687819 (10fgiunchedi) I don't recall seeing this issue on s4 since https://gerrit.wikimedia.org/r/314229 landed, still an issue or we... [23:46:57] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4756/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [23:47:03] bmansurov, so [23:47:23] bmansurov, this issue affects pages that have been parsed since it's deployment? [23:47:29] Krenair: yes [23:47:37] Krenair: i'm running the script locally to verify [23:47:41] because I see this earlier-than option [23:47:55] And wondering if it might be best to do the reverse of that in this case [23:47:59] (03CR) 10Paladox: "Yay :)" [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [23:48:32] (03CR) 10Dzahn: [C: 031] Phabricator: Set domain for phab2001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [23:49:21] Krenair: reverse the patch that caused the issue? [23:49:31] Krenair: there is another patch that temporarily fixes it too:https://gerrit.wikimedia.org/r/#/c/324827/2 [23:49:32] no [23:49:53] run a version of the script with a later-than option [23:50:22] bmansurov: i've been talking to katz and robson about it. A revert won't really fix the problem, because post-revert we would already have non-free images in the field that it thinks is only free [23:50:36] ok [23:50:37] bmansurov: i think the sanest approach is to links update all the articles, over a couple days [23:50:49] it's at least the least dangerous, but the most computationally expensive [23:51:38] ebernhardson: ok so as Krenair says first we need to fix the affected articles [23:51:49] no [23:51:53] why not? [23:51:55] I haven't said anything [23:51:58] like that [23:52:00] bmansurov: the problem is there is no obvious way to fix the articles [23:52:10] what about later-than? [23:52:14] see now I'm regretting even looking at this [23:52:27] Krenair: looks like i misunderstood you [23:52:34] bmansurov: we could use some heuristic, like copy the free field to the default field for every page that's been edited since the train rolled forward. That's different for different wiki's [23:52:43] but that seems iffy, and will leave behind edge cases [23:53:08] (03CR) 10Dzahn: [C: 04-1] "no, the point of having the "active_server" setting is to make puppet stop certain things that should not be running on the inactive serve" [puppet] - 10https://gerrit.wikimedia.org/r/324833 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:53:31] anyway it sounds like ebernhardson knows more about this than I, so I'm going to leave it to him [23:54:10] ebernhardson: so are we looking at changing the maintenance script? [23:54:22] ebernhardson: or just running it now and waiting out the problem? [23:54:29] (03CR) 10Dzahn: "we need to talk more about what exactly can/should be running on the warm standby server that is not getting traffic and what needs to sta" [puppet] - 10https://gerrit.wikimedia.org/r/324833 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:55:19] bmansurov: after talking to katz and dgarry, It sounds like the best plan is deploy a change to allow returning non-free images. Despite the name that will mostly be returning free page images since the pages in the database havn't been updated with non-free [23:55:34] bmansurov: we run the script, populating the free field. Once that's done switch the default back to free images [23:55:50] (03CR) 1020after4: [C: 031] Phabricator: Set domain for phab2001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [23:55:51] jouncebot: next [23:55:51] In 0 hour(s) and 4 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161202T0000) [23:56:02] ebernhardson: ok, are you on top of deploying the patch? or should I take care of it? [23:56:35] bmansurov: jdlrobson just merged the patch to allow non-free. Will get that swatted once jenkins merges it. Not sure about the maint script to re-run linksupdate everyhwhere though [23:57:11] ebernhardson: ok thanks, I'll ping MaxSem about it [23:57:19] (03CR) 10Dzahn: [C: 032] "yup, upstream says it's deprecated" [puppet] - 10https://gerrit.wikimedia.org/r/324812 (owner: 10Paladox) [23:57:43] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/324812 (owner: 10Paladox) [23:58:56] !log bsitzmann@tin Starting deploy [mobileapps/deploy@b545699]: Update mobileapps to 04a6e84 [23:59:02] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log