[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161216T0000). [00:00:12] copy from nlwiki :P [00:00:19] Urbanecm: as a follow-up change, add your definitive main page name in the whitelist in InitialiseSettings.php, Hlavní_strana currently [00:00:23] s/nlwiki/dewiki [00:00:33] ok [00:00:42] 06Operations: label mwlog1001/WMF4724 (and update racktables) - https://phabricator.wikimedia.org/T153383#2878971 (10RobH) [00:01:10] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878252 (10RobH) [00:01:13] So for SWAT: that will wait a little bit, train was late, so is wiki creation. [00:01:54] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878986 (10RobH) Assigned over to @fgiunchedi for service implementation. Feel free to resolve this task when no longer needed. [00:02:16] 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2878988 (10RobH) a:05RobH>03fgiunchedi [00:02:25] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for arbcom-cs.wikipedia.org (duration: 00m 40s) [00:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:06] !log mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php arbcom_cswiki --backend=local-multiwrite --private [00:08:11] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [00:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:07] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2878663 (10Cmjohnson) @Eevans Please follow the step outlined in https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_requests. Once completed there is a 3 day gr... [00:09:14] !log arbcom-cs.wikipedia.org creation is done. [00:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:47] Main Page needs to be adjusted though, but Urbanecm was tasked with it iirc [00:11:00] fwiw his account should be created there, right? [00:11:16] yes I'm grepping the log to find again the username he wants [00:11:27] Martin Urbanec [00:11:41] I tried `lastops | grep Urbanecm | grep -v " has " | grep Martin`, as I'm sure it's Martin ... without success [00:12:18] Dereckson: https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec [00:12:29] + he said so in the Task [00:12:29] 06Operations: reinstall/reimage sinistra as mwlog2001 - https://phabricator.wikimedia.org/T153384#2879007 (10RobH) [00:12:38] ok [00:12:49] let me search for more security [00:13:49] https://phabricator.wikimedia.org/T151731#2863051 [00:14:20] !log arbcom_cswiki: Creating and promoting User:Martin Urbanec into bureaucrat [00:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:41] * TabbyCat checks [00:15:07] ostriches: sorry to bother -- what was the magic incantation for gerrit -> github replication you did? still not seeing operations/software/hhvm_exporter in github :( [00:15:21] Member of: Bureaucrats [00:15:23] (03PS1) 10Dzahn: toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) [00:15:30] I guess he can assign sysops himself [00:15:51] godog: It gets triggered when something lands, or `ssh gerrit.wikimedia.org replication start some/repo/name` [00:16:38] ostriches: ack thanks, I've nudged it now [00:16:48] yw [00:17:56] (03PS2) 10Dereckson: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327590 (owner: 10Jdrewniak) [00:18:00] debt: ping?. [00:19:21] (03PS2) 10Dereckson: Revert "Temporarily disable centralauth-rename right" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327657 (owner: 10Kaldari) [00:19:29] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327657 (owner: 10Kaldari) [00:19:38] kaldari: here [00:20:23] (03Merged) 10jenkins-bot: Revert "Temporarily disable centralauth-rename right" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327657 (owner: 10Kaldari) [00:20:51] kaldari: live on mwdebug1002 [00:20:56] checking... [00:21:10] jan_drewniak: ping? [00:21:21] (03CR) 10jenkins-bot: [V: 04-1] toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [00:21:53] Dereckson: Looks good on 1002, feel free to sync [00:22:42] I'm here, Dereckson [00:23:06] debt: okay, I'm syncing kaldari change, then you're next [00:23:10] thanks [00:23:37] (03PS3) 10Dereckson: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327590 (owner: 10Jdrewniak) [00:23:39] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Reenable centralauth-rename right (T148242) (duration: 00m 49s) [00:23:44] (03CR) 10Dereckson: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327590 (owner: 10Jdrewniak) [00:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:51] T148242: Fully populate local_user_id and global_user_id fields in production - https://phabricator.wikimedia.org/T148242 [00:24:16] debt: you can test before I sync and purge on mwdebug1002? [00:24:33] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327590 (owner: 10Jdrewniak) [00:24:38] sure [00:25:09] should be good to go [00:25:22] debt: live on mwdebug1002 [00:25:28] 06Operations, 10Domains, 06Labs, 10Pywikibot-core, 10Traffic: pywikipedia.org is not responding; pywikibot.org is not registered - https://phabricator.wikimedia.org/T106311#2879061 (10Dzahn) [00:25:53] looks fine there, Dereckson [00:26:39] Dereckson: looks good on Prod, thanks! [00:26:46] debt: syncing [00:27:14] (03CR) 10Dzahn: [C: 04-1] contint: provision the secondary CI master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:27:19] !log dereckson@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 39s) [00:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:00] !log dereckson@tin Synchronized portals: (no message) (duration: 00m 40s) [00:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:32] yurik: I've 327409 to Jenkins, you've the floor to deploy it [00:30:45] I've CR'ed +2 [00:30:59] and already merged [00:31:07] Dereckson, thanks! [00:31:20] do you think we have scap time for i18n today? [00:31:37] Dereckson, also, do we need to bump core, or is v6 ok? [00:33:16] v6 is okay [00:34:51] ok, thx [00:37:39] (03PS3) 10Dzahn: contint: provision the secondary CI master [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:37:50] Dereckson: looks good in production, thanks! [00:37:53] (03CR) 10Dzahn: [C: 031] "compiled. no-op on contint1001 http://puppet-compiler.wmflabs.org/4903/" [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:38:06] debt: you're welcome [00:38:58] (03CR) 10Dzahn: [C: 032] contint: provision the secondary CI master [puppet] - 10https://gerrit.wikimedia.org/r/327594 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:42:11] PROBLEM - Check systemd state on contint2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:11] PROBLEM - puppet last run on contint2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:21] PROBLEM - salt-minion processes on contint2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:31] PROBLEM - DPKG on contint2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:01] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational [00:43:11] RECOVERY - salt-minion processes on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:43:22] RECOVERY - DPKG on contint2001 is OK: All packages OK [00:43:37] testing on mw1002 with MaxSem & jgirault [00:43:44] mwdebug1002 [00:43:59] (03PS1) 10Cmjohnson: Adding prometheus1003-4 to netboot.cfg file T152504 [puppet] - 10https://gerrit.wikimedia.org/r/327677 [00:46:01] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [00:47:18] !log yurik@tin Synchronized php-1.29.0-wmf.6/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/327409/ (duration: 00m 41s) [00:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:22] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879130 (10Dzahn) [00:50:02] done [00:50:56] (03PS2) 10Dzahn: zuul: manage service status from hiera [puppet] - 10https://gerrit.wikimedia.org/r/327649 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:53:02] (03PS1) 10Filippo Giunchedi: prometheus: test hhvm/apache exporter on mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/327681 (https://phabricator.wikimedia.org/T147423) [00:56:22] (03CR) 10Dzahn: [] "PS2: manual rebase, compiled -> http://puppet-compiler.wmflabs.org/4904/" [puppet] - 10https://gerrit.wikimedia.org/r/327649 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:56:34] (03CR) 10Dzahn: [C: 032] zuul: manage service status from hiera [puppet] - 10https://gerrit.wikimedia.org/r/327649 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [00:58:30] RECOVERY - cassandra-a CQL 10.64.32.130:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.32.130 port 9042 [00:58:57] 06Operations, 06Performance-Team: Upgrade Grafana to 4.0.2 - https://phabricator.wikimedia.org/T152473#2879148 (10fgiunchedi) >>! In T152473#2878554, @Gilles wrote: > That's done now, right? Correct! [00:59:12] (03CR) 10Eevans: [C: 031] "Ready." [puppet] - 10https://gerrit.wikimedia.org/r/327560 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [00:59:20] (03PS2) 10Eevans: enable instance restbase1017-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327560 (https://phabricator.wikimedia.org/T151086) [01:00:10] PROBLEM - jenkins_zmq_publisher on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [01:02:21] (03PS3) 10Dzahn: enable instance restbase1017-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327560 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [01:02:26] (03CR) 10Cmjohnson: [C: 032] Adding prometheus1003-4 to netboot.cfg file T152504 [puppet] - 10https://gerrit.wikimedia.org/r/327677 (owner: 10Cmjohnson) [01:03:27] (03CR) 10Dzahn: [V: 032 C: 032] enable instance restbase1017-b.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327560 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [01:03:53] mutante: thanks! [01:05:20] sure, np [01:06:04] did something related to OAuth tokens change today, or is something broken right now? [01:06:30] ACKNOWLEDGEMENT - jenkins_zmq_publisher on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused daniel_zahn T150771 [01:06:32] ragesoss: It's possible... Did it only break in the last few hours? [01:06:38] yes [01:06:43] .6 [01:07:06] Hmm [01:07:16] https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.6/Changelog shows no oauth changes at all [01:07:50] ragesoss: Wanna be more specific? [01:08:45] some of my tests that interact with the api are failing, both locally and on CI [01:08:55] looking into more details now. [01:11:10] (03PS2) 10Filippo Giunchedi: prometheus: test hhvm/apache exporter on mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/327681 (https://phabricator.wikimedia.org/T147423) [01:11:18] (03CR) 10Filippo Giunchedi: [] "PCC https://puppet-compiler.wmflabs.org/4908/" [puppet] - 10https://gerrit.wikimedia.org/r/327681 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [01:12:23] (03PS2) 10Dzahn: contint: add a disabled zuul server on contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [01:13:16] (03CR) 10Dzahn: [C: 032] "PS2: manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [01:14:11] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: test hhvm/apache exporter on mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/327681 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [01:14:16] (03PS3) 10Filippo Giunchedi: prometheus: test hhvm/apache exporter on mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/327681 (https://phabricator.wikimedia.org/T147423) [01:14:20] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:14:25] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] prometheus: test hhvm/apache exporter on mwdebug [puppet] - 10https://gerrit.wikimedia.org/r/327681 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [01:14:35] rebase wars!!11one [01:15:09] (03CR) 10Dzahn: "Error: The debian provider can not handle attribute enable" [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [01:15:17] lol, yes [01:17:00] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zuul] [01:18:46] ACKNOWLEDGEMENT - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[zuul] daniel_zahn T150771 [01:22:26] (03PS1) 10Filippo Giunchedi: prometheus: fix apache_exporter's port [puppet] - 10https://gerrit.wikimedia.org/r/327682 [01:23:07] hmm... some curl: http://127.0.0.1:9001/stop in the logs.... weird :) [01:23:57] ApiVisualEditor.php line 253 undefined var ... James_F ? [01:24:14] and line 250 [01:24:24] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix apache_exporter's port [puppet] - 10https://gerrit.wikimedia.org/r/327682 (owner: 10Filippo Giunchedi) [01:26:16] 250 is blank [01:28:48] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:28:58] PROBLEM - DPKG on mwdebug1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:29:18] PROBLEM - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [01:29:38] PROBLEM - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [01:29:50] yes, because it's supposed to be stopped [01:30:08] we need to add the usual logic to avoid the icinga thing on the non-active server [01:30:18] PROBLEM - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.131 and port 9042: Connection refused [01:30:24] mwdebug1002 - no idea [01:30:46] restbase1017 - yea, that is brand new [01:30:53] on it [01:31:09] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.131 and port 9042: Connection refused eevans Bootstrapping [01:31:23] ACKNOWLEDGEMENT - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused daniel_zahn T150771 [01:31:23] ACKNOWLEDGEMENT - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server daniel_zahn T150771 [01:31:29] cool [01:32:50] mwdebug is me [01:33:00] gotcha [01:33:20] Reedy: I think the problem I'm having might be related to how the API handles requests for non-existent users. So, plausibly related to the API i18n work? Looks like there were some live requests with dummy data in the test suite that I didn't realize were hitting mediawiki. [01:33:28] sound plausible? [01:35:16] (03CR) 10Dzahn: contint: add a disabled zuul server on contint2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [01:35:35] (03CR) 10Dzahn: "for "enable" attribute: Valid values are true, false, manual. so "mask" isn't one of them" [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [01:36:44] (03PS1) 10Filippo Giunchedi: Remove duplicate hhvm_tc_used_bytes metric [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/327683 [01:36:58] ragesoss: Yeah, there's been some error message change sand stuff [01:37:24] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Remove duplicate hhvm_tc_used_bytes metric [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/327683 (owner: 10Filippo Giunchedi) [01:38:18] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 37 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-hhvm-exporter] [01:39:12] (03PS1) 10Dzahn: contint: ensure => 'stopped' for zuul instead of 'mask' [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) [01:41:09] (03PS2) 10Dzahn: contint: ensure => 'stopped' for zuul instead of 'mask' [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) [01:42:14] (03PS1) 10Krinkle: tests: Use cp0xxx instead of cp1xxx for sample data [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 [01:42:32] (03PS1) 10Filippo Giunchedi: Release 0.2 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/327687 [01:42:47] (03CR) 10jenkins-bot: [V: 04-1] tests: Use cp0xxx instead of cp1xxx for sample data [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [01:42:49] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational [01:42:58] (03PS3) 10Dzahn: contint: ensure => 'stopped' for zuul instead of 'mask' [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) [01:42:58] RECOVERY - DPKG on mwdebug1002 is OK: All packages OK [01:42:58] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:43:18] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:43:27] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Release 0.2 [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/327687 (owner: 10Filippo Giunchedi) [01:43:53] (03CR) 10Krinkle: [] "Build is passing at https://travis-ci.org/Krinkle/operations-software-conftool/builds/184422466. It seems other commits in this repo were " [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [01:44:09] (03CR) 10Dzahn: [C: 032] "I see we have one place where we use "mask" but we use an exec for that and not the service attribute. guess that explains why.. (modules/" [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [01:45:06] (03CR) 10Dzahn: [C: 032] "yup, so we have to wait for Puppet 4 (4.2?) for this feature" [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [01:47:08] (03CR) 10Dzahn: "Notice: /Stage[main]/Zuul::Server/Service[zuul]/enable: enable changed 'true' to 'false'" [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [01:48:08] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [01:48:15] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/327685/" [puppet] - 10https://gerrit.wikimedia.org/r/327650 (https://phabricator.wikimedia.org/T150771) (owner: 10Hashar) [01:48:29] (03PS1) 10Filippo Giunchedi: Fix tests to not consider hhvm_tc_used_bytes [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/327688 [01:48:44] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Fix tests to not consider hhvm_tc_used_bytes [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/327688 (owner: 10Filippo Giunchedi) [01:49:19] (03CR) 10Dzahn: [] "ack, the regular prod role works in labs, now, let's get rid of the labs role" [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) (owner: 10Hashar) [01:49:56] (03PS2) 10Dzahn: phabricator: fix passing config on labs [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) (owner: 10Hashar) [01:50:18] (03CR) 10Dzahn: [C: 032] "merging anyways, but let's kill this role as soon as possible" [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) (owner: 10Hashar) [01:58:26] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879239 (10Dzahn) rebased/amended/merged/follow-up fix done contint1001/2001 are now including identical roles in site.pp... [02:02:22] (03PS1) 10Dzahn: phabricator: delete labs role [puppet] - 10https://gerrit.wikimedia.org/r/327690 (https://phabricator.wikimedia.org/T139475) [02:02:41] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/327690/" [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) (owner: 10Hashar) [02:06:01] (03CR) 10Dzahn: [C: 031] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari) [02:11:01] (03PS1) 10Dzahn: contint: combine contint1001/2001 in a single node regex [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) [02:15:52] Krinkle: re: https://gerrit.wikimedia.org/r/#/c/327686 seems a bit specific to change only cp hostnames and not e.g. mw, not 100% sure what you are attempting to do [02:16:08] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:16:24] godog: Just wanted to avoid those prod host names from showing up in Wikimedia Git searches in all repos [02:16:31] since they're not "real" [02:16:57] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2879244 (10TTO) >>! In T126146#2876690, @thiemowmde wrote: > His -1 could easily have been overruled by someone else,... [02:17:06] I was looking for which cluster cp1055 is in via https://github.com/search?q=org:wikimedia+cp1055&ref=opensearch&type=Code [02:17:10] and noticed the false positive matches [02:18:07] (03PS1) 10Dzahn: contint: simplify includes in site.pp, move things to master role [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) [02:20:23] Krinkle: ah ok, yeah I suspect there's many such examples in git overall :( [02:20:54] godog: not to my knowledge. The few that showed up over time I've removed :) [02:21:22] anyhow, I'll update the mw ones as well [02:22:35] Krinkle: hehe e.g. also mw1010 shows up as false positive [02:22:40] maybe I got lucky tho [02:22:45] Lpl [02:22:47] Lol* [02:23:54] (03PS2) 10Krinkle: tests: Use cp0 and srv0 instead of cp1/mw1 for sample data [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 [02:24:16] godog: What's the state of the Jenkins job btw? [02:24:26] It seems no commit to this repo has been in Gerrit until mine just now [02:24:33] But it does have a Jenkins job set up [02:24:36] (03CR) 10jenkins-bot: [V: 04-1] tests: Use cp0 and srv0 instead of cp1/mw1 for sample data [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [02:24:39] I assume this isn't used by ops? [02:25:19] Krinkle: no idea tbh, we use the software but I haven't committed to the repo [02:25:53] https://github.com/wikimedia/operations-software-conftool/graphs/contributors [02:26:43] QED [02:27:03] :) [02:29:23] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [02:30:09] ok I'm off, bye! [02:30:36] (03PS1) 10Mobrovac: Beta: RESTBase: Use the BetaCluster MCS instance [puppet] - 10https://gerrit.wikimedia.org/r/327694 (https://phabricator.wikimedia.org/T149671) [02:31:15] _joe_ elukey hhvm/apache exporters are running on mwdebug100[12] btw if tomorrow you want to apply the role to some more hosts I didn't see any impact whatsoever [02:31:48] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 11m 47s) [02:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Dec 16 02:36:34 UTC 2016 (duration 4m 46s) [02:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:02] (03CR) 10Mobrovac: [C: 031] "Cherry-picked in beta and applies only to beta. Good to go." [puppet] - 10https://gerrit.wikimedia.org/r/327694 (https://phabricator.wikimedia.org/T149671) (owner: 10Mobrovac) [02:43:23] (03PS1) 10Dzahn: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) [02:44:02] (03CR) 10Dzahn: [] "This is similar to how we solve the same thing with Gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [02:44:56] (03PS2) 10Dzahn: role::jsbench: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320547 (owner: 10Muehlenhoff) [02:45:08] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:46:46] (03CR) 10Dzahn: [C: 032] Beta: RESTBase: Use the BetaCluster MCS instance [puppet] - 10https://gerrit.wikimedia.org/r/327694 (https://phabricator.wikimedia.org/T149671) (owner: 10Mobrovac) [02:47:18] (03CR) 10Dzahn: [C: 032] role::jsbench: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320547 (owner: 10Muehlenhoff) [02:47:23] (03PS3) 10Dzahn: role::jsbench: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320547 (owner: 10Muehlenhoff) [02:58:25] (03PS3) 10Dzahn: Tools proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/321371 (owner: 10Muehlenhoff) [03:00:22] (03CR) 10Dzahn: [C: 032] Tools proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/321371 (owner: 10Muehlenhoff) [03:02:17] (03CR) 10Dzahn: [C: 04-1] "you should ask people who already work with the "ELK-stack" (Elasticsearch Logstash Kibana) about this. i don't know if we really want/nee" [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [03:08:55] !log osmium (role ve): upgrade chromium for https://www.debian.org/security/2016/dsa-3731 [03:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:28] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:22:33] (03CR) 10Krinkle: [C: 032] tests: Clean up PHPUnit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325054 (owner: 10Krinkle) [03:23:08] (03Merged) 10jenkins-bot: tests: Clean up PHPUnit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325054 (owner: 10Krinkle) [03:28:58] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [03:29:30] that will be me rolling out security upgrades [03:30:58] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [03:31:16] !log installing apt upgrade for https://www.debian.org/security/2016/dsa-3733 on thumbor, bastion, misc-others group [03:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:38] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:37:54] !log installing apt upgrade for DSA-3733-1 on everything in codfw (dc-codfw group) [03:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:28] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [03:42:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [03:42:39] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:43:38] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [03:44:38] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [03:44:39] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [03:45:38] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [03:46:12] i dont know why it did that, but it's done now [04:01:38] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [04:03:58] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:19:58] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1880.60 Read Requests/Sec=3033.90 Write Requests/Sec=0.60 KBytes Read/Sec=27107.20 KBytes_Written/Sec=28.80 [04:29:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=48.10 Read Requests/Sec=168.70 Write Requests/Sec=146.80 KBytes Read/Sec=1601.60 KBytes_Written/Sec=1541.20 [04:32:58] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [04:57:55] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 07HTTPS, 05codfw-rollout: Enable HTTPS on internal MediaWiki appserver virtual service hostnames - https://phabricator.wikimedia.org/T109315#2879355 (10BBlack) [04:57:58] 06Operations, 10Traffic, 07HHVM, 13Patch-For-Review, and 2 others: Enable TLS termination on the MediaWiki clusters - https://phabricator.wikimedia.org/T153042#2879353 (10BBlack) [04:58:38] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:13:48] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:26:38] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [05:39:28] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 2 minutes ago with 12 failures. Failed resources (up to 3 shown): Package[debian-goodies],Package[apt-listchanges],Package[ethtool],Package[atop] [05:40:48] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:07:28] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:13:38] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:32:10] <_joe_> godog: ack [06:42:38] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:53:58] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:54:38] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:04:35] (03PS1) 10Urbanecm: Enable subpages in NS0 for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327700 [07:12:38] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:16:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327701 (https://phabricator.wikimedia.org/T150644) [07:19:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327701 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [07:19:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327701 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [07:21:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1045 - T150644 (duration: 00m 40s) [07:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:07] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [07:22:08] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:38] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:22:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:22:58] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:23:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:25:29] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2879751 (10Joe) [07:25:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:27:33] !log upgrading apt via debdeploy on dc-all for DSA-3733-1 apt -- security update [07:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:29:09] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[gdb] [07:34:16] (03CR) 10Marostegui: [C: 031] "server caught up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327505 (owner: 10Marostegui) [07:34:21] godog: niceeeeeee [07:34:24] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327505 [07:35:47] !log Deploy alter table wikidatawiki.revision on db1045 - T150644 [07:35:56] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327505 (owner: 10Marostegui) [07:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:01] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [07:36:29] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327505 (owner: 10Marostegui) [07:37:37] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2068 - T151552 (duration: 00m 39s) [07:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:51] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [07:40:23] (03PS1) 10Marostegui: osc_host.sh: Added skip-ssl for the connection [software] - 10https://gerrit.wikimedia.org/r/327703 (https://phabricator.wikimedia.org/T111654) [07:40:38] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:41:08] (03PS2) 10Marostegui: osc_host.sh: Add skip-ssl for the connection [software] - 10https://gerrit.wikimedia.org/r/327703 (https://phabricator.wikimedia.org/T111654) [07:43:06] !log silver - upgrading apt [07:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:08] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:53:18] RECOVERY - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.32.131 port 9042 [07:55:13] (03CR) 10Hashar: "Sorry Danel, I used my local puppet documentation which is some 4.x and has enable => mask :D" [puppet] - 10https://gerrit.wikimedia.org/r/327685 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [07:57:09] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:58:44] (03CR) 10Hashar: [] contint: combine contint1001/2001 in a single node regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [08:10:40] (03CR) 10Hashar: [C: 04-1] contint: simplify includes in site.pp, move things to master role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [08:10:45] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2879890 (10Joe) [08:17:19] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2879900 (10Joe) So, given we need to close this discussion once and for all, I came up with a change of wording that would have us write the profile cla... [08:18:20] (03CR) 10Hashar: [C: 04-1] "I have to refactor/split contint::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [08:20:57] !log Stop mysql db2048 for maintenance - https://phabricator.wikimedia.org/T149553 [08:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:32] (03CR) 10Hashar: [C: 04-1] "That ties the zuul module to a hiera key under contint::. Instead we can skip the include in zuul::server when ever service_enable is fal" [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [08:26:25] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2879938 (10yuvipanda) [08:28:48] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2879965 (10hashar) >>! In T150771#2879239, @Dzahn wrote: > rebased/amended/merged/follow-up fix done What a surprise to hav... [08:40:08] (03CR) 10Jcrespo: [C: 031] osc_host.sh: Add skip-ssl for the connection [software] - 10https://gerrit.wikimedia.org/r/327703 (https://phabricator.wikimedia.org/T111654) (owner: 10Marostegui) [08:48:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327708 [08:50:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327708 (owner: 10Marostegui) [08:50:33] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2724601 (10ArielGlenn) I have a few scripts that are generated from templates. Any thoughts about what we can do for these cases? [08:50:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327708 (owner: 10Marostegui) [08:51:08] Hi gerrit is slow for me [08:51:20] It's not my connection as I can load irc and google [08:51:28] paladox: it is working fine for me so far [08:51:37] Oh [08:51:45] 06Operations, 06Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#2880030 (10Gehel) Changing the IO scheduler to `noop` instead of `deadline` makes mostly no difference: Steps: ``` Dec 15 20:50:38 Writing a byte at a time...done Dec 15... [08:52:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 - T150644 (duration: 00m 56s) [08:52:12] paladox: for the last two hours I haven't noticed anything unusual [08:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:23] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [08:52:37] It's saying safari cannot open [08:52:42] The page [08:54:39] It seems to be only me I guess my mobile provider cached it so it is not loading [08:55:09] It's working now [08:55:14] :-) [08:55:35] :) [08:55:58] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:03:27] (03CR) 10Marostegui: [C: 032] osc_host.sh: Add skip-ssl for the connection [software] - 10https://gerrit.wikimedia.org/r/327703 (https://phabricator.wikimedia.org/T111654) (owner: 10Marostegui) [09:04:30] (03Merged) 10jenkins-bot: osc_host.sh: Add skip-ssl for the connection [software] - 10https://gerrit.wikimedia.org/r/327703 (https://phabricator.wikimedia.org/T111654) (owner: 10Marostegui) [09:08:12] !log Deploy alter table dbstore1002 wikidatawiki.revision - T150644 [09:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:25] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [09:15:12] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2880093 (10thiemowmde) @greg: Looks like the urgent priority in T151247 is because stuff got deleted (and restored by... [09:19:18] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:55] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2880142 (10TTO) >>! In T126146#2880093, @thiemowmde wrote: > @TTO: Why do you think this has something to do with "Wi... [09:23:58] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:26:33] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2880151 (10Volans) a:03Volans [09:33:12] (03PS1) 10Dereckson: toollabs: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327709 [09:36:06] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2880170 (10thiemowmde) "Sites" is not a first-class Wikibase concept. It was meant to replace core's interwiki map. T... [09:37:55] (03PS1) 10Dereckson: swift: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327710 [09:39:01] (03CR) 10Dereckson: [] "Upstream seems to have fixed that issue in 2O11." [puppet] - 10https://gerrit.wikimedia.org/r/327710 (owner: 10Dereckson) [09:39:16] (03PS4) 10Jcrespo: Repool db1073 after maintenance as enwiki extra API node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327668 (https://phabricator.wikimedia.org/T149728) [09:41:00] (03PS1) 10Dereckson: varnish: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327711 [09:42:50] * Dereckson wonders if it's valuable to separate commits by modules to ease migration adn testing, or if it's trivial enough to have one commit for every #!/usr/bin/python to #!/usr/bin/env python [09:43:08] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:50] yeah and 179 #!/bin/bash [09:48:18] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:50:13] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2880204 (10Volans) @ArielGlenn it's surely depends on the specific cases, but I think that this is usually an anti-pattern. What... [10:11:08] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:12:24] (03CR) 10Hashar: [C: 031] contint: New role for Docker based CI slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [10:20:11] 06Operations, 10MediaWiki-Internationalization: Norwegian messages inContentLanguage look for on-wiki overrides at the /nb subpage, not the root page - https://phabricator.wikimedia.org/T126146#2880279 (10jhsoby) >>! In T126146#2880093, @thiemowmde wrote: > The only thing the Wikidata team can do is to **remov... [10:31:18] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [10:34:13] doing a friday deploy https://phabricator.wikimedia.org/T153424 [10:41:22] !log krenair@tin Synchronized php-1.29.0-wmf.6/extensions/OAuth/api/MWOAuthSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/327714/1 (duration: 00m 40s) [10:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:50] !log Deploy alter table dbstore1001 wikidatawiki.revision [10:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:51] that appears to have fixed things [10:43:37] yup [10:45:02] (03PS3) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [10:45:04] (03PS1) 10Yuvipanda: labsdb: Make delete-dbusers work with new config format [puppet] - 10https://gerrit.wikimedia.org/r/327715 [10:45:08] (03PS1) 10ArielGlenn: don't treat missing output files as truncated [dumps] - 10https://gerrit.wikimedia.org/r/327716 [10:45:59] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [10:46:07] (03CR) 10ArielGlenn: [C: 032] don't treat missing output files as truncated [dumps] - 10https://gerrit.wikimedia.org/r/327716 (owner: 10ArielGlenn) [10:46:14] (03CR) 10jenkins-bot: [V: 04-1] labsdb: Make delete-dbusers work with new config format [puppet] - 10https://gerrit.wikimedia.org/r/327715 (owner: 10Yuvipanda) [10:47:09] !log ariel@tin Starting deploy [dumps/dumps@7118c1b]: fix error checking for reruns of single checkpoint files [10:47:11] !log ariel@tin Finished deploy [dumps/dumps@7118c1b]: fix error checking for reruns of single checkpoint files (duration: 00m 01s) [10:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:38] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:57:04] (03PS2) 10Ema: varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) [10:57:06] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2880407 (10hashar) About spellcheck, Jessie has version 0.3.4 , would it make sense to backport 0.4.4 from testing and add it to... [10:57:57] (03CR) 10Ema: [] varnishmedia: port to cachestats.CacheStatsSender (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [10:58:48] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: run shell inside the chroot on build failures [puppet] - 10https://gerrit.wikimedia.org/r/327497 (owner: 10Ema) [10:59:07] (03CR) 10Alexandros Kosiaris: [C: 032] "tested in labs (packager in packaging project). Seems to work fine! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/327497 (owner: 10Ema) [10:59:14] (03PS2) 10Alexandros Kosiaris: package_builder: run shell inside the chroot on build failures [puppet] - 10https://gerrit.wikimedia.org/r/327497 (owner: 10Ema) [10:59:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: run shell inside the chroot on build failures [puppet] - 10https://gerrit.wikimedia.org/r/327497 (owner: 10Ema) [10:59:18] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:02:39] akosiaris: the shell run on package building failure, that is a good thing for manual building but I am afraid it is going to break the debian-glue jobs? :D [11:02:43] ( ref https://gerrit.wikimedia.org/r/#/c/327497/ ) [11:03:12] (I wish it was an option of cowbuilder really, would be quite handy) [11:03:32] ah yes, the jobs are going to get stuck [11:03:55] hashar: what if we guarded it with a ENV var ? [11:04:06] for building Nodepool images I am using diskimage-builder that has can break to a shell when an env is set [11:04:13] so that debian-glue jobs issued a NO_C10shell=true ? [11:04:36] so one exports break=after-error and bam [11:04:37] or something similar.. not attached to naming [11:04:40] yeah [11:04:54] (03CR) 10Volans: [] "LGTM, but there is another one." (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/327686 (owner: 10Krinkle) [11:04:55] I would make breaking to shell the default [11:05:00] yes [11:05:10] and pick whatever env name to let us disable that hook/feature on jenkins [11:05:15] ok agreed [11:05:28] thanks for remembering that [11:05:53] (03PS3) 10Ema: varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) [11:06:00] akosiaris: is there a packaging project? I created a package builder in deployment-prep since I needed more than 8GB of ram for a build :) [11:06:02] I was opting for having the shell enabled via an ENV var, but then after discussing it with ema realized that we probably want it to be the default [11:06:22] elukey: yes, let me add you as a user. It's quite the old project.. dated back to at least 2013 [11:06:56] akosiaris: so what ever works for you :} [11:07:11] hashar: ok [11:07:19] will amend the change and let you know of the variable [11:07:27] I am not sure how to pass the envs in debian-glue / cowbuilder etc. I think I need to whitelist the env variable in the labs sudo policy. But I will figure it out [11:07:48] if I was smarter [11:08:07] (03CR) 10Jcrespo: [C: 032] Repool db1073 after maintenance as enwiki extra API node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327668 (https://phabricator.wikimedia.org/T149728) (owner: 10Jcrespo) [11:08:11] we would have a dedicated private jenkins that auto build packages for us and if build from a tag auto publish them to a central repo :} [11:08:30] and we will have .deb packages full continuous integration/deployment [11:08:38] ema: elukey: I 've added you to the labs packaging project [11:08:48] (which is how one tagging a package ends up magically upgrading and breaking prod) [11:09:02] hehehe [11:09:15] I am not that worried about the last step to be honest [11:09:22] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [11:09:35] if we could get to something that just has the debs somewhere I would be happy [11:09:36] I have run puppet on the two slaves that are building packages (integration-slave-jessie-1001 and integration-slave-1002). So the debian-glue jobs should be broken now [11:09:47] nice db1073 finally back in the mix!! [11:09:51] akosiaris: thanks! [11:10:11] akosiaris: another possibility would be to detect whether the terminal/shell is a tty and if so break to shell else skip [11:10:17] not sure if that is reliable though [11:10:42] akosiaris: thanks! [11:11:36] hashar: depends on whether jenkins allocates a pseudotty when initiating the tests [11:12:06] $ ([ -t 1 ] && echo terminal || echo nop) [11:12:08] terminal [11:12:08] $ ([ -t 1 ] && echo terminal || echo nop) | cat [11:12:10] nop [11:12:10] $ [11:12:24] I have no idea about Jenkins behavior :( [11:13:48] https://integration.wikimedia.org/ci/job/hashar-is-a-tty/1/console [11:14:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 (duration: 00m 40s) [11:14:17] has TERM=dumb and apparently stdout is not a terminal [11:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:53] so [ -t 1 ] might be good enough [11:15:12] there's also the tty command [11:15:16] under /usr/bin/tty [11:15:29] returns the tty and exits with 0 [11:15:37] or just says no tty and exists with 1 [11:15:38] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:14] yeah [11:16:20] though my manpage states: The -s option is deprecated in favor of the ``test -t 0'' [11:16:30] (full disclosure, I have been reading http://stackoverflow.com/questions/911168/how-to-detect-if-my-shell-script-is-running-through-a-pipe ) [11:16:49] heh, mine doesn't [11:16:56] -s, --silent, --quiet [11:16:56] print nothing, only return an exit status [11:17:01] debian stretch [11:17:06] and in bash, there is the variable ${-} which would contains 'i' if it is interactive [11:17:38] man pages... Mine come from the Mac OS BSD one (date June 6 1993 eek) [11:18:09] anyway -t it is [11:18:11] testing [11:18:24] then the exact behavior is to be figure [11:18:36] but I guess you want to break to shell by default when run interactively. Would be quite handy [11:18:42] IOPS in labs is astonishing [11:18:55] tmpfs man !! tmpfs! [11:19:04] delete a 100kb file takes 4-5 seconds [11:19:26] maybe one of the compute node has some ongoing issue :/ [11:19:40] (03PS2) 10Urbanecm: Enable subpages in NS0 for arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327700 (https://phabricator.wikimedia.org/T151731) [11:19:45] https://grafana.wikimedia.org/dashboard/db/labvirt-node-disk-stats knows [11:21:09] what are read_await ? [11:21:10] seconds ? [11:21:40] no clue [11:21:45] bad dashboard [11:21:46] that would comes from diamond [11:21:55] https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/diskusage/diskusage.py#L240 [11:22:28] the util percentage though is pretty telling [11:22:42] at least 2 boxes in > 60% disk utilization [11:22:45] that explains a lot [11:23:10] yup [11:23:17] been going on for at least 7 days [11:24:38] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:24:43] and yeah dashboard is a bit crazy :D [11:24:53] the list of labvirt nodes is hardcoded [11:26:41] !log enabling remote IPMI where it's not enabled T150160 [11:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:53] T150160: Remote IPMI doesn't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160 [11:29:53] akosiaris: I am refactoring that dashboard [11:30:03] hashar ah, I 've done some changes already [11:30:05] might want to reload [11:30:20] namely fixed the await issue [11:30:29] aoeharh [11:30:30] I have saved already, I am backing off now [11:30:36] it's yours! [11:30:55] ahh yeah the await is an average [11:31:30] we should probably migrate to json+git for most used templates [11:31:54] better to keep track of changes an collaborate on improvements [11:31:58] *and [11:32:13] if we have a staging grafana to easily prototype yeah [11:32:19] (03CR) 10Alexandros Kosiaris: [C: 032] swift: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327710 (owner: 10Dereckson) [11:32:23] you can stage on the same instance [11:32:25] (03PS2) 10Alexandros Kosiaris: swift: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327710 (owner: 10Dereckson) [11:32:27] with another name [11:32:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] swift: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327710 (owner: 10Dereckson) [11:32:31] then delete it [11:32:52] haven't thought about that :} sounds good [11:33:13] krinkle created nice documentation here: https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Save_dashboards_in_puppet [11:33:48] most dashboads are not worth it, but for some, we cannot not have it on git [11:33:48] hashar: just be careful with the JSON export from the interface. The JSON it generates funnily enough needs some tweaking to work [11:34:04] namely the sources [11:34:23] hashar: so, the [ -t 1 ] approach does not work [11:34:32] :-((( [11:34:43] seems like the hooks is called already in a pipe or something [11:34:44] guess the file descriptor is not passed around [11:34:48] probably in a subshell [11:34:50] yes [11:35:19] back to the ENV approach I 'd say [11:35:31] doing one final test with the tty command just to make sure first though [11:37:07] !log run apt-get autoremove across labs to cleanup old linux kernel images [11:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:11] !log abort apt-get autoremove very quickly, because I don't want to mess with grub menus on a friday [11:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:26] that never ends [11:44:38] the labvirt1010 -- labvirt1014 do not seem to report diamond stats :( [11:44:55] I sense a wrong regex somewhere such as labvirt100.* :D [11:50:06] yeah sites.pp shows they have different config. The 100X ones having an extra openstack::nova::partition{ '/dev/sdb': } [11:51:03] mystery solved [11:54:08] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:03] 06Operations, 10DBA: Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave - https://phabricator.wikimedia.org/T153440#2880544 (10jcrespo) [11:57:43] 06Operations, 06Collaboration-Team-Triage, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#1499219 (10jcrespo) a:05jcrespo>03None Focusing on the blocker subtask. [12:06:15] (03PS1) 10Alexandros Kosiaris: package_builder: if guard the C10shell hook [puppet] - 10https://gerrit.wikimedia.org/r/327720 [12:15:17] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/327720 (owner: 10Alexandros Kosiaris) [12:16:08] !log Stop replication db2033 for maintenance - T151552 [12:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:22] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [12:16:57] !log remove all WMF php5 packages from apt.wikimedia.org for precise-wikimedia. No longer required [12:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:01] (03PS2) 10Alexandros Kosiaris: package_builder: if guard the C10shell hook [puppet] - 10https://gerrit.wikimedia.org/r/327720 [12:21:07] (03PS4) 10Ema: varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) [12:23:08] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:26:34] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: if guard the C10shell hook [puppet] - 10https://gerrit.wikimedia.org/r/327720 (owner: 10Alexandros Kosiaris) [12:32:15] (03PS1) 10Mobrovac: MobileApps: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/327723 (https://phabricator.wikimedia.org/T144598) [12:37:18] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:38] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:56:58] (03CR) 10Ema: [C: 032] varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [12:57:06] (03PS5) 10Ema: varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) [12:57:11] (03CR) 10Ema: [V: 032 C: 032] varnishmedia: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327513 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [13:02:08] (03CR) 10Alexandros Kosiaris: [C: 031] MobileApps: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/327723 (https://phabricator.wikimedia.org/T144598) (owner: 10Mobrovac) [13:04:39] (03CR) 10Alexandros Kosiaris: [C: 032] MobileApps: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/327723 (https://phabricator.wikimedia.org/T144598) (owner: 10Mobrovac) [13:04:44] (03PS2) 10Alexandros Kosiaris: MobileApps: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/327723 (https://phabricator.wikimedia.org/T144598) (owner: 10Mobrovac) [13:04:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] MobileApps: Switch to Scap3 config deploys [puppet] - 10https://gerrit.wikimedia.org/r/327723 (https://phabricator.wikimedia.org/T144598) (owner: 10Mobrovac) [13:05:41] !log disable puppet for mobileapps scap3 migration on scb boxes [13:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:14] !log running puppet on scb200X boxes for mobileapps scap3 migration [13:06:18] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:38] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:07:48] akosiaris: let me know when i can deploy to codfw [13:09:30] mobrovac: now [13:12:38] !log mobrovac@tin Starting deploy [mobileapps/deploy@c67a5ef]: Switching MCS in codfw to Scap3 deploys T144598 [13:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:50] T144598: Enable Scap3 config deploys for MCS - https://phabricator.wikimedia.org/T144598 [13:13:41] !log mobrovac@tin Finished deploy [mobileapps/deploy@c67a5ef]: Switching MCS in codfw to Scap3 deploys T144598 (duration: 01m 03s) [13:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:10] akosiaris: tutto bene, we can go on to eqiad [13:19:21] akosiaris: ok, i [13:19:34] i'm enabling puppet in eqiad and running it [13:20:07] (03CR) 10Hashar: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/326161 (https://phabricator.wikimedia.org/T152640) (owner: 10Filippo Giunchedi) [13:20:23] !log enabling and running puppet in eqiad scb [13:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:06] !log mobrovac@tin Starting deploy [mobileapps/deploy@c67a5ef]: Switching MCS in eqiad to Scap3 deploys T144598 [13:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:21] T144598: Enable Scap3 config deploys for MCS - https://phabricator.wikimedia.org/T144598 [13:21:40] !log mobrovac@tin Finished deploy [mobileapps/deploy@c67a5ef]: Switching MCS in eqiad to Scap3 deploys T144598 (duration: 00m 34s) [13:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:57] mobrovac: :-) [13:22:09] akosiaris: {{done}} [13:22:19] akosiaris: thnx for your help [13:22:23] (03PS2) 10Yuvipanda: labsdb: Make delete-dbusers work with new config format [puppet] - 10https://gerrit.wikimedia.org/r/327715 [13:22:25] (03PS4) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [13:22:27] (03PS1) 10Yuvipanda: labsdb: Change unique index for account_host table [puppet] - 10https://gerrit.wikimedia.org/r/327727 [13:22:40] great [13:22:43] thanks as well [13:23:31] (03CR) 10jenkins-bot: [V: 04-1] [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [13:25:40] akosiaris: apparently the jenkins job does not copy type C hooks ( C10shell.wikimedia.org ) [13:25:50] ? [13:26:12] should it ? [13:26:18] copy ? .. I don't follow [13:27:28] so I though that your C10shell hook would cause the jenkins build to stall/fail somehow [13:28:38] I have rebuild a build that failed ( https://integration.wikimedia.org/ci/job/debian-glue-non-voting/457/console ) and the C10shell is not triggered [13:28:44] gotta dig in the console log [13:29:09] maybe the package got built but the job fails later on because of lintian [13:29:31] yeah that is [13:30:49] 06Operations, 10Ops-Access-Requests, 06Analytics-Kanban, 15User-Elukey: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2880870 (10elukey) [13:35:52] akosiaris: I think it will be fine. I have added the env variable to the job. Thank you! [13:36:21] hashar: thanks as well [13:37:13] (03PS1) 10Elukey: Add the new user fdans with basic Analytics group permissions [puppet] - 10https://gerrit.wikimedia.org/r/327730 (https://phabricator.wikimedia.org/T153303) [13:49:19] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2880915 (10hashar) I have updated the labs projects security rules to allow contint2001 to ssh to the labs instances [13:51:11] (03PS1) 10Ema: varnishrls: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327733 (https://phabricator.wikimedia.org/T151643) [13:52:52] (03PS2) 10Ema: varnishrls: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327733 (https://phabricator.wikimedia.org/T151643) [13:55:33] (03PS3) 10Ema: varnishrls: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/327733 (https://phabricator.wikimedia.org/T151643) [13:56:06] (03PS3) 10Yuvipanda: labsdb: Make delete-dbusers work with new config format [puppet] - 10https://gerrit.wikimedia.org/r/327715 [13:56:14] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Make delete-dbusers work with new config format [puppet] - 10https://gerrit.wikimedia.org/r/327715 (owner: 10Yuvipanda) [13:57:11] (03PS2) 10Yuvipanda: labsdb: Change unique index for account_host table [puppet] - 10https://gerrit.wikimedia.org/r/327727 [13:57:18] (03CR) 10Yuvipanda: [V: 032 C: 032] labsdb: Change unique index for account_host table [puppet] - 10https://gerrit.wikimedia.org/r/327727 (owner: 10Yuvipanda) [14:00:51] (03PS1) 10Elukey: Apply the hhvm/apache prometheus exporter roles to the MW appservers [puppet] - 10https://gerrit.wikimedia.org/r/327734 (https://phabricator.wikimedia.org/T147423) [14:02:52] (03PS5) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [14:03:10] (03PS2) 10Elukey: Apply the hhvm/apache prometheus exporter roles to the MW appservers [puppet] - 10https://gerrit.wikimedia.org/r/327734 (https://phabricator.wikimedia.org/T147423) [14:04:43] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2880969 (10mobrovac) [14:11:20] (03PS6) 10Yuvipanda: [WIP] maintain-dbusers.py for maintaining labsdb users [puppet] - 10https://gerrit.wikimedia.org/r/327157 [14:21:53] (03CR) 10Elukey: [] "PCC looks good except a weird error for an api server:" [puppet] - 10https://gerrit.wikimedia.org/r/327734 (https://phabricator.wikimedia.org/T147423) (owner: 10Elukey) [14:22:29] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:43] ^ looks like downtime expired [14:24:08] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 76 failures. Last run 1 minute ago with 76 failures. Failed resources (up to 3 shown): Package[nagios-plugins-basic],Package[apt-transport-https],Package[tree],Package[ngrep] [14:29:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Apply the hhvm/apache prometheus exporter roles to the MW appservers [puppet] - 10https://gerrit.wikimedia.org/r/327734 (https://phabricator.wikimedia.org/T147423) (owner: 10Elukey) [14:30:50] !log disabling puppet on the eqiad appservers to rollout gradually the prometheus apache/hhvm exporters [14:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:41] (03PS3) 10Elukey: Apply the hhvm/apache prometheus exporter roles to the MW appservers [puppet] - 10https://gerrit.wikimedia.org/r/327734 (https://phabricator.wikimedia.org/T147423) [14:37:16] (03CR) 10Elukey: [C: 032] Apply the hhvm/apache prometheus exporter roles to the MW appservers [puppet] - 10https://gerrit.wikimedia.org/r/327734 (https://phabricator.wikimedia.org/T147423) (owner: 10Elukey) [14:39:22] proceeding with the codfw appservers [14:45:26] everything seems good [14:48:22] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2881052 (10mobrovac) [14:48:47] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2881067 (10mobrovac) [14:48:49] 06Operations, 10Mobile-Content-Service, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 3 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2772430 (10mobrovac) [14:50:49] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2881052 (10mobrovac) @dr0ptp4kt such a request needs manager approval. Please... [14:50:49] I am going to wait for all the codfw hosts to be completed and then I'll proceed with eqiad [14:52:08] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:55:48] Is eqiad the main wikimedia datacenter? [14:56:07] I noticed that its mostly mentioned rather then other datacenters [14:56:28] <_joe_> Zppix: it's the active datacenter for MW [14:56:35] Ah [14:56:46] doesnt labs use it as well [14:56:51] <_joe_> codfw is the "passive" one, the others are just caching/networking [14:56:57] <_joe_> yes, labs is in eqiad [14:57:09] I see thats cool. [14:57:19] <_joe_> analytics too is eqiad-only [14:57:33] (Is very interested in WMF infrastructure) [14:57:34] <_joe_> but pretty much everything else is built both in eqiad and codfw [14:57:49] I think Gerrit has exploded [14:57:53] <_joe_> Zppix: I think this is all on wikitech, lemme find the link [14:58:03] hashar: tell me something new :P [14:58:07] gerrit sshd is fine [14:58:14] web server is broken [14:58:47] grrrit-wm: hasnt complained about gerrit (its set to restart if anything) so it seems to be fine for me and it hashar [14:58:48] hm, only https is broken. http still serves redirects [14:59:03] Krenair: ssl issue? [14:59:22] Maybe the cert is on vaction again [14:59:31] integration.wikimedia.org is on the misc cache as well and works fine [14:59:32] more likely to be a simple apache in front of the real service that's still functioning, but the real service is having issues? [14:59:48] so I would say it is Gerrit / cobalt.wikimedia.org related [15:00:00] as well? [15:00:10] <_joe_> Krenair: https serves the actual application [15:00:10] gerrit isn't behind varnish hashar [15:00:12] yeah Gerrit is at 1500% CPU [15:00:16] <_joe_> so it's gerrit the app having an issue [15:00:25] Krenair: oh yeah my bad sorry [15:00:39] Krenair: try restart the webservice for the gerrit [15:00:39] it has multiple threads at 100% cpu [15:00:58] <_joe_> hashar: are we interested in understanding why it's hanging? [15:01:04] <_joe_> if not, it's java, just restart it [15:01:29] show-queue has a *lot* [15:01:33] yeah would be nice to try to get a thread dump [15:01:34] _joe_: im curious considering i will need to know how grrrit-wm will be affected if at all [15:01:51] mostly 'index change of project mediawiki/core' [15:02:39] _joe_ /usr/lib/jvm/java-8-openjdk-amd64/bin/jstack -F 58754 [15:02:56] hashar: sudo -u gerrit_uid jstat -gcutil `pidof java` 1000 3 [15:03:18] <_joe_> not pidof java [15:03:24] Ok i dont speak gerrit can someone translate [15:03:37] it is GerritCodeReview [15:03:48] ? [15:03:52] <_joe_> hashar: are you doing that? [15:03:58] Zppix, let's not distract them from investigating the issue, hmm? [15:04:00] I don't have sudo [15:04:16] iirc Gerrit uses java8 [15:04:27] and the alternatives are not up-to-date / still point to java7 [15:04:38] Krenair: im trying to figure out if grrrit-wm will be affected to know if i will to go in and restart it or something [15:04:50] <_joe_> hashar: did you restart it by any chance? [15:05:05] so would be: sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jstat -gcutil `pidof GerritCodeReview` 1000 3 [15:05:07] we can work that out once gerrit is working again Zppix [15:05:09] <_joe_> because the cpu usage went down [15:05:12] <_joe_> it's ok now [15:05:24] I don't have any access beside my unprivileged shell access [15:05:32] <_joe_> hashar: ok, it self-healed [15:05:43] I am copy pasting dcausse command to the wikitech page [15:05:57] <_joe_> next time instead of pasting commands in IRC, guys, just ask me "take a thread dump please" [15:06:00] <_joe_> :) [15:06:11] <_joe_> I'll understand you're asking me to do it [15:07:45] Krenair: ack sorry for being disruptive [15:08:32] Zppix, okay. try posting a comment on a gerrit change to check the bot is ok? [15:08:57] Krenair: i will once gerrit is normal [15:09:04] !log killing excess salt minions on iridium.eqiad.wmnet (there are 3 running) [15:09:12] Cause in theory it should fix itself Krenair [15:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:51] _joe_: sorry :D [15:10:01] I thought gerrit had self-healed, Zppix? [15:10:26] I dont know ive not had time to look yet Krenair [15:13:01] !log prometheus apache and hhvm exporters running on the eqiad MW appservers [15:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:16] godog --^ \o/ [15:33:58] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#2881201 (10hashar) Jenkins just runs `tox` which has the env/commands to run defin... [15:42:28] (03PS1) 10Eevans: enable instance restbase1017-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327745 (https://phabricator.wikimedia.org/T151086) [15:43:17] (03CR) 10Eevans: [C: 031] "Ready!" [puppet] - 10https://gerrit.wikimedia.org/r/327745 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:50:41] urandom: --^ should be super easy [15:50:52] elukey: yup [15:51:10] urandom: merging :) [15:51:21] elukey: thanks! [15:51:25] (03CR) 10Elukey: [C: 032] enable instance restbase1017-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327745 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [15:51:52] (03PS7) 10Hashar: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [15:51:54] (03PS1) 10Hashar: Support flake8 with python3 [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) [15:52:33] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [15:52:36] urandom: done! [15:52:48] (03CR) 10Hashar: [] "Rebased the change on top of:" [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [15:53:12] elukey: yup! [15:53:54] (03CR) 10Hashar: [] "Tested it locally by adding an invalid python file under modules/rolefiles/mariadb. the py27-flake8 env pass fine, the py3-flake8 only ch" [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) (owner: 10Hashar) [15:54:21] (03CR) 10Jcrespo: [C: 031] "I am ok with this solution, but I would like our software engineer to aprove it." [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) (owner: 10Hashar) [15:54:47] Krenair grrrit-wm can self heal, sometimes it wont work so you can run grrrit-wm: ?restart (remove ? i just doint want the bot restarting now as it's working.) which should just restart the ssh connection. [15:54:50] 06Operations, 10Ops-Access-Requests, 10Reading-Web-Trending-Service, 06Services (doing), 15User-mobrovac: Allow @Jdlrobson and @bearND to deploy and manage the trending edits service - https://phabricator.wikimedia.org/T153458#2881295 (10dr0ptp4kt) Approved [15:54:58] if that's what your talking about [15:55:03] (03CR) 10Hashar: [] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [15:55:19] Also gerrit cpu is mostly 100% every couple of hours. [15:56:14] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#2881303 (10hashar) a:03hashar https://gerrit.wikimedia.org... [15:58:33] 06Operations, 10Ops-Access-Requests, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Requesting access to Analytics production shell for Francisco Dans - https://phabricator.wikimedia.org/T153303#2875954 (10Cmjohnson) FYI: This will need Ops approval since it grants SUDO and will be discussed du... [15:58:46] (03CR) 10Jcrespo: [] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [16:00:31] (03CR) 10Hashar: [] "I am invoking the pywikibot community ( XZise JayVdb Xqt ). The pywikibot/core repository has some similar issue, namely using flake8 in b" [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) (owner: 10Hashar) [16:00:51] jynus: I have added pywikibot developers to my hacky tox.ini change [16:01:11] jynus: the pywikibot framework had more or less the same issue and they solved it via a similar hack [16:01:29] it is not ideal though. Would be nice to have flake8 to magically switch between python2.7 and python3 parser based on the shebang [16:01:52] or [16:02:13] that python3 version would parser lower versions correctly [16:02:15] PROBLEM - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.132 and port 9042: Connection refused [16:02:28] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881321 (10Paladox) [16:02:36] python2.7 does for 3 for most cases [16:03:05] but I assume nothing related to introspection [16:03:40] welcome to CI word where you get hhvm vs zend 55 vs zend 5.3 , nodejs 0.10 / 4.x / 6.x etc :( [16:03:42] thank you for the help [16:03:50] thx for the task! [16:03:51] :D [16:03:53] I didn't add CI because I though [16:04:04] thought it was a puppet-only issue [16:04:05] it really ANNOYS me when folks get blocked on a test failure [16:04:35] the funny thing is that other .py files have a python3 shebang but seems to lint just fine [16:04:51] yes, it is what I said before [16:05:06] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881333 (10Paladox) [16:05:15] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:18] I think it works for the most part inter-versions, except object-related stuff [16:05:51] there was lately some discussion about python linting [16:05:57] pywikibot folks would know for sure [16:06:00] hashar, let me show you on ticket [16:06:06] *one [16:06:09] maybe flake8 under python3 manage to find issues that the python2.7 one would not find [16:06:14] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2724182 (10Paladox) [16:08:02] hashar, oh, you are already there [16:08:06] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881343 (10Paladox) [16:08:12] T144169 [16:08:12] T144169: Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 [16:08:13] (03PS1) 10Mobrovac: Trending Edits: Add the admin group (and add it to SCB) [puppet] - 10https://gerrit.wikimedia.org/r/327754 (https://phabricator.wikimedia.org/T153458) [16:09:27] (03PS1) 10Mobrovac: Add jdlrobson to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/327755 (https://phabricator.wikimedia.org/T153458) [16:09:34] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2881348 (10bd808) >>! In T147718#2879900, @Joe wrote: > Which makes me think again that labs VMs should have their own puppet repository distinct from t... [16:10:41] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881366 (10hashar) For the December 16th issue, there were multiple threads at 100% CPU and HTOP reported ~ 150... [16:11:01] jynus: I am everywhere? :} [16:11:19] jynus: so that one is for python, and there is a similar task for shell script [16:12:56] (03Draft1) 10Paladox: Gerrit: Remove java 7 package [puppet] - 10https://gerrit.wikimedia.org/r/327756 [16:12:58] (03Draft2) 10Paladox: Gerrit: Remove java 7 package [puppet] - 10https://gerrit.wikimedia.org/r/327756 [16:13:15] (03CR) 10Hashar: [] "Another fix would be to have python3 only files to use a '.py3' file extension. This way the python 2.7 flake8 will not find them and we c" [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) (owner: 10Hashar) [16:13:18] the blocker is actually not that ticket you kmention [16:13:38] but that fixing it will generate other, many existing errors [16:13:59] jouncebot: next [16:13:59] In 391 hour(s) and 46 minute(s): HOLIDAY (observed) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170102T0000) [16:14:11] I'm going to deploy some Special:Watchlist fixes [16:14:57] (03PS2) 10Andrew Bogott: Keystone hooks: monkeypatch keystone to change project id to project name [puppet] - 10https://gerrit.wikimedia.org/r/327664 (https://phabricator.wikimedia.org/T150091) [16:15:06] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2881399 (10Paladox) @hashar that sounds like the prevous problems described on here, i noticed gerrit's cpu get... [16:16:39] hashar the gerrit problem sounds like gc pausing gerrit. [16:16:55] gc as in java's gc not gc in gerrit which we disabled. [16:16:59] (03CR) 10Andrew Bogott: [C: 032] Keystone hooks: monkeypatch keystone to change project id to project name [puppet] - 10https://gerrit.wikimedia.org/r/327664 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [16:17:52] !log mobrovac@tin Starting deploy [citoid/deploy@e7e3a42]: (no message) [16:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:45] 06Operations, 10DNS, 10Traffic, 07Beta-Cluster-reproducible: Ferm/DNS library weirdness on deployment-mediawiki boxes - https://phabricator.wikimedia.org/T153468#2881417 (10Krenair) ```root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolve... [16:19:57] !log mobrovac@tin Finished deploy [citoid/deploy@e7e3a42]: (no message) (duration: 02m 05s) [16:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:59] !log argon, chlorine, cp3012, elastic2006: upgraded apt (these were the few exceptions somehow not covered by 'dc-all' already) [16:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:21] (03CR) 10Volans: [C: 04-1] "I'm not entering in the merit of the code, I'll do a code review in the next days (sorry for the delay) but I want to highlight a major po" [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [16:28:00] (03CR) 10Volans: [C: 04-1] "I agree with the problem and to find a proper solution, but given also my comment in https://gerrit.wikimedia.org/r/#/c/326155 I don't thi" [puppet] - 10https://gerrit.wikimedia.org/r/327746 (https://phabricator.wikimedia.org/T152950) (owner: 10Hashar) [16:31:03] (03CR) 10Chad: [C: 031] "Will need manual cleanup afterwords, but lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/327756 (owner: 10Paladox) [16:32:23] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2881476 (10Volans) @Dzahn Thanks for the work! Unfortunately there are a bunch of hosts on which the freeipmi tools are not yet installed and some where although the remote IPMI is alrea... [16:32:31] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~17% of the fleet - https://phabricator.wikimedia.org/T150160#2881477 (10Volans) a:05Volans>03None [16:34:15] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:35:36] !log legoktm@tin Synchronized php-1.29.0-wmf.6/resources/src/mediawiki.special/mediawiki.special.watchlist.js: Revert confirmation button and apply other fixes - T153389 T153438 (duration: 00m 39s) [16:35:39] !log tin, auth1001, auth2001 - upgrade php5 packages [16:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:49] T153438: Marking Watchlist visited requires confirmation - https://phabricator.wikimedia.org/T153438 [16:35:50] T153389: "mark all as visited" appears temporarily grayed/disabled after a refresh following a watchlist reset - https://phabricator.wikimedia.org/T153389 [16:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:11] !log cleaning dangling elasticsearch indices on codfw [16:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:34] !log legoktm@tin Synchronized php-1.29.0-wmf.6/resources/Resources.php: Revert confirmation button and apply other fixes - T153389 T153438 (duration: 00m 39s) [16:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:59] !log krypton, mira, wasat, phab2001, bohrium, notebook[12]001 - upgrade php5 packages [16:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:52] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2881502 (10hashar) status: got distracted with other things today. Have to follow up on Daniel follow up patches and refacto... [16:46:11] !log silver - upgraded php5-fss | copper - upgraded php5-dev | einsteinium/tegmen - upgraded php5-gd, php5* [16:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:54] !log repooling elastic2006.codfw.wmnet [16:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:30] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2006.codfw.wmnet [16:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:22] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#2771280 (10Dzahn) And exactly this almost happened. i saw "piwik" as "waiting for upgrade" in servermon and only in the last minute stopped myself when "The following packages will be DOWNG... [16:48:45] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.132 and port 9042: Connection refused eevans Bootstrapping [16:56:14] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2881551 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Cmjohnson The server is now back fully into production after being cloned from db1052 as an extra api node (which... [17:00:32] 06Operations, 10DNS, 10Traffic, 07Beta-Cluster-reproducible: Ferm/DNS library weirdness on deployment-mediawiki boxes - https://phabricator.wikimedia.org/T153468#2881588 (10Krenair) Ferm will shut up if you make this change: ```diff --git a/src/ferm b/src/ferm index 67f5c89..5874642 100755 --- a/src/ferm +... [17:03:45] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2881614 (10Cmjohnson) Hi Eric A couple of other things I verified your L3 at https://phabricator.wikimedia.org/legalpad/signatures/3/?after=50 I will need @gwicke to approve... [17:04:18] !log silver (wikitech) - upgrade openssh, openssl [17:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:00] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494#2881618 (10fgiunchedi) >>! In T148494#2880204, @Volans wrote: > @ArielGlenn it's surely depends on the specific cases, but I thi... [17:05:31] !log silver (wikitech) - upgrade apache2, apparmor, python [17:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:52] urandom: you need to run hive queries against webrequest data right? [17:07:10] elukey: yeah [17:07:37] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2878663 (10elukey) analytics-privatedata-users will be fine Chris! [17:07:40] :) [17:09:23] (03PS1) 10ArielGlenn: when rerunning a checkpoint file, use only the relevant prefetch file(s) [dumps] - 10https://gerrit.wikimedia.org/r/327764 [17:09:45] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[tzdata],Service[zotero],Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members] [17:10:15] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:11:27] (03PS2) 10ArielGlenn: when rerunning a checkpoint file, use only the relevant prefetch file(s) [dumps] - 10https://gerrit.wikimedia.org/r/327764 [17:12:11] !log mw-canary - upgraded php5 packages [17:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:03] (03Draft1) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) [17:14:07] (03Draft2) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) [17:16:36] !log unbanned elastic2006 to complete repooling [17:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:13] (03PS3) 10ArielGlenn: when rerunning a checkpoint file, use only the relevant prefetch file(s) [dumps] - 10https://gerrit.wikimedia.org/r/327764 [17:31:12] (03CR) 10ArielGlenn: [C: 032] when rerunning a checkpoint file, use only the relevant prefetch file(s) [dumps] - 10https://gerrit.wikimedia.org/r/327764 (owner: 10ArielGlenn) [17:35:55] !log restart elastic2006.codfw.wmnet [17:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:25] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2881747 (10RobH) Well, if the proposed replacements to stat100[23] will potentially have GPU added at the time of order, can we demonstrate that we kn... [17:37:04] 06Operations, 10media-storage: separate swift error logging from request logging - https://phabricator.wikimedia.org/T84348#2881751 (10fgiunchedi) [17:37:21] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Estimate hardware requirements for WDQS upgrade - https://phabricator.wikimedia.org/T148747#2881753 (10RobH) 05stalled>03Resolved As both wdqs expansion systems have been ordered and have setup tasks, I'm resolving... [17:37:46] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:37:49] 06Operations, 10media-storage: separate swift error logging from request logging - https://phabricator.wikimedia.org/T84348#926100 (10fgiunchedi) 05Open>03Resolved This was completed in Iec42489838e [17:38:15] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:39:44] 06Operations, 10Monitoring: Setup HTCP monitoring alerts - https://phabricator.wikimedia.org/T82176#2881761 (10fgiunchedi) [17:45:20] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#2881774 (10fgiunchedi) [17:46:29] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup prometheus200[3-4] - https://phabricator.wikimedia.org/T151338#2881779 (10Papaul) Hello Papaul, It looks like there was a back order on the iDRAC port so our dispatching team had to reissue our dispatch using an alternate part from a different wareho... [17:53:03] 06Operations, 10hardware-requests: codfw: (2) servers request for ORES redis databases - https://phabricator.wikimedia.org/T142190#2881797 (10RobH) We don't really have any in warranty spares in codfw that come near this low a specification. Our current spare hardware in codfw: Two systems with: Dual Intel®... [18:01:31] 06Operations, 07Puppet, 06Analytics-Kanban, 13Patch-For-Review: Refactor eventlogging.pp role into multiple files (and maybe get rid of inheritance) - https://phabricator.wikimedia.org/T152621#2881824 (10Nuria) 05Open>03Resolved [18:01:32] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2881825 (10Nuria) [18:01:58] Attention: At 7PM CST UTC-6, grrrit-wm will be going under maintaience and will be down for at most 30 minutes if you have any questions please direct them to me, for more info please go to E424 on Phab! [18:05:43] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2881844 (10fgiunchedi) [18:07:27] 06Operations, 07Puppet, 10Deployment-Systems, 06Release-Engineering-Team, 05Mediawiki SWAT Deployments: mwdebug1002 should have PHP extensions - https://phabricator.wikimedia.org/T153316#2876393 (10greg) Let's do that. It makes sense. [18:08:00] Zppix: thanks for the heads up, could you use !log in front of that [18:08:18] Sure [18:08:23] is it about fixes for the new gerrit version? [18:08:25] !log Attention: At 7PM CST UTC-6, grrrit-wm will be going under maintaience and will be down for at most 30 minutes if you have any questions please direct them to me, for more info please go to E424 on Phab! [18:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:40] No mutante nickserv stuff [18:09:12] oh? to make grrrit-wm identify with nickserv? [18:09:41] Were grouping the test bot to its acct we dont have any changes for new gerrit ver [18:10:05] That and im going to use the time to prune anything that needs it [18:10:49] time zones are hard... [18:10:50] ? test bot on same account [18:10:59] not sure i see why yet [18:11:04] but *nod* [18:11:13] apergos: right [18:11:38] apergos: i put utc-6 on it [18:11:48] I know [18:11:55] then I have to remember what utc-thingie I am [18:12:02] then I have to do actual arithmetic [18:12:04] what timezone are you [18:12:11] * paladox is utc + 0 [18:12:12] eet [18:12:22] Let me look [18:12:52] if you put the utc time I can get from there close enough :-D [18:12:56] Utc +2 [18:12:59] I'm usually only an hour off then :-D [18:13:11] Eet is utc +2 [18:14:42] At 4am eet is when [18:15:06] apergos: ^ [18:15:07] so I'm sound asleep then [18:15:09] install eggdrop, add all regulars as users, add timezone field so bot knows who lives where, when making an announcement tell each user in PM "That's in X hours".. :p /me shuts up [18:15:13] 2 am utc, that's the upshot [18:15:15] gotcha [18:15:32] nope, one announcement in utc is good enough [18:15:47] I think we're used to converting within an hour of margin of error for that :-P [18:15:53] (given daylight savings or not and etc) [18:16:04] I did the time i did on perpose [18:16:16] ^ i bery gud erglsh [18:16:17] anyways I will butt out now and let people do actual work [18:16:23] is more interested in what the actual fix is for and why the test bot should not be a separate account [18:16:23] heh [18:16:58] mutante: so i dont waste group contacts time assigning a cloak [18:20:06] ah [18:39:11] 06Operations, 05Prometheus-metrics-monitoring: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#2881894 (10fgiunchedi) [18:40:42] (03CR) 10Tim Landscheidt: [C: 04-1] "https://www.debian.org/doc/packaging-manuals/python-policy/ch-programs.html: "As noted in Interpreter Location, Section 2.4.2, the form #!" [puppet] - 10https://gerrit.wikimedia.org/r/327709 (owner: 10Dereckson) [18:42:57] (03CR) 10Dereckson: [] "The goal is to have a repository as independant to a specific distribution or OS as possible. Alternatively, the goal of Debian policies i" [puppet] - 10https://gerrit.wikimedia.org/r/327709 (owner: 10Dereckson) [18:49:18] !log mobrovac@tin Starting deploy [citoid/deploy@d05c8bb]: (no message) [18:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:57] !log mobrovac@tin Finished deploy [citoid/deploy@d05c8bb]: (no message) (duration: 00m 39s) [18:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:27] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494837 (10fgiunchedi) Just noticed today that posts on blog.wikimedia.org try to load `event.gif` beacon from `bits`: ```lang=js var beacon = document.createElement( 'img' ); beacon... [19:01:54] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002.eqiad.wmnet for eevans - https://phabricator.wikimedia.org/T153375#2881985 (10GWicke) Approved. [19:06:29] 06Operations, 10Traffic, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2882009 (10fgiunchedi) >>! In T107430#2881955, @fgiunchedi wrote: > I'm not sure how to change that and how it is set from though @Krenair pointed out this is from the wp theme: https:/... [19:13:35] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:55] (03CR) 10Dzahn: [] contint: combine contint1001/2001 in a single node regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [19:17:35] (03PS2) 10Dzahn: contint: combine contint1001/2001 in a single node regex [puppet] - 10https://gerrit.wikimedia.org/r/327691 (https://phabricator.wikimedia.org/T150771) [19:21:12] !log planet[12]001, rutherfordium (people.wm.org), ununpentium (RT) - upgrade php5-cli/-common, libapache-mod-php5,.. [19:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:31] Anyone around that’s interested in some ‘commentary’ on the TimedMediaHandler drama? [19:22:37] https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [19:22:49] Admire that graph… [19:23:02] There is context, if anyone is interested. [19:24:50] !log contint1001, tungsten, hafnium - upgrade php5-cli/-common, libapache-mod-php5,.. [19:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:29] (03PS2) 10Cmjohnson: Adding prometheus1003-4 to netboot.cfg file T152504 [puppet] - 10https://gerrit.wikimedia.org/r/327677 [19:27:30] !log rolling out php upgrades on mw-codfw [19:27:33] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding prometheus1003-4 to netboot.cfg file T152504 [puppet] - 10https://gerrit.wikimedia.org/r/327677 (owner: 10Cmjohnson) [19:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:15] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:37:22] (03PS1) 10Andrew Bogott: Add export OS_INTERFACE=public to observerenv [puppet] - 10https://gerrit.wikimedia.org/r/327787 (https://phabricator.wikimedia.org/T150092) [19:39:45] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:41:13] (03Abandoned) 10Paladox: Gerrit: Install filebeat on gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:42:35] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:54:00] Hi does anyone know how i can create a patch to add gerrit to logstash please? [19:54:06] input is log4j [19:54:29] and output is elasticsearch but we want to customise the index to have the name gerrit in it [19:54:31] please [19:54:37] https://phabricator.wikimedia.org/T141324 [19:54:43] I have tested it locally [19:54:57] and works but now i want to create a patch to add it to prod gerrit [19:57:48] bd808 ^^? [19:58:35] !log carbon (install) apt-get upgrade [19:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:53] (03PS1) 10Urbanecm: Add new page protection level on etwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327789 (https://phabricator.wikimedia.org/T153465) [20:02:40] !log iridium (phabricator) - upgrade openssl, openssh-server, sftp, python2.7 [20:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:15] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:05:28] !log helium (backups) - upgrade apache2, openssl, python, dpkg .. [20:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:08] https://phabricator.wikimedia.org/T153488 BTW, if anyone cares. [20:07:45] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [20:08:15] PROBLEM - DPKG on helium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:09:15] RECOVERY - DPKG on helium is OK: All packages OK [20:09:27] !log stat1002 - install various package upgrades [20:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:46] PROBLEM - DPKG on stat1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:14:17] !log stat1002 - /etc/sudoers is puppetized but package upgrades of sudo want to override it and suggest to put local modifications in /etc/sudoers.d/, keeping installed version [20:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:45] RECOVERY - DPKG on stat1002 is OK: All packages OK [20:17:53] !log silver (wikitech) - upgraded openssh-sftp-server, login, firejail [20:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:33] paladox: do you really need a different index to log to? generally we log to the same indices (logstash-2016.12.14 or whatever) and use type's to distinguish where it came from [20:23:44] Oh [20:24:02] nope, we can log to logstash-**** [20:24:31] paladox: not sure if you should be writing direct, or writing to something that logstash reads to though. bd808 might have some better thoughts on that [20:24:41] Oh ok [20:29:11] 06Operations, 10Analytics-Cluster: stat1004 - sync snakebite version with repo - https://phabricator.wikimedia.org/T153493#2882338 (10Dzahn) [20:30:51] (03PS5) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [20:30:59] (03PS6) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [20:31:10] ebernhardson ostriches something like https://gerrit.wikimedia.org/r/#/c/326177/ ? [20:31:45] !log eventlog2001 - upgraded scap | bohrium - upgraded salt | multatuli - upgraded snimpy (all one-offs from servermon list) [20:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:27] paladox: what are you trying to do there? Send the log events to elasticsearch by embedding a logstash runtime in gerrit? [20:32:57] Yep, i carn't seem to find how it is done for the other things that do it with logstash [20:33:13] well you could ask [20:33:21] or even look around a bit [20:33:26] Oh, i was looking [20:33:28] on github [20:33:34] https://github.com/wikimedia/operations-puppet/search?p=4&q=logstash&type=Code&utf8=%E2%9C%93 [20:33:38] what you have rigged up there is really weird [20:33:47] oh [20:34:34] you have log4j setup to send to logstash (which makes sense) and then a logstash instance embedded in your jetty server that reads from a socket and writes to elasticsearch directly [20:34:48] the second part looks to be completely useless [20:35:21] And I am really still not sure why you are trying to use the beta cluster logstash [20:35:21] Oh. So i doint need logstash installed on gerrit prod? [20:35:28] Im not [20:35:34] I doint know what to change that too [20:35:40] log4j.appender.tcp.RemoteHost=deployment-logstash2.deployment-prep.eqiad.wmflabs [20:35:41] too = to [20:36:13] Im not = im not using that localy and doint want to do that in my change but i doint know what logstash url to use. [20:36:25] Hey, Phab question…. [20:36:50] paladox: it might help looking in modules/elasticsearch for how gehel setup the pipeline from the prod search cluster to our centralized logging via log4j. I'm not super familiar with it though [20:36:53] this https://wikitech.wikimedia.org/wiki/Logstash#Production_Logstash shows you can use many logstash servers [20:36:53] oh [20:36:54] I can’t figure out how (maybe it takes perms I do not have) to set something as a ‘blocking task( [20:36:57] thanks [20:37:35] Revent: a "child" task would be blocking the parent [20:37:38] https://phabricator.wikimedia.org/T153488 <- the things I set as ‘subtasks’ (and any other server-side video uploads) should really be blocked by this. [20:37:39] bd808: Because beta's logstash is a perfectly fine place to use for a beta gerrit [20:37:50] bd808: Oh, lol, wrong parity. [20:37:57] We shouldn't have to setup a whole ELK cluster just for gerrit logging in labs [20:38:11] http://gerrit-log.git.wmflabs.org/ [20:38:17] ostriches: sure. but that patch is for any gerrit deploy isn't it? [20:38:44] Sure, yes, it needs to be configurable by environment :) [20:38:48] according to wikitech it says this [20:38:49] Hostslogstash100[1-6] servers in Eqiad. [20:38:55] which one do i choose? [20:39:14] logstash1002 would probably work paladox [20:39:18] Ok thanks [20:39:24] logstash1002.wikimedia.org? [20:39:52] logstash1002.eqiad.wmnet [20:40:16] thanks [20:40:35] should i remove installing logstash package to on gerrit? [20:41:06] (03PS7) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [20:42:14] BTW, anyone with some degree of clue about TimedMediaHandler, actually ‘reading’ my comments there would be helpful, tho from what I have been told it’s Brion’s baby and he’s been intending to rewrite it for years. [20:42:48] But a coder type might be able to sort my comments into fixable bugs. [20:43:00] Revent hi, TimedMediaHandler is currently going through replacing it's player with videojs replacing kaultra player [20:43:46] paladox: I’m not sure that’s relevant, this is all more about how it manages itself I think. [20:43:58] Oh [20:44:03] ok sorry [20:44:10] paladox: https://phabricator.wikimedia.org/T153488 [20:44:24] (the comment with all the bullet points) [20:44:25] oh [20:45:15] You could try raising the priority of the task if you think it needs attention. [20:46:06] paladox: It’s not massive drama, other than ‘please stop uploading more huge shit for now’, and there is a task about adding more capacity already. [20:46:56] oh [20:47:31] But the system obviously needs work… most of the bugs I listed are only apparent because the system got massively overloaded, and would be avoided just by making the ‘starting tasks’ logic sane enough to avoid overloading the servers. [20:48:12] to rant, YOU CAN’T MULTITASK VIDEO TRANSCODING! :P [20:48:15] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:48:35] oh [20:49:03] I've set the log4j logstash domain to logstash1002.eqiad.wmnet [20:49:31] * paladox is looking on how to setup logstash log4j for gerrit in puppet [20:55:38] (03CR) 10Tim Landscheidt: [C: 04-1] "(Who is "we"?) There is no goal to be OS-agnostic as far as I'm aware. These scripts are run only on Ubuntu and Debian, and they are hig" [puppet] - 10https://gerrit.wikimedia.org/r/327709 (owner: 10Dereckson) [20:58:25] I mean, I guess you ‘can’ multitask video transcodes, the way the system works now proves that, but it is remarkably unhelpful. [21:06:35] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:11:13] BTW, if any ops-type has any suggestions about dealing with this transcoder drama, I would greatly appreciate it. [21:12:04] I’m basically trying to use a bug to curcumvent the effects of another bug. :/ [21:13:09] (03CR) 10Paladox: [] "@bd808 I'm not sure how to setup this as we have logstash1002.eqiad.wmnet set in log4j, should I change it to localhost?" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) [21:13:26] (because the other option is to let the servers waste electricity for weeks, and leave them all in the ‘failed’ queue. [21:17:15] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [21:29:06] it is already setup here https://github.com/wikimedia/operations-puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/logstash/manifests/input/log4j.pp and https://github.com/wikimedia/operations-puppet/blob/918328573c6f9111cd9afb89c8f375310bd026da/modules/role/manifests/logstash/collector.pp#L69 [21:29:07] it seems [21:29:23] (03PS8) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 [21:29:55] (03CR) 10Paladox: [] "log4j for logstash is setup here" [puppet] - 10https://gerrit.wikimedia.org/r/326177 (owner: 10Paladox) [21:30:58] (03PS9) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) [21:33:15] (03PS1) 10Eevans: enable instance restbase1018-a.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327847 (https://phabricator.wikimedia.org/T151086) [21:33:39] (03CR) 10Eevans: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/327847 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [21:33:50] (03CR) 10Chad: [C: 032] Support python 2/3 octals, not just python2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327383 (owner: 10Chad) [21:34:33] (03Merged) 10jenkins-bot: Support python 2/3 octals, not just python2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327383 (owner: 10Chad) [21:34:35] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:34:38] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2882560 (10Tgr) [21:35:40] !log labmon1001 - upgrade apache, gnupg, host, openssh-*, openssl [21:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:54] and since this is now Friday afternoon, i'll stop with the upgrading [21:36:15] What is it illegal to upgrade on fridays lol [21:36:25] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89774.81 seconds [21:36:56] It's almost early morning saturday here :) [21:37:00] $dayofweek is a factor, yea [21:37:25] mutante there seems to be 4 spikes on the cpu (cobalt) https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [21:37:42] mutante: one of the ten commandments of wikimedia thou shalt not upgrade on thy Fridays [21:38:54] out of respect for everyone's Saturday [21:39:10] lol [21:39:13] paladox: uhm.. so where's the log we were supposed to check when it happens [21:39:47] /var/lib/gerrit2/review_site/logs/* maybe it's in /var/lib/gerrit2/review_site/logs/error_log [21:39:58] thanks [21:40:03] your welcome [21:40:29] wasnt there an extra file for just gc ? [21:40:34] custom log [21:42:02] maybe there is some logs in /srv/gerrit/jvm/ [21:42:05] yeh [21:42:10] mutante ^^ [21:42:12] paladox: the last error in error_log is from 22 hours ago or so [21:42:17] Oh [21:42:19] doesnt match the spikes [21:42:24] mutante: gerrit kubectl? [21:42:25] /srv/gerrit/jvm/ [21:42:28] Thats ^^ gc [21:42:45] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:43:14] hmm, that seems new [21:43:17] "admin tool" part [21:43:25] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 4.808 second response time [21:44:03] paladox: ah, took me a second , it's almost that, /srv/gerrit/jvmlogs/ [21:44:17] Oh yeh my mistake [21:44:54] (03CR) 10BryanDavis: [] [WIP] maintain-dbusers.py for maintaining labsdb users (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327157 (owner: 10Yuvipanda) [21:44:55] mutante: admin tool isnt new [21:44:58] Maybe it could be something else on cobalt running it? [21:45:11] Zppix: i mean the part that icinga says this [21:45:36] It usually did it in labs wonder why it sent that here [21:47:50] 06Operations, 10Analytics, 10EventBus, 06Services (done), 15User-mobrovac: EventBus HTTP proxy service's syslog entries should be readable - https://phabricator.wikimedia.org/T153028#2882574 (10mobrovac) 05Open>03Resolved The imminent problem has been resolved, let's track log files consolidation in... [21:48:15] RECOVERY - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.32.132 port 9042 [21:48:30] Now thats new ^ [21:49:44] 06Operations, 10Gerrit, 13Patch-For-Review: Gerrit: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 / 30/11/2016 / 16/12/2016 - https://phabricator.wikimedia.org/T148478#2882577 (10Dzahn) @paladox pointed out these spikes today {F5076613} Here is the newest of the custom gc log f... [21:50:00] paladox: https://phabricator.wikimedia.org/T148478#2882577 [21:50:43] Zppix: that's new as in "just installed" , yes [21:50:55] thanks [21:50:56] Wow [21:53:11] (03CR) 10Eevans: [C: 031] "Ready!" [puppet] - 10https://gerrit.wikimedia.org/r/327847 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [21:53:48] (03CR) 10Dzahn: [C: 032] enable instance restbase1018-a.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/327847 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [21:54:01] Zppix: ^ that's gonna be the next one [21:54:11] mutante http://stackoverflow.com/questions/27003451/jvm-consumes-100-cpu-with-a-lot-of-gc [21:54:24] apparently if we used all the ram it can trigger java's gc more easily. [21:54:38] + the high cpu i find is when the ram is full [21:57:34] (03PS1) 10BryanDavis: wikitech: Add oathauth group with oathauth-api-all right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327852 (https://phabricator.wikimedia.org/T153487) [21:57:39] paladox: yea, it's windows and tomcat though, but could be [21:57:53] i thought windows is an os? [21:58:10] the link you pasted talks about tomcat on windows [21:58:13] oh [21:58:18] but maybe that doesnt matter. i dont know [21:58:22] i was reading the second answer [22:00:06] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#2882600 (10mobrovac) [22:00:26] (03PS2) 10Andrew Bogott: Add export OS_INTERFACE=public to observerenv [puppet] - 10https://gerrit.wikimedia.org/r/327787 (https://phabricator.wikimedia.org/T150092) [22:00:50] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services (watching): Confirm attribution needs - https://phabricator.wikimedia.org/T150875#2882619 (10mobrovac) [22:01:18] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services (watching): Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#2882633 (10mobrovac) [22:01:53] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#2882636 (10mobrovac) [22:01:54] (03CR) 10Andrew Bogott: [C: 032] Add export OS_INTERFACE=public to observerenv [puppet] - 10https://gerrit.wikimedia.org/r/327787 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [22:02:54] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2882660 (10mobrovac) [22:03:51] 06Operations, 06Services (blocked): Set warning thresholds for average cluster utilization - https://phabricator.wikimedia.org/T76306#2882664 (10mobrovac) [22:08:18] 06Operations, 10MediaWiki-API, 10Monitoring, 10Parsoid, and 3 others: API action=parsoid-batch not available on Graphite - https://phabricator.wikimedia.org/T152776#2882677 (10mobrovac) [22:09:21] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Current state and next steps for RESTBase storage - https://phabricator.wikimedia.org/T152724#2882684 (10mobrovac) [22:10:44] 06Operations, 10ChangeProp, 10Mobile-Content-Service, 06Parsing-Team, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2882685 (10mobrovac) [22:12:45] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 5 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#2882706 (10mobrovac) [22:13:23] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): Evaluate ScyllaDB as a near-term replacement to Cassandra - https://phabricator.wikimedia.org/T150811#2882707 (10mobrovac) [22:13:45] (03CR) 10Dzahn: [] contint: simplify includes in site.pp, move things to master role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [22:14:35] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services (done): establish new thresholds for cassandra alarms after switching restbase to dtcs - https://phabricator.wikimedia.org/T118976#2882711 (10mobrovac) Could be time to close this one perhaps? [22:16:36] (03PS2) 10Dzahn: contint: fix/move 'backup'-includes, move from node to role [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) [22:17:54] mutante: did you see my comment on https://gerrit.wikimedia.org/r/#/c/327847? [22:18:34] mutante: i.e. that this is the first instance for this machine, and has that chicken-egg issue we discussed a while back [22:19:28] urandom: oh, i missed it indeed. one sec [22:19:42] (03PS3) 10Dzahn: contint: fix/move 'backup'-includes, move from node to role [puppet] - 10https://gerrit.wikimedia.org/r/327693 (https://phabricator.wikimedia.org/T150771) [22:23:38] on it, had to get the install-console access [22:24:48] !log restbase1018 - signing puppet cert, initial run [22:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:08] it's installing all the things now, urandom [22:26:15] mutante: cool [22:26:37] mutante: fwiw, this is the last host [22:26:47] ah :)) [22:26:51] so this won't be needed again for a while :) [22:29:33] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:29:34] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:29:56] !log salt master: deleting unacceptd keys for decom'ed hosts neon and palladium, accepting key for restbase1018 [22:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:15] thanks mutante for taking care of that! [22:30:23] there we go, it's done [22:30:23] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 58 seconds ago with 4 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d],Package[scap],Package[cassandra/metrics-collector],Package[restbase/deploy] [22:30:28] urandom: uid=11774(eevans) gid=500(wikidev) groups=500(wikidev),744(restbase-roots) [22:30:33] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:30:34] go ahead and ssh :) [22:30:36] mutante: ty! [22:30:44] works [22:31:03] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:32:23] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [22:32:45] ACKNOWLEDGEMENT - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new install [22:32:45] ACKNOWLEDGEMENT - NTP on restbase1018 is CRITICAL: NTP CRITICAL: Offset unknown daniel_zahn new install [22:32:46] ACKNOWLEDGEMENT - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused daniel_zahn new install [22:32:46] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused daniel_zahn new install [22:32:46] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn new install [22:32:46] ACKNOWLEDGEMENT - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed daniel_zahn new install [22:32:46] ACKNOWLEDGEMENT - puppet last run on restbase1018 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d],Package[scap],Package[cassandra/metrics-collector],Package[restbase/deploy] daniel_zahn new install [22:32:46] ACKNOWLEDGEMENT - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) daniel_zahn new install [22:33:24] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [22:33:33] RECOVERY - cassandra-a service on restbase1018 is OK: OK - cassandra-a is active [22:33:34] yw, and we'll see the recoveries either way, difference is just it won't repeat the message [22:33:43] RECOVERY - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-a valid until 2017-12-13 00:15:55 +0000 (expires in 361 days) [22:33:47] well that was quick anyways :) [22:34:03] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational [22:34:26] ^ yay [22:36:07] !log eevans@tin Starting deploy [cassandra/twcs@0b0c838]: (no message) [22:36:17] !log eevans@tin Finished deploy [cassandra/twcs@0b0c838]: (no message) (duration: 00m 10s) [22:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:46] (03CR) 10Dzahn: "logged into with "install-console" special key from puppet master, enabled puppet, signed cert request on master, initial puppet run.. tha" [puppet] - 10https://gerrit.wikimedia.org/r/327847 (https://phabricator.wikimedia.org/T151086) (owner: 10Eevans) [22:37:03] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:37:12] !log eevans@tin Starting deploy [cassandra/twcs@0b0c838]: (no message) [22:37:16] !log eevans@tin Finished deploy [cassandra/twcs@0b0c838]: (no message) (duration: 00m 04s) [22:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:33] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:37:34] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:38:23] RECOVERY - Restbase root url on restbase1018 is OK: HTTP OK: HTTP/1.1 200 - 15450 bytes in 0.016 second response time [22:38:30] !log eevans@tin Starting deploy [cassandra/twcs@0b0c838]: (no message) [22:38:33] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [22:38:34] !log eevans@tin Finished deploy [cassandra/twcs@0b0c838]: (no message) (duration: 00m 04s) [22:38:36] !log restbase deployed the latest code and pooled restbase1018 [22:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:13] !log eevans@tin Starting deploy [cassandra/twcs@0b0c838]: (no message) [22:39:18] !log eevans@tin Finished deploy [cassandra/twcs@0b0c838]: (no message) (duration: 00m 05s) [22:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:25] mobrovac: the new machines aren't in conftool yet so the pooling can't happen btw [22:40:33] RECOVERY - cassandra-a service on restbase1018 is OK: OK - cassandra-a is active [22:40:43] RECOVERY - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is OK: SSL OK - Certificate restbase1018-a valid until 2017-12-13 00:15:55 +0000 (expires in 361 days) [22:40:48] urandom: everything ok? why the friday deploys? [22:40:56] oh right godog lol [22:41:03] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational [22:41:11] let's mitigate that then [22:41:20] greg-g: just a rb cluster expansion [22:41:33] greg-g: i'm bootstrapping a new Cassandra instance, the 'deploy' is a dependant jar file [22:41:43] got it, thanks [22:41:58] Attention: Scheduled grrrit-wm maintaince in 3 hours [22:42:08] greg-g: what alarm did i trip? :) [22:42:27] urandom: fyi, when you scap deploy, you can specify a message at the end that appears in SAL: scap deploy 'my fancy message here' [22:42:49] oh, i see the messages above now [22:43:14] godog: should we add them to conftool? [22:43:25] frustrating, for some reason, everyone once in a while, git-fat doesn't hydrate the jar [22:43:34] s/everyone/every/ [22:44:32] urandom: the "greg noticing deploys on a friday" alarm :) [22:44:51] especially the friday before the deploy freeze, makes me wonder what's going on [22:44:58] yeah, i see the logmsgbot noise now [22:45:14] urandom: perhaps ping thcipriani on T147856 [22:45:14] T147856: Scap deploy failed to sync git-fat artifacts - https://phabricator.wikimedia.org/T147856 [22:45:19] its "new" in scap3 to do that :) [22:45:27] haha [22:45:33] * thcipriani perks up [22:45:45] scap deploy "getting out to push" [22:45:55] thcipriani: the dead has risen [22:46:30] thcipriani: git-fat hates me [22:46:35] :( [22:46:45] mobrovac: my plan was to do the cassandra layer first and then move on to restbase, codfw is in the same state (not yet in conftool) [22:46:49] which host did it fail to hydrate on? [22:46:56] if (user === 'urandom') { ; } [22:46:59] I can see if there's anything in the logs [22:47:08] thcipriani: restbase1018.eqiad.wmnet [22:47:14] *this time* [22:47:28] seems like 1/10 or so it fails for me [22:47:32] yeah, I remember seeing the ticket. [22:48:11] If (user === Zppix) { Print 200 million check} [22:48:14] git-fat is implemented the same way it was in trebuchet (where I also remember this being an issue, so that probably wasn't a bright idea) [22:48:40] godog: but those are separate things, the new nodes don't need to be in rb's seed list for it to pick them up [22:48:50] thcipriani: i was going to mention that it was a problem then too, but didn't want to use the T-word [22:49:05] not on a friday, anyway [22:49:15] urandom: I appreciate that :) [22:49:56] :) [22:50:34] git-fat hates everyone [22:50:52] thcipriani: And yes, it wasn't the *best* choice, but it was the "it'll mostly-ish work and we need it now" choice [22:51:00] Always wanted to revisit [22:51:50] hrm. Well, the pull happened on that host is what the logs tell me. I bet it's something about the failure mode of git-fat that we aren't able to catch... [22:52:10] (suspicion: exiting 0 even on failure) [22:53:25] Yep [22:53:40] https://static01.nyt.com/images/2016/08/05/us/05onfire1_xp/05onfire1_xp-master768-v2.jpg [22:56:20] "Which is always indicative of good software" [22:56:22] lol thcipriani [22:56:48] yarp :( [22:57:22] urandom: sorry about the continued problems. Raised the priority on T147856 -- will dig. [22:57:22] T147856: Scap deploy failed to sync git-fat artifacts - https://phabricator.wikimedia.org/T147856 [22:57:38] thcipriani: no worries; thanks! [23:04:53] PROBLEM - MD RAID on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:13] mobrovac: I know they are separate, no reason to do it together either [23:13:53] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [23:19:17] !log upgrading php5 to 5.6.29 on mw canary (DSA-3737-1) [23:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:23] PROBLEM - DPKG on mw1264 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:34] PROBLEM - DPKG on mw1262 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:43] PROBLEM - DPKG on mw1265 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:53] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:03] PROBLEM - DPKG on mw1263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:22:00] crap [23:22:06] but that's why canary [23:22:23] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:23] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [23:31:33] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [23:33:23] PROBLEM - puppet last run on mw1264 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [23:34:43] ok, that's what Moritz warned about [23:34:47] but it only happened one version later [23:35:13] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:40:13] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 121 seconds ago with 0 failures [23:40:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:41:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:41:23] RECOVERY - DPKG on tungsten is OK: All packages OK [23:41:39] !log tungsten - fixed hanging dpkg install, killed, dpkg-reconfigure libapache2-mod-php5 [23:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:03] mutante: anything I can help with? [23:42:18] so i think tungsten is ok now [23:42:18] (03PS1) 10Filippo Giunchedi: prometheus: add aggregation rules for varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/327873 (https://phabricator.wikimedia.org/T147424) [23:42:23] and caused that above [23:42:30] now i will do the same fix with mw canary [23:42:38] and i won't touch the non-canary stuff [23:42:53] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [23:43:18] godog: thanks, let me try fixing those mw, i'll let you know [23:43:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:43:38] the graphite alarms are because tungsten itself had the issue [23:43:46] because it was in the group of "misc"s i upgraded [23:44:17] it only happens where you have php5 and also php-pear apparently [23:45:23] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [23:45:43] RECOVERY - DPKG on mw1262 is OK: All packages OK [23:45:54] !log mw1262 - killed dpkg, dpkg --configure -a , apt-get install php5 [23:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:46:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:46:58] ah ok [23:47:33] Attention: 1 hour until scheduled grrrit-wm maintainence [23:47:43] RECOVERY - DPKG on mw1265 is OK: All packages OK [23:49:03] RECOVERY - DPKG on mw1263 is OK: All packages OK [23:49:45] yea, that is the same fix one by one [23:49:53] RECOVERY - DPKG on mw1261 is OK: All packages OK [23:50:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:50:24] RECOVERY - DPKG on mw1264 is OK: All packages OK [23:50:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:53:31] !log same fix for other broken 'mw-canary' mw1261 - 1268 - killed dpkg, dpkg --configure -a , apt-get install php5 (after upgrade to 5.6.29 in combination with php-pear hangs at postinst) [23:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:24] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:55:33] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:56:23] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:56:24] RECOVERY - puppet last run on mw1264 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:57:53] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures