[00:00:26] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 05Mediawiki SWAT Deployments: Clarify SWAT process for testing maintence script changes (to not use mwdebug* hosts) - https://phabricator.wikimedia.org/T153316#2963567 (10greg) [00:02:38] (03PS10) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) [00:03:35] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [00:06:44] (03PS11) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) [00:07:36] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [00:07:43] what now :p [00:11:04] (03PS12) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) [00:12:00] (03CR) 10Jcrespo: [C: 032] mariadb: repool db1065 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333812 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [00:14:40] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [00:14:54] (03Merged) 10jenkins-bot: mariadb: repool db1065 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333812 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [00:15:12] (03CR) 10jenkins-bot: mariadb: repool db1065 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333812 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [00:17:56] (03PS13) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) [00:26:15] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:26:25] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 with low load after reimage (duration: 00m 45s) [00:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:31] (03CR) 10Gergő Tisza: "This would disable account creation, not login. IIRC account creation on loginwiki has already been disabled for a long time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29) [00:28:35] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [00:29:48] (03CR) 10Gergő Tisza: "On second thought you'll have to do something with canAuthenticateNow() whether your remove login providers or not, since that's what cont" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29) [00:54:15] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:55:15] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:55:45] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:03:27] (03PS4) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [01:03:30] (03CR) 10Dzahn: "good point. doing that!" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [01:04:19] (03CR) 10Dzahn: [C: 04-1] "hold on .. rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [01:04:37] (03CR) 10jerkins-bot: [V: 04-1] hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [01:06:47] (03CR) 10Dzahn: [C: 04-1] "seems base module changed not long ago" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [01:11:15] PROBLEM - Check whether ferm is active by checking the default input chain on db1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [01:12:15] RECOVERY - Check whether ferm is active by checking the default input chain on db1026 is OK: OK ferm input default policy is set [01:12:37] (03PS1) 10Dzahn: typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 [01:13:38] (03CR) 10jerkins-bot: [V: 04-1] typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 (owner: 10Dzahn) [01:18:58] !log mwscript deleteEqualMessages.php --wiki gotwiki (T45917) [01:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:02] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [01:21:55] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:23:15] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:24:45] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [01:37:24] (03PS14) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) [01:38:15] (03CR) 10Dzahn: [V: 032 C: 032] typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 (owner: 10Dzahn) [01:38:42] (03PS2) 10Dzahn: typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 [01:38:54] (03CR) 10Dzahn: [V: 032 C: 032] typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 (owner: 10Dzahn) [01:42:59] (03PS1) 10Dzahn: typos: fix "rysnc", "wikimeda" [puppet] - 10https://gerrit.wikimedia.org/r/333825 [01:43:29] (03PS4) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [01:43:47] (03PS2) 10Dzahn: typos: fix "rysnc", "wikimeda" [puppet] - 10https://gerrit.wikimedia.org/r/333825 [01:44:22] (03CR) 10Volans: "@godog: thanks for the review!" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [01:47:13] (03PS3) 10Dzahn: gerrit/lists/microsite/rolematcher: fix "rysnc", "wikimeda" typos [puppet] - 10https://gerrit.wikimedia.org/r/333825 [01:47:25] (03CR) 10Dzahn: [C: 032] gerrit/lists/microsite/rolematcher: fix "rysnc", "wikimeda" typos [puppet] - 10https://gerrit.wikimedia.org/r/333825 (owner: 10Dzahn) [01:50:55] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:51:35] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:56:49] (03PS15) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) [01:59:31] (03CR) 10Dzahn: [C: 032] "as intended, it adds config on carbon/install2001, but not on install1001 http://puppet-compiler.wmflabs.org/5192/" [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [02:02:04] eh. "nice" SERVER: Invalid relationship: [02:02:19] not caught by compiler [02:02:43] but i see the problem [02:03:35] PROBLEM - puppet last run on install1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:04:09] ACKNOWLEDGEMENT - puppet last run on install1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn WIP [02:04:10] ACKNOWLEDGEMENT - Check systemd state on install2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP [02:11:25] mutante: if you need help just ping me [02:11:59] (03CR) 10Krinkle: [C: 04-1] "Not sure that one makes sense. We usually use the Wikimedia icon instead of the Meta-Wiki icon. Except for community projects. E.g. doc.wi" [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad) [02:12:14] volans: thank you, i got it [02:12:30] ok :) [02:13:05] (03PS1) 10Dzahn: aptrepo:rsync: fix 'Invalid relationship' and ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/333830 [02:13:11] (03CR) 10Krinkle: [C: 04-1] "(Same for NOC)" [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad) [02:14:44] (03PS2) 10Dzahn: aptrepo:rsync: fix 'Invalid relationship' and ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/333830 [02:15:50] (03CR) 10Dzahn: [C: 032] aptrepo:rsync: fix 'Invalid relationship' and ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/333830 (owner: 10Dzahn) [02:18:13] icinga-wm: sup [02:18:35] RECOVERY - puppet last run on install1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [02:18:38] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.8) (duration: 06m 40s) [02:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:35] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [02:21:42] (03CR) 10Krinkle: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/332707 (owner: 10Chad) [02:23:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 24 02:23:01 UTC 2017 (duration 4m 23s) [02:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:47] (03PS5) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [02:46:50] (03PS6) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [02:49:48] (03PS7) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [02:50:39] volans: ^ but there is the unrelated thing, i had to change it because base became a "profile" meanwhile. so it wasn't in init.pp either anymore, but good point to move it [02:50:48] be back later for now [02:51:03] ok, I'll take a look [02:51:13] thx [02:54:13] (03CR) 10Volans: "Much nicer. I'm usually not a fan of true defaults and prefer false as a default (like skip_monitoring), but is a personal habit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [02:54:49] (03PS1) 10Dzahn: delete dumps.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940) [02:56:12] (03CR) 10Dzahn: hiera override to skip base icinga for test/decom hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [02:56:28] (03PS8) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [03:22:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.79 seconds [03:28:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 119.94 seconds [03:35:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1823.533416 Seconds [03:36:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 41.761426 Seconds [03:40:25] (03PS2) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack) [03:46:01] (03PS3) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack) [04:11:07] (03CR) 10NehalDaveND: "I am very sorry for this. But I forgot how to review patch. Can someone tell me how can I review this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333640 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [04:11:10] (03CR) 10Niharika29: [C: 04-1] "Hmm, this seems like something which makes sense as a global. Do you think it'd be better off as a global? Perhaps $wgDisableLogin. I saw " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29) [04:17:09] (03PS4) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack) [04:24:35] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:34:51] (03PS5) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack) [04:53:35] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:05:15] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1810.51684 Seconds [05:06:15] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [05:16:55] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#2964028 (10Volans) [05:30:55] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:39:45] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:42:25] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:59:55] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:08:45] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:10:25] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:20:21] <_joe_> !log repooling mw2098 after scap pull [06:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:55] (03PS6) 10Volans: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [06:24:37] (03PS7) 10Volans: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [06:24:55] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:37:55] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:39:05] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [06:43:06] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:43:15] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:44:04] (03PS8) 10Volans: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [06:46:11] (03CR) 10Volans: "Puppet compiler result:" [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack) [06:46:35] RECOVERY - mediawiki-installation DSH group on mw2098 is OK: OK [06:49:55] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:53:56] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:06:36] (03PS1) 10Marostegui: Revert "site.pp: db1052's binlog changed to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333849 [07:06:58] (03CR) 10jerkins-bot: [V: 04-1] Revert "site.pp: db1052's binlog changed to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333849 (owner: 10Marostegui) [07:08:05] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:10:15] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:11:50] (03PS1) 10Marostegui: site.pp: Disable RBR on db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) [07:11:59] (03Abandoned) 10Marostegui: Revert "site.pp: db1052's binlog changed to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333849 (owner: 10Marostegui) [07:13:15] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:15:15] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:19:39] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) [07:21:15] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:28:31] (03PS2) 10Marostegui: site.pp: Disable RBR on db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) [07:30:30] (03CR) 10Marostegui: "This compiles fine: https://puppet-compiler.wmflabs.org/5199/" [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [07:45:24] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2964236 (10Dzahn) I moved the eqiad Ganglia aggregator from carbon to install1001 today. This part is unblocked. [07:48:28] 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2964240 (10hashar) 05Open>03Resolved [07:50:24] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964243 (10Dzahn) [07:50:48] (03CR) 10Hashar: "Thanks :-}" [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar) [07:51:15] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:56:07] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2964262 (10Marostegui) >>! In T155769#2962307, @matmarex wrote: >>>! In T155769#2960504, @Marostegui wrote: >> If you guys consider it is safe to delete,... [07:58:54] (03PS4) 10Marostegui: mariadb: Split dbstore role classes [puppet] - 10https://gerrit.wikimedia.org/r/332228 (https://phabricator.wikimedia.org/T130128) [08:03:45] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:37] (03PS2) 10Muehlenhoff: Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836) [08:10:48] (03CR) 10Marostegui: [C: 032] mariadb: Split dbstore role classes [puppet] - 10https://gerrit.wikimedia.org/r/332228 (https://phabricator.wikimedia.org/T130128) (owner: 10Marostegui) [08:16:46] (03PS3) 10Muehlenhoff: Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836) [08:20:15] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:21:35] (03CR) 10Muehlenhoff: [C: 032] Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [08:22:48] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [08:24:19] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [08:24:33] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [08:25:16] 06Operations, 10ops-codfw: mw2098 drac offline - system unreachable - https://phabricator.wikimedia.org/T155688#2964302 (10MoritzMuehlenhoff) I've repooled the host. [08:25:39] <_joe_> moritzm: I already repooled it this mroning [08:25:44] <_joe_> did I miss something? [08:26:19] !log marostegui@tin Synchronized wmf-config/db-codfw.php: wmf-config/db-eqiad.php Add rack positions - T155999 (duration: 00m 50s) [08:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:23] T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999 [08:28:44] (03PS1) 10Ema: Revert "Temporarily depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/333854 [08:28:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions - T155999 (duration: 00m 41s) [08:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:23] _joe_: no, you're right, re-looking at the confctl output it changed from yes to yes, gonna make some coffee :-) [08:29:48] <_joe_> heh ok I thought I brainfarted earlier [08:31:55] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [08:32:45] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:34:36] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) [08:35:20] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Docker installation for production kubernetes - https://phabricator.wikimedia.org/T147181#2964318 (10Joe) [08:35:23] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2964317 (10Joe) 05stalled>03Resolved [08:36:04] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 15User-Joe, 07Wikimedia-Multiple-active-datacenters: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009#2961483 (10Joe) a:03Joe [08:40:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) (owner: 10Marostegui) [08:42:34] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) (owner: 10Marostegui) [08:42:45] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) (owner: 10Marostegui) [08:44:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1065 original weight - T156005 (duration: 00m 39s) [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:10] T156005: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005 [08:54:55] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:13:26] 06Operations, 10hardware-requests: hardware request for netmon1001 - https://phabricator.wikimedia.org/T156040#2962228 (10faidon) Thanks for being thorough @RobH and actually double-checking the disk usage :) Disk space usage is indeed minimal, but this box holds a lot of RRDs (for LibreNMS and currently Torru... [09:13:53] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2964439 (10Gilles) 05Open>03Resolved Fixes for the 404 log coming on a different task. I'm not seeing /temp 404s anymore in the swift logs. [09:16:08] 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#2964444 (10Gilles) Might be related to the iowait issues investigated in T151851 [09:16:09] (03PS1) 10Marostegui: db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) [09:17:15] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [09:17:34] (03CR) 10DCausse: [C: 031] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel) [09:17:35] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:57] (03CR) 10Marostegui: [C: 032] db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [09:19:47] (03Merged) 10jenkins-bot: db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [09:19:57] (03CR) 10jenkins-bot: db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui) [09:20:56] (03PS1) 10MarcoAurelio: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) [09:21:06] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2054 - T153300 (duration: 00m 39s) [09:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:10] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [09:21:50] !log Alter table db2054 metawiki.pagelinks - T153300 [09:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:55] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:24:04] (03PS2) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 [09:24:10] marostegui: please give me a ping once your done with mediawiki-config deploys as I would like to get https://gerrit.wikimedia.org/r/#/c/332917 out (without getting in your way) [09:24:36] addshore: hey! I am done :) [09:24:45] marostegui: awesome! [09:24:48] (03CR) 10Addshore: [C: 032] Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [09:24:51] at leaste for the next couple of hours I think :) [09:24:52] (03PS8) 10Addshore: Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) [09:25:01] (03CR) 10Addshore: [C: 032] Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [09:25:03] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel) [09:25:18] cool, this should only take a few mins (noop) [09:25:46] 06Operations, 10ops-eqiad, 10DBA: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2964458 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done. [09:26:20] (03PS3) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 [09:26:50] (03Merged) 10jenkins-bot: Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [09:27:04] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2964460 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done. [09:28:08] (03CR) 10jenkins-bot: Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [09:28:37] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964461 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2029.codfw.wmnet'] ``` T... [09:29:05] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [09:30:51] !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 1/4 noop (duration: 00m 53s) [09:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:55] T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995 [09:32:00] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 2/4 noop (duration: 00m 41s) [09:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:00] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 3/4 noop (duration: 00m 40s) [09:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:57] !log addshore@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 4/4 noop (duration: 00m 38s) [09:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:18] All done there, and all looks good! [09:35:05] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:18] (03CR) 10Hashar: "Indeed "bundle exec rake puppetlint" process the whole tree + submodules and choke on them. I already have patches to fix the submodules:" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar) [09:35:39] contint2001 I havent touched it [09:36:15] Attempt to assign to a reserved variable name: "trusted" [09:36:18] !log add /dev/sdb partitions to md RAID device on mw2251 [09:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:30] hashar: yeah.. known. just rerun puppet [09:36:42] it's a damn puppet+puppetdb bug [09:36:46] ;-D [09:36:59] indeed it is alll fine now [09:37:00] thanks! [09:37:02] it's happening randomly [09:37:05] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:37:30] IIRC the upstream bug has been closed as WONTFIX [09:37:32] <_joe_> that happens when a connection to puppetdb fails IIRC [09:37:41] <_joe_> yes, that too [09:38:09] ah yes, puppet inserting the "trusted" fact on the local yaml cache [09:38:30] WONTFIX cause "we shouldn't sanitize that" or something [09:38:38] need to reread the damn bug [09:41:41] (03PS1) 10DCausse: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) [09:46:35] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:47:55] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [09:48:03] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2029.codfw.wmnet'] ``` and were **ALL** successful. [09:49:22] (03PS1) 10Faidon Liambotis: raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 [09:50:01] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2964509 (10akosiaris) Yeah this has been happening for days. The disk is not yet kicked out of the array, which buffles me since the dmesg has many ``` [1636325.780704] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0... [09:51:05] (03CR) 10Alexandros Kosiaris: [C: 031] raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis) [09:53:17] !log mark /dev/sdb as faulty on md devices on bast3001 T154603 [09:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:21] T154603: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603 [09:54:04] (03PS1) 10Muehlenhoff: Remove access credentials for junikowski [puppet] - 10https://gerrit.wikimedia.org/r/333868 (https://phabricator.wikimedia.org/T152957) [09:55:15] PROBLEM - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 3, Spare: 0 [09:55:16] ACKNOWLEDGEMENT - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 3, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T156116 [09:55:20] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T156116#2964513 (10ops-monitoring-bot) [09:55:27] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2964517 (10akosiaris) Forced the disk as failed. I suppose we should schedule a replacement. In the meantime bast3001 will work at reduced redundancy, which is fine given we got another 3 bast boxes [09:55:54] hmmm ops-monitoring-bot decided to create a new task.. let's merge it in [09:56:33] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2964519 (10akosiaris) [09:56:36] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T156116#2964521 (10akosiaris) [10:04:41] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964533 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2030.codfw.wmnet'] ``` T... [10:05:02] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for junikowski [puppet] - 10https://gerrit.wikimedia.org/r/333868 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff) [10:05:14] RECOVERY - Elasticsearch HTTPS on elastic2029 is OK: SSL OK - Certificate elastic2029.codfw.wmnet valid until 2022-01-23 10:04:08 +0000 (expires in 1824 days) [10:07:44] 06Operations: Optional expiry date for user accounts - https://phabricator.wikimedia.org/T142816#2964535 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:14:34] (03PS1) 10Muehlenhoff: Add account expiry dates for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/333872 (https://phabricator.wikimedia.org/T142816) [10:16:38] (03PS4) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 [10:17:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) [10:19:34] (03CR) 10Marostegui: [C: 04-2] "wait until around 13:00UTC. There is SWAT at 14:00UTC so we need to push this before that, as the move is scheduled for 14:00UTC with Chri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [10:20:17] (03CR) 10DCausse: [C: 031] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel) [10:25:26] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964547 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2030.codfw.wmnet'] ``` and were **ALL** successful. [10:26:34] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:27:22] RECOVERY - Elasticsearch HTTPS on elastic2030 is OK: SSL OK - Certificate elastic2030.codfw.wmnet valid until 2022-01-23 10:26:18 +0000 (expires in 1824 days) [10:30:09] (03PS10) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) [10:30:17] (03PS11) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [10:38:49] (03PS5) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 [10:40:10] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964568 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2031.codfw.wmnet'] ``` T... [10:41:12] (03CR) 10Gehel: [C: 032] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel) [10:41:24] (03CR) 10Muehlenhoff: [C: 032] Add account expiry dates for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/333872 (https://phabricator.wikimedia.org/T142816) (owner: 10Muehlenhoff) [10:41:30] (03PS2) 10Muehlenhoff: Add account expiry dates for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/333872 (https://phabricator.wikimedia.org/T142816) [10:41:47] (03PS1) 10Alexandros Kosiaris: redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878 [10:48:29] (03PS1) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [10:48:52] 06Operations, 10Monitoring, 10Traffic, 07Wikimedia-Incident: Plot number of cached objects on a per-server per-DC basis - https://phabricator.wikimedia.org/T154864#2964613 (10ema) 05Open>03Resolved @fgiunchedi added per-host stats as well: https://grafana.wikimedia.org/dashboard/db/varnish-machine-sta... [10:49:28] (03CR) 10jerkins-bot: [V: 04-1] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [10:50:43] sigh [10:51:08] jerkins-bot lol [10:51:26] (03PS1) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 [10:51:44] (03PS2) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) [10:52:11] woa operations-puppet-typos is very nice [10:52:13] (03CR) 10Ema: raid: also check for State: degraded in md arrays (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis) [10:52:55] equiad! [10:55:32] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:56:31] (03PS3) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) [10:56:52] (03PS2) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [10:58:03] (03CR) 10jerkins-bot: [V: 04-1] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [10:59:04] elukey: almost there! :) [11:00:17] ema: for some reason the first time puppet parser validate and puppet-lint were fine on my laptop, then syntax error. Now I tried to fix it, puppet-lint warnings :P [11:00:27] the main issue is behind the keyboard [11:00:33] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964645 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2031.codfw.wmnet'] ``` and were **ALL** successful. [11:01:34] and also pcc remembered to me that I forgot the memcached prometheus exporter [11:02:50] (03PS2) 10Alexandros Kosiaris: redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878 [11:03:29] elukey: wikilove to you [11:03:50] ahhaha [11:04:58] (03PS1) 10Addshore: Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) [11:05:13] (03PS4) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) [11:06:06] <_joe_> ahahahahahah [11:09:04] (03PS1) 10Addshore: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) [11:09:25] (03PS2) 10Addshore: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) [11:11:27] ostriches: ping [11:11:43] TabbyCat: he is sleeping for sure [11:12:03] hashar: didn't knew, sorry, will look for another phab admin then [11:12:41] ori maybe ? [11:12:46] RECOVERY - Elasticsearch HTTPS on elastic2031 is OK: SSL OK - Certificate elastic2031.codfw.wmnet valid until 2022-01-23 11:11:12 +0000 (expires in 1824 days) [11:15:02] or greg-g ? [11:15:08] TabbyCat: they are all sleeping [11:15:25] TabbyCat: and ori is no more working for the wmf :( Your best chance is to fill in a task [11:15:37] hashar: he's still a phab admin [11:15:58] TabbyCat: add in #Project-Admins / #Repository-Admins I guess [11:16:03] and that should spam the proper set of folks [11:16:04] I think I'll mail AKlapper and ask him to disable an account [11:16:58] what if he is not around ? :] [11:17:21] anyway lunch time for me & [11:26:46] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:28:45] (03PS3) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [11:29:07] <_joe_> elukey: I'll take a look later [11:29:27] * elukey sees incoming -1s :D [11:29:31] thanks! [11:29:42] still running pcc to figure out if I am missing anything [11:35:04] (03PS4) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [11:42:22] (03PS5) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [11:43:49] (03CR) 10Jcrespo: [C: 04-1] "You need to change a new master of db1095 to ROW first." [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [11:43:52] (03CR) 10Tobias Gritschacher: "* image template replacement from I2b9cef3d71 has been merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore) [11:54:56] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:56:00] (03CR) 10Addshore: [C: 032] Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [11:56:20] (03PS6) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [11:57:51] (03Merged) 10jenkins-bot: Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [11:58:02] (03CR) 10jenkins-bot: Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [12:03:46] (03Abandoned) 10Elukey: [WIP] Add temporary dc to Redis config to allow a eqiad replica [puppet] - 10https://gerrit.wikimedia.org/r/323807 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [12:03:59] (03Abandoned) 10Elukey: WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [12:05:00] !log addshore@tin Synchronized wmf-config/extension-list-labs: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] 1/4 noop (duration: 00m 39s) [12:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:05] T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995 [12:05:52] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] 2/4 noop (duration: 00m 39s) [12:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:14] Dereckson: around? [12:06:36] Hi. Yes. [12:06:37] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] 3/4 noop (duration: 00m 39s) [12:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:09] Dereckson: hi, I wonder if you could run a server script in dry-mode only and paste the output? [12:07:31] !log addshore@tin Synchronized wmf-config/CommonSettings.php: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] & [[gerrit:333884|Populate InterwikiSortingInterwikiSortOrders with WB Client]] 4/4 noop (duration: 00m 39s) [12:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:50] Dereckson: it'd be for https://phabricator.wikimedia.org/T147915#2961853 [12:11:23] (03CR) 10Addshore: [C: 032] Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [12:11:26] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [12:11:30] This would output a list of global accounts, only logins, it seems okay on a privacy basis. [12:12:17] (03PS1) 10Muehlenhoff: Add more email addresses and contacts for account extensions [puppet] - 10https://gerrit.wikimedia.org/r/333892 [12:12:26] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3092101 keys, up 85 days 3 hours - replication_delay is 0 [12:12:54] (03Merged) 10jenkins-bot: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [12:13:04] (03CR) 10jenkins-bot: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [12:14:24] Dereckson: yep, nothing that special listusers wouldn't show you [12:14:32] TabbyCat: no dry run will need to coordinate with j.ynus and m.arostegui as it needs to iterate among 49 millions of accounts [12:14:47] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155995 [[gerrit:333882|Copy InterwikiSorting settings from wmgWikibaseClientSettings]] noop (duration: 00m 39s) [12:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:51] T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995 [12:14:53] uh [12:15:04] that's bad [12:15:22] Dereckson: subtask with dba? [12:15:47] https://phabricator.wikimedia.org/diffusion/ECAU/browse/master/maintenance/deleteEmptyAccounts.php;86ce123406becbfe9e60e9b7e6aa7785b6e81061$48 [12:16:21] fatal error on https://de.wikipedia.org/wiki/Wikipedia:Festivalsommer/Galerie [12:16:35] "Typs „ConfigException“ [12:16:39] (03CR) 10Muehlenhoff: [C: 032] Add more email addresses and contacts for account extensions [puppet] - 10https://gerrit.wikimedia.org/r/333892 (owner: 10Muehlenhoff) [12:16:39] addshore: ping ^ [12:16:45] reverting [12:17:02] https://nl.wikipedia.org/wiki/Wikipedia:Te_beoordelen_pagina%27s/Toegevoegd_20170111 [12:17:20] syncing [12:17:21] Planned upgrade ? [12:17:41] <_joe_> nope, a problem in a deploy [12:17:47] Got this when trying to save -" [WIdFrgpAADsAAj6FvFYAAABG] 2017-01-24 12:16:47: Fatal exception of type "ConfigException" " [12:17:55] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: Revert last (duration: 00m 39s) [12:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:05] <_joe_> ShakespeareFan00: try again now? [12:18:13] looks like its back [12:18:23] GlobalVarConfig::get: undefined option: 'InterwikiSortingAlwaysSort' [12:18:59] could be a sync issue [12:19:15] <_joe_> Dereckson: I don't think so? [12:19:18] Dereckson: ah, no, I see what the issue is. [12:19:28] addshore: you forget wgGlobalVarConfig::get: undefined option: 'InterwikiSortingAlwaysSort' [12:19:36] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:19:38] Also can I make a request for a 'font-deployment'? [12:19:40] addshore: you forget wgInterwikiSortingAlwaysSort? [12:19:46] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [12:19:50] <_joe_> Dereckson: did it work on mwdebug? [12:19:52] Dereckson: that has been delibertly remove [12:19:57] Iam trying to get support for FiraSans to be supported across Wikimedia projects? [12:20:05] _joe_: addshore is reverting [12:20:27] But wikibase checks for the existance of 1 global and if that exists it will load the rest. so in adding these it tried loading the other which is actually not being added at all [12:20:32] <_joe_> Dereckson: I know, I was trying to understand how we got to have an outage [12:20:34] Dereckson: _joe_ already reverted [12:20:48] addshore: did you test it on mwdebug1002 before syncing to prod? [12:20:54] Dereckson: yup [12:21:36] but it could be not all code paths hit this, I can write something up in a bit! [12:21:42] this is filled as https://phabricator.wikimedia.org/T156123 already [12:21:46] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964809 (10akosiaris) Done. Now esams+eqiad use install1001 as DHCP server and ulsfo+codfw use install2001 as DHCP server. [12:21:46] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:22:31] 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964819 (10Dereckson) Caused by https://gerrit.wikimedia.org/r/#/c/333882/. Immediately reverted. [12:22:34] 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964822 (10matmarex) I think someone botched a deployment. [12:23:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:23:35] 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964840 (10matmarex) [12:23:46] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [12:23:55] !log switch all networks to use install1001, install2001 as DHCP relay endpoint. T156109 [12:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:59] T156109: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109 [12:24:08] 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964848 (10MarcoAurelio) [12:24:35] 06Operations, 10Wikimedia-General-or-Unknown, 07Spike: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964849 (10Dereckson) [12:25:16] 06Operations, 10Wikimedia-General-or-Unknown, 07Spike, 07Wikimedia-Incident: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964774 (10Dereckson) [12:25:19] Dereckson: as I just reverted on tin I'll put it on gerrit now [12:26:03] (03CR) 10Addshore: [C: 04-1] "more pending changes needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore) [12:26:05] 06Operations, 10Wikimedia-General-or-Unknown, 07Spike, 07Wikimedia-Incident: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964854 (10He7d3r) [Copying from the duplicated task] When I opened the following link today I got > [WIdFjApAAEUAAewxpqMAAABK] 2017-01-24 12:16:12: E... [12:26:10] 06Operations, 10Wikimedia-General-or-Unknown, 07Spike, 07Wikimedia-Incident: wgGlobalVarConfig::get: undefined option: 'InterwikiSortingAlwaysSort' exception - https://phabricator.wikimedia.org/T156123#2964855 (10Dereckson) [12:26:12] (03PS1) 10Addshore: Revert "Copy InterwikiSorting settings from wmgWikibaseClientSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 [12:26:27] (03CR) 10Addshore: [C: 032] "Already reverted on tin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 (owner: 10Addshore) [12:27:05] 06Operations, 10DBA: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2964856 (10Marostegui) [12:27:49] (03Merged) 10jenkins-bot: Revert "Copy InterwikiSorting settings from wmgWikibaseClientSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 (owner: 10Addshore) [12:27:52] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964871 (10akosiaris) @dzahn, I think that part is done, please do some tests and then we can resolve [12:28:11] (03CR) 10jenkins-bot: Revert "Copy InterwikiSorting settings from wmgWikibaseClientSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 (owner: 10Addshore) [12:28:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:29:56] addshore: b635f7075731f vs a678bc86b61 -> probablyt useful to reset the branch like try git fetch ; git log b635f7075731f..a678bc86b61 and if void: git reset a678bc86b61 ; git status [12:30:16] sorry I meant `git diff b635f7075731f a678bc86b61` [12:30:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:30:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:30:48] Dereckson: ack, just done! [12:31:07] TabbyCat: 47M/49M [12:31:18] First time I have had to revert something directly on tin and push it out fast.. [12:31:24] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964878 (10Marostegui) [12:31:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:31:48] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [12:32:31] addshore: oh yes, you were right: revert it on Tin, then sync is the more urgent. Gerrit, etc. can wait aftwerwards. [12:35:44] Dereckson: I'm guessing that was big enough to warrent a https://wikitech.wikimedia.org/wiki/Incident_documentation ? [12:36:41] TabbyCat: so, the script would delete 2148 accounts [12:36:57] addshore > yes, seems so [12:37:10] Dereckson: still 2148 empty global accounts? [12:37:13] wow [12:37:23] results could be posted? [12:37:28] phab paste? [12:37:44] if concerned with something, make it visible just to you and me [12:39:23] addshore: are you guys going to write an incident report? [12:39:36] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [12:40:26] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3092089 keys, up 85 days 4 hours - replication_delay is 0 [12:40:34] Dereckson: I'm leaving now but you can reach me through phab conpherence if you need to, au revoir [12:41:07] elukey: yup, I will [12:41:18] (03PS1) 10Yuvipanda: tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897 [12:41:30] thanks :) [12:41:39] elukey: my first one D: [12:42:12] it happens! [12:42:26] addshore: are you done so I can push a depool to mediawikiconfig? [12:42:41] marostegui: yup! everything is done & clean [12:42:46] addshore: thanks! :) [12:43:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [12:43:11] (03PS2) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) [12:48:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [12:48:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T156004 (duration: 00m 39s) [12:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:37] T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004 [12:49:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) [12:51:09] (03CR) 10Hashar: [C: 04-1] contint/zuul: skip Icinga monitoring if server not master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [12:51:16] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964931 (10Marostegui) [12:51:29] (03PS2) 10Hashar: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [12:51:48] (03CR) 10jerkins-bot: [V: 04-1] contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [12:51:52] !log installing pcsc-lite security updates on trusty hosts (jessie already fixed a while ago) [12:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [12:52:06] (03PS3) 10Hashar: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [12:53:46] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [12:53:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [12:54:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [12:55:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T156004 (duration: 00m 39s) [12:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:05] T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004 [12:56:29] !log Shutdown mysql on db1051 for maintenance - T156004 [12:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:31] (03PS1) 10Cmjohnson: Updating dns for db1051 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333899 [13:00:44] !log Shutdown db1051 for maintenance - T156004 [13:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:48] T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004 [13:01:14] (03CR) 10Cmjohnson: [C: 032] Updating dns for db1051 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333899 (owner: 10Cmjohnson) [13:03:36] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:05:43] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) [13:09:50] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [13:11:22] (03CR) 10Hashar: "Puppet compile is https://puppet-compiler.wmflabs.org/5209/" [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [13:11:32] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [13:11:37] (03PS1) 10Hoo man: Log time and shard number on Wikidata dump failure [puppet] - 10https://gerrit.wikimedia.org/r/333901 [13:11:42] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [13:13:56] (03CR) 10Alexandros Kosiaris: [C: 031] tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897 (owner: 10Yuvipanda) [13:14:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: wmf-config/db-codfw.php Change db1051 IP - T156004 (duration: 00m 39s) [13:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:04] T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004 [13:15:47] (03PS2) 10Faidon Liambotis: raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 [13:16:00] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db1051 IP - T156004 (duration: 00m 39s) [13:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:55] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965005 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2032.codfw.wmnet'] ``` T... [13:28:17] (03CR) 10ArielGlenn: [C: 032] Log time and shard number on Wikidata dump failure [puppet] - 10https://gerrit.wikimedia.org/r/333901 (owner: 10Hoo man) [13:33:02] (03PS1) 10Yuvipanda: tools: Use packages in k8s bastions [puppet] - 10https://gerrit.wikimedia.org/r/333904 [13:33:31] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965022 (10Marostegui) [13:33:34] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2965019 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated Thanks... [13:33:36] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:33:49] Dereckson: elukey https://wikitech.wikimedia.org/wiki/Incident_documentation/20170124-WikibaseClient-InterwikiSorting In a meeting now but will post it around after [13:36:01] thanks! [13:37:36] !log Shutdown mysql on db1052 for maintenance - T156006 [13:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:41] T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006 [13:37:51] (03PS1) 10Yuvipanda: tools: Switch workers to using debs [puppet] - 10https://gerrit.wikimedia.org/r/333906 [13:41:04] !log Shutdown db1052 for maintenance - T156006 [13:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] (03PS1) 10Cmjohnson: Updating dns for db1052 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333907 [13:42:38] (03CR) 10Cmjohnson: [C: 032] Updating dns for db1052 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333907 (owner: 10Cmjohnson) [13:43:21] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) [13:45:18] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [13:47:12] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [13:47:15] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965117 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2032.codfw.wmnet'] ``` and were **ALL** successful. [13:48:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db1052 IP - T156006 (duration: 00m 39s) [13:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:27] T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006 [13:48:58] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [13:49:16] (03PS1) 10Gilles: Fix mechanism to disable default nginx configuration [puppet/nginx] - 10https://gerrit.wikimedia.org/r/333909 (https://phabricator.wikimedia.org/T154270) [13:49:17] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db1052 IP - T156006 (duration: 00m 39s) [13:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:21] (03CR) 10jerkins-bot: [V: 04-1] Fix mechanism to disable default nginx configuration [puppet/nginx] - 10https://gerrit.wikimedia.org/r/333909 (https://phabricator.wikimedia.org/T154270) (owner: 10Gilles) [13:51:41] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965131 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2033.codfw.wmnet'] ``` T... [13:52:36] RECOVERY - Elasticsearch HTTPS on elastic2032 is OK: SSL OK - Certificate elastic2032.codfw.wmnet valid until 2022-01-23 13:50:49 +0000 (expires in 1824 days) [13:57:03] jouncebot: next [13:57:04] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1400) [13:57:35] dcausse: go go go :) [13:57:40] o/ [13:57:44] I can swat? :) [13:57:58] guess we can start yeah :] [13:58:06] zeljkof: I will do the swat :] [13:58:43] dcausse: wanna do the magic CR+2 / scap pull / scap sync-file dance? [13:58:54] hashar: sure I can do that [13:58:58] great! [13:59:08] I am around if you need assistance [13:59:28] (03PS3) 10DCausse: [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1400). [14:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:11] about that one, I think that namespaces have a property to define whether they are content [14:00:20] so in theory CirrusSearch could auto prioritize such namespaces [14:00:41] hashar: yes... but I still don't know if I should do that [14:01:02] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2010 is OK: OK ferm input default policy is set [14:01:03] I'd like to find some usecases where such low boost were actually useful [14:01:32] yup [14:01:47] hashar, dcausse: great, good luck with swat :) [14:01:54] zeljkof: thanks :) [14:03:41] (03CR) 10DCausse: [C: 032] [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) (owner: 10DCausse) [14:05:22] (03Merged) 10jenkins-bot: [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) (owner: 10DCausse) [14:05:36] (03CR) 10jenkins-bot: [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) (owner: 10DCausse) [14:07:42] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:42] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [14:10:31] 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965171 (10Marostegui) [14:10:34] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2965168 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated thanks... [14:10:52] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965175 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2033.codfw.wmnet'] ``` and were **ALL** successful. [14:13:40] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T155142 [cirrus] Increase weigths for content namespaces on mw.org (duration: 00m 39s) [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:44] T155142: Pages in the "Manual" namespace are ranked very poorly in MediaWiki.org search results - https://phabricator.wikimedia.org/T155142 [14:15:40] (03PS2) 10DCausse: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) [14:15:59] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965194 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2034.codfw.wmnet'] ``` T... [14:16:12] RECOVERY - Elasticsearch HTTPS on elastic2033 is OK: SSL OK - Certificate elastic2033.codfw.wmnet valid until 2022-01-23 14:14:34 +0000 (expires in 1824 days) [14:17:40] (03CR) 10DCausse: [C: 032] [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse) [14:19:16] (03Merged) 10jenkins-bot: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse) [14:19:26] (03CR) 10jenkins-bot: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse) [14:21:12] (03PS1) 10Marostegui: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) [14:23:21] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T155515 [cirrus] properly set wgCirrusSearchUseIcuFolding (duration: 00m 39s) [14:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:26] T155515: Reindex el, en, fr and he wikis to enable ICU folding - https://phabricator.wikimedia.org/T155515 [14:26:03] !log EU SWAT Done [14:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:01] \O/ [14:29:16] (03PS1) 10Elukey: Increase retry wait time for Hadoop Yarn Nodemanager checks [puppet] - 10https://gerrit.wikimedia.org/r/333912 [14:33:39] (03PS1) 10Yuvipanda: tools: Use packages for kube-proxy on webproxies [puppet] - 10https://gerrit.wikimedia.org/r/333913 [14:36:11] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965225 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2034.codfw.wmnet'] ``` and were **ALL** successful. [14:36:22] (03PS2) 10Marostegui: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) [14:38:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [14:39:45] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [14:39:56] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui) [14:40:51] (03PS25) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [14:40:53] (03PS25) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [14:40:55] (03PS26) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [14:40:57] (03PS10) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925) [14:41:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 with less weight - T156004 (duration: 00m 41s) [14:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:17] T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004 [14:43:41] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965249 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2035.codfw.wmnet'] ``` T... [14:44:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) [14:44:23] RECOVERY - Elasticsearch HTTPS on elastic2034 is OK: SSL OK - Certificate elastic2034.codfw.wmnet valid until 2022-01-23 14:42:45 +0000 (expires in 1824 days) [14:45:01] (03PS2) 10Filippo Giunchedi: scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728) [14:45:28] (03PS2) 10Yuvipanda: tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897 [14:45:34] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897 (owner: 10Yuvipanda) [14:45:52] (03PS2) 10Yuvipanda: tools: Switch workers to using debs [puppet] - 10https://gerrit.wikimedia.org/r/333906 [14:45:58] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Switch workers to using debs [puppet] - 10https://gerrit.wikimedia.org/r/333906 (owner: 10Yuvipanda) [14:47:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [14:49:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [14:49:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui) [14:50:25] (03PS1) 10Filippo Giunchedi: prometheus: add memcached aggregation and additional rules [puppet] - 10https://gerrit.wikimedia.org/r/333915 [14:50:39] (03PS3) 10Marostegui: site.pp: Disable RBR on db1052 enable it on db1073 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) [14:50:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T155999 (duration: 00m 39s) [14:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:48] T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999 [14:51:02] 06Operations, 10media-storage: Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136#2965275 (10ema) [14:53:11] 06Operations, 10media-storage, 07Wikimedia-Incident: Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136#2965277 (10ema) p:05Triage>03Normal [14:53:45] cmjohnson1: I'm going to depool ms-fe1001 [14:53:54] okay [14:54:51] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1001.eqiad.wmnet [14:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:11] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5212/ this compiles fine and changes db1052 to STATEMENT and db1073 to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [14:55:38] !log Stop replication on db1052 and db1073 for maintenance - T156006 [14:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:42] T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006 [14:56:01] it'll take maybe 3/5 minutes to fully drain [14:56:24] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2965288 (10Cmjohnson) added a secondary switch, asw2-c2-eqiad. accessible via scs port 48 [14:56:47] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2965291 (10jcrespo) [14:58:03] RECOVERY - NTP on ms-be2010 is OK: NTP OK: Offset -0.0006507337093 secs [15:00:33] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:01:13] mhh doesn't look like ms-fe1001 is being depooled, checking [15:02:26] (03PS2) 10Yuvipanda: tools: Use packages for kube-proxy on webproxies [puppet] - 10https://gerrit.wikimedia.org/r/333913 [15:03:10] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965316 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2035.codfw.wmnet'] ``` and were **ALL** successful. [15:04:08] !log recabling labstore1004/1005 eth1 [15:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:43] RECOVERY - Elasticsearch HTTPS on elastic2035 is OK: SSL OK - Certificate elastic2035.codfw.wmnet valid until 2022-01-23 15:04:24 +0000 (expires in 1824 days) [15:07:32] !log drbdadm adjust test for 1004/1005 w/ 192.168.0.0/30 [15:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:44] yep looks like low-traffic primary lvs1003 didn't pick up the etcd change [15:08:05] I'll try again [15:08:09] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1001.eqiad.wmnet [15:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:25] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use packages for kube-proxy on webproxies [puppet] - 10https://gerrit.wikimedia.org/r/333913 (owner: 10Yuvipanda) [15:09:30] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1001.eqiad.wmnet [15:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:45] godog: it could be that pybal crashed on lvs1003 -> T134893 [15:09:46] T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893 [15:10:02] yeah looks like only 1012 1006 and 1009 see the change [15:10:09] ema: most likely [15:10:24] !log drbdadm adjust misc for 1004/1005 w/ 192.168.0.0/30 [15:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:46] ema: so the "fix" is turning off and on again [15:11:04] godog: yep [15:12:11] isn't that always the fix? [15:12:16] Jan 24 11:45:28 lvs1003 pybal[6642]: Unhandled error in Deferred: [15:12:16] Jan 24 11:45:28 lvs1003 pybal[6642]: Unhandled Error [15:12:16] Jan 24 11:45:28 lvs1003 pybal[6642]: Traceback (most recent call last): [15:12:19] Jan 24 11:45:28 lvs1003 pybal[6642]: Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly. [15:12:30] that's a good reason to explode! [15:12:51] twisted makes network programming fun again :) [15:13:20] internet.error is also great [15:14:23] !log bounce pybal on lvs1003 - T134893 [15:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:57] elukey: waiting for .error to be a TLD [15:17:39] (03Abandoned) 10Marostegui: site.pp: Disable RBR on db1052 enable it on db1073 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [15:24:40] 06Operations: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965447 (10ema) [15:25:47] cmjohnson1: you can unplug ms-fe1001 production interface, depooled now [15:25:53] I'll shut icinga [15:25:58] great..thx [15:28:12] godog: success, i plugged in the fiber from ms-fe1001 to fe1005 and i have a connection....on the reverse side the fiber to ms-fe1005 did not establish a link. it's not the server or the nic card [15:28:33] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:30:59] cmjohnson1: ok thanks! I think you can plug ms-fe1001 back in [15:31:28] godog: give me another couple of mins plz [15:31:32] ok! [15:31:53] (03PS1) 10Muehlenhoff: Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 [15:33:18] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965468 (10ema) [15:36:50] (03CR) 10Muehlenhoff: [C: 032] Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 (owner: 10Muehlenhoff) [15:36:55] (03PS2) 10Muehlenhoff: Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 [15:38:31] (03PS2) 10Yuvipanda: tools: Use packages in k8s bastions [puppet] - 10https://gerrit.wikimedia.org/r/333904 [15:38:41] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use packages in k8s bastions [puppet] - 10https://gerrit.wikimedia.org/r/333904 (owner: 10Yuvipanda) [15:38:45] 06Operations, 10hardware-requests: hardware request for netmon1001 - https://phabricator.wikimedia.org/T156040#2965499 (10RobH) a:05mark>03RobH We don't have any spare systems with SSDs, so we would have to order the machine specifically to house them. Since it seems this spare won't do, I'll go ahead and... [15:41:29] (03PS3) 10Muehlenhoff: Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 [15:42:29] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 (owner: 10Muehlenhoff) [15:49:51] !log installing tomcat7 security updates on trusty hosts (jessie already fixed a while ago) [15:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:05] Is anyone able to give a review on this patch? https://gerrit.wikimedia.org/r/#/c/333158/ [15:54:29] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-7] - https://phabricator.wikimedia.org/T155095#2965547 (10Cmjohnson) I was able to confirm the servers and NIC cards were good and ms-fe1005 and 1006 are now up and accessible. [15:54:54] !log drbdadm adjust tools for 1004/1005 w/ 192.168.0.0/30 [15:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:18] !log shutting down ms-be2002 for maintenance [15:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:49] !log upgraded nodejs on thorium to 6.9 / restarted pivot [15:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 [15:59:31] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2965555 (10RobH) My understanding is they don't expire like that, unless they weren't ever loaded with the proper firmware. So is there a way to flash when its expired? [16:00:43] PROBLEM - Host ms-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:03] 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2965556 (10Papaul) it is not allowing to upload the firmware at all. [16:02:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 (owner: 10Marostegui) [16:03:31] (03PS3) 10Alexandros Kosiaris: redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878 [16:03:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 (owner: 10Marostegui) [16:03:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 (owner: 10Marostegui) [16:04:11] (03Abandoned) 10Alex Monk: labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it [puppet] - 10https://gerrit.wikimedia.org/r/313034 (owner: 10Alex Monk) [16:04:24] (03PS1) 10Alexandros Kosiaris: Add passwords::redis::ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/333924 [16:04:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T155999 (duration: 00m 48s) [16:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:55] T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999 [16:05:39] (03CR) 10Alexandros Kosiaris: [C: 032] redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878 (owner: 10Alexandros Kosiaris) [16:06:01] 06Operations, 10media-storage: high CPU usage from swift-proxy on frontend machines - https://phabricator.wikimedia.org/T156143#2965565 (10fgiunchedi) [16:06:36] (03PS1) 10Cmjohnson: Adding dns entries for frpm1001.frack both mgmt and production [dns] - 10https://gerrit.wikimedia.org/r/333925 [16:06:55] (03PS1) 10Marostegui: site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) [16:07:17] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1001.eqiad.wmnet [16:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:53] 06Operations, 10media-storage: High CPU usage from swift-proxy on frontend machines - https://phabricator.wikimedia.org/T156143#2965581 (10fgiunchedi) p:05Triage>03Normal [16:09:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) [16:09:41] RECOVERY - Redis replication status tcp_6379 on oresrdb1002 is OK: OK: REDIS 2.8.17 on 10.64.0.10:6379 has 1 databases (db0) with 2394417 keys, up 12 days 6 hours - replication_delay is 0 [16:10:00] yay [16:10:00] RECOVERY - Redis replication status tcp_6380 on oresrdb1002 is OK: OK: REDIS 2.8.17 on 10.64.0.10:6380 has 1 databases (db0) with 22322657 keys, up 12 days 6 hours - replication_delay is 0 [16:10:03] paravoid: ^ [16:10:05] fixed finally [16:10:21] took a while... had to refactor our redis monitoring a bit [16:10:49] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for frpm1001.frack both mgmt and production [dns] - 10https://gerrit.wikimedia.org/r/333925 (owner: 10Cmjohnson) [16:11:40] PROBLEM - Redis status tcp_6378 on rdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.32.76 on port 6378 [16:11:40] PROBLEM - Redis status tcp_6381 on rdb1005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.24 on port 6381 [16:11:40] PROBLEM - Redis status tcp_6379 on rdb1003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.201 on port 6379 [16:11:40] PROBLEM - Redis status tcp_6380 on rdb1005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.24 on port 6380 [16:11:40] PROBLEM - Redis status tcp_6381 on rdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.32.76 on port 6381 [16:11:41] PROBLEM - Redis status tcp_6378 on rdb1003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.201 on port 6378 [16:11:41] PROBLEM - Redis status tcp_6379 on mc1003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.182 on port 6379 [16:11:42] PROBLEM - Redis status tcp_6379 on mc1002 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.181 on port 6379 [16:11:46] damn [16:11:49] all these are me [16:11:50] PROBLEM - Redis status tcp_6379 on mc1015 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.48.103 on port 6379 [16:11:50] PROBLEM - Redis status tcp_6379 on mc1006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.185 on port 6379 [16:11:50] PROBLEM - Redis status tcp_6379 on mc1009 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.32.163 on port 6379 [16:11:51] PROBLEM - Redis status tcp_6379 on mc1017 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.48.95 on port 6379 [16:11:51] PROBLEM - Redis status tcp_6379 on mc1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.183 on port 6379 [16:12:11] need to revert I suppose... lemme see if I can fix it first though [16:12:32] !log kill stray swift-proxy processes from ms-fe1* T156143 [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:36] T156143: High CPU usage from swift-proxy on frontend machines - https://phabricator.wikimedia.org/T156143 [16:12:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [16:14:26] (03PS1) 10Alexandros Kosiaris: Fix typo for check_redis definition [puppet] - 10https://gerrit.wikimedia.org/r/333928 [16:14:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [16:14:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [16:14:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix typo for check_redis definition [puppet] - 10https://gerrit.wikimedia.org/r/333928 (owner: 10Alexandros Kosiaris) [16:15:13] (03PS2) 10Marostegui: site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) [16:15:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T156006 (duration: 00m 47s) [16:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:36] T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006 [16:16:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add passwords::redis::ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/333924 (owner: 10Alexandros Kosiaris) [16:16:48] PROBLEM - Redis replication status tcp_6381 on rdb2002 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.0.120 on port 6381 [16:16:48] RECOVERY - Redis status tcp_6379 on mc1005 is OK: OK: REDIS 2.8.17 on 10.64.0.184:6379 has 1 databases (db0) with 521586 keys, up 159 days 8 hours [16:16:48] RECOVERY - Redis replication status tcp_6381 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6381 has 1 databases (db0) with 3108759 keys, up 279 days 3 hours - replication_delay is 0 [16:16:48] RECOVERY - Redis replication status tcp_6379 on mc2009 is OK: OK: REDIS 2.8.17 on 10.192.16.39:6379 has 1 databases (db0) with 422527 keys, up 76 days 15 hours - replication_delay is 0 [16:16:48] RECOVERY - Redis replication status tcp_6380 on mc2016 is OK: OK: REDIS 2.8.17 on 10.192.32.23:6380 has 1 databases (db0) with 519544 keys, up 76 days 18 hours - replication_delay is 0 [16:16:48] RECOVERY - Redis replication status tcp_6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6381 has 1 databases (db0) with 3101866 keys, up 85 days 7 hours - replication_delay is 0 [16:16:58] RECOVERY - Redis replication status tcp_6379 on mc2002 is OK: OK: REDIS 2.8.17 on 10.192.0.35:6379 has 1 databases (db0) with 523447 keys, up 76 days 14 hours - replication_delay is 1 [16:16:58] RECOVERY - Redis replication status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 7811063 keys, up 85 days 7 hours - replication_delay is 0 [16:16:58] RECOVERY - Redis replication status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 3102928 keys, up 85 days 7 hours - replication_delay is 0 [16:16:58] RECOVERY - Redis replication status tcp_6478 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6478 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 4 [16:16:58] RECOVERY - Redis replication status tcp_6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6481 has 1 databases (db0) with 3106394 keys, up 85 days 7 hours - replication_delay is 0 [16:16:58] RECOVERY - Redis replication status tcp_6379 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6379 has 1 databases (db0) with 3108710 keys, up 85 days 7 hours - replication_delay is 0 [16:16:59] RECOVERY - Redis replication status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 3103227 keys, up 85 days 7 hours - replication_delay is 0 [16:16:59] RECOVERY - Redis status tcp_6379 on mc1009 is OK: OK: REDIS 2.8.17 on 10.64.32.163:6379 has 1 databases (db0) with 422519 keys, up 159 days 8 hours [16:17:00] RECOVERY - Redis status tcp_6379 on mc1004 is OK: OK: REDIS 2.8.17 on 10.64.0.183:6379 has 1 databases (db0) with 449508 keys, up 159 days 8 hours [16:17:00] RECOVERY - Redis status tcp_6379 on oresrdb1001 is OK: OK: REDIS 2.8.17 on 10.64.48.129:6379 has 1 databases (db0) with 2394948 keys, up 12 days 5 hours [16:17:01] RECOVERY - Redis status tcp_6379 on mc1007 is OK: OK: REDIS 2.8.17 on 10.64.32.161:6379 has 1 databases (db0) with 500025 keys, up 159 days 8 hours [16:17:01] RECOVERY - Redis status tcp_6380 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6380 has 1 databases (db0) with 7813425 keys, up 278 days 1 hours [16:17:02] RECOVERY - Redis status tcp_6379 on mc1001 is OK: OK: REDIS 2.8.17 on 10.64.0.180:6379 has 1 databases (db0) with 474661 keys, up 159 days 8 hours [16:17:02] RECOVERY - Redis replication status tcp_6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6379 has 1 databases (db0) with 3108660 keys, up 85 days 7 hours - replication_delay is 0 [16:17:08] akosiaris: \o/ [16:17:12] ok fixed [16:17:18] RECOVERY - Redis replication status tcp_6380 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6380 has 1 databases (db0) with 3104678 keys, up 85 days 7 hours - replication_delay is 0 [16:17:18] RECOVERY - Redis replication status tcp_6379 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6379 has 1 databases (db0) with 7810934 keys, up 279 days 3 hours - replication_delay is 0 [16:17:18] RECOVERY - Redis status tcp_6380 on oresrdb1001 is OK: OK: REDIS 2.8.17 on 10.64.48.129:6380 has 1 databases (db0) with 22337099 keys, up 12 days 5 hours [16:17:18] RECOVERY - Redis status tcp_6379 on rdb1001 is OK: OK: REDIS 2.8.17 on 10.64.32.76:6379 has 1 databases (db0) with 7810929 keys, up 278 days 1 hours [16:17:18] RECOVERY - Redis status tcp_6381 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6381 has 1 databases (db0) with 7721205 keys, up 278 days 1 hours [16:17:24] damn typo, sorry [16:17:28] RECOVERY - Redis replication status tcp_6478 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6478 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 8 [16:17:28] RECOVERY - Redis replication status tcp_6378 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6378 has 1 databases (db0) with 15 keys, up 85 days 7 hours - replication_delay is 0 [16:17:28] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3099430 keys, up 85 days 7 hours - replication_delay is 0 [16:17:28] RECOVERY - Redis replication status tcp_6378 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6378 has 1 databases (db0) with 15 keys, up 279 days 3 hours - replication_delay is 0 [16:17:28] RECOVERY - Redis replication status tcp_6380 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6380 has 1 databases (db0) with 3102726 keys, up 279 days 3 hours - replication_delay is 0 [16:17:29] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5214/ compiles fine and changes only db0172" [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [16:17:38] RECOVERY - Redis replication status tcp_6381 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 3108636 keys, up 85 days 7 hours - replication_delay is 0 [16:17:38] RECOVERY - Redis status tcp_6378 on rdb1001 is OK: OK: REDIS 2.8.17 on 10.64.32.76:6378 has 1 databases (db0) with 15 keys, up 278 days 1 hours [16:17:38] RECOVERY - Redis status tcp_6379 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6379 has 1 databases (db0) with 7811804 keys, up 278 days 1 hours [16:17:38] RECOVERY - Redis status tcp_6381 on rdb1001 is OK: OK: REDIS 2.8.17 on 10.64.32.76:6381 has 1 databases (db0) with 3108596 keys, up 278 days 1 hours [16:17:38] RECOVERY - Redis status tcp_6378 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6378 has 1 databases (db0) with 4705607 keys, up 278 days 1 hours [16:17:51] you should voice the bot, it get throttled a bit [16:18:17] (03PS3) 10Marostegui: site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) [16:18:38] RECOVERY - Redis replication status tcp_6378 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6378 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 10 [16:18:41] !log removing lvs4002_T151273 policy from cr1/2-ulsfo [16:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:48] RECOVERY - Redis replication status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 3106260 keys, up 85 days 7 hours - replication_delay is 0 [16:19:22] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: Set up monitoring for ORES redis database - https://phabricator.wikimedia.org/T155482#2965632 (10akosiaris) 05Open>03Resolved And with https://gerrit.wikimedia.org/r/#/c/333878/ this is now done. Had to refactor the current moni... [16:20:12] (03CR) 10Marostegui: [C: 032] site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui) [16:20:28] RECOVERY - Redis replication status tcp_6379 on mc2012 is OK: OK: REDIS 2.8.17 on 10.192.16.42:6379 has 1 databases (db0) with 444670 keys, up 76 days 16 hours - replication_delay is 0 [16:20:28] RECOVERY - Redis replication status tcp_6378 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6378 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 7 [16:21:37] (03CR) 10Filippo Giunchedi: [C: 032] scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [16:21:45] (03PS3) 10Filippo Giunchedi: scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728) [16:21:48] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:22:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T156006 (duration: 00m 41s) [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:08] T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006 [16:23:37] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2965652 (10faidon) Just in: > Engineering has fixed PR 1238906 has been fixed through master PR 1205416, and the fix would be available 14.1X53-D42 onwards, sc... [16:26:36] !log Restart mysql db1072 [16:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:44] RECOVERY - Redis status tcp_6379 on mc1002 is OK: OK: REDIS 2.8.17 on 10.64.0.181:6379 has 1 databases (db0) with 523118 keys, up 159 days 8 hours [16:26:44] RECOVERY - Redis replication status tcp_6379 on mc2001 is OK: OK: REDIS 2.8.17 on 10.192.0.34:6379 has 1 databases (db0) with 474546 keys, up 76 days 14 hours - replication_delay is 0 [16:26:54] RECOVERY - Redis replication status tcp_6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6381 has 1 databases (db0) with 7721449 keys, up 85 days 7 hours - replication_delay is 0 [16:26:54] RECOVERY - Redis replication status tcp_6378 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6378 has 1 databases (db0) with 4705607 keys, up 85 days 7 hours - replication_delay is 4 [16:26:54] RECOVERY - Redis status tcp_6379 on mc1006 is OK: OK: REDIS 2.8.17 on 10.64.0.185:6379 has 1 databases (db0) with 502785 keys, up 159 days 8 hours [16:26:54] RECOVERY - Redis status tcp_6379 on mc1017 is OK: OK: REDIS 2.8.17 on 10.64.48.95:6379 has 1 databases (db0) with 483532 keys, up 159 days 8 hours [16:26:55] RECOVERY - Redis replication status tcp_6379 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6379 has 1 databases (db0) with 3108682 keys, up 279 days 2 hours - replication_delay is 0 [16:26:55] RECOVERY - Redis status tcp_6379 on mc1018 is OK: OK: REDIS 2.8.17 on 10.64.48.96:6379 has 1 databases (db0) with 519430 keys, up 159 days 8 hours [16:27:04] RECOVERY - Redis replication status tcp_6380 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6380 has 1 databases (db0) with 3104900 keys, up 279 days 2 hours - replication_delay is 0 [16:27:04] RECOVERY - Redis replication status tcp_6379 on mc2004 is OK: OK: REDIS 2.8.17 on 10.192.0.37:6379 has 1 databases (db0) with 449576 keys, up 76 days 15 hours - replication_delay is 0 [16:27:04] RECOVERY - Redis replication status tcp_6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6379 has 1 databases (db0) with 7812011 keys, up 85 days 7 hours - replication_delay is 0 [16:27:04] RECOVERY - Redis replication status tcp_6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6380 has 1 databases (db0) with 7813481 keys, up 85 days 7 hours - replication_delay is 0 [16:27:14] RECOVERY - Redis replication status tcp_6378 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6378 has 1 databases (db0) with 3 keys, up 279 days 2 hours - replication_delay is 1 [16:27:14] RECOVERY - Redis status tcp_6379 on mc1016 is OK: OK: REDIS 2.8.17 on 10.64.48.104:6379 has 1 databases (db0) with 595442 keys, up 159 days 8 hours [16:27:15] RECOVERY - Redis replication status tcp_6380 on mc2001 is OK: OK: REDIS 2.8.17 on 10.192.0.34:6380 has 1 databases (db0) with 483500 keys, up 76 days 14 hours - replication_delay is 0 [16:27:15] RECOVERY - Redis replication status tcp_6379 on mc2014 is OK: OK: REDIS 2.8.17 on 10.192.32.21:6379 has 1 databases (db0) with 528374 keys, up 76 days 17 hours - replication_delay is 0 [16:27:15] RECOVERY - Redis replication status tcp_6379 on mc2005 is OK: OK: REDIS 2.8.17 on 10.192.0.38:6379 has 1 databases (db0) with 521327 keys, up 76 days 15 hours - replication_delay is 0 [16:27:24] RECOVERY - Redis status tcp_6379 on mc1008 is OK: OK: REDIS 2.8.17 on 10.64.32.162:6379 has 1 databases (db0) with 436959 keys, up 159 days 8 hours [16:27:24] RECOVERY - Redis status tcp_6379 on mc1011 is OK: OK: REDIS 2.8.17 on 10.64.32.165:6379 has 1 databases (db0) with 522481 keys, up 159 days 8 hours [16:27:25] RECOVERY - Redis replication status tcp_6381 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6381 has 1 databases (db0) with 3101761 keys, up 279 days 2 hours - replication_delay is 0 [16:27:34] RECOVERY - Redis status tcp_6379 on mc1012 is OK: OK: REDIS 2.8.17 on 10.64.32.166:6379 has 1 databases (db0) with 444484 keys, up 159 days 8 hours [16:27:59] <_joe_> uh what happened there? [16:28:19] <_joe_> oh alex happened [16:31:24] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#2965674 (10Papaul) {F5350097} {F5350101} I switch the IDRAC from Dedicated to NIC2 to access the server in case there is something to do. This is just a temporary fix. [16:32:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333930 [16:33:00] (03PS4) 10Andrew Bogott: labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [16:34:44] RECOVERY - Host ms-be2002 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [16:35:25] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:37:05] !log upgrading aqs1004 to node6 [16:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:34] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [16:42:24] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [16:43:04] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:14] PROBLEM - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [16:44:24] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [16:45:03] paravoid: --^ [16:45:16] hey [16:45:21] looking, thanks [16:45:34] from the alerts smells like a power outage [16:45:40] yup indeed [16:45:49] how did you check? (curious) [16:45:55] faidon@asw-ulsfo> show chassis alarms [16:46:03] 2017-01-24 16:40:39 UTC Major FPC 2 PEM 1 is not powered [16:46:05] ah nice [16:46:12] but also we lost cp4012 and ripe-atlas-ulsfo [16:46:23] yeah [16:47:34] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6 [16:47:54] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6 [16:47:54] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6 [16:47:54] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6 [16:47:54] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6 [16:47:55] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6 [16:48:04] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6 [16:48:04] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6 [16:48:20] spam from cp4012 [16:48:44] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:49:53] indeed [16:50:54] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965727 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2036.codfw.wmnet'] ``` T... [16:52:15] !log planet2001 - reinstalling to test DHCP/TFTP from install2001 [16:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:24] jouncebot, next [16:53:25] In 0 hour(s) and 6 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1700) [16:54:03] this looks like a bad time [16:54:11] !log tools deleting tools-mail-01 [16:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:08] ostriches, thcipriani: dunno if this is going ahead now [16:56:22] RECOVERY - Redis replication status tcp_6379 on mc2011 is OK: OK: REDIS 2.8.17 on 10.192.16.41:6379 has 1 databases (db0) with 522729 keys, up 76 days 17 hours - replication_delay is 0 [16:56:22] RECOVERY - Redis replication status tcp_6379 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6379 has 1 databases (db0) with 3109514 keys, up 279 days 2 hours - replication_delay is 0 [16:56:23] RECOVERY - Redis replication status tcp_6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6379 has 1 databases (db0) with 7812138 keys, up 85 days 8 hours - replication_delay is 0 [16:56:23] RECOVERY - Redis status tcp_6379 on mc1014 is OK: OK: REDIS 2.8.17 on 10.64.48.102:6379 has 1 databases (db0) with 528541 keys, up 159 days 8 hours [16:56:23] RECOVERY - Redis replication status tcp_6379 on mc2013 is OK: OK: REDIS 2.8.17 on 10.192.32.20:6379 has 1 databases (db0) with 516876 keys, up 76 days 17 hours - replication_delay is 0 [16:56:23] RECOVERY - Redis replication status tcp_6380 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6380 has 1 databases (db0) with 7814334 keys, up 279 days 3 hours - replication_delay is 0 [16:56:23] RECOVERY - Redis status tcp_6380 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6380 has 1 databases (db0) with 3104029 keys, up 278 days 2 hours [16:56:24] RECOVERY - Redis replication status tcp_6379 on mc2003 is OK: OK: REDIS 2.8.17 on 10.192.0.36:6379 has 1 databases (db0) with 532024 keys, up 76 days 15 hours - replication_delay is 0 [16:56:24] RECOVERY - Redis status tcp_6380 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6380 has 1 databases (db0) with 3105767 keys, up 278 days 1 hours [16:56:25] RECOVERY - Redis status tcp_6379 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6379 has 1 databases (db0) with 3109599 keys, up 278 days 2 hours [16:56:32] RECOVERY - Redis replication status tcp_6381 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6381 has 1 databases (db0) with 7722312 keys, up 279 days 3 hours - replication_delay is 0 [16:56:32] RECOVERY - Redis status tcp_6379 on mc1003 is OK: OK: REDIS 2.8.17 on 10.64.0.182:6379 has 1 databases (db0) with 532017 keys, up 159 days 8 hours [16:56:32] RECOVERY - Redis replication status tcp_6380 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6380 has 1 databases (db0) with 3104103 keys, up 279 days 2 hours - replication_delay is 0 [16:56:32] RECOVERY - Redis replication status tcp_6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6381 has 1 databases (db0) with 3109889 keys, up 85 days 8 hours - replication_delay is 0 [16:56:42] RECOVERY - Redis replication status tcp_6381 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6381 has 1 databases (db0) with 7722305 keys, up 85 days 8 hours - replication_delay is 0 [16:56:52] RECOVERY - Redis status tcp_6379 on mc1015 is OK: OK: REDIS 2.8.17 on 10.64.48.103:6379 has 1 databases (db0) with 498436 keys, up 159 days 8 hours [16:56:52] RECOVERY - Redis replication status tcp_6379 on mc2007 is OK: OK: REDIS 2.8.17 on 10.192.16.37:6379 has 1 databases (db0) with 500287 keys, up 76 days 16 hours - replication_delay is 0 [16:56:52] RECOVERY - Redis replication status tcp_6379 on mc2015 is OK: OK: REDIS 2.8.17 on 10.192.32.22:6379 has 1 databases (db0) with 498426 keys, up 76 days 18 hours - replication_delay is 0 [16:56:52] RECOVERY - Redis replication status tcp_6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6380 has 1 databases (db0) with 3103955 keys, up 85 days 8 hours - replication_delay is 0 [16:57:02] RECOVERY - Redis status tcp_6379 on mc1013 is OK: OK: REDIS 2.8.17 on 10.64.48.101:6379 has 1 databases (db0) with 516835 keys, up 159 days 8 hours [16:57:02] RECOVERY - Redis status tcp_6378 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6378 has 1 databases (db0) with 3 keys, up 278 days 2 hours [16:57:02] RECOVERY - Redis status tcp_6381 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6381 has 1 databases (db0) with 3107495 keys, up 278 days 2 hours [16:57:02] RECOVERY - Redis status tcp_6379 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6379 has 1 databases (db0) with 3109531 keys, up 278 days 2 hours [16:57:02] RECOVERY - Redis status tcp_6379 on mc1010 is OK: OK: REDIS 2.8.17 on 10.64.32.164:6379 has 1 databases (db0) with 520367 keys, up 159 days 8 hours [16:57:03] RECOVERY - Redis replication status tcp_6379 on mc2006 is OK: OK: REDIS 2.8.17 on 10.192.0.39:6379 has 1 databases (db0) with 503450 keys, up 76 days 15 hours - replication_delay is 0 [16:57:03] RECOVERY - Redis replication status tcp_6379 on mc2010 is OK: OK: REDIS 2.8.17 on 10.192.16.40:6379 has 1 databases (db0) with 520370 keys, up 76 days 17 hours - replication_delay is 0 [16:57:04] RECOVERY - Redis replication status tcp_6378 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6378 has 1 databases (db0) with 4705607 keys, up 279 days 3 hours - replication_delay is 1 [16:57:04] RECOVERY - Redis replication status tcp_6381 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6381 has 1 databases (db0) with 3107516 keys, up 279 days 2 hours - replication_delay is 0 [16:57:05] RECOVERY - Redis replication status tcp_6378 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6378 has 1 databases (db0) with 3 keys, up 279 days 2 hours - replication_delay is 8 [16:57:05] RECOVERY - Redis replication status tcp_6378 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6378 has 1 databases (db0) with 15 keys, up 85 days 8 hours - replication_delay is 0 [16:57:12] RECOVERY - Redis status tcp_6378 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6378 has 1 databases (db0) with 3 keys, up 278 days 2 hours [16:57:12] RECOVERY - Redis replication status tcp_6379 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6379 has 1 databases (db0) with 7813015 keys, up 85 days 8 hours - replication_delay is 0 [16:57:12] RECOVERY - Redis replication status tcp_6379 on mc2008 is OK: OK: REDIS 2.8.17 on 10.192.16.38:6379 has 1 databases (db0) with 437176 keys, up 76 days 16 hours - replication_delay is 0 [16:57:12] RECOVERY - Redis status tcp_6381 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6381 has 1 databases (db0) with 3102673 keys, up 278 days 2 hours [16:57:12] RECOVERY - Redis replication status tcp_6379 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6379 has 1 databases (db0) with 7812946 keys, up 279 days 3 hours - replication_delay is 0 [16:57:12] RECOVERY - Redis replication status tcp_6380 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6380 has 1 databases (db0) with 7814359 keys, up 85 days 8 hours - replication_delay is 0 [16:57:13] RECOVERY - Redis replication status tcp_6378 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6378 has 1 databases (db0) with 4705607 keys, up 85 days 8 hours - replication_delay is 3 [16:57:25] Krenair: my patch is not critical to get out now. The functionality will only be used by the next version of scap coming Soon™ so doesn't have to be today ;\ [16:57:30] er :\ [16:57:42] none of the stuff on there is critical [16:58:19] (03PS1) 10RobH: lost a PDU tower in ulsfo 1.22 [dns] - 10https://gerrit.wikimedia.org/r/333931 [16:58:41] lots more redis recoveries than there were original alerts? [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1700). Please do the needful. [17:00:04] ostriches, Krenair, and thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:07] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2965751 (10Gehel) Current elasticsearch nodes in eqiad are as follow: * **A / A3**: elastic10(30|31|32|33|34|35) - //6 nodes// * **A / A6**: elasti... [17:00:12] RECOVERY - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [17:03:10] 06Operations, 15User-Elukey: Cronspam from mwlog* - https://phabricator.wikimedia.org/T156151#2965779 (10fgiunchedi) [17:03:33] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [17:03:42] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 78.63 ms [17:03:52] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [17:03:52] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [17:04:02] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK [17:04:02] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [17:04:02] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK [17:04:02] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [17:04:02] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [17:05:32] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.72 ms [17:08:11] I'm looking at puppet swat patches btw [17:09:15] has the "installer hangs at 21% during 'Configuring apt' - Retrieving file 4 or 9" issue .. and it feels to familiar [17:09:22] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last) [17:10:00] (03PS5) 10Filippo Giunchedi: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad) [17:11:12] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:11:24] 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965796 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2036.codfw.wmnet'] ``` and were **ALL** successful. [17:13:02] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:22] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 431 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [17:14:42] RECOVERY - Elasticsearch HTTPS on elastic2036 is OK: SSL OK - Certificate elastic2036.codfw.wmnet valid until 2022-01-23 17:13:25 +0000 (expires in 1824 days) [17:16:19] ostriches: 👍 [17:16:48] (03PS1) 10Addshore: DNM "Copy InterwikiSorting settings from wmgWikibaseClientSettings"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 [17:17:02] (03CR) 10Addshore: [C: 04-2] DNM "Copy InterwikiSorting settings from wmgWikibaseClientSettings"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 (owner: 10Addshore) [17:17:27] godog: Yay thx! [17:17:32] (03PS2) 10Addshore: DNM!!! Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 [17:17:48] (03PS2) 10Addshore: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) [17:17:57] (03PS3) 10Addshore: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995) [17:19:47] Krenair: looking at yours now [17:21:09] (03PS4) 10Filippo Giunchedi: ssh: Don't add IPv6 address as an alias in exported resource if it's undefined [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [17:21:14] 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2965818 (10RobH) [17:21:17] 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965816 (10RobH) 05Resolved>03Open I'm reopening this. LVS4002 had its power supply fail again, the exact same PSU slot that died before, PSU2. I had taken another power supply out of cp4012 an... [17:26:10] (03CR) 10Filippo Giunchedi: [C: 032] ssh: Don't add IPv6 address as an alias in exported resource if it's undefined [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk) [17:28:18] 06Operations, 10ops-eqiad, 10ops-ulsfo: ship R620 power supplies to ulsfo - https://phabricator.wikimedia.org/T156154#2965844 (10RobH) [17:29:11] 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965860 (10RobH) So when we get the replacement power supplies mentioned on T156154, we should move the power ports used by lvs4002 with another system. Then if the other system has a psu failure, w... [17:30:31] (03PS1) 10Yuvipanda: tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 [17:30:54] (03CR) 10Gehel: [C: 031] "LGTM. Thanks for taking care of our tech debt!" [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt) [17:31:16] (03PS2) 10Gehel: Remove gehel from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333240 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [17:31:18] (03CR) 10jerkins-bot: [V: 04-1] tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 (owner: 10Yuvipanda) [17:32:02] (03PS2) 10Yuvipanda: tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 [17:32:40] Krenair: I was looking at https://puppet-compiler.wmflabs.org/5170/mw1161.eqiad.wmnet/ did you look into why mw1161 has no diff? [17:32:54] (03CR) 10Gehel: [C: 032] Remove gehel from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333240 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [17:33:42] (03CR) 10Gehel: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff) [17:33:50] could someone from ops / with access please submit https://gerrit.wikimedia.org/r/#/c/324689/ (apparently I can't) and it has been sitting there for nearly 2 months now! :) [17:34:29] godog, I think because it's a jobrunner [17:34:32] addshore: done [17:34:36] not quite sure [17:34:38] yuvipanda: cheers! [17:34:50] (03PS4) 10Gehel: Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff) [17:34:53] (03CR) 10Yuvipanda: [C: 032] tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 (owner: 10Yuvipanda) [17:35:01] (03PS3) 10Yuvipanda: tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 [17:35:08] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 (owner: 10Yuvipanda) [17:35:44] 06Operations, 10ops-eqiad, 10ops-ulsfo: ship R620 power supplies to ulsfo - https://phabricator.wikimedia.org/T156154#2965882 (10Cmjohnson) We do not have any decommissioned R620s in eqiad. [17:36:01] (03PS1) 10Chad: Remove myself from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333946 [17:36:20] moritzm: Heh, reminded me of something I'd been meaning to do ^ [17:37:07] (03PS1) 10Yuvipanda: tools: Get rid of kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/333947 [17:37:55] godog, yeah looks like jobrunners don't get those apache configs [17:38:25] krenair@mw1161:~$ ls -l /etc/apache2/sites-enabled/ [17:38:25] total 0 [17:38:25] lrwxrwxrwx 1 root root 42 Oct 14 08:18 00-dummy.conf -> /etc/apache2/sites-available/00-dummy.conf [17:38:25] lrwxrwxrwx 1 root root 51 Oct 14 08:15 01-hhvm-jobrunner.conf -> /etc/apache2/sites-available/01-hhvm-jobrunner.conf [17:38:25] lrwxrwxrwx 1 root root 47 Oct 14 08:18 50-hhvm-admin.conf -> /etc/apache2/sites-available/50-hhvm-admin.conf [17:38:28] krenair@mw1161:~$ [17:38:45] (03CR) 10Gehel: [C: 032] Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff) [17:38:47] (03PS1) 10Marostegui: Revert "site.pp: Enable RBR on db1072" [puppet] - 10https://gerrit.wikimedia.org/r/333948 [17:39:03] indeed, looks like it [17:39:12] (03PS5) 10Gehel: Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff) [17:39:48] (03PS2) 10Yuvipanda: tools: Get rid of kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/333947 [17:39:56] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Get rid of kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/333947 (owner: 10Yuvipanda) [17:40:26] (03CR) 10Filippo Giunchedi: "Patch LGTM, I've added joe and elukey as they routinely work on apache for opinions too" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [17:40:32] Krenair: ^ [17:41:02] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:41:07] godog: are you doing puppet swat now? [17:41:09] ok [17:41:26] (03PS2) 10Marostegui: Revert "site.pp: Enable RBR on db1072" [puppet] - 10https://gerrit.wikimedia.org/r/333948 [17:41:26] marostegui: yeah, one patch left to go but not intrusive, feel free to merge [17:41:38] godog: ok thanks :) [17:41:48] godog: i am not pushing just yet though [17:41:51] so you can go ahead if you like [17:42:30] (03PS5) 10Filippo Giunchedi: Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani) [17:42:41] marostegui: ok! waiting on jenkins [17:42:47] :) [17:44:05] (03CR) 10Filippo Giunchedi: [C: 032] Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani) [17:44:53] thcipriani: ^ [17:45:12] marostegui: I'm done SWATting [17:45:19] godog: thanks! [17:45:21] godog: thanks! I'll give it a go on tin here in a few to make sure it works :) [17:45:55] (works as expected, that is, won't break anything :)) [17:46:21] thcipriani: ok, let me know if things are borked, I'm going to dinner soonish but I'll be around later too [17:46:29] yup, will do [17:48:03] (03PS6) 10Gehel: Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff) [17:48:06] (03CR) 10Gehel: [V: 032 C: 032] Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff) [17:51:13] (03PS2) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247 [17:51:15] (03PS9) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [17:51:20] ostriches: thanks, I'll merge that tomorrow morning [17:52:09] (03CR) 10Filippo Giunchedi: "> Can we use something other than 443, so we don't run into the same" [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi) [17:54:03] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1001 is CRITICAL: connect to address 10.64.0.35 and port 7231: Connection refused Filippo Giunchedi restbase deployment TODO [17:54:03] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.35, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc5e8601950: Failed to establish a new connection: [Errno 111] Connection refused,)) Filippo Giunchedi restbase deployment TODO [17:54:03] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1002 is CRITICAL: connect to address 10.64.32.112 and port 7231: Connection refused Filippo Giunchedi restbase deployment TODO [17:54:03] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa409335950: Failed to establish a new connection: [Errno 111] Connection refused,)) Filippo Giunchedi restbase deployment TODO [17:54:03] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused Filippo Giunchedi restbase deployment TODO [17:54:03] ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.46, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f68ec14e950: Failed to establish a new connection: [Errno 111] Connection refused,)) Filippo Giunchedi restbase deployment TODO [17:54:08] sorry about the spam [17:54:29] mobrovac: ^ re: restbase on restbase-dev [17:54:42] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:55:32] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [17:58:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 [17:59:12] 06Operations, 10ops-codfw, 06Discovery, 10Elasticsearch, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965921 (10Gehel) [18:00:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 (owner: 10Marostegui) [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1800). [18:00:43] (03PS1) 10Jcrespo: mariadb: Move db1072 back to a normal slave [puppet] - 10https://gerrit.wikimedia.org/r/333952 (https://phabricator.wikimedia.org/T155999) [18:01:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 (owner: 10Marostegui) [18:01:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 (owner: 10Marostegui) [18:02:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T156006 (duration: 00m 49s) [18:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:26] T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006 [18:03:15] (03PS1) 10Jcrespo: MariaDB: Setting db1065 as the new master of sanitarium2 [puppet] - 10https://gerrit.wikimedia.org/r/333953 (https://phabricator.wikimedia.org/T155999) [18:05:28] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1072 back to a normal slave [puppet] - 10https://gerrit.wikimedia.org/r/333952 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo) [18:05:39] (03CR) 10Jcrespo: [C: 032] MariaDB: Setting db1065 as the new master of sanitarium2 [puppet] - 10https://gerrit.wikimedia.org/r/333953 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo) [18:07:08] !log planet2001 - re-adding to puppet, revoke old cert, sign new cert, initial run [18:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:02] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:10:19] !log restart mysql db1065 maintenance - https://phabricator.wikimedia.org/T155999) [18:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:12] 06Operations, 10ops-eqiad, 10ops-ulsfo: ship R620 power supplies to ulsfo - https://phabricator.wikimedia.org/T156154#2965941 (10RobH) 05Open>03Resolved Thanks for checking, I'll note on related tasks. [18:11:17] 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965943 (10RobH) I asked Chris if we had any decommissioned R620s in eqiad so we can steal power supplies, but we do not. >>! In T156154#2965882, @Cmjohnson wrote: > We do not have any decommissione... [18:17:08] twentyafterfour: hi! have you done the branch cut for mw train yet? [18:17:37] AndyRussG: I'm just starting it, should I hold off? [18:18:13] (03PS2) 10Chad: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 [18:18:29] twentyafterfour: hmm mmmmmaybe, one sec... thanks! [18:19:01] !log arlolra@tin Starting deploy [parsoid/deploy@c1a14c0]: Updating Parsoid to d000fdb4 [18:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:09] twentyafterfour: K just consulted, if u can wait 5 min for us to merge some stuff into the CentralNotice deploy branch? thx!!!! [18:20:48] (03CR) 10Chad: [C: 032] Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 (owner: 10Chad) [18:22:42] (03Merged) 10jenkins-bot: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 (owner: 10Chad) [18:22:52] (03CR) 10jenkins-bot: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 (owner: 10Chad) [18:23:16] AndyRussG: no problem [18:24:38] !log demon@tin Synchronized docroot: Removing old wikidata docroot (duration: 00m 46s) [18:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:05] 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011#2965980 (10MoritzMuehlenhoff) These are fully rolled out: audiofile automake-1.14 clamav cmake exim4 file javatools libxml2 python-django python2.7 unbound systemd [18:25:42] twentyafterfour: grear! Just waiting for Jenkins... https://gerrit.wikimedia.org/r/#/c/333955 [18:26:18] After that merges the submodule pointer for CentralNotice should update automatically [18:31:49] (03PS1) 10Chad: beta: standardize deployment.wikimedia.beta.wmflabs.org docroot [puppet] - 10https://gerrit.wikimedia.org/r/333958 [18:32:52] twentyafterfour: merged, just checking that the submodule pointer in core is up to date [18:33:47] (03CR) 10Chad: [C: 032] Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad) [18:35:22] (03Merged) 10jenkins-bot: Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad) [18:35:36] (03CR) 10jenkins-bot: Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad) [18:36:15] (03PS1) 10Chad: Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 [18:37:02] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:37:47] !log demon@tin Synchronized docroot: tidying up mobileportal docroot stuff (duration: 00m 41s) [18:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:03] twentyafterfour: aarg now I'm confused, I don't think we have to do any updating of submodule pointers in core master, because the submodules are only in the core dpeloy branches, right? [18:38:16] CentralNotice's deploy branch is up-to-date now.... [18:38:38] Master doesn't have submodules :) [18:38:41] So the CN commit that we'd like to put on the train is 24e8419d587681ee26e420ee6ba9313ea32a3ed1 [18:38:44] right.... [18:40:22] ostriches: remind me how the branch cut gets the right submodule pointers for extensions (if u'r not busy...) [18:40:25] * AndyRussG jostles brain [18:40:29] !log arlolra@tin Finished deploy [parsoid/deploy@c1a14c0]: Updating Parsoid to d000fdb4 (duration: 21m 28s) [18:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:06] AndyRussG: So for non-special extensions (ie: 95% of them), we create a new branch from master (wmf/blablahblah), then add that branch to the new branch we've made for core [18:41:19] !log arlolra@tin Starting deploy [parsoid/deploy@c1a14c0]: Retry updating Parsoid to d000fdb4 [18:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:28] For the "special" extensions, we don't do the branching, we just add the branch/tag you already have defined as your submodule [18:41:40] in CN's case, it should just pull in whatever's in that wmf_deploy or w/e branch [18:42:06] ostriches: ah right, it's all coming back to me now [18:42:08] thx!!!! [18:42:23] https://phabricator.wikimedia.org/diffusion/MREL/browse/master/make-wmf-branch/config.json;HEAD$171 [18:43:32] twentyafterfour: so I think we're good to go :) [18:45:13] ostriches: interesting! [18:45:33] !log arlolra@tin Finished deploy [parsoid/deploy@c1a14c0]: Retry updating Parsoid to d000fdb4 (duration: 04m 14s) [18:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:42] PROBLEM - parsoid on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:10] (03PS1) 10Chad: Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 [18:51:12] PROBLEM - salt-minion processes on planet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:52:42] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 4.098 second response time [18:53:10] godog: np, i will deploy rb there soon-ish :) [18:55:27] (03Abandoned) 10Dduvall: Check .scap-master-ready file before syncing scap masters [puppet] - 10https://gerrit.wikimedia.org/r/267934 (owner: 10Dduvall) [18:58:06] !log Updated Parsoid to version d000fdb4 (T58846, T154804, T152633) [18:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:13] T152633: TypeError: Cannot read property 'length' of undefined - https://phabricator.wikimedia.org/T152633 [18:58:13] T58846: Review failing sanitizer bugs - https://phabricator.wikimedia.org/T58846 [18:58:13] T154804: TypeError in parsoid gallery module - https://phabricator.wikimedia.org/T154804 [18:58:21] AndyRussG: thanks [18:58:22] PROBLEM - Check systemd state on db2060 is CRITICAL: CRITICAL - Failed to get D-Bus connection: Connection refused: unexpected [18:59:07] PROBLEM - MariaDB disk space on db2060 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [18:59:07] PROBLEM - MariaDB Slave IO: s6 on db2060 is CRITICAL: CRITICAL slave_io_state could not connect [18:59:07] PROBLEM - MariaDB Slave SQL: s6 on db2060 is CRITICAL: CRITICAL slave_sql_state could not connect [18:59:16] PROBLEM - mysqld processes on db2060 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:59:16] PROBLEM - Disk space on db2060 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [18:59:25] * volans looking [18:59:28] ^ checking [18:59:30] did it crash? [18:59:34] oh you're still here [18:59:50] yes, we were hacing fan at -databases [19:00:04] Deploy window Changed: No SWAT window at this time on Tuesdays going forward (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1900) [19:00:31] eheheh [19:00:43] [612121.400194] sd 0:1:0:0: rejecting I/O to offline device [19:00:43] [612121.425964] sd 0:1:0:0: rejecting I/O to offline device [19:00:43] db2060 login: [19:00:50] /srv is not accessible [19:00:56] so probably RAID went down [19:01:17] that host had issues before if i recall correctly [19:01:21] is it a master or a regular slave? [19:01:22] https://phabricator.wikimedia.org/T154031 [19:01:34] ig it is a slave, let's create a ticket and fix it tomorrow [19:01:36] api slave [19:01:39] ok [19:01:42] will take care of that [19:02:02] "Firmware update complete." [19:02:04] yway! [19:02:07] \o/ [19:02:09] lovely [19:02:11] no excuses from the vendor [19:02:46] (03CR) 10Chad: [C: 032] Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad) [19:03:42] PROBLEM - puppet last run on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:58] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2966089 (10jcrespo) [19:05:12] PROBLEM - MD RAID on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:05:32] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [19:05:42] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag could not connect [19:05:48] I will silence this host [19:06:06] (03Merged) 10jenkins-bot: Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad) [19:07:12] RECOVERY - MD RAID on ruthenium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:07:18] !log demon@tin Synchronized docroot/foundation/logos: rm some old junk logos (duration: 00m 42s) [19:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:56] !log change replication master of db1095 to db1052 [19:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:09] (03CR) 10jenkins-bot: Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad) [19:08:42] PROBLEM - puppet last run on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:10:42] PROBLEM - Check systemd state on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:12] PROBLEM - MD RAID on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:11:17] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2966096 (10Dzahn) @akosiaris Thank you! I have reinstalled planet2001 using install2001 and it worked fine. I will do some more tests for eqiad soon. [19:12:32] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [19:15:42] PROBLEM - Check systemd state on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:15:58] oom_killer in action on ruthenium [19:16:13] PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:02] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [19:19:02] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:19:12] RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:19:12] PROBLEM - configured eth on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:19:12] PROBLEM - DPKG on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:21:12] PROBLEM - salt-minion processes on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:21:44] subbu: any special activity on ruthenium ^^^ ? It's swapping heavily and there are tons of /srv/visualdiff/node_modules/phantomjs/lib/phantom/bin/phantomjs processes [19:22:02] RECOVERY - salt-minion processes on ruthenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:22:12] PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:32] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [19:25:42] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 6.221 second response time [19:26:01] volans: arlolra is not running tests, we don't know if TimStarling or subbu have started any [19:26:32] mobrovac: any easy way to check? [19:27:12] RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:27:33] volans: not that i know of :/ [19:27:36] arlolra: ^ ? [19:28:11] volans: can't even ssh in there now [19:28:18] I'm in it [19:28:20] so it must be really busy [19:28:32] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [19:28:38] yes, memory full, swap full [19:29:32] PROBLEM - dhclient process on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:33] the multiple phantomjs processes is probably a good indication that one of them was running a test [19:29:46] i think it's fine to stop it, if you're in there [19:29:58] !log change replication master of db1095 to db1065 [19:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:05] arlolra: ok [19:30:20] they probably need to cleanup the result of visualdiff runs [19:30:22] RECOVERY - dhclient process on ruthenium is OK: PROCS OK: 0 processes with command name dhclient [19:30:38] old runs [19:31:06] arlolra: you mean sending a SIGTERM to all phantomjs processes? they are not child of a common process [19:31:37] and I probably need to restart parsoid, their child died and was not able to restart them due to not available memory [19:32:12] PROBLEM - salt-minion processes on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:02] RECOVERY - salt-minion processes on ruthenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:33:17] hmm, they should have been spawned by testreduce [19:33:48] but, sure, send a signal to them all if you have to [19:33:59] arlolra: there is a /usr/bin/nodejs client-cluster.js -c 8 /etc/testreduce/parsoid-rt-client.config.js process with some childs [19:34:13] but the "node /srv/visualdiff/node_modules/phantomjs/bin/phantomjs" are not child of it [19:34:22] PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:32] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.110 second response time [19:35:32] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [19:35:33] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 50 minutes ago with 0 failures [19:35:52] !log killed 822 "/srv/visualdiff/node_modules/phantomjs/lib/phantom/bin/phantomjs" processes on ruthenium. RAM and swap full, host unresponsive [19:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:02] RECOVERY - configured eth on ruthenium is OK: OK - interfaces up [19:36:02] RECOVERY - MD RAID on ruthenium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:36:02] RECOVERY - DPKG on ruthenium is OK: All packages OK [19:36:12] RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:36:46] i see [19:37:09] !log branching 1.29.0-wmf.9 refs T154683 [19:37:09] it recovered immediately [19:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:12] T154683: MW-1.29.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T154683 [19:37:47] it would be this one /usr/bin/nodejs client-cluster.js -c 4 /etc/testreduce/parsoid-vd-client.config.js [19:37:58] -vd [19:38:38] https://www.mediawiki.org/wiki/Parsoid/Visual_Diffs_Testing [19:38:56] says we want sudo service parsoid-vd-client stop [19:39:11] and sudo service parsoid-vd stop [19:39:43] !log sudo service parsoid-vd stop on ruthenium [19:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:57] arlolra: done, and actually it was restarting swaping the subprocesses [19:40:14] s/swaping/spawning/ [19:40:31] 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2966240 (10jcrespo) [19:40:57] so, something new was merged in the testreduce that is exploding? [19:41:40] pid 1661 looks like it shouldn't be there anymore [19:42:22] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:42:32] yeah I was checking the systemd unit, because it's back running [19:43:37] testreduce is a general thing to run tests of a large set of pages. we use it for parsoid roundtrip testing and, separately, for visualdiff'ing ... the latter produces a lot of large images on disk [19:44:00] 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2966280 (10jcrespo) Adding @TTO and @Cenarium because they may know the actual right people to add to this ticket (probably not them) for the mediawiki bug side... [19:44:04] that would be https://github.com/wikimedia/integration-visualdiff [19:44:34] so the client is stopped, the server is still running "/usr/bin/nodejs server.js --config /etc/testreduce/parsoid-vd.settings.js", but that seems to be ok, I dont'w see anymore spawning of processes [19:44:45] TimStarling or subbu were probably running it for https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy [19:45:44] it was killing the server, so if it's soemthing that we run often we might have some regression, otherwise maybe was just too aggressive [19:45:55] * volans brb [19:48:10] volans: they've respawned! as long as they're jobs queued with the server this'll probably continue [19:49:00] arlolra: should I stop the server too? [19:49:09] parsoid-vd I mean [19:50:10] yes [19:50:33] let's jsut stop it all and i'll let them know [19:50:51] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#2966308 (10EBernhardson) [19:51:08] (03PS1) 10EBernhardson: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) [19:51:54] !log ruthenium: stopped parsoid-vd and parsoid-vd-client to avoid uncontrolled spawning of phantomjs childs [19:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:11] (03CR) 10EBernhardson: "The log4j2 properties file was tested in vagrant against a 5.x instance. It looks to do as necessary, but i've never worked with log4j2 be" [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [19:52:28] (03PS1) 10Jcrespo: mariadb: Set binlog_format to STATEMENT for db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333970 (https://phabricator.wikimedia.org/T156008) [19:52:43] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#2966323 (10EBernhardson) [19:53:16] arlolra: done, notifying them, thanks for the help! [19:53:57] np, thank you, glad it's under control [19:54:12] I'll keep an eye on it for a bit [20:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T2000). [20:00:54] (03CR) 10Jcrespo: [C: 032] mariadb: Set binlog_format to STATEMENT for db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333970 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo) [20:01:27] (03PS1) 10Chad: Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 [20:04:52] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Service[apparmor] [20:05:03] (03PS1) 10Chad: Swap wmfwiki docroot to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/333974 [20:05:28] (03CR) 10BearND: "Generally, this looks good to me from a regex perspective, just a minor nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [20:05:32] (03CR) 10Chad: [C: 04-1] "Also depends on finishing cleaning up existing docroot/foundation" [puppet] - 10https://gerrit.wikimedia.org/r/333974 (owner: 10Chad) [20:06:31] (03CR) 10Chad: [C: 032] Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 (owner: 10Chad) [20:08:02] (03Merged) 10jenkins-bot: Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 (owner: 10Chad) [20:08:16] (03CR) 10jenkins-bot: Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 (owner: 10Chad) [20:09:40] !log demon@tin Synchronized docroot: Adding new wikimediafoundation.org docroot (duration: 01m 05s) [20:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:12] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:10:22] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:10:37] twentyafterfour: I'll stop with my random docroot fixes, forgot it's train time :) [20:11:04] oh [20:11:06] yeah [20:11:14] I just ran `scap prep` [20:11:39] (03CR) 10Chad: [C: 031] "Furthermore, this is already a symlink to wikimedia.org, so we're just removing a layer of indirection :)" [puppet] - 10https://gerrit.wikimedia.org/r/333958 (owner: 10Chad) [20:17:08] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2966457 (10Dzahn) @bbogaert ``` -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I can confirm that Riccard (volans) should have access to the Yubikey laptop referenced in Phab Ticket T123818, Zen Desk #9727. - -- Da... [20:17:31] (03PS1) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) [20:17:52] ostriches: all patches failed :-/ [20:18:50] No surprise :( [20:18:58] Need some help? [20:19:22] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services (watching): Confirm attribution needs - https://phabricator.wikimedia.org/T150875#2966460 (10ZhouZ) Just as an updated reminder to this task. Our Terms of Use allows for attribution to text contr... [20:19:30] (03PS2) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) [20:19:32] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:23:43] godog: I got a very small beta-only docroot thing. Should I put it down for thurs' puppetswat? https://gerrit.wikimedia.org/r/#/c/333958/ [20:23:50] (or is it small enough we can jfdi?) [20:24:07] ostriches: I think I've got it [20:24:26] 👍 [20:25:06] (03PS3) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) [20:26:05] (03PS2) 10Dzahn: delete dumps.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940) [20:29:12] PROBLEM - MD RAID on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:42] PROBLEM - puppet last run on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:42] PROBLEM - parsoid on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:42] PROBLEM - Check systemd state on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:31:22] PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:54] damn... probably puppet restarted them [20:32:10] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2966486 (10bbogaert) @Dzahn Ricard has the laptop. ``` byronicle:~ bbogaert$ gpg --verify confirm-volans.sig gpg: Signature made Tue Jan 24 12:14:38 2017 PST using RSA key ID F5F6A067 gpg: Good signature from "Daniel Za... [20:32:12] RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [20:32:52] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:33:32] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [20:33:32] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.408 second response time [20:33:33] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [20:34:02] RECOVERY - MD RAID on ruthenium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:34:49] (03PS1) 10Rush: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 [20:35:35] (03PS2) 10Rush: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 [20:36:37] ostriches: yeah if it is beta-only we can jfdi [20:37:12] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:38:01] (03Abandoned) 10Andrew Bogott: Revert "wmf_sink: Remove all ldap handling" [puppet] - 10https://gerrit.wikimedia.org/r/333660 (owner: 10Andrew Bogott) [20:38:31] (03CR) 10Yuvipanda: [C: 031] "presidented seal of approval" [puppet] - 10https://gerrit.wikimedia.org/r/333978 (owner: 10Rush) [20:38:49] (03PS3) 10BryanDavis: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 (owner: 10Rush) [20:39:29] yuvipanda: http://knowyourmeme.com/memes/seal-of-approval ? [20:39:40] (03PS4) 10BryanDavis: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [20:39:53] godog: Sweet. It's https://gerrit.wikimedia.org/r/#/c/333958/ :) [20:40:09] mutante: more like https://www.theguardian.com/us-news/2016/dec/19/unpresidented-trump-word-definition [20:40:09] (03CR) 10BryanDavis: [C: 031] "done messing with commit message" [puppet] - 10https://gerrit.wikimedia.org/r/333978 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [20:40:36] (03CR) 10Rush: [V: 032 C: 032] tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush) [20:40:47] yuvipanda: oh wow, word of the year even [20:40:57] (03PS2) 10Filippo Giunchedi: beta: standardize deployment.wikimedia.beta.wmflabs.org docroot [puppet] - 10https://gerrit.wikimedia.org/r/333958 (owner: 10Chad) [20:47:32] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:48:05] 06Operations, 06Parsing-Team: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#2966544 (10Volans) [20:49:08] (03PS1) 10Andrew Bogott: Move labtestweb openstack::version to newton [puppet] - 10https://gerrit.wikimedia.org/r/333979 [20:49:18] !log disabled puppet on ruthenium to avoid the restart of parsoid-vd and parsoid-vd-client processes T156177 [20:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:22] T156177: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177 [20:49:23] !log twentyafterfour@tin Started scap: test wikis to 1.29.0-wmf.9 refs T155525 [20:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:27] T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525 [20:50:08] (03Abandoned) 10Gilles: Fix mechanism to disable default nginx configuration [puppet/nginx] - 10https://gerrit.wikimedia.org/r/333909 (https://phabricator.wikimedia.org/T154270) (owner: 10Gilles) [20:51:21] (03PS2) 10Andrew Bogott: Move labtestweb openstack::version to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/333979 [20:51:39] (03CR) 10Filippo Giunchedi: [C: 032] beta: standardize deployment.wikimedia.beta.wmflabs.org docroot [puppet] - 10https://gerrit.wikimedia.org/r/333958 (owner: 10Chad) [20:52:10] ostriches: ^ [20:53:30] godog: cool thanks. I'll verify in beta [20:53:45] ostriches: np [20:54:10] This adventure is nearing completion :) [20:54:58] (03PS1) 10Yuvipanda: tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 [20:55:00] (03CR) 10Andrew Bogott: [C: 032] Move labtestweb openstack::version to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/333979 (owner: 10Andrew Bogott) [20:55:02] https://www.quora.com/What-are-the-best-Phabricator-macros-memes [20:55:06] (03PS3) 10Andrew Bogott: Move labtestweb openstack::version to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/333979 [20:55:44] https://www.quora.com/What-are-the-best-Phabricator-Pokemon-for-use-in-code-reviews?redirected_qid=1319946 [20:56:05] brion, did you see my last comment? https://phabricator.wikimedia.org/T155750 [20:56:15] should I open a new report? [20:56:30] (03CR) 10Andrew Bogott: [C: 031] tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 (owner: 10Yuvipanda) [20:56:32] (03PS2) 10Yuvipanda: tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 [20:56:33] mutante: all the copyrights \o/ [20:56:39] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 (owner: 10Yuvipanda) [20:57:06] p858snake|: well, i just traced back the origin of "seal of approval" to a Flickr account, and Flickr should be fine to import to commons right :p [20:57:30] if the licensing for the upload on flickr allows it [20:57:58] "Fixes for latent bugs which don't manifest in impactful ways should be accepted with Metapod or Kakuna." [20:58:02] (03PS3) 10Yuvipanda: tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 [20:58:07] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 (owner: 10Yuvipanda) [20:58:08] flickr uploads aren't CC- by default iirc [21:01:35] p858snake|: right.. and sad as it is, this one has "All rights reserved" on it [21:01:36] godog: Force-ran puppet on the beta apaches, picked up the change, everything working just fine :D [21:02:32] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 52 seconds ago with 6 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/openstack_auth/backend.py],File[/etc/openstack-dashboard/keystone_policy.json],File[/usr/share/openstack-dashboard/openstack_dashboard/local/enabled/_1925_puppet_prefix_panel.py],File[/usr/share/openstack-dashboard/openstack_das [21:02:56] ostriches: nice \o/ [21:03:02] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [21:03:03] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [21:03:19] is there any mediawiki deployment in progress? [21:03:23] p858snake|: of course it's on reddit, deviantart, twitter, imgur and > 3,800 other pages anyways [21:03:38] ostriches^ ? [21:03:51] twentyafterfour is conducting the train [21:04:00] sorry [21:04:05] still on it, I assume [21:04:23] Probably, I'll let him give a more exact status :) [21:04:28] no need [21:04:46] will got away, twentyafterfour ping me when done (I may be away) [21:05:24] I am not in hurry, I just do not want to collide [21:07:38] /13/8 [21:07:46] oh, it is 2 hour window [21:07:56] that is my mistake [21:16:04] (03PS3) 10Brian Wolff: Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618) [21:20:46] hhvm on mw1290 is unhappy [21:20:55] Syntax Error: Couldn't find trailer dictionary [21:21:09] Syntax Error: Couldn't read xref table [21:21:40] Eh, info-level, but seems specific to 1290 [21:22:01] !log twentyafterfour@tin Finished scap: test wikis to 1.29.0-wmf.9 refs T155525 (duration: 32m 37s) [21:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:08] T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525 [21:22:59] should I restart hhvm there? [21:23:14] I dunno, I can't find it [21:23:18] Might've been transient [21:25:15] (03PS1) 10Andrew Bogott: Horizon: Forward some custom files from liberty [puppet] - 10https://gerrit.wikimedia.org/r/333983 [21:25:52] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [21:26:07] jynus: Seems to have just been a brief thing at 21:22, stopped completely. Transient :) [21:26:44] (03CR) 10Chad: [C: 032] Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 (owner: 10Chad) [21:27:01] (03CR) 10Andrew Bogott: [C: 032] Horizon: Forward some custom files from liberty [puppet] - 10https://gerrit.wikimedia.org/r/333983 (owner: 10Andrew Bogott) [21:28:27] (03Merged) 10jenkins-bot: Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 (owner: 10Chad) [21:28:52] RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [21:28:58] (03CR) 10jenkins-bot: Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 (owner: 10Chad) [21:30:55] (03PS1) 10Chad: Add .bash_profile to my homedir so my .bashrc works [puppet] - 10https://gerrit.wikimedia.org/r/333984 [21:31:00] !log demon@tin Synchronized docroot: Drop labs docroot, unused in prod (duration: 00m 44s) [21:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:19] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 [21:32:21] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 (owner: 1020after4) [21:32:37] (03PS1) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 [21:32:46] jynus: almost done here [21:33:10] (03PS3) 10Dzahn: delete dumps.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940) [21:33:50] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 (owner: 1020after4) [21:34:01] (03CR) 10jenkins-bot: group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 (owner: 1020after4) [21:34:44] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.9 refs T155525 [21:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:48] T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525 [21:34:58] (03PS2) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 [21:35:11] (03PS1) 10Andrew Bogott: Horizon: Add mitaka version of the puppetpanel. [puppet] - 10https://gerrit.wikimedia.org/r/333987 [21:35:42] (03CR) 10Dzahn: [C: 032] "key deleted from private repo" [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn) [21:37:01] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2966807 (10Andrew) - I will double-check the caching, although I'm pretty sure I verified that the cache was working previously. - I'm currently experimenting with the next rev of Hor... [21:37:33] (03PS2) 10Dzahn: add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780 [21:37:56] (03CR) 10Andrew Bogott: [C: 032] Horizon: Add mitaka version of the puppetpanel. [puppet] - 10https://gerrit.wikimedia.org/r/333987 (owner: 10Andrew Bogott) [21:38:03] (03PS2) 10Andrew Bogott: Horizon: Add mitaka version of the puppetpanel. [puppet] - 10https://gerrit.wikimedia.org/r/333987 [21:38:10] jynus: all done [21:41:40] twentyafterfour, thanks! [21:42:25] (03PS4) 10Dzahn: openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954 [21:42:32] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [21:43:48] !log Finished group0 to wmf/1.29.0-wmf.9 (refs T15525) Changelog: https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.9/Changelog [21:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:52] T15525: Category Sorting Incorrectly - https://phabricator.wikimedia.org/T15525 [21:44:04] (03PS3) 10Chad: dumps: Add a favicon (using the wmf one) [puppet] - 10https://gerrit.wikimedia.org/r/333080 [21:44:10] (03CR) 10Dzahn: [C: 032] openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn) [21:44:39] (03PS3) 10Dzahn: openstack: designate/glance/keystone not in autoload module [puppet] - 10https://gerrit.wikimedia.org/r/332955 [21:45:26] ugh I've been ref'ing the wrong tasks :-/ [21:45:26] (03PS4) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) [21:46:59] (03CR) 10Jcrespo: [C: 032] mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo) [21:47:12] PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:48:47] (03Merged) 10jenkins-bot: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo) [21:48:58] (03CR) 10jenkins-bot: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo) [21:49:00] ^ andrewbogott labtestweb is puking on itself a bit, is that you? [21:49:12] definitely me [21:49:40] (03PS3) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 [21:50:12] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [21:50:33] (03CR) 10Dzahn: [C: 032] openstack: designate/glance/keystone not in autoload module [puppet] - 10https://gerrit.wikimedia.org/r/332955 (owner: 10Dzahn) [21:50:46] andrewbogott: kk [21:51:06] it should clear in a minute or two, everything looks fine locally [21:51:54] (03CR) 10Rush: [C: 032] labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 (owner: 10Rush) [21:52:01] (03PS4) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 [21:52:05] (03CR) 10Rush: [V: 032 C: 032] labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 (owner: 10Rush) [21:52:52] (03CR) 10Dzahn: [C: 032] dumps: Add a favicon (using the wmf one) [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad) [21:52:58] (03PS4) 10Dzahn: dumps: Add a favicon (using the wmf one) [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad) [21:53:10] andrewbogott: no worries then just wasn't sure [21:55:08] !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1065 as dump/vslow & clean up s1 comments (duration: 00m 43s) [21:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:12] mutante: Thx, I see our new favicon now :) [21:56:15] no more 404 [21:56:59] Database::ping, that is new to me [21:57:15] Oldddddd function in MW :) [21:57:35] Falls back on mysqli_ping (or similar, depending on php extension you're using) [21:57:42] ostriches: :) i noticed the 404s well. nice! [22:07:30] (03PS1) 10Jcrespo: mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) [22:08:12] PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:08:33] mutante: Trivial thing for my homedir, if you've got a sec... https://gerrit.wikimedia.org/r/#/c/333984/ [22:09:18] (03PS2) 10Dzahn: Add .bash_profile to my homedir so my .bashrc works [puppet] - 10https://gerrit.wikimedia.org/r/333984 (owner: 10Chad) [22:09:46] (03CR) 10Dzahn: [V: 032 C: 032] Add .bash_profile to my homedir so my .bashrc works [puppet] - 10https://gerrit.wikimedia.org/r/333984 (owner: 10Chad) [22:10:57] ACKNOWLEDGEMENT - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages andrew bogott Upstream packages seem broken... work in progress. [22:12:32] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/demon/.bash_profile] [22:13:58] !log update RESTBase to 69065e2: staging [22:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:32] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/openstack_auth/plugin/wmtotp.py],File[/usr/lib/python2.7/dist-packages/openstack_auth/backend.py],File[/usr/lib/python2.7/dist-packages/openstack_auth/forms.py],Package[openstack-dashboard] [22:16:57] greg-g: I just took a window from 18:00Z-19:00Z tomorrow for a Striker deploy [22:19:26] !log update RESTBase to 69065e2: canary on restbase1007 [22:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:10] !log update RESTBase to 69065e2 [22:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:05] (03PS2) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) [22:31:04] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [22:32:29] (03Merged) 10jenkins-bot: mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [22:32:39] (03CR) 10jenkins-bot: mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo) [22:33:21] bd808: neat :) [22:40:32] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:41:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 for reimage (duration: 00m 55s) [22:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:23] I guess that hhvm syntax thingie is wider than I thought.... [22:48:24] https://logstash.wikimedia.org/goto/daf20a1752e93bcb1186bd08916a01ec [22:49:58] Hmm, that error isn't hhvm, it's something with pdfs. [22:51:15] Definitely picked up in last few hours https://logstash.wikimedia.org/goto/6ff9b22efdd3ad556969e8806efb090a [22:57:42] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.199 second response time [22:58:42] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.390 second response time [23:04:02] !log reimage db1066 [23:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:47] (03Restored) 10Thcipriani: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [23:08:28] (03PS2) 10Thcipriani: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 [23:09:11] (03CR) 10Dzahn: "when i touched the deployment keys in private repo to change the passphrases, the file names disappeared from the comment column in ssh-ad" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani) [23:09:53] !log ebernhardson@tin Synchronized php-1.29.0-wmf.9/includes/specials/SpecialSearch.php: Update special:search security patc h to not fatal (duration: 00m 44s) [23:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:19] (03PS3) 10Dzahn: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (https://phabricator.wikimedia.org/T154943) (owner: 10Thcipriani) [23:10:55] jouncebot: next [23:10:55] In 0 hour(s) and 49 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T0000) [23:14:51] (03CR) 10Paladox: [C: 031] Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (https://phabricator.wikimedia.org/T154943) (owner: 10Thcipriani) [23:15:01] (03CR) 10Dzahn: [C: 032] "already cherry-picked on beta and tested on tin" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (https://phabricator.wikimedia.org/T154943) (owner: 10Thcipriani) [23:26:44] !log restarting db1052 for kernel upgrade [23:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:42] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.218 second response time [23:30:42] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.303 second response time [23:45:16] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2967328 (10Dzahn) I used these 2 boxes to test install from install1001 (instead of carbon). The installer started fine on 1003, then the install just fails at grub install for unknown an... [23:46:17] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2967330 (10Dzahn) I also tested with prometheus1003 if the installer starts. It does.. (fails later at grub install but not related to this here). [23:49:15] !log analytics1015 (unused spare system) - use for test OS install [23:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:57] !log carbon stopping DHCP [23:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:56] !log carbon - stopping puppet, stopping atftpd [23:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:02] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [23:51:35] ACKNOWLEDGEMENT - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstall [23:53:42] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:54:32] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [23:55:42] PROBLEM - puppet last run on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:02] PROBLEM - dhclient process on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:02] PROBLEM - configured eth on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:12] PROBLEM - DPKG on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:17] (03CR) 10Gergő Tisza: "Yet another configuration setting is a pretty ugly solution, I don't have any better one though. (Ideally we would just detect that $authM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29) [23:56:22] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:22] PROBLEM - salt-minion processes on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:22] PROBLEM - MD RAID on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:32] PROBLEM - Check size of conntrack table on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:56:32] PROBLEM - Disk space on analytics1015 is CRITICAL: Return code of 255 is out of bounds [23:58:16] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2967350 (10Dzahn) [23:58:20] 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2967348 (10Dzahn) 05Open>03Resolved finally tested with analytics1015 (unused spare system), installed trusty image from install1001. services on carbon were down too. resolving now [23:58:42] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.589 second response time [23:59:42] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.895 second response time