[00:00:26] <wikibugs>	 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 05Mediawiki SWAT Deployments: Clarify SWAT process for testing maintence script changes (to not use mwdebug* hosts) - https://phabricator.wikimedia.org/T153316#2963567 (10greg)
[00:02:38] <wikibugs>	 (03PS10) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380)
[00:03:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn)
[00:06:44] <wikibugs>	 (03PS11) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380)
[00:07:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn)
[00:07:43] <mutante>	 what now :p
[00:11:04] <wikibugs>	 (03PS12) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380)
[00:12:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: repool db1065 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333812 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo)
[00:14:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn)
[00:14:54] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: repool db1065 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333812 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo)
[00:15:12] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: repool db1065 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333812 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo)
[00:17:56] <wikibugs>	 (03PS13) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380)
[00:26:15] <icinga-wm>	 PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:26:25] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 with low load after reimage (duration: 00m 45s)
[00:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:31] <wikibugs>	 (03CR) 10Gergő Tisza: "This would disable account creation, not login. IIRC account creation on loginwiki has already been disabled for a long time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29)
[00:28:35] <icinga-wm>	 RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[00:29:48] <wikibugs>	 (03CR) 10Gergő Tisza: "On second thought you'll have to do something with canAuthenticateNow() whether your remove login providers or not, since that's what cont" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29)
[00:54:15] <icinga-wm>	 RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[00:55:15] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:55:45] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:27] <wikibugs>	 (03PS4) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632)
[01:03:30] <wikibugs>	 (03CR) 10Dzahn: "good point. doing that!" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)
[01:04:19] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "hold on .. rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)
[01:04:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)
[01:06:47] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "seems base module changed not long ago" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)
[01:11:15] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db1026 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[01:12:15] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db1026 is OK: OK ferm input default policy is set
[01:12:37] <wikibugs>	 (03PS1) 10Dzahn: typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822
[01:13:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 (owner: 10Dzahn)
[01:18:58] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki gotwiki (T45917)
[01:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:02] <stashbot>	 T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917
[01:21:55] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:23:15] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[01:24:45] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[01:37:24] <wikibugs>	 (03PS14) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380)
[01:38:15] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 (owner: 10Dzahn)
[01:38:42] <wikibugs>	 (03PS2) 10Dzahn: typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822
[01:38:54] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] typos: add rysnc, rsnyc, wikimeda [puppet] - 10https://gerrit.wikimedia.org/r/333822 (owner: 10Dzahn)
[01:42:59] <wikibugs>	 (03PS1) 10Dzahn: typos: fix "rysnc", "wikimeda" [puppet] - 10https://gerrit.wikimedia.org/r/333825
[01:43:29] <wikibugs>	 (03PS4) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588)
[01:43:47] <wikibugs>	 (03PS2) 10Dzahn: typos: fix "rysnc", "wikimeda" [puppet] - 10https://gerrit.wikimedia.org/r/333825
[01:44:22] <wikibugs>	 (03CR) 10Volans: "@godog: thanks for the review!" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[01:47:13] <wikibugs>	 (03PS3) 10Dzahn: gerrit/lists/microsite/rolematcher: fix "rysnc", "wikimeda" typos [puppet] - 10https://gerrit.wikimedia.org/r/333825
[01:47:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] gerrit/lists/microsite/rolematcher: fix "rysnc", "wikimeda" typos [puppet] - 10https://gerrit.wikimedia.org/r/333825 (owner: 10Dzahn)
[01:50:55] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[01:51:35] <icinga-wm>	 PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:56:49] <wikibugs>	 (03PS15) 10Dzahn: aptrepo: setup rsync between 2 APT servers [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380)
[01:59:31] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "as intended, it adds config on carbon/install2001, but not on install1001  http://puppet-compiler.wmflabs.org/5192/" [puppet] - 10https://gerrit.wikimedia.org/r/333676 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn)
[02:02:04] <mutante>	 eh. "nice" SERVER: Invalid relationship: 
[02:02:19] <mutante>	 not caught by compiler
[02:02:43] <mutante>	 but i see the problem
[02:03:35] <icinga-wm>	 PROBLEM - puppet last run on install1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:04:09] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on install1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn WIP
[02:04:10] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on install2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP
[02:11:25] <volans>	 mutante: if you need help just ping me
[02:11:59] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Not sure that one makes sense. We usually use the Wikimedia icon instead of the Meta-Wiki icon. Except for community projects. E.g. doc.wi" [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad)
[02:12:14] <mutante>	 volans: thank you, i got it
[02:12:30] <volans>	 ok :)
[02:13:05] <wikibugs>	 (03PS1) 10Dzahn: aptrepo:rsync: fix 'Invalid relationship' and ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/333830
[02:13:11] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "(Same for NOC)" [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad)
[02:14:44] <wikibugs>	 (03PS2) 10Dzahn: aptrepo:rsync: fix 'Invalid relationship' and ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/333830
[02:15:50] <wikibugs>	 (03CR) 10Dzahn: [C: 032] aptrepo:rsync: fix 'Invalid relationship' and ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/333830 (owner: 10Dzahn)
[02:18:13] <mutante>	 icinga-wm: sup
[02:18:35] <icinga-wm>	 RECOVERY - puppet last run on install1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[02:18:38] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.8) (duration: 06m 40s)
[02:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:19:35] <icinga-wm>	 RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[02:21:42] <wikibugs>	 (03CR) 10Krinkle: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/332707 (owner: 10Chad)
[02:23:01] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jan 24 02:23:01 UTC 2017 (duration 4m 23s)
[02:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:41:47] <wikibugs>	 (03PS5) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632)
[02:46:50] <wikibugs>	 (03PS6) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632)
[02:49:48] <wikibugs>	 (03PS7) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632)
[02:50:39] <mutante>	 volans: ^ but there is the unrelated thing, i had to change it because base became a "profile" meanwhile. so it wasn't in init.pp either anymore, but good point to move it
[02:50:48] <mutante>	 be back later for now
[02:51:03] <volans>	 ok, I'll take a look 
[02:51:13] <mutante>	 thx
[02:54:13] <wikibugs>	 (03CR) 10Volans: "Much nicer. I'm usually not a fan of true defaults and prefer false as a default (like skip_monitoring), but is a personal habit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)
[02:54:49] <wikibugs>	 (03PS1) 10Dzahn: delete dumps.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940)
[02:56:12] <wikibugs>	 (03CR) 10Dzahn: hiera override to skip base icinga for test/decom hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn)
[02:56:28] <wikibugs>	 (03PS8) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632)
[03:22:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.79 seconds
[03:28:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 119.94 seconds
[03:35:25] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1823.533416 Seconds
[03:36:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 41.761426 Seconds
[03:40:25] <wikibugs>	 (03PS2) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack)
[03:46:01] <wikibugs>	 (03PS3) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack)
[04:11:07] <wikibugs>	 (03CR) 10NehalDaveND: "I am very sorry for this. But I forgot how to review patch. Can someone tell me how can I review this patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333640 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson)
[04:11:10] <wikibugs>	 (03CR) 10Niharika29: [C: 04-1] "Hmm, this seems like something which makes sense as a global. Do you think it'd be better off as a global? Perhaps $wgDisableLogin. I saw " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29)
[04:17:09] <wikibugs>	 (03PS4) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack)
[04:24:35] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:34:51] <wikibugs>	 (03PS5) 10Volans: [WIP] discovery stuff [puppet] - 10https://gerrit.wikimedia.org/r/331789 (owner: 10BBlack)
[04:53:35] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[05:05:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1810.51684 Seconds
[05:06:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[05:16:55] <wikibugs>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#2964028 (10Volans)
[05:30:55] <icinga-wm>	 PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:39:45] <icinga-wm>	 PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:42:25] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:59:55] <icinga-wm>	 RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[06:08:45] <icinga-wm>	 RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:10:25] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:20:21] <_joe_>	 !log repooling mw2098 after scap pull
[06:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:55] <wikibugs>	 (03PS6) 10Volans: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[06:24:37] <wikibugs>	 (03PS7) 10Volans: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[06:24:55] <icinga-wm>	 PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:37:55] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:39:05] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[06:43:06] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:43:15] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:44:04] <wikibugs>	 (03PS8) 10Volans: [WIP] DNS: service discovery [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[06:46:11] <wikibugs>	 (03CR) 10Volans: "Puppet compiler result:" [puppet] - 10https://gerrit.wikimedia.org/r/331789 (https://phabricator.wikimedia.org/T156100) (owner: 10BBlack)
[06:46:35] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2098 is OK: OK
[06:49:55] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:53:56] <icinga-wm>	 RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[07:06:36] <wikibugs>	 (03PS1) 10Marostegui: Revert "site.pp: db1052's binlog changed to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333849
[07:06:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "site.pp: db1052's binlog changed to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333849 (owner: 10Marostegui)
[07:08:05] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[07:10:15] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:11:50] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Disable RBR on db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006)
[07:11:59] <wikibugs>	 (03Abandoned) 10Marostegui: Revert "site.pp: db1052's binlog changed to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333849 (owner: 10Marostegui)
[07:13:15] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:15:15] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:19:39] <wikibugs>	 (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999)
[07:21:15] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:28:31] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Disable RBR on db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006)
[07:30:30] <wikibugs>	 (03CR) 10Marostegui: "This compiles fine: https://puppet-compiler.wmflabs.org/5199/" [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[07:45:24] <wikibugs>	 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2964236 (10Dzahn) I moved the eqiad Ganglia aggregator from carbon to install1001 today. This part is unblocked.
[07:48:28] <wikibugs>	 06Operations, 10Continuous-Integration-Config, 06Operations-Software-Development, 13Patch-For-Review: tox-jessie is failing on operations/software - https://phabricator.wikimedia.org/T152549#2964240 (10hashar) 05Open>03Resolved
[07:50:24] <wikibugs>	 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964243 (10Dzahn)
[07:50:48] <wikibugs>	 (03CR) 10Hashar: "Thanks :-}" [software] - 10https://gerrit.wikimedia.org/r/325762 (https://phabricator.wikimedia.org/T152549) (owner: 10Hashar)
[07:51:15] <icinga-wm>	 PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:56:07] <wikibugs>	 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#2964262 (10Marostegui) >>! In T155769#2962307, @matmarex wrote: >>>! In T155769#2960504, @Marostegui wrote: >> If you guys consider it is safe to delete,...
[07:58:54] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Split dbstore role classes [puppet] - 10https://gerrit.wikimedia.org/r/332228 (https://phabricator.wikimedia.org/T130128)
[08:03:45] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:10:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836)
[08:10:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Split dbstore role classes [puppet] - 10https://gerrit.wikimedia.org/r/332228 (https://phabricator.wikimedia.org/T130128) (owner: 10Marostegui)
[08:16:46] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836)
[08:20:15] <icinga-wm>	 RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[08:21:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove otto and elukey from eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/333242 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[08:22:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui)
[08:24:19] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui)
[08:24:33] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333851 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui)
[08:25:16] <wikibugs>	 06Operations, 10ops-codfw: mw2098 drac offline - system unreachable - https://phabricator.wikimedia.org/T155688#2964302 (10MoritzMuehlenhoff) I've repooled the host.
[08:25:39] <_joe_>	 moritzm: I already repooled it this mroning
[08:25:44] <_joe_>	 did I miss something?
[08:26:19] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: wmf-config/db-eqiad.php Add rack positions -  T155999 (duration: 00m 50s)
[08:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:23] <stashbot>	 T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999
[08:28:44] <wikibugs>	 (03PS1) 10Ema: Revert "Temporarily depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/333854
[08:28:56] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions -  T155999 (duration: 00m 41s)
[08:29:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:23] <moritzm>	 _joe_: no, you're right, re-looking at the confctl output it changed from yes to yes, gonna make some coffee :-)
[08:29:48] <_joe_>	 heh ok I thought I brainfarted earlier
[08:31:55] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[08:32:45] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[08:34:36] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005)
[08:35:20] <wikibugs>	 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Docker installation for production kubernetes - https://phabricator.wikimedia.org/T147181#2964318 (10Joe)
[08:35:23] <wikibugs>	 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2964317 (10Joe) 05stalled>03Resolved
[08:36:04] <wikibugs>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 15User-Joe, 07Wikimedia-Multiple-active-datacenters: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009#2961483 (10Joe) a:03Joe
[08:40:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) (owner: 10Marostegui)
[08:42:34] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) (owner: 10Marostegui)
[08:42:45] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333856 (https://phabricator.wikimedia.org/T156005) (owner: 10Marostegui)
[08:44:05] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1065 original weight - T156005 (duration: 00m 39s)
[08:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:10] <stashbot>	 T156005: Reimage db1065 and db1066 - https://phabricator.wikimedia.org/T156005
[08:54:55] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:13:26] <wikibugs>	 06Operations, 10hardware-requests: hardware request for netmon1001 - https://phabricator.wikimedia.org/T156040#2962228 (10faidon) Thanks for being thorough @RobH and actually double-checking the disk usage :) Disk space usage is indeed minimal, but this box holds a lot of RRDs (for LibreNMS and currently Torru...
[09:13:53] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2964439 (10Gilles) 05Open>03Resolved Fixes for the 404 log coming  on a different task. I'm not seeing /temp 404s anymore in the swift logs.
[09:16:08] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Thumbor inexplicably 504s intermittently on files that render fine later - https://phabricator.wikimedia.org/T150746#2964444 (10Gilles) Might be related to the iowait issues investigated in T151851
[09:16:09] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300)
[09:17:15] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK
[09:17:34] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel)
[09:17:35] <icinga-wm>	 PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:17:57] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui)
[09:19:47] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui)
[09:19:57] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333859 (https://phabricator.wikimedia.org/T153300) (owner: 10Marostegui)
[09:20:56] <wikibugs>	 (03PS1) 10MarcoAurelio: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729)
[09:21:06] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2054 - T153300 (duration: 00m 39s)
[09:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:10] <stashbot>	 T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300
[09:21:50] <marostegui>	 !log Alter table db2054 metawiki.pagelinks - T153300
[09:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:55] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[09:24:04] <wikibugs>	 (03PS2) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696
[09:24:10] <addshore>	 marostegui: please give me a ping once your done with mediawiki-config deploys as I would like to get https://gerrit.wikimedia.org/r/#/c/332917 out (without getting in your way)
[09:24:36] <marostegui>	 addshore: hey! I am done :)
[09:24:45] <addshore>	 marostegui: awesome!
[09:24:48] <wikibugs>	 (03CR) 10Addshore: [C: 032] Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[09:24:51] <marostegui>	 at leaste for the next couple of hours I think :)
[09:24:52] <wikibugs>	 (03PS8) 10Addshore: Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995)
[09:25:01] <wikibugs>	 (03CR) 10Addshore: [C: 032] Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[09:25:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel)
[09:25:18] <addshore>	 cool, this should only take a few mins (noop)
[09:25:46] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2964458 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done.
[09:26:20] <wikibugs>	 (03PS3) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696
[09:26:50] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[09:27:04] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2964460 (10Marostegui) Alerts silenced for 24 hours - I will re-enable them once the move is done.
[09:28:08] <wikibugs>	 (03CR) 10jenkins-bot: Prepare to enable InterwikiSorting on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332917 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[09:28:37] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964461 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2029.codfw.wmnet'] ``` T...
[09:29:05] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[09:30:51] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 1/4 noop (duration: 00m 53s)
[09:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:55] <stashbot>	 T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995
[09:32:00] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 2/4 noop (duration: 00m 41s)
[09:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:00] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 3/4 noop (duration: 00m 40s)
[09:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:57] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:332917|T155995 Prepare to enable InterwikiSorting on beta cluster]] 4/4 noop (duration: 00m 38s)
[09:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:18] <addshore>	 All done there, and all looks good!
[09:35:05] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:35:18] <wikibugs>	 (03CR) 10Hashar: "Indeed "bundle exec rake puppetlint" process the whole tree + submodules and choke on them.  I already have patches to fix the submodules:" [puppet] - 10https://gerrit.wikimedia.org/r/331239 (https://phabricator.wikimedia.org/T154915) (owner: 10Hashar)
[09:35:39] <hashar>	 contint2001 I havent touched it
[09:36:15] <hashar>	 Attempt to assign to a reserved variable name: "trusted"
[09:36:18] <akosiaris>	 !log add /dev/sdb partitions to md RAID device on mw2251
[09:36:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:30] <akosiaris>	 hashar: yeah.. known. just rerun puppet
[09:36:42] <akosiaris>	 it's a damn puppet+puppetdb bug
[09:36:46] <hashar>	 ;-D
[09:36:59] <hashar>	 indeed it is alll fine now
[09:37:00] <hashar>	 thanks!
[09:37:02] <akosiaris>	 it's happening randomly
[09:37:05] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[09:37:30] <akosiaris>	 IIRC the upstream bug has been closed as WONTFIX
[09:37:32] <_joe_>	 that happens when a connection to puppetdb fails IIRC
[09:37:41] <_joe_>	 yes, that too
[09:38:09] <akosiaris>	 ah yes, puppet inserting the "trusted" fact on the local yaml cache
[09:38:30] <akosiaris>	 WONTFIX cause "we shouldn't sanitize that" or something
[09:38:38] <akosiaris>	 need to reread the damn bug
[09:41:41] <wikibugs>	 (03PS1) 10DCausse: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515)
[09:46:35] <icinga-wm>	 RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[09:47:55] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[09:48:03] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2029.codfw.wmnet'] ```  and were **ALL** successful.
[09:49:22] <wikibugs>	 (03PS1) 10Faidon Liambotis: raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866
[09:50:01] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2964509 (10akosiaris) Yeah this has been happening for days. The disk is not yet kicked out of the array, which buffles me since the dmesg has many  ``` [1636325.780704] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0...
[09:51:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis)
[09:53:17] <akosiaris>	 !log mark /dev/sdb as faulty on md devices on bast3001 T154603
[09:53:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:21] <stashbot>	 T154603: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603
[09:54:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access credentials for junikowski [puppet] - 10https://gerrit.wikimedia.org/r/333868 (https://phabricator.wikimedia.org/T152957)
[09:55:15] <icinga-wm>	 PROBLEM - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 3, Spare: 0
[09:55:16] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on bast3001 is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 3, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T156116
[09:55:20] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T156116#2964513 (10ops-monitoring-bot)
[09:55:27] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2964517 (10akosiaris) Forced the disk as failed. I suppose we should schedule a replacement. In the meantime bast3001 will work at reduced redundancy, which is fine given we got another 3 bast boxes
[09:55:54] <akosiaris>	 hmmm ops-monitoring-bot decided to create a new task.. let's merge it in
[09:56:33] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#2964519 (10akosiaris)
[09:56:36] <wikibugs>	 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T156116#2964521 (10akosiaris)
[10:04:41] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964533 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2030.codfw.wmnet'] ``` T...
[10:05:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for junikowski [puppet] - 10https://gerrit.wikimedia.org/r/333868 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff)
[10:05:14] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2029 is OK: SSL OK - Certificate elastic2029.codfw.wmnet valid until 2022-01-23 10:04:08 +0000 (expires in 1824 days)
[10:07:44] <wikibugs>	 06Operations: Optional expiry date for user accounts - https://phabricator.wikimedia.org/T142816#2964535 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[10:14:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Add account expiry dates for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/333872 (https://phabricator.wikimedia.org/T142816)
[10:16:38] <wikibugs>	 (03PS4) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696
[10:17:59] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004)
[10:19:34] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "wait until around 13:00UTC. There is SWAT at 14:00UTC so we need to push this before that, as the move is scheduled for 14:00UTC with Chri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[10:20:17] <wikibugs>	 (03CR) 10DCausse: [C: 031] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel)
[10:25:26] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964547 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2030.codfw.wmnet'] ```  and were **ALL** successful.
[10:26:34] <icinga-wm>	 PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:27:22] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2030 is OK: SSL OK - Certificate elastic2030.codfw.wmnet valid until 2022-01-23 10:26:18 +0000 (expires in 1824 days)
[10:30:09] <wikibugs>	 (03PS10) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645)
[10:30:17] <wikibugs>	 (03PS11) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645)
[10:38:49] <wikibugs>	 (03PS5) 10Gehel: elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696
[10:40:10] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964568 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2031.codfw.wmnet'] ``` T...
[10:41:12] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - increase size of GC logs [puppet] - 10https://gerrit.wikimedia.org/r/333696 (owner: 10Gehel)
[10:41:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add account expiry dates for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/333872 (https://phabricator.wikimedia.org/T142816) (owner: 10Muehlenhoff)
[10:41:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Add account expiry dates for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/333872 (https://phabricator.wikimedia.org/T142816)
[10:41:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878
[10:48:29] <wikibugs>	 (03PS1) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[10:48:52] <wikibugs>	 06Operations, 10Monitoring, 10Traffic, 07Wikimedia-Incident: Plot number of cached objects on  a per-server per-DC basis - https://phabricator.wikimedia.org/T154864#2964613 (10ema) 05Open>03Resolved @fgiunchedi added per-host stats as well: https://grafana.wikimedia.org/dashboard/db/varnish-machine-sta...
[10:49:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey)
[10:50:43] <elukey>	 sigh
[10:51:08] <TabbyCat>	 jerkins-bot lol
[10:51:26] <wikibugs>	 (03PS1) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882
[10:51:44] <wikibugs>	 (03PS2) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995)
[10:52:11] <elukey>	 woa operations-puppet-typos is very nice
[10:52:13] <wikibugs>	 (03CR) 10Ema: raid: also check for State: degraded in md arrays (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333866 (owner: 10Faidon Liambotis)
[10:52:55] <addshore>	 equiad!
[10:55:32] <icinga-wm>	 RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[10:56:31] <wikibugs>	 (03PS3) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995)
[10:56:52] <wikibugs>	 (03PS2) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[10:58:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey)
[10:59:04] <ema>	 elukey: almost there! :)
[11:00:17] <elukey>	 ema: for some reason the first time puppet parser validate and puppet-lint were fine on my laptop, then syntax error. Now I tried to fix it, puppet-lint warnings :P
[11:00:27] <elukey>	 the main issue is behind the keyboard
[11:00:33] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2964645 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2031.codfw.wmnet'] ```  and were **ALL** successful.
[11:01:34] <elukey>	 and also pcc remembered to me that I forgot the memcached prometheus exporter
[11:02:50] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878
[11:03:29] <ema>	 elukey: wikilove to you
[11:03:50] <elukey>	 ahhaha
[11:04:58] <wikibugs>	 (03PS1) 10Addshore: Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995)
[11:05:13] <wikibugs>	 (03PS4) 10Addshore: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995)
[11:06:06] <_joe_>	 ahahahahahah
[11:09:04] <wikibugs>	 (03PS1) 10Addshore: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995)
[11:09:25] <wikibugs>	 (03PS2) 10Addshore: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995)
[11:11:27] <TabbyCat>	 ostriches: ping
[11:11:43] <hashar>	 TabbyCat: he is sleeping for sure
[11:12:03] <TabbyCat>	 hashar: didn't knew, sorry, will look for another phab admin then
[11:12:41] <TabbyCat>	 ori maybe ?
[11:12:46] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2031 is OK: SSL OK - Certificate elastic2031.codfw.wmnet valid until 2022-01-23 11:11:12 +0000 (expires in 1824 days)
[11:15:02] <TabbyCat>	 or greg-g ?
[11:15:08] <hashar>	 TabbyCat: they are all sleeping
[11:15:25] <hashar>	 TabbyCat: and ori is no more working for the wmf :(     Your best chance is to fill in a task
[11:15:37] <TabbyCat>	 hashar: he's still a phab admin
[11:15:58] <hashar>	 TabbyCat: add in #Project-Admins  / #Repository-Admins  I guess
[11:16:03] <hashar>	 and that should spam the proper set of folks
[11:16:04] <TabbyCat>	 I think I'll mail AKlapper and ask him to disable an account
[11:16:58] <hashar>	 what if he is not around ? :]
[11:17:21] <hashar>	 anyway lunch time for me &
[11:26:46] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:28:45] <wikibugs>	 (03PS3) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[11:29:07] <_joe_>	 elukey: I'll take a look later
[11:29:27] * elukey sees incoming -1s :D 
[11:29:31] <elukey>	 thanks!
[11:29:42] <elukey>	 still running pcc to figure out if I am missing anything
[11:35:04] <wikibugs>	 (03PS4) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[11:42:22] <wikibugs>	 (03PS5) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[11:43:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "You need to change a new master of db1095 to ROW first." [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[11:43:52] <wikibugs>	 (03CR) 10Tobias Gritschacher: "* image template replacement from I2b9cef3d71 has been merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332904 (https://phabricator.wikimedia.org/T150184) (owner: 10Addshore)
[11:54:56] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[11:56:00] <wikibugs>	 (03CR) 10Addshore: [C: 032] Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[11:56:20] <wikibugs>	 (03PS6) 10Elukey: Refactor role memcached in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[11:57:51] <wikibugs>	 (03Merged) 10jenkins-bot: Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[11:58:02] <wikibugs>	 (03CR) 10jenkins-bot: Populate InterwikiSortingInterwikiSortOrders with WB Client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333884 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[12:03:46] <wikibugs>	 (03Abandoned) 10Elukey: [WIP] Add temporary dc to Redis config to allow a eqiad replica [puppet] - 10https://gerrit.wikimedia.org/r/323807 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey)
[12:03:59] <wikibugs>	 (03Abandoned) 10Elukey: WIP - Add base Redis instance if no MW shard is configured. [puppet] - 10https://gerrit.wikimedia.org/r/332983 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey)
[12:05:00] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/extension-list-labs: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] 1/4 noop (duration: 00m 39s)
[12:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:05] <stashbot>	 T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995
[12:05:52] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] 2/4 noop (duration: 00m 39s)
[12:05:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:14] <TabbyCat>	 Dereckson: around?
[12:06:36] <Dereckson>	 Hi. Yes.
[12:06:37] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] 3/4 noop (duration: 00m 39s)
[12:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:09] <TabbyCat>	 Dereckson: hi, I wonder if you could run a server script in dry-mode only and paste the output?
[12:07:31] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/CommonSettings.php: T155995 [[gerrit:332917|Prepare to enable InterwikiSorting on beta cluster]] & [[gerrit:333884|Populate InterwikiSortingInterwikiSortOrders with WB Client]] 4/4 noop (duration: 00m 39s)
[12:07:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:50] <TabbyCat>	 Dereckson: it'd be for https://phabricator.wikimedia.org/T147915#2961853
[12:11:23] <wikibugs>	 (03CR) 10Addshore: [C: 032] Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[12:11:26] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479
[12:11:30] <Dereckson>	 This would output a list of global accounts, only logins, it seems okay on a privacy basis.
[12:12:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Add more email addresses and contacts for account extensions [puppet] - 10https://gerrit.wikimedia.org/r/333892
[12:12:26] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3092101 keys, up 85 days 3 hours - replication_delay is 0
[12:12:54] <wikibugs>	 (03Merged) 10jenkins-bot: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[12:13:04] <wikibugs>	 (03CR) 10jenkins-bot: Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333882 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[12:14:24] <TabbyCat>	 Dereckson: yep, nothing that special listusers wouldn't show you
[12:14:32] <Dereckson>	 TabbyCat: no dry run will need to coordinate with j.ynus and m.arostegui as it needs to iterate among 49 millions of accounts
[12:14:47] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T155995 [[gerrit:333882|Copy InterwikiSorting settings from wmgWikibaseClientSettings]] noop (duration: 00m 39s)
[12:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:51] <stashbot>	 T155995: Deploy InterwikiSorting extension to beta - https://phabricator.wikimedia.org/T155995
[12:14:53] <TabbyCat>	 uh
[12:15:04] <TabbyCat>	 that's bad
[12:15:22] <TabbyCat>	 Dereckson: subtask with dba?
[12:15:47] <Dereckson>	 https://phabricator.wikimedia.org/diffusion/ECAU/browse/master/maintenance/deleteEmptyAccounts.php;86ce123406becbfe9e60e9b7e6aa7785b6e81061$48
[12:16:21] <Raymond_>	 fatal error on https://de.wikipedia.org/wiki/Wikipedia:Festivalsommer/Galerie
[12:16:35] <Raymond_>	 "Typs „ConfigException“
[12:16:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add more email addresses and contacts for account extensions [puppet] - 10https://gerrit.wikimedia.org/r/333892 (owner: 10Muehlenhoff)
[12:16:39] <Dereckson>	 addshore: ping ^
[12:16:45] <addshore>	 reverting
[12:17:02] <sjoerddebruin>	 https://nl.wikipedia.org/wiki/Wikipedia:Te_beoordelen_pagina%27s/Toegevoegd_20170111
[12:17:20] <addshore>	 syncing
[12:17:21] <ShakespeareFan00>	 Planned upgrade ?
[12:17:41] <_joe_>	 nope, a problem in a deploy
[12:17:47] <ShakespeareFan00>	 Got this when trying to save -" [WIdFrgpAADsAAj6FvFYAAABG] 2017-01-24 12:16:47: Fatal exception of type "ConfigException" "
[12:17:55] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: Revert last (duration: 00m 39s)
[12:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:05] <_joe_>	 ShakespeareFan00: try again now?
[12:18:13] <addshore>	 looks like its back
[12:18:23] <Dereckson>	 GlobalVarConfig::get: undefined option: 'InterwikiSortingAlwaysSort'
[12:18:59] <Dereckson>	 could be a sync issue
[12:19:15] <_joe_>	 Dereckson: I don't think so?
[12:19:18] <addshore>	 Dereckson: ah, no, I see what the issue is.
[12:19:28] <Dereckson>	 addshore: you forget wgGlobalVarConfig::get: undefined option: 'InterwikiSortingAlwaysSort'
[12:19:36] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:19:38] <ShakespeareFan00>	 Also can I make a request for a 'font-deployment'?
[12:19:40] <Dereckson>	 addshore: you forget wgInterwikiSortingAlwaysSort?
[12:19:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[12:19:50] <_joe_>	 Dereckson: did it work on mwdebug?
[12:19:52] <addshore>	 Dereckson: that has been delibertly remove
[12:19:57] <ShakespeareFan00>	 Iam trying to get support for FiraSans to be supported across Wikimedia projects?
[12:20:05] <Dereckson>	 _joe_: addshore is reverting
[12:20:27] <addshore>	 But wikibase checks for the existance of 1 global and if that exists it will load the rest. so in adding these it tried loading the other which is actually not being added at all
[12:20:32] <_joe_>	 Dereckson: I know, I was trying to understand how we got to have an outage
[12:20:34] <addshore>	 Dereckson: _joe_ already reverted
[12:20:48] <Dereckson>	 addshore: did you test it on mwdebug1002 before syncing to prod?
[12:20:54] <addshore>	 Dereckson: yup
[12:21:36] <addshore>	 but it could be not all code paths hit this, I can write something up in a bit!
[12:21:42] <MatmaRex>	 this is filled as https://phabricator.wikimedia.org/T156123 already
[12:21:46] <wikibugs>	 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964809 (10akosiaris) Done. Now esams+eqiad use install1001 as DHCP server and ulsfo+codfw use install2001 as DHCP server.
[12:21:46] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[12:22:31] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964819 (10Dereckson) Caused by https://gerrit.wikimedia.org/r/#/c/333882/. Immediately reverted.
[12:22:34] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964822 (10matmarex) I think someone botched a deployment.
[12:23:26] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[12:23:35] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964840 (10matmarex)
[12:23:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]
[12:23:55] <akosiaris>	 !log switch all networks to use install1001, install2001 as DHCP relay endpoint. T156109
[12:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:59] <stashbot>	 T156109: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109
[12:24:08] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964848 (10MarcoAurelio)
[12:24:35] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown, 07Spike: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964849 (10Dereckson)
[12:25:16] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown, 07Spike, 07Wikimedia-Incident: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964774 (10Dereckson)
[12:25:19] <addshore>	 Dereckson: as I just reverted on tin I'll put it on gerrit now
[12:26:03] <wikibugs>	 (03CR) 10Addshore: [C: 04-1] "more pending changes needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995) (owner: 10Addshore)
[12:26:05] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown, 07Spike, 07Wikimedia-Incident: Fatal error on the French Wikibooks - https://phabricator.wikimedia.org/T156123#2964854 (10He7d3r) [Copying from the duplicated task] When I opened the following link today I got > [WIdFjApAAEUAAewxpqMAAABK] 2017-01-24 12:16:12: E...
[12:26:10] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown, 07Spike, 07Wikimedia-Incident: wgGlobalVarConfig::get: undefined option: 'InterwikiSortingAlwaysSort' exception - https://phabricator.wikimedia.org/T156123#2964855 (10Dereckson)
[12:26:12] <wikibugs>	 (03PS1) 10Addshore: Revert "Copy InterwikiSorting settings from wmgWikibaseClientSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895
[12:26:27] <wikibugs>	 (03CR) 10Addshore: [C: 032] "Already reverted on tin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 (owner: 10Addshore)
[12:27:05] <wikibugs>	 06Operations, 10DBA: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2964856 (10Marostegui)
[12:27:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Copy InterwikiSorting settings from wmgWikibaseClientSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 (owner: 10Addshore)
[12:27:52] <wikibugs>	 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964871 (10akosiaris) @dzahn, I think that part is done, please do some tests and then we can resolve
[12:28:11] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Copy InterwikiSorting settings from wmgWikibaseClientSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333895 (owner: 10Addshore)
[12:28:36] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:29:56] <Dereckson>	 addshore: b635f7075731f vs a678bc86b61 -> probablyt useful to reset the branch like try git fetch ; git log b635f7075731f..a678bc86b61 and if void: git reset a678bc86b61 ; git status
[12:30:16] <Dereckson>	 sorry I meant `git diff b635f7075731f a678bc86b61`
[12:30:27] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:30:46] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:30:48] <addshore>	 Dereckson: ack, just done!
[12:31:07] <Dereckson>	 TabbyCat: 47M/49M
[12:31:18] <addshore>	 First time I have had to revert something directly on tin and push it out fast..
[12:31:24] <wikibugs>	 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964878 (10Marostegui)
[12:31:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[12:31:48] <wikibugs>	 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui)
[12:32:31] <Dereckson>	 addshore: oh yes, you were right: revert it on Tin, then sync is the more urgent. Gerrit, etc. can wait aftwerwards.
[12:35:44] <addshore>	 Dereckson: I'm guessing that was big enough to warrent a https://wikitech.wikimedia.org/wiki/Incident_documentation ?
[12:36:41] <Dereckson>	 TabbyCat: so, the script would delete 2148 accounts
[12:36:57] <Dereckson>	 addshore > yes, seems so
[12:37:10] <TabbyCat>	 Dereckson: still 2148 empty global accounts?
[12:37:13] <TabbyCat>	 wow
[12:37:23] <TabbyCat>	 results could be posted?
[12:37:28] <TabbyCat>	 phab paste?
[12:37:44] <TabbyCat>	 if concerned with something, make it visible just to you and me
[12:39:23] <elukey>	 addshore: are you guys going to write an incident report?
[12:39:36] <icinga-wm>	 PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479
[12:40:26] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3092089 keys, up 85 days 4 hours - replication_delay is 0
[12:40:34] <TabbyCat>	 Dereckson: I'm leaving now but you can reach me through phab conpherence if you need to, au revoir
[12:41:07] <addshore>	 elukey: yup, I will
[12:41:18] <wikibugs>	 (03PS1) 10Yuvipanda: tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897
[12:41:30] <elukey>	 thanks :)
[12:41:39] <addshore>	 elukey: my first one D:
[12:42:12] <elukey>	 it happens!
[12:42:26] <marostegui>	 addshore: are you done so I can push a depool to mediawikiconfig?
[12:42:41] <addshore>	 marostegui: yup! everything is done & clean
[12:42:46] <marostegui>	 addshore: thanks! :)
[12:43:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[12:43:11] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004)
[12:48:02] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333873 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[12:48:33] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T156004 (duration: 00m 39s)
[12:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:37] <stashbot>	 T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004
[12:49:43] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004)
[12:51:09] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] contint/zuul: skip Icinga monitoring if server not master (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn)
[12:51:16] <wikibugs>	 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964931 (10Marostegui)
[12:51:29] <wikibugs>	 (03PS2) 10Hashar: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn)
[12:51:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn)
[12:51:52] <moritzm>	 !log installing pcsc-lite security updates on trusty hosts (jessie already fixed a while ago)
[12:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:04] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[12:52:06] <wikibugs>	 (03PS3) 10Hashar: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn)
[12:53:46] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[12:53:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[12:54:02] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333898 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[12:55:01] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T156004 (duration: 00m 39s)
[12:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:05] <stashbot>	 T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004
[12:56:29] <marostegui>	 !log Shutdown mysql on db1051 for maintenance - T156004
[12:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:31] <wikibugs>	 (03PS1) 10Cmjohnson: Updating dns for db1051 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333899
[13:00:44] <marostegui>	 !log Shutdown db1051 for maintenance - T156004
[13:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:48] <stashbot>	 T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004
[13:01:14] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Updating dns for db1051 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333899 (owner: 10Cmjohnson)
[13:03:36] <icinga-wm>	 PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:05:43] <wikibugs>	 (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004)
[13:09:50] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[13:11:22] <wikibugs>	 (03CR) 10Hashar: "Puppet compile is https://puppet-compiler.wmflabs.org/5209/" [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn)
[13:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[13:11:37] <wikibugs>	 (03PS1) 10Hoo man: Log time and shard number on Wikidata dump failure [puppet] - 10https://gerrit.wikimedia.org/r/333901
[13:11:42] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Update db1051 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333900 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[13:13:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897 (owner: 10Yuvipanda)
[13:14:00] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: wmf-config/db-codfw.php Change db1051 IP - T156004 (duration: 00m 39s)
[13:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:04] <stashbot>	 T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004
[13:15:47] <wikibugs>	 (03PS2) 10Faidon Liambotis: raid: also check for State: degraded in md arrays [puppet] - 10https://gerrit.wikimedia.org/r/333866
[13:16:00] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db1051 IP - T156004 (duration: 00m 39s)
[13:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:55] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965005 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2032.codfw.wmnet'] ``` T...
[13:28:17] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Log time and shard number on Wikidata dump failure [puppet] - 10https://gerrit.wikimedia.org/r/333901 (owner: 10Hoo man)
[13:33:02] <wikibugs>	 (03PS1) 10Yuvipanda: tools: Use packages in k8s bastions [puppet] - 10https://gerrit.wikimedia.org/r/333904
[13:33:31] <wikibugs>	 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965022 (10Marostegui)
[13:33:34] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004#2965019 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely.  tendril updated  Thanks...
[13:33:36] <icinga-wm>	 RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[13:33:49] <addshore>	 Dereckson: elukey https://wikitech.wikimedia.org/wiki/Incident_documentation/20170124-WikibaseClient-InterwikiSorting In a meeting now but will post it around after
[13:36:01] <elukey>	 thanks!
[13:37:36] <marostegui>	 !log Shutdown mysql on db1052 for maintenance - T156006
[13:37:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:41] <stashbot>	 T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006
[13:37:51] <wikibugs>	 (03PS1) 10Yuvipanda: tools: Switch workers to using debs [puppet] - 10https://gerrit.wikimedia.org/r/333906
[13:41:04] <marostegui>	 !log Shutdown db1052 for maintenance - T156006
[13:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:12] <wikibugs>	 (03PS1) 10Cmjohnson: Updating dns for db1052 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333907
[13:42:38] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Updating dns for db1052 to coincide with rack change T156004 [dns] - 10https://gerrit.wikimedia.org/r/333907 (owner: 10Cmjohnson)
[13:43:21] <wikibugs>	 (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006)
[13:45:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[13:47:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[13:47:15] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965117 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2032.codfw.wmnet'] ```  and were **ALL** successful.
[13:48:23] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db1052 IP - T156006 (duration: 00m 39s)
[13:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:27] <stashbot>	 T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006
[13:48:58] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1052 IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333908 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[13:49:16] <wikibugs>	 (03PS1) 10Gilles: Fix mechanism to disable default nginx configuration [puppet/nginx] - 10https://gerrit.wikimedia.org/r/333909 (https://phabricator.wikimedia.org/T154270)
[13:49:17] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db1052 IP - T156006 (duration: 00m 39s)
[13:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix mechanism to disable default nginx configuration [puppet/nginx] - 10https://gerrit.wikimedia.org/r/333909 (https://phabricator.wikimedia.org/T154270) (owner: 10Gilles)
[13:51:41] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965131 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2033.codfw.wmnet'] ``` T...
[13:52:36] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2032 is OK: SSL OK - Certificate elastic2032.codfw.wmnet valid until 2022-01-23 13:50:49 +0000 (expires in 1824 days)
[13:57:03] <hashar>	 jouncebot: next
[13:57:04] <jouncebot>	 In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1400)
[13:57:35] <hashar>	 dcausse: go go go :)
[13:57:40] <dcausse>	 o/
[13:57:44] <dcausse>	 I can swat? :)
[13:57:58] <hashar>	 guess we can start yeah :]
[13:58:06] <hashar>	 zeljkof: I will do the swat :]
[13:58:43] <hashar>	 dcausse: wanna do the magic CR+2 / scap pull / scap sync-file  dance?
[13:58:54] <dcausse>	 hashar: sure I can do that
[13:58:58] <hashar>	 great!
[13:59:08] <hashar>	 I am around if you need assistance
[13:59:28] <wikibugs>	 (03PS3) 10DCausse: [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142)
[14:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1400).
[14:00:04] <jouncebot>	 dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[14:00:11] <hashar>	 about that one, I think that namespaces have a property to define whether they are content
[14:00:20] <hashar>	 so in theory CirrusSearch could auto prioritize such namespaces
[14:00:41] <dcausse>	 hashar: yes... but I still don't know if I should do that
[14:01:02] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2010 is OK: OK ferm input default policy is set
[14:01:03] <dcausse>	 I'd like to find some usecases where such low boost were actually useful
[14:01:32] <hashar>	 yup
[14:01:47] <zeljkof>	 hashar, dcausse: great, good luck with swat :)
[14:01:54] <dcausse>	 zeljkof: thanks :)
[14:03:41] <wikibugs>	 (03CR) 10DCausse: [C: 032] [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) (owner: 10DCausse)
[14:05:22] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) (owner: 10DCausse)
[14:05:36] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] Increase weigths for content namespaces on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332513 (https://phabricator.wikimedia.org/T155142) (owner: 10DCausse)
[14:07:42] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:42] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING
[14:10:31] <wikibugs>	 06Operations, 10DBA, 06Labs, 10netops, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965171 (10Marostegui)
[14:10:34] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006#2965168 (10Marostegui) 05Open>03Resolved a:03Cmjohnson db1051 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely.  tendril updated  thanks...
[14:10:52] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965175 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2033.codfw.wmnet'] ```  and were **ALL** successful.
[14:13:40] <logmsgbot>	 !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T155142 [cirrus] Increase weigths for content namespaces on mw.org (duration: 00m 39s)
[14:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:44] <stashbot>	 T155142: Pages in the "Manual" namespace are ranked very poorly in MediaWiki.org search results - https://phabricator.wikimedia.org/T155142
[14:15:40] <wikibugs>	 (03PS2) 10DCausse: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515)
[14:15:59] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965194 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2034.codfw.wmnet'] ``` T...
[14:16:12] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2033 is OK: SSL OK - Certificate elastic2033.codfw.wmnet valid until 2022-01-23 14:14:34 +0000 (expires in 1824 days)
[14:17:40] <wikibugs>	 (03CR) 10DCausse: [C: 032] [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse)
[14:19:16] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse)
[14:19:26] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] properly set wgCirrusSearchUseIcuFolding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333863 (https://phabricator.wikimedia.org/T155515) (owner: 10DCausse)
[14:21:12] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004)
[14:23:21] <logmsgbot>	 !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: T155515 [cirrus] properly set wgCirrusSearchUseIcuFolding (duration: 00m 39s)
[14:23:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:26] <stashbot>	 T155515: Reindex el, en, fr and he wikis to enable ICU folding - https://phabricator.wikimedia.org/T155515
[14:26:03] <dcausse>	 !log EU SWAT Done
[14:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:01] <hashar>	 \O/
[14:29:16] <wikibugs>	 (03PS1) 10Elukey: Increase retry wait time for Hadoop Yarn Nodemanager checks [puppet] - 10https://gerrit.wikimedia.org/r/333912
[14:33:39] <wikibugs>	 (03PS1) 10Yuvipanda: tools: Use packages for kube-proxy on webproxies [puppet] - 10https://gerrit.wikimedia.org/r/333913
[14:36:11] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965225 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2034.codfw.wmnet'] ```  and were **ALL** successful.
[14:36:22] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004)
[14:38:22] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[14:39:45] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[14:39:56] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1051 with less weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333911 (https://phabricator.wikimedia.org/T156004) (owner: 10Marostegui)
[14:40:51] <wikibugs>	 (03PS25) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717)
[14:40:53] <wikibugs>	 (03PS25) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717)
[14:40:55] <wikibugs>	 (03PS26) 10BBlack: cache_misc req_handling: subpaths, cache policy, defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717)
[14:40:57] <wikibugs>	 (03PS10) 10BBlack: cache_misc: stream.wm.o subpathing for eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/327550 (https://phabricator.wikimedia.org/T143925)
[14:41:14] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 with less weight - T156004 (duration: 00m 41s)
[14:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:17] <stashbot>	 T156004: Move db1051 to row B3 - https://phabricator.wikimedia.org/T156004
[14:43:41] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965249 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2035.codfw.wmnet'] ``` T...
[14:44:05] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999)
[14:44:23] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2034 is OK: SSL OK - Certificate elastic2034.codfw.wmnet valid until 2022-01-23 14:42:45 +0000 (expires in 1824 days)
[14:45:01] <wikibugs>	 (03PS2) 10Filippo Giunchedi: scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728)
[14:45:28] <wikibugs>	 (03PS2) 10Yuvipanda: tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897
[14:45:34] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Switch to using packages for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/333897 (owner: 10Yuvipanda)
[14:45:52] <wikibugs>	 (03PS2) 10Yuvipanda: tools: Switch workers to using debs [puppet] - 10https://gerrit.wikimedia.org/r/333906
[14:45:58] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Switch workers to using debs [puppet] - 10https://gerrit.wikimedia.org/r/333906 (owner: 10Yuvipanda)
[14:47:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui)
[14:49:05] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui)
[14:49:16] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333914 (https://phabricator.wikimedia.org/T155999) (owner: 10Marostegui)
[14:50:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add memcached aggregation and additional rules [puppet] - 10https://gerrit.wikimedia.org/r/333915
[14:50:39] <wikibugs>	 (03PS3) 10Marostegui: site.pp: Disable RBR on db1052 enable it on db1073 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006)
[14:50:44] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T155999 (duration: 00m 39s)
[14:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:48] <stashbot>	 T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999
[14:51:02] <wikibugs>	 06Operations, 10media-storage: Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136#2965275 (10ema)
[14:53:11] <wikibugs>	 06Operations, 10media-storage, 07Wikimedia-Incident: Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136#2965277 (10ema) p:05Triage>03Normal
[14:53:45] <godog>	 cmjohnson1: I'm going to depool ms-fe1001 
[14:53:54] <cmjohnson1>	 okay
[14:54:51] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1001.eqiad.wmnet
[14:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:11] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5212/ this compiles fine and changes db1052 to STATEMENT and db1073 to ROW" [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[14:55:38] <marostegui>	 !log Stop replication on db1052 and db1073 for maintenance - T156006
[14:55:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:42] <stashbot>	 T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006
[14:56:01] <godog>	 it'll take maybe 3/5 minutes to fully drain
[14:56:24] <wikibugs>	 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2965288 (10Cmjohnson) added a secondary switch, asw2-c2-eqiad.  accessible via scs port 48
[14:56:47] <wikibugs>	 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#2965291 (10jcrespo)
[14:58:03] <icinga-wm>	 RECOVERY - NTP on ms-be2010 is OK: NTP OK: Offset -0.0006507337093 secs
[15:00:33] <icinga-wm>	 PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:01:13] <godog>	 mhh doesn't look like ms-fe1001 is being depooled, checking
[15:02:26] <wikibugs>	 (03PS2) 10Yuvipanda: tools: Use packages for kube-proxy on webproxies [puppet] - 10https://gerrit.wikimedia.org/r/333913
[15:03:10] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965316 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2035.codfw.wmnet'] ```  and were **ALL** successful.
[15:04:08] <chasemp>	 !log recabling labstore1004/1005 eth1
[15:04:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:43] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2035 is OK: SSL OK - Certificate elastic2035.codfw.wmnet valid until 2022-01-23 15:04:24 +0000 (expires in 1824 days)
[15:07:32] <chasemp>	 !log drbdadm adjust test for 1004/1005 w/ 192.168.0.0/30
[15:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:44] <godog>	 yep looks like low-traffic primary lvs1003 didn't pick up the etcd change
[15:08:05] <godog>	 I'll try again
[15:08:09] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1001.eqiad.wmnet
[15:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:25] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use packages for kube-proxy on webproxies [puppet] - 10https://gerrit.wikimedia.org/r/333913 (owner: 10Yuvipanda)
[15:09:30] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe1001.eqiad.wmnet
[15:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:45] <ema>	 godog: it could be that pybal crashed on lvs1003 -> T134893 
[15:09:46] <stashbot>	 T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893
[15:10:02] <godog>	 yeah looks like only 1012 1006 and 1009 see the change
[15:10:09] <godog>	 ema: most likely
[15:10:24] <chasemp>	 !log drbdadm adjust misc for 1004/1005 w/ 192.168.0.0/30
[15:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:46] <godog>	 ema: so the "fix" is turning off and on again
[15:11:04] <ema>	 godog: yep
[15:12:11] <bblack>	 isn't that always the fix?
[15:12:16] <ema>	 Jan 24 11:45:28 lvs1003 pybal[6642]: Unhandled error in Deferred:
[15:12:16] <ema>	 Jan 24 11:45:28 lvs1003 pybal[6642]: Unhandled Error
[15:12:16] <ema>	 Jan 24 11:45:28 lvs1003 pybal[6642]: Traceback (most recent call last):
[15:12:19] <ema>	 Jan 24 11:45:28 lvs1003 pybal[6642]: Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
[15:12:30] <ema>	 that's a good reason to explode!
[15:12:51] <bblack>	 twisted makes network programming fun again :)
[15:13:20] <elukey>	 internet.error is also great
[15:14:23] <godog>	 !log bounce pybal on lvs1003 - T134893
[15:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:57] <godog>	 elukey: waiting for .error to be a TLD
[15:17:39] <wikibugs>	 (03Abandoned) 10Marostegui: site.pp: Disable RBR on db1052 enable it on db1073 [puppet] - 10https://gerrit.wikimedia.org/r/333850 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[15:24:40] <wikibugs>	 06Operations: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965447 (10ema)
[15:25:47] <godog>	 cmjohnson1: you can unplug ms-fe1001 production interface, depooled now
[15:25:53] <godog>	 I'll shut icinga
[15:25:58] <cmjohnson1>	 great..thx
[15:28:12] <cmjohnson1>	 godog: success, i plugged in the fiber from ms-fe1001 to fe1005 and i have a connection....on the reverse side the fiber to ms-fe1005 did not establish a link. it's not the server or the nic card 
[15:28:33] <icinga-wm>	 RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[15:30:59] <godog>	 cmjohnson1: ok thanks! I think you can plug ms-fe1001 back in
[15:31:28] <cmjohnson1>	 godog: give me another couple of mins plz
[15:31:32] <godog>	 ok!
[15:31:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919
[15:33:18] <wikibugs>	 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965468 (10ema)
[15:36:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 (owner: 10Muehlenhoff)
[15:36:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919
[15:38:31] <wikibugs>	 (03PS2) 10Yuvipanda: tools: Use packages in k8s bastions [puppet] - 10https://gerrit.wikimedia.org/r/333904
[15:38:41] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Use packages in k8s bastions [puppet] - 10https://gerrit.wikimedia.org/r/333904 (owner: 10Yuvipanda)
[15:38:45] <wikibugs>	 06Operations, 10hardware-requests: hardware request for netmon1001 - https://phabricator.wikimedia.org/T156040#2965499 (10RobH) a:05mark>03RobH We don't have any spare systems with SSDs, so we would have to order the machine specifically to house them.  Since it seems this spare won't do, I'll go ahead and...
[15:41:29] <wikibugs>	 (03PS3) 10Muehlenhoff: Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919
[15:42:29] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] Add contact email addresses and account expiry dates for fr contractors [puppet] - 10https://gerrit.wikimedia.org/r/333919 (owner: 10Muehlenhoff)
[15:49:51] <moritzm>	 !log installing tomcat7 security updates on trusty hosts (jessie already fixed a while ago)
[15:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:05] <coreyfloyd>	 Is anyone able to give a review on this patch? https://gerrit.wikimedia.org/r/#/c/333158/
[15:54:29] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up ms-fe100[5-7] - https://phabricator.wikimedia.org/T155095#2965547 (10Cmjohnson) I was able to confirm the servers and NIC cards were good and  ms-fe1005 and 1006 are now up and accessible.
[15:54:54] <chasemp>	 !log drbdadm adjust tools for 1004/1005 w/ 192.168.0.0/30
[15:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:18] <papaul>	 !log shutting down ms-be2002 for maintenance
[15:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:49] <moritzm>	 !log upgraded nodejs on thorium to 6.9 / restarted pivot
[15:58:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:17] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923
[15:59:31] <wikibugs>	 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2965555 (10RobH) My understanding is they don't expire like that, unless they weren't ever loaded with the proper firmware.  So is there a way to flash when its expired?
[16:00:43] <icinga-wm>	 PROBLEM - Host ms-be2002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:02:03] <wikibugs>	 06Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#2965556 (10Papaul) it is not allowing to upload the firmware at all.
[16:02:05] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 (owner: 10Marostegui)
[16:03:31] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878
[16:03:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 (owner: 10Marostegui)
[16:03:56] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333923 (owner: 10Marostegui)
[16:04:11] <wikibugs>	 (03Abandoned) 10Alex Monk: labs nfsclient: Require /mnt/nfs's existence before trying to mount underneath it [puppet] - 10https://gerrit.wikimedia.org/r/313034 (owner: 10Alex Monk)
[16:04:24] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add passwords::redis::ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/333924
[16:04:49] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 - T155999 (duration: 00m 48s)
[16:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:55] <stashbot>	 T155999: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999
[16:05:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] redis: Allow specifying credential file for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/333878 (owner: 10Alexandros Kosiaris)
[16:06:01] <wikibugs>	 06Operations, 10media-storage: high CPU usage from swift-proxy on frontend machines - https://phabricator.wikimedia.org/T156143#2965565 (10fgiunchedi)
[16:06:36] <wikibugs>	 (03PS1) 10Cmjohnson: Adding dns entries for frpm1001.frack both mgmt and production [dns] - 10https://gerrit.wikimedia.org/r/333925
[16:06:55] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006)
[16:07:17] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1001.eqiad.wmnet
[16:07:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:53] <wikibugs>	 06Operations, 10media-storage: High CPU usage from swift-proxy on frontend machines - https://phabricator.wikimedia.org/T156143#2965581 (10fgiunchedi) p:05Triage>03Normal
[16:09:29] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006)
[16:09:41] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on oresrdb1002 is OK: OK: REDIS 2.8.17 on 10.64.0.10:6379 has 1 databases (db0) with 2394417 keys, up 12 days 6 hours - replication_delay is 0
[16:10:00] <akosiaris>	 yay
[16:10:00] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on oresrdb1002 is OK: OK: REDIS 2.8.17 on 10.64.0.10:6380 has 1 databases (db0) with 22322657 keys, up 12 days 6 hours - replication_delay is 0
[16:10:03] <akosiaris>	 paravoid: ^
[16:10:05] <akosiaris>	 fixed finally
[16:10:21] <akosiaris>	 took a while... had to refactor our redis monitoring a bit
[16:10:49] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding dns entries for frpm1001.frack both mgmt and production [dns] - 10https://gerrit.wikimedia.org/r/333925 (owner: 10Cmjohnson)
[16:11:40] <icinga-wm>	 PROBLEM - Redis status tcp_6378 on rdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.32.76 on port 6378
[16:11:40] <icinga-wm>	 PROBLEM - Redis status tcp_6381 on rdb1005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.24 on port 6381
[16:11:40] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on rdb1003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.201 on port 6379
[16:11:40] <icinga-wm>	 PROBLEM - Redis status tcp_6380 on rdb1005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.24 on port 6380
[16:11:40] <icinga-wm>	 PROBLEM - Redis status tcp_6381 on rdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.32.76 on port 6381
[16:11:41] <icinga-wm>	 PROBLEM - Redis status tcp_6378 on rdb1003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.201 on port 6378
[16:11:41] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.182 on port 6379
[16:11:42] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1002 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.181 on port 6379
[16:11:46] <akosiaris>	 damn
[16:11:49] <akosiaris>	 all these are me
[16:11:50] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1015 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.48.103 on port 6379
[16:11:50] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.185 on port 6379
[16:11:50] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1009 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.32.163 on port 6379
[16:11:51] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1017 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.48.95 on port 6379
[16:11:51] <icinga-wm>	 PROBLEM - Redis status tcp_6379 on mc1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.0.183 on port 6379
[16:12:11] <akosiaris>	 need to revert I suppose... lemme see if I can fix it first though
[16:12:32] <godog>	 !log kill stray swift-proxy processes from ms-fe1* T156143
[16:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:36] <stashbot>	 T156143: High CPU usage from swift-proxy on frontend machines - https://phabricator.wikimedia.org/T156143
[16:12:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[16:14:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Fix typo for check_redis definition [puppet] - 10https://gerrit.wikimedia.org/r/333928
[16:14:28] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[16:14:52] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333927 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[16:14:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix typo for check_redis definition [puppet] - 10https://gerrit.wikimedia.org/r/333928 (owner: 10Alexandros Kosiaris)
[16:15:13] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006)
[16:15:32] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T156006 (duration: 00m 47s)
[16:15:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:36] <stashbot>	 T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006
[16:16:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add passwords::redis::ores_password [labs/private] - 10https://gerrit.wikimedia.org/r/333924 (owner: 10Alexandros Kosiaris)
[16:16:48] <icinga-wm>	 PROBLEM - Redis replication status tcp_6381 on rdb2002 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.0.120 on port 6381
[16:16:48] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1005 is OK: OK: REDIS 2.8.17 on 10.64.0.184:6379 has 1 databases (db0) with 521586 keys, up 159 days 8 hours
[16:16:48] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6381 has 1 databases (db0) with 3108759 keys, up 279 days 3 hours - replication_delay is 0
[16:16:48] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2009 is OK: OK: REDIS 2.8.17 on 10.192.16.39:6379 has 1 databases (db0) with 422527 keys, up 76 days 15 hours - replication_delay is 0
[16:16:48] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on mc2016 is OK: OK: REDIS 2.8.17 on 10.192.32.23:6380 has 1 databases (db0) with 519544 keys, up 76 days 18 hours - replication_delay is 0
[16:16:48] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6381 has 1 databases (db0) with 3101866 keys, up 85 days 7 hours - replication_delay is 0
[16:16:58] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2002 is OK: OK: REDIS 2.8.17 on 10.192.0.35:6379 has 1 databases (db0) with 523447 keys, up 76 days 14 hours - replication_delay is 1
[16:16:58] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6379 has 1 databases (db0) with 7811063 keys, up 85 days 7 hours - replication_delay is 0
[16:16:58] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 3102928 keys, up 85 days 7 hours - replication_delay is 0
[16:16:58] <icinga-wm>	 RECOVERY - Redis replication status tcp_6478 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6478 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 4
[16:16:58] <icinga-wm>	 RECOVERY - Redis replication status tcp_6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6481 has 1 databases (db0) with 3106394 keys, up 85 days 7 hours - replication_delay is 0
[16:16:58] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6379 has 1 databases (db0) with 3108710 keys, up 85 days 7 hours - replication_delay is 0
[16:16:59] <icinga-wm>	 RECOVERY - Redis replication status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 3103227 keys, up 85 days 7 hours - replication_delay is 0
[16:16:59] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1009 is OK: OK: REDIS 2.8.17 on 10.64.32.163:6379 has 1 databases (db0) with 422519 keys, up 159 days 8 hours
[16:17:00] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1004 is OK: OK: REDIS 2.8.17 on 10.64.0.183:6379 has 1 databases (db0) with 449508 keys, up 159 days 8 hours
[16:17:00] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on oresrdb1001 is OK: OK: REDIS 2.8.17 on 10.64.48.129:6379 has 1 databases (db0) with 2394948 keys, up 12 days 5 hours
[16:17:01] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1007 is OK: OK: REDIS 2.8.17 on 10.64.32.161:6379 has 1 databases (db0) with 500025 keys, up 159 days 8 hours
[16:17:01] <icinga-wm>	 RECOVERY - Redis status tcp_6380 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6380 has 1 databases (db0) with 7813425 keys, up 278 days 1 hours
[16:17:02] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1001 is OK: OK: REDIS 2.8.17 on 10.64.0.180:6379 has 1 databases (db0) with 474661 keys, up 159 days 8 hours
[16:17:02] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6379 has 1 databases (db0) with 3108660 keys, up 85 days 7 hours - replication_delay is 0
[16:17:08] <godog>	 akosiaris: \o/
[16:17:12] <akosiaris>	 ok fixed
[16:17:18] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6380 has 1 databases (db0) with 3104678 keys, up 85 days 7 hours - replication_delay is 0
[16:17:18] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6379 has 1 databases (db0) with 7810934 keys, up 279 days 3 hours - replication_delay is 0
[16:17:18] <icinga-wm>	 RECOVERY - Redis status tcp_6380 on oresrdb1001 is OK: OK: REDIS 2.8.17 on 10.64.48.129:6380 has 1 databases (db0) with 22337099 keys, up 12 days 5 hours
[16:17:18] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on rdb1001 is OK: OK: REDIS 2.8.17 on 10.64.32.76:6379 has 1 databases (db0) with 7810929 keys, up 278 days 1 hours
[16:17:18] <icinga-wm>	 RECOVERY - Redis status tcp_6381 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6381 has 1 databases (db0) with 7721205 keys, up 278 days 1 hours
[16:17:24] <akosiaris>	 damn typo, sorry
[16:17:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6478 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6478 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 8
[16:17:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6378 has 1 databases (db0) with 15 keys, up 85 days 7 hours - replication_delay is 0
[16:17:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3099430 keys, up 85 days 7 hours - replication_delay is 0
[16:17:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6378 has 1 databases (db0) with 15 keys, up 279 days 3 hours - replication_delay is 0
[16:17:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb1002 is OK: OK: REDIS 2.8.17 on 10.64.32.77:6380 has 1 databases (db0) with 3102726 keys, up 279 days 3 hours - replication_delay is 0
[16:17:29] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5214/ compiles fine and changes only db0172" [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[16:17:38] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 3108636 keys, up 85 days 7 hours - replication_delay is 0
[16:17:38] <icinga-wm>	 RECOVERY - Redis status tcp_6378 on rdb1001 is OK: OK: REDIS 2.8.17 on 10.64.32.76:6378 has 1 databases (db0) with 15 keys, up 278 days 1 hours
[16:17:38] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6379 has 1 databases (db0) with 7811804 keys, up 278 days 1 hours
[16:17:38] <icinga-wm>	 RECOVERY - Redis status tcp_6381 on rdb1001 is OK: OK: REDIS 2.8.17 on 10.64.32.76:6381 has 1 databases (db0) with 3108596 keys, up 278 days 1 hours
[16:17:38] <icinga-wm>	 RECOVERY - Redis status tcp_6378 on rdb1003 is OK: OK: REDIS 2.8.17 on 10.64.0.201:6378 has 1 databases (db0) with 4705607 keys, up 278 days 1 hours
[16:17:51] <niko>	 you should voice the bot, it get throttled a bit
[16:18:17] <wikibugs>	 (03PS3) 10Marostegui: site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006)
[16:18:38] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6378 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 10
[16:18:41] <paravoid>	 !log removing lvs4002_T151273 policy from cr1/2-ulsfo
[16:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:48] <icinga-wm>	 RECOVERY - Redis replication status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 3106260 keys, up 85 days 7 hours - replication_delay is 0
[16:19:22] <wikibugs>	 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: Set up monitoring for ORES redis database - https://phabricator.wikimedia.org/T155482#2965632 (10akosiaris) 05Open>03Resolved And with https://gerrit.wikimedia.org/r/#/c/333878/ this is now done. Had to refactor the current moni...
[16:20:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] site.pp: Enable RBR on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/333926 (https://phabricator.wikimedia.org/T156006) (owner: 10Marostegui)
[16:20:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2012 is OK: OK: REDIS 2.8.17 on 10.192.16.42:6379 has 1 databases (db0) with 444670 keys, up 76 days 16 hours - replication_delay is 0
[16:20:28] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6378 has 1 databases (db0) with 3 keys, up 85 days 7 hours - replication_delay is 7
[16:21:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi)
[16:21:45] <wikibugs>	 (03PS3) 10Filippo Giunchedi: scholarships: move udp2log to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/333235 (https://phabricator.wikimedia.org/T123728)
[16:21:48] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:22:03] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T156006 (duration: 00m 41s)
[16:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:08] <stashbot>	 T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006
[16:23:37] <wikibugs>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2965652 (10faidon) Just in: > Engineering has fixed PR 1238906 has been fixed through master PR 1205416, and the fix would be available 14.1X53-D42 onwards, sc...
[16:26:36] <marostegui>	 !log Restart mysql db1072
[16:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:44] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1002 is OK: OK: REDIS 2.8.17 on 10.64.0.181:6379 has 1 databases (db0) with 523118 keys, up 159 days 8 hours
[16:26:44] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2001 is OK: OK: REDIS 2.8.17 on 10.192.0.34:6379 has 1 databases (db0) with 474546 keys, up 76 days 14 hours - replication_delay is 0
[16:26:54] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6381 has 1 databases (db0) with 7721449 keys, up 85 days 7 hours - replication_delay is 0
[16:26:54] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6378 has 1 databases (db0) with 4705607 keys, up 85 days 7 hours - replication_delay is 4
[16:26:54] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1006 is OK: OK: REDIS 2.8.17 on 10.64.0.185:6379 has 1 databases (db0) with 502785 keys, up 159 days 8 hours
[16:26:54] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1017 is OK: OK: REDIS 2.8.17 on 10.64.48.95:6379 has 1 databases (db0) with 483532 keys, up 159 days 8 hours
[16:26:55] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6379 has 1 databases (db0) with 3108682 keys, up 279 days 2 hours - replication_delay is 0
[16:26:55] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1018 is OK: OK: REDIS 2.8.17 on 10.64.48.96:6379 has 1 databases (db0) with 519430 keys, up 159 days 8 hours
[16:27:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6380 has 1 databases (db0) with 3104900 keys, up 279 days 2 hours - replication_delay is 0
[16:27:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2004 is OK: OK: REDIS 2.8.17 on 10.192.0.37:6379 has 1 databases (db0) with 449576 keys, up 76 days 15 hours - replication_delay is 0
[16:27:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6379 has 1 databases (db0) with 7812011 keys, up 85 days 7 hours - replication_delay is 0
[16:27:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 10.192.16.122:6380 has 1 databases (db0) with 7813481 keys, up 85 days 7 hours - replication_delay is 0
[16:27:14] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6378 has 1 databases (db0) with 3 keys, up 279 days 2 hours - replication_delay is 1
[16:27:14] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1016 is OK: OK: REDIS 2.8.17 on 10.64.48.104:6379 has 1 databases (db0) with 595442 keys, up 159 days 8 hours
[16:27:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on mc2001 is OK: OK: REDIS 2.8.17 on 10.192.0.34:6380 has 1 databases (db0) with 483500 keys, up 76 days 14 hours - replication_delay is 0
[16:27:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2014 is OK: OK: REDIS 2.8.17 on 10.192.32.21:6379 has 1 databases (db0) with 528374 keys, up 76 days 17 hours - replication_delay is 0
[16:27:15] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2005 is OK: OK: REDIS 2.8.17 on 10.192.0.38:6379 has 1 databases (db0) with 521327 keys, up 76 days 15 hours - replication_delay is 0
[16:27:24] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1008 is OK: OK: REDIS 2.8.17 on 10.64.32.162:6379 has 1 databases (db0) with 436959 keys, up 159 days 8 hours
[16:27:24] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1011 is OK: OK: REDIS 2.8.17 on 10.64.32.165:6379 has 1 databases (db0) with 522481 keys, up 159 days 8 hours
[16:27:25] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb1006 is OK: OK: REDIS 2.8.17 on 10.64.48.55:6381 has 1 databases (db0) with 3101761 keys, up 279 days 2 hours - replication_delay is 0
[16:27:34] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1012 is OK: OK: REDIS 2.8.17 on 10.64.32.166:6379 has 1 databases (db0) with 444484 keys, up 159 days 8 hours
[16:27:59] <_joe_>	 uh what happened there?
[16:28:19] <_joe_>	 oh alex happened
[16:31:24] <wikibugs>	 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#2965674 (10Papaul)  {F5350097}  {F5350101}  I switch the IDRAC from Dedicated to NIC2 to access the server in case there is something to do. This is just a temporary fix.
[16:32:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333930
[16:33:00] <wikibugs>	 (03PS4) 10Andrew Bogott: labstore: Don't use wikitech API to find labs instances in nfs-exportd [puppet] - 10https://gerrit.wikimedia.org/r/328609 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[16:34:44] <icinga-wm>	 RECOVERY - Host ms-be2002 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms
[16:35:25] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[16:37:05] <elukey>	 !log upgrading aqs1004 to node6
[16:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:34] <icinga-wm>	 PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100%
[16:42:24] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[16:43:04] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:43:14] <icinga-wm>	 PROBLEM - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms
[16:44:24] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[16:45:03] <elukey>	 paravoid: --^
[16:45:16] <paravoid>	 hey
[16:45:21] <paravoid>	 looking, thanks
[16:45:34] <paravoid>	 from the alerts smells like a power outage
[16:45:40] <paravoid>	 yup indeed
[16:45:49] <elukey>	 how did you check? (curious)
[16:45:55] <paravoid>	 faidon@asw-ulsfo> show chassis alarms 
[16:46:03] <paravoid>	 2017-01-24 16:40:39 UTC  Major  FPC 2 PEM 1 is not powered
[16:46:05] <elukey>	 ah nice
[16:46:12] <paravoid>	 but also we lost cp4012 and ripe-atlas-ulsfo
[16:46:23] <elukey>	 yeah
[16:47:34] <icinga-wm>	 PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6
[16:47:54] <icinga-wm>	 PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6
[16:47:54] <icinga-wm>	 PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6
[16:47:54] <icinga-wm>	 PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6
[16:47:54] <icinga-wm>	 PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6
[16:47:55] <icinga-wm>	 PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp4012_v4, cp4012_v6
[16:48:04] <icinga-wm>	 PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6
[16:48:04] <icinga-wm>	 PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v4, cp4012_v6
[16:48:20] <elukey>	 spam from cp4012
[16:48:44] <icinga-wm>	 RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[16:49:53] <paravoid>	 indeed
[16:50:54] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965727 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['elastic2036.codfw.wmnet'] ``` T...
[16:52:15] <mutante>	 !log planet2001 - reinstalling to test DHCP/TFTP from install2001
[16:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:24] <Krenair>	 jouncebot, next
[16:53:25] <jouncebot>	 In 0 hour(s) and 6 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1700)
[16:54:03] <Krenair>	 this looks like a bad time
[16:54:11] <andrewbogott>	 !log tools deleting tools-mail-01
[16:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:08] <Krenair>	 ostriches, thcipriani: dunno if this is going ahead now
[16:56:22] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2011 is OK: OK: REDIS 2.8.17 on 10.192.16.41:6379 has 1 databases (db0) with 522729 keys, up 76 days 17 hours - replication_delay is 0
[16:56:22] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6379 has 1 databases (db0) with 3109514 keys, up 279 days 2 hours - replication_delay is 0
[16:56:23] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6379 has 1 databases (db0) with 7812138 keys, up 85 days 8 hours - replication_delay is 0
[16:56:23] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1014 is OK: OK: REDIS 2.8.17 on 10.64.48.102:6379 has 1 databases (db0) with 528541 keys, up 159 days 8 hours
[16:56:23] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2013 is OK: OK: REDIS 2.8.17 on 10.192.32.20:6379 has 1 databases (db0) with 516876 keys, up 76 days 17 hours - replication_delay is 0
[16:56:23] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6380 has 1 databases (db0) with 7814334 keys, up 279 days 3 hours - replication_delay is 0
[16:56:23] <icinga-wm>	 RECOVERY - Redis status tcp_6380 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6380 has 1 databases (db0) with 3104029 keys, up 278 days 2 hours
[16:56:24] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2003 is OK: OK: REDIS 2.8.17 on 10.192.0.36:6379 has 1 databases (db0) with 532024 keys, up 76 days 15 hours - replication_delay is 0
[16:56:24] <icinga-wm>	 RECOVERY - Redis status tcp_6380 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6380 has 1 databases (db0) with 3105767 keys, up 278 days 1 hours
[16:56:25] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6379 has 1 databases (db0) with 3109599 keys, up 278 days 2 hours
[16:56:32] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6381 has 1 databases (db0) with 7722312 keys, up 279 days 3 hours - replication_delay is 0
[16:56:32] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1003 is OK: OK: REDIS 2.8.17 on 10.64.0.182:6379 has 1 databases (db0) with 532017 keys, up 159 days 8 hours
[16:56:32] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6380 has 1 databases (db0) with 3104103 keys, up 279 days 2 hours - replication_delay is 0
[16:56:32] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6381 has 1 databases (db0) with 3109889 keys, up 85 days 8 hours - replication_delay is 0
[16:56:42] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6381 has 1 databases (db0) with 7722305 keys, up 85 days 8 hours - replication_delay is 0
[16:56:52] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1015 is OK: OK: REDIS 2.8.17 on 10.64.48.103:6379 has 1 databases (db0) with 498436 keys, up 159 days 8 hours
[16:56:52] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2007 is OK: OK: REDIS 2.8.17 on 10.192.16.37:6379 has 1 databases (db0) with 500287 keys, up 76 days 16 hours - replication_delay is 0
[16:56:52] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2015 is OK: OK: REDIS 2.8.17 on 10.192.32.22:6379 has 1 databases (db0) with 498426 keys, up 76 days 18 hours - replication_delay is 0
[16:56:52] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6380 has 1 databases (db0) with 3103955 keys, up 85 days 8 hours - replication_delay is 0
[16:57:02] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1013 is OK: OK: REDIS 2.8.17 on 10.64.48.101:6379 has 1 databases (db0) with 516835 keys, up 159 days 8 hours
[16:57:02] <icinga-wm>	 RECOVERY - Redis status tcp_6378 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6378 has 1 databases (db0) with 3 keys, up 278 days 2 hours
[16:57:02] <icinga-wm>	 RECOVERY - Redis status tcp_6381 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6381 has 1 databases (db0) with 3107495 keys, up 278 days 2 hours
[16:57:02] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6379 has 1 databases (db0) with 3109531 keys, up 278 days 2 hours
[16:57:02] <icinga-wm>	 RECOVERY - Redis status tcp_6379 on mc1010 is OK: OK: REDIS 2.8.17 on 10.64.32.164:6379 has 1 databases (db0) with 520367 keys, up 159 days 8 hours
[16:57:03] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2006 is OK: OK: REDIS 2.8.17 on 10.192.0.39:6379 has 1 databases (db0) with 503450 keys, up 76 days 15 hours - replication_delay is 0
[16:57:03] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2010 is OK: OK: REDIS 2.8.17 on 10.192.16.40:6379 has 1 databases (db0) with 520370 keys, up 76 days 17 hours - replication_delay is 0
[16:57:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6378 has 1 databases (db0) with 4705607 keys, up 279 days 3 hours - replication_delay is 1
[16:57:04] <icinga-wm>	 RECOVERY - Redis replication status tcp_6381 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6381 has 1 databases (db0) with 3107516 keys, up 279 days 2 hours - replication_delay is 0
[16:57:05] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb1008 is OK: OK: REDIS 2.8.17 on 10.64.32.19:6378 has 1 databases (db0) with 3 keys, up 279 days 2 hours - replication_delay is 8
[16:57:05] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb2001 is OK: OK: REDIS 2.8.17 on 10.192.0.119:6378 has 1 databases (db0) with 15 keys, up 85 days 8 hours - replication_delay is 0
[16:57:12] <icinga-wm>	 RECOVERY - Redis status tcp_6378 on rdb1007 is OK: OK: REDIS 2.8.17 on 10.64.32.18:6378 has 1 databases (db0) with 3 keys, up 278 days 2 hours
[16:57:12] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6379 has 1 databases (db0) with 7813015 keys, up 85 days 8 hours - replication_delay is 0
[16:57:12] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on mc2008 is OK: OK: REDIS 2.8.17 on 10.192.16.38:6379 has 1 databases (db0) with 437176 keys, up 76 days 16 hours - replication_delay is 0
[16:57:12] <icinga-wm>	 RECOVERY - Redis status tcp_6381 on rdb1005 is OK: OK: REDIS 2.8.17 on 10.64.0.24:6381 has 1 databases (db0) with 3102673 keys, up 278 days 2 hours
[16:57:12] <icinga-wm>	 RECOVERY - Redis replication status tcp_6379 on rdb1004 is OK: OK: REDIS 2.8.17 on 10.64.16.183:6379 has 1 databases (db0) with 7812946 keys, up 279 days 3 hours - replication_delay is 0
[16:57:12] <icinga-wm>	 RECOVERY - Redis replication status tcp_6380 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6380 has 1 databases (db0) with 7814359 keys, up 85 days 8 hours - replication_delay is 0
[16:57:13] <icinga-wm>	 RECOVERY - Redis replication status tcp_6378 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6378 has 1 databases (db0) with 4705607 keys, up 85 days 8 hours - replication_delay is 3
[16:57:25] <thcipriani>	 Krenair: my patch is not critical to get out now. The functionality will only be used by the next version of scap coming Soon™ so doesn't have to be today ;\
[16:57:30] <thcipriani>	 er :\
[16:57:42] <Krenair>	 none of the stuff on there is critical
[16:58:19] <wikibugs>	 (03PS1) 10RobH: lost a PDU tower in ulsfo 1.22 [dns] - 10https://gerrit.wikimedia.org/r/333931
[16:58:41] <Krenair>	 lots more redis recoveries than there were original alerts?
[17:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1700). Please do the needful.
[17:00:04] <jouncebot>	 ostriches, Krenair, and thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[17:00:07] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#2965751 (10Gehel) Current elasticsearch nodes in eqiad are as follow:  * **A / A3**: elastic10(30|31|32|33|34|35) - //6 nodes// * **A / A6**: elasti...
[17:00:12] <icinga-wm>	 RECOVERY - Juniper alarms on asw-ulsfo.mgmt.ulsfo.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[17:03:10] <wikibugs>	 06Operations, 15User-Elukey: Cronspam from mwlog* - https://phabricator.wikimedia.org/T156151#2965779 (10fgiunchedi)
[17:03:33] <icinga-wm>	 RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK
[17:03:42] <icinga-wm>	 RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 78.63 ms
[17:03:52] <icinga-wm>	 RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK
[17:03:52] <icinga-wm>	 RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK
[17:04:02] <icinga-wm>	 RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 36 ESP OK
[17:04:02] <icinga-wm>	 RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK
[17:04:02] <icinga-wm>	 RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 36 ESP OK
[17:04:02] <icinga-wm>	 RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK
[17:04:02] <icinga-wm>	 RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK
[17:05:32] <icinga-wm>	 RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.72 ms
[17:08:11] <godog>	 I'm looking at puppet swat patches btw
[17:09:15] <mutante>	 has the "installer hangs at 21% during 'Configuring apt' - Retrieving file 4 or 9" issue  .. and it feels to familiar
[17:09:22] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last)
[17:10:00] <wikibugs>	 (03PS5) 10Filippo Giunchedi: docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad)
[17:11:12] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[17:11:24] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965796 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2036.codfw.wmnet'] ```  and were **ALL** successful.
[17:13:02] <icinga-wm>	 PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:14:22] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 431 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[17:14:42] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS on elastic2036 is OK: SSL OK - Certificate elastic2036.codfw.wmnet valid until 2022-01-23 17:13:25 +0000 (expires in 1824 days)
[17:16:19] <godog>	 ostriches: 👍
[17:16:48] <wikibugs>	 (03PS1) 10Addshore: DNM "Copy InterwikiSorting settings from wmgWikibaseClientSettings"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936
[17:17:02] <wikibugs>	 (03CR) 10Addshore: [C: 04-2] DNM "Copy InterwikiSorting settings from wmgWikibaseClientSettings"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936 (owner: 10Addshore)
[17:17:27] <ostriches>	 godog: Yay thx!
[17:17:32] <wikibugs>	 (03PS2) 10Addshore: DNM!!! Copy InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333936
[17:17:48] <wikibugs>	 (03PS2) 10Addshore: Rm InterwikiSorting settings from wmgWikibaseClientSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333885 (https://phabricator.wikimedia.org/T155995)
[17:17:57] <wikibugs>	 (03PS3) 10Addshore: Enable InterwikiSorting on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333603 (https://phabricator.wikimedia.org/T155995)
[17:19:47] <godog>	 Krenair: looking at yours now
[17:21:09] <wikibugs>	 (03PS4) 10Filippo Giunchedi: ssh: Don't add IPv6 address as an alias in exported resource if it's undefined [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk)
[17:21:14] <wikibugs>	 06Operations, 10ops-ulsfo: ulsfo pdu 1.22 replacement - https://phabricator.wikimedia.org/T151263#2965818 (10RobH)
[17:21:17] <wikibugs>	 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965816 (10RobH) 05Resolved>03Open I'm reopening this.  LVS4002 had its power supply fail again, the exact same PSU slot that died before, PSU2.  I had taken another power supply out of cp4012 an...
[17:26:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] ssh: Don't add IPv6 address as an alias in exported resource if it's undefined [puppet] - 10https://gerrit.wikimedia.org/r/333472 (https://phabricator.wikimedia.org/T72792) (owner: 10Alex Monk)
[17:28:18] <wikibugs>	 06Operations, 10ops-eqiad, 10ops-ulsfo: ship R620 power supplies to ulsfo - https://phabricator.wikimedia.org/T156154#2965844 (10RobH)
[17:29:11] <wikibugs>	 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965860 (10RobH) So when we get the replacement power supplies mentioned on T156154, we should move the power ports used by lvs4002 with another system.  Then if the other system has a psu failure, w...
[17:30:31] <wikibugs>	 (03PS1) 10Yuvipanda: tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943
[17:30:54] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM. Thanks for taking care of our tech debt!" [puppet] - 10https://gerrit.wikimedia.org/r/329328 (owner: 10Tim Landscheidt)
[17:31:16] <wikibugs>	 (03PS2) 10Gehel: Remove gehel from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333240 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[17:31:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 (owner: 10Yuvipanda)
[17:32:02] <wikibugs>	 (03PS2) 10Yuvipanda: tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943
[17:32:40] <godog>	 Krenair: I was looking at https://puppet-compiler.wmflabs.org/5170/mw1161.eqiad.wmnet/ did you look into why mw1161 has no diff?
[17:32:54] <wikibugs>	 (03CR) 10Gehel: [C: 032] Remove gehel from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333240 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff)
[17:33:42] <wikibugs>	 (03CR) 10Gehel: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff)
[17:33:50] <addshore>	 could someone from ops / with access please submit https://gerrit.wikimedia.org/r/#/c/324689/ (apparently I can't) and it has been sitting there for nearly 2 months now! :)
[17:34:29] <Krenair>	 godog, I think because it's a jobrunner
[17:34:32] <yuvipanda>	 addshore: done
[17:34:36] <Krenair>	 not quite sure
[17:34:38] <addshore>	 yuvipanda: cheers!
[17:34:50] <wikibugs>	 (03PS4) 10Gehel: Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff)
[17:34:53] <wikibugs>	 (03CR) 10Yuvipanda: [C: 032] tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 (owner: 10Yuvipanda)
[17:35:01] <wikibugs>	 (03PS3) 10Yuvipanda: tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943
[17:35:08] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Get rid of hacky k8s deployment bash scripts [puppet] - 10https://gerrit.wikimedia.org/r/333943 (owner: 10Yuvipanda)
[17:35:44] <wikibugs>	 06Operations, 10ops-eqiad, 10ops-ulsfo: ship R620 power supplies to ulsfo - https://phabricator.wikimedia.org/T156154#2965882 (10Cmjohnson) We do not have any decommissioned R620s in eqiad.
[17:36:01] <wikibugs>	 (03PS1) 10Chad: Remove myself from elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/333946
[17:36:20] <ostriches>	 moritzm: Heh, reminded me of something I'd been meaning to do ^
[17:37:07] <wikibugs>	 (03PS1) 10Yuvipanda: tools: Get rid of kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/333947
[17:37:55] <Krenair>	 godog, yeah looks like jobrunners don't get those apache configs
[17:38:25] <Krenair>	 krenair@mw1161:~$ ls -l /etc/apache2/sites-enabled/
[17:38:25] <Krenair>	 total 0
[17:38:25] <Krenair>	 lrwxrwxrwx 1 root root 42 Oct 14 08:18 00-dummy.conf -> /etc/apache2/sites-available/00-dummy.conf
[17:38:25] <Krenair>	 lrwxrwxrwx 1 root root 51 Oct 14 08:15 01-hhvm-jobrunner.conf -> /etc/apache2/sites-available/01-hhvm-jobrunner.conf
[17:38:25] <Krenair>	 lrwxrwxrwx 1 root root 47 Oct 14 08:18 50-hhvm-admin.conf -> /etc/apache2/sites-available/50-hhvm-admin.conf
[17:38:28] <Krenair>	 krenair@mw1161:~$
[17:38:45] <wikibugs>	 (03CR) 10Gehel: [C: 032] Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff)
[17:38:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "site.pp: Enable RBR on db1072" [puppet] - 10https://gerrit.wikimedia.org/r/333948
[17:39:03] <godog>	 indeed, looks like it
[17:39:12] <wikibugs>	 (03PS5) 10Gehel: Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff)
[17:39:48] <wikibugs>	 (03PS2) 10Yuvipanda: tools: Get rid of kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/333947
[17:39:56] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Get rid of kubebuilder [puppet] - 10https://gerrit.wikimedia.org/r/333947 (owner: 10Yuvipanda)
[17:40:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Patch LGTM, I've added joe and elukey as they routinely work on apache for opinions too" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[17:40:32] <godog>	 Krenair: ^
[17:41:02] <icinga-wm>	 RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[17:41:07] <marostegui>	 godog: are you doing puppet swat now?
[17:41:09] <Krenair>	 ok
[17:41:26] <wikibugs>	 (03PS2) 10Marostegui: Revert "site.pp: Enable RBR on db1072" [puppet] - 10https://gerrit.wikimedia.org/r/333948
[17:41:26] <godog>	 marostegui: yeah, one patch left to go but not intrusive, feel free to merge
[17:41:38] <marostegui>	 godog: ok thanks :)
[17:41:48] <marostegui>	 godog: i am not pushing just yet though
[17:41:51] <marostegui>	 so you can go ahead if you like
[17:42:30] <wikibugs>	 (03PS5) 10Filippo Giunchedi: Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani)
[17:42:41] <godog>	 marostegui: ok! waiting on jenkins
[17:42:47] <marostegui>	 :)
[17:44:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani)
[17:44:53] <godog>	 thcipriani: ^
[17:45:12] <godog>	 marostegui: I'm done SWATting
[17:45:19] <marostegui>	 godog: thanks!
[17:45:21] <thcipriani>	 godog: thanks! I'll give it a go on tin here in a few to make sure it works :)
[17:45:55] <thcipriani>	 (works as expected, that is, won't break anything :))
[17:46:21] <godog>	 thcipriani: ok, let me know if things are borked, I'm going to dinner soonish but I'll be around later too
[17:46:29] <thcipriani>	 yup, will do
[17:48:03] <wikibugs>	 (03PS6) 10Gehel: Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff)
[17:48:06] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] Stick with node 4.6 on maps due to karthotherian not being ready for node 6 [puppet] - 10https://gerrit.wikimedia.org/r/332768 (https://phabricator.wikimedia.org/T149331) (owner: 10Muehlenhoff)
[17:51:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: tlsproxy: add nginx_bootstrap define [puppet] - 10https://gerrit.wikimedia.org/r/333247
[17:51:15] <wikibugs>	 (03PS9) 10Filippo Giunchedi: swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455)
[17:51:20] <moritzm>	 ostriches: thanks, I'll merge that tomorrow morning
[17:52:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Can we use something other than 443, so we don't run into the same" [puppet] - 10https://gerrit.wikimedia.org/r/333247 (owner: 10Filippo Giunchedi)
[17:54:03] <icinga-wm>	 ACKNOWLEDGEMENT - Restbase root url on restbase-dev1001 is CRITICAL: connect to address 10.64.0.35 and port 7231: Connection refused Filippo Giunchedi restbase deployment TODO
[17:54:03] <icinga-wm>	 ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.35, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc5e8601950: Failed to establish a new connection: [Errno 111] Connection refused,)) Filippo Giunchedi restbase deployment TODO
[17:54:03] <icinga-wm>	 ACKNOWLEDGEMENT - Restbase root url on restbase-dev1002 is CRITICAL: connect to address 10.64.32.112 and port 7231: Connection refused Filippo Giunchedi restbase deployment TODO
[17:54:03] <icinga-wm>	 ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa409335950: Failed to establish a new connection: [Errno 111] Connection refused,)) Filippo Giunchedi restbase deployment TODO
[17:54:03] <icinga-wm>	 ACKNOWLEDGEMENT - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused Filippo Giunchedi restbase deployment TODO
[17:54:03] <icinga-wm>	 ACKNOWLEDGEMENT - restbase endpoints health on restbase-dev1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.46, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f68ec14e950: Failed to establish a new connection: [Errno 111] Connection refused,)) Filippo Giunchedi restbase deployment TODO
[17:54:08] <godog>	 sorry about the spam
[17:54:29] <godog>	 mobrovac: ^ re: restbase on restbase-dev
[17:54:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:55:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[17:58:32] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950
[17:59:12] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 10Elasticsearch, and 2 others: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2965921 (10Gehel)
[18:00:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 (owner: 10Marostegui)
[18:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1800).
[18:00:43] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db1072 back to a normal slave [puppet] - 10https://gerrit.wikimedia.org/r/333952 (https://phabricator.wikimedia.org/T155999)
[18:01:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 (owner: 10Marostegui)
[18:01:24] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333950 (owner: 10Marostegui)
[18:02:21] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T156006 (duration: 00m 49s)
[18:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:26] <stashbot>	 T156006: Move db1052 to row B3 - https://phabricator.wikimedia.org/T156006
[18:03:15] <wikibugs>	 (03PS1) 10Jcrespo: MariaDB: Setting db1065 as the new master of sanitarium2 [puppet] - 10https://gerrit.wikimedia.org/r/333953 (https://phabricator.wikimedia.org/T155999)
[18:05:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move db1072 back to a normal slave [puppet] - 10https://gerrit.wikimedia.org/r/333952 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo)
[18:05:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] MariaDB: Setting db1065 as the new master of sanitarium2 [puppet] - 10https://gerrit.wikimedia.org/r/333953 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo)
[18:07:08] <mutante>	 !log planet2001 - re-adding to puppet, revoke old cert, sign new cert, initial run
[18:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:02] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:10:19] <marostegui>	 !log restart mysql db1065 maintenance - https://phabricator.wikimedia.org/T155999)
[18:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:12] <wikibugs>	 06Operations, 10ops-eqiad, 10ops-ulsfo: ship R620 power supplies to ulsfo - https://phabricator.wikimedia.org/T156154#2965941 (10RobH) 05Open>03Resolved Thanks for checking, I'll note on related tasks.
[18:11:17] <wikibugs>	 06Operations, 10ops-ulsfo, 10netops: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965943 (10RobH) I asked Chris if we had any decommissioned R620s in eqiad so we can steal power supplies, but we do not.  >>! In T156154#2965882, @Cmjohnson wrote: > We do not have any decommissione...
[18:17:08] <AndyRussG>	 twentyafterfour: hi! have you done the branch cut for mw train yet?
[18:17:37] <twentyafterfour>	 AndyRussG: I'm just starting it, should I hold off?
[18:18:13] <wikibugs>	 (03PS2) 10Chad: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712
[18:18:29] <AndyRussG>	 twentyafterfour: hmm mmmmmaybe, one sec... thanks!
[18:19:01] <logmsgbot>	 !log arlolra@tin Starting deploy [parsoid/deploy@c1a14c0]: Updating Parsoid to d000fdb4
[18:19:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:09] <AndyRussG>	 twentyafterfour: K just consulted, if u can wait 5 min for us to merge some stuff into the CentralNotice deploy branch? thx!!!!
[18:20:48] <wikibugs>	 (03CR) 10Chad: [C: 032] Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 (owner: 10Chad)
[18:22:42] <wikibugs>	 (03Merged) 10jenkins-bot: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 (owner: 10Chad)
[18:22:52] <wikibugs>	 (03CR) 10jenkins-bot: Drop wikidata docroot, unused (uses wikidata.org now) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330712 (owner: 10Chad)
[18:23:16] <twentyafterfour>	 AndyRussG: no problem
[18:24:38] <logmsgbot>	 !log demon@tin Synchronized docroot: Removing old wikidata docroot (duration: 00m 46s)
[18:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:05] <wikibugs>	 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011#2965980 (10MoritzMuehlenhoff) These are fully rolled out:  audiofile automake-1.14 clamav cmake exim4 file javatools libxml2 python-django python2.7 unbound systemd
[18:25:42] <AndyRussG>	 twentyafterfour: grear! Just waiting for Jenkins... https://gerrit.wikimedia.org/r/#/c/333955
[18:26:18] <AndyRussG>	 After that merges the submodule pointer for CentralNotice should update automatically
[18:31:49] <wikibugs>	 (03PS1) 10Chad: beta: standardize deployment.wikimedia.beta.wmflabs.org docroot [puppet] - 10https://gerrit.wikimedia.org/r/333958
[18:32:52] <AndyRussG>	 twentyafterfour: merged, just checking that the submodule pointer in core is up to date
[18:33:47] <wikibugs>	 (03CR) 10Chad: [C: 032] Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad)
[18:35:22] <wikibugs>	 (03Merged) 10jenkins-bot: Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad)
[18:35:36] <wikibugs>	 (03CR) 10jenkins-bot: Remove extra layer of symlink indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323999 (owner: 10Chad)
[18:36:15] <wikibugs>	 (03PS1) 10Chad: Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960
[18:37:02] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[18:37:47] <logmsgbot>	 !log demon@tin Synchronized docroot: tidying up mobileportal docroot stuff (duration: 00m 41s)
[18:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:03] <AndyRussG>	 twentyafterfour: aarg now I'm confused, I don't think we have to do any updating of submodule pointers in core master, because the submodules are only in the core dpeloy branches, right?
[18:38:16] <AndyRussG>	 CentralNotice's deploy branch is up-to-date now....
[18:38:38] <ostriches>	 Master doesn't have submodules :)
[18:38:41] <AndyRussG>	 So the CN commit that we'd like to put on the train is 24e8419d587681ee26e420ee6ba9313ea32a3ed1
[18:38:44] <AndyRussG>	 right....
[18:40:22] <AndyRussG>	 ostriches: remind me how the branch cut gets the right submodule pointers for extensions (if u'r not busy...)
[18:40:25] * AndyRussG jostles brain
[18:40:29] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@c1a14c0]: Updating Parsoid to d000fdb4 (duration: 21m 28s)
[18:40:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:06] <ostriches>	 AndyRussG: So for non-special extensions (ie: 95% of them), we create a new branch from master (wmf/blablahblah), then add that branch to the new branch we've made for core
[18:41:19] <logmsgbot>	 !log arlolra@tin Starting deploy [parsoid/deploy@c1a14c0]: Retry updating Parsoid to d000fdb4
[18:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:28] <ostriches>	 For the "special" extensions, we don't do the branching, we just add the branch/tag you already have defined as your submodule
[18:41:40] <ostriches>	 in CN's case, it should just pull in whatever's in that wmf_deploy or w/e branch
[18:42:06] <AndyRussG>	 ostriches: ah right, it's all coming back to me now
[18:42:08] <AndyRussG>	 thx!!!!
[18:42:23] <ostriches>	 https://phabricator.wikimedia.org/diffusion/MREL/browse/master/make-wmf-branch/config.json;HEAD$171
[18:43:32] <AndyRussG>	 twentyafterfour: so I think we're good to go :)
[18:45:13] <AndyRussG>	 ostriches: interesting!
[18:45:33] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@c1a14c0]: Retry updating Parsoid to d000fdb4 (duration: 04m 14s)
[18:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:42] <icinga-wm>	 PROBLEM - parsoid on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:51:10] <wikibugs>	 (03PS1) 10Chad: Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962
[18:51:12] <icinga-wm>	 PROBLEM - salt-minion processes on planet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[18:52:42] <icinga-wm>	 RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 4.098 second response time
[18:53:10] <mobrovac>	 godog: np, i will deploy rb there soon-ish :)
[18:55:27] <wikibugs>	 (03Abandoned) 10Dduvall: Check .scap-master-ready file before syncing scap masters [puppet] - 10https://gerrit.wikimedia.org/r/267934 (owner: 10Dduvall)
[18:58:06] <arlolra>	 !log Updated Parsoid to version d000fdb4 (T58846, T154804, T152633)
[18:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:13] <stashbot>	 T152633: TypeError: Cannot read property 'length' of undefined - https://phabricator.wikimedia.org/T152633
[18:58:13] <stashbot>	 T58846: Review failing sanitizer bugs - https://phabricator.wikimedia.org/T58846
[18:58:13] <stashbot>	 T154804: TypeError in parsoid gallery module - https://phabricator.wikimedia.org/T154804
[18:58:21] <twentyafterfour>	 AndyRussG: thanks
[18:58:22] <icinga-wm>	 PROBLEM - Check systemd state on db2060 is CRITICAL: CRITICAL - Failed to get D-Bus connection: Connection refused: unexpected
[18:59:07] <icinga-wm>	 PROBLEM - MariaDB disk space on db2060 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error
[18:59:07] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on db2060 is CRITICAL: CRITICAL slave_io_state could not connect
[18:59:07] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on db2060 is CRITICAL: CRITICAL slave_sql_state could not connect
[18:59:16] <icinga-wm>	 PROBLEM - mysqld processes on db2060 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[18:59:16] <icinga-wm>	 PROBLEM - Disk space on db2060 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error
[18:59:25] * volans looking
[18:59:28] <marostegui>	 ^ checking
[18:59:30] <jynus>	 did it crash?
[18:59:34] <volans>	 oh you're still here
[18:59:50] <jynus>	 yes, we were hacing fan at -databases
[19:00:04] <jouncebot>	 Deploy window Changed: No SWAT window at this time on Tuesdays going forward (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T1900)
[19:00:31] <volans>	 eheheh
[19:00:43] <marostegui>	 [612121.400194] sd 0:1:0:0: rejecting I/O to offline device
[19:00:43] <marostegui>	 [612121.425964] sd 0:1:0:0: rejecting I/O to offline device
[19:00:43] <marostegui>	 db2060 login:
[19:00:50] <jynus>	  /srv is not accessible
[19:00:56] <jynus>	 so probably RAID went down
[19:01:17] <marostegui>	 that host had issues before if i recall correctly
[19:01:21] <jynus>	 is it a master or a regular slave?
[19:01:22] <marostegui>	 https://phabricator.wikimedia.org/T154031
[19:01:34] <jynus>	 ig it is a slave, let's create a ticket and fix it tomorrow
[19:01:36] <marostegui>	 api slave
[19:01:39] <marostegui>	 ok
[19:01:42] <marostegui>	 will take care of that
[19:02:02] <jynus>	 "Firmware update complete."
[19:02:04] <jynus>	 yway!
[19:02:07] <marostegui>	 \o/
[19:02:09] <marostegui>	 lovely
[19:02:11] <jynus>	 no excuses from the vendor
[19:02:46] <wikibugs>	 (03CR) 10Chad: [C: 032] Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad)
[19:03:42] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:03:58] <wikibugs>	 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2966089 (10jcrespo)
[19:05:12] <icinga-wm>	 PROBLEM - MD RAID on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:05:32] <icinga-wm>	 RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures
[19:05:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag could not connect
[19:05:48] <marostegui>	 I will silence this host
[19:06:06] <wikibugs>	 (03Merged) 10jenkins-bot: Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad)
[19:07:12] <icinga-wm>	 RECOVERY - MD RAID on ruthenium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[19:07:18] <logmsgbot>	 !log demon@tin Synchronized docroot/foundation/logos: rm some old junk logos (duration: 00m 42s)
[19:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:56] <jynus>	 !log change replication master of db1095 to db1052
[19:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:09] <wikibugs>	 (03CR) 10jenkins-bot: Foundation docroot: removing some unused/ancient logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333962 (owner: 10Chad)
[19:08:42] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:10:42] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:11:12] <icinga-wm>	 PROBLEM - MD RAID on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:11:17] <wikibugs>	 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2966096 (10Dzahn) @akosiaris Thank you! I have reinstalled planet2001 using install2001 and it worked fine. I will do some more tests for eqiad soon.
[19:12:32] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[19:15:42] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:15:58] <volans>	 oom_killer in action on ruthenium
[19:16:13] <icinga-wm>	 PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:19:02] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[19:19:02] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:19:12] <icinga-wm>	 RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[19:19:12] <icinga-wm>	 PROBLEM - configured eth on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:19:12] <icinga-wm>	 PROBLEM - DPKG on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:21:12] <icinga-wm>	 PROBLEM - salt-minion processes on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:21:44] <volans>	 subbu: any special activity on ruthenium ^^^ ? It's swapping heavily and there are tons of /srv/visualdiff/node_modules/phantomjs/lib/phantom/bin/phantomjs processes
[19:22:02] <icinga-wm>	 RECOVERY - salt-minion processes on ruthenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:22:12] <icinga-wm>	 PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:23:32] <icinga-wm>	 PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused
[19:25:42] <icinga-wm>	 RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 6.221 second response time
[19:26:01] <mobrovac>	 volans: arlolra is not running tests, we don't know if TimStarling or subbu have started any
[19:26:32] <volans>	 mobrovac: any easy way to check?
[19:27:12] <icinga-wm>	 RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[19:27:33] <mobrovac>	 volans: not that i know of :/
[19:27:36] <mobrovac>	 arlolra: ^ ?
[19:28:11] <mobrovac>	 volans: can't even ssh in there now
[19:28:18] <volans>	 I'm in it
[19:28:20] <mobrovac>	 so it must be really busy 
[19:28:32] <icinga-wm>	 PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused
[19:28:38] <volans>	 yes, memory full, swap full
[19:29:32] <icinga-wm>	 PROBLEM - dhclient process on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:29:33] <arlolra>	 the multiple phantomjs processes is probably a good indication that one of them was running a test
[19:29:46] <arlolra>	 i think it's fine to stop it, if you're in there
[19:29:58] <jynus>	 !log change replication master of db1095 to db1065
[19:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:05] <volans>	 arlolra: ok
[19:30:20] <arlolra>	 they probably need to cleanup the result of visualdiff runs
[19:30:22] <icinga-wm>	 RECOVERY - dhclient process on ruthenium is OK: PROCS OK: 0 processes with command name dhclient
[19:30:38] <arlolra>	 old runs
[19:31:06] <volans>	 arlolra: you mean sending a SIGTERM to all phantomjs processes? they are not child of a common process
[19:31:37] <volans>	 and I probably need to restart parsoid, their child died and was not able to restart them due to not available memory
[19:32:12] <icinga-wm>	 PROBLEM - salt-minion processes on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:33:02] <icinga-wm>	 RECOVERY - salt-minion processes on ruthenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:33:17] <arlolra>	 hmm, they should have been spawned by testreduce
[19:33:48] <arlolra>	 but, sure, send a signal to them all if you have to
[19:33:59] <volans>	 arlolra: there is a /usr/bin/nodejs client-cluster.js -c 8 /etc/testreduce/parsoid-rt-client.config.js process with some childs
[19:34:13] <volans>	 but the "node /srv/visualdiff/node_modules/phantomjs/bin/phantomjs" are not child of it
[19:34:22] <icinga-wm>	 PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:35:32] <icinga-wm>	 RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.110 second response time
[19:35:32] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[19:35:33] <icinga-wm>	 RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 50 minutes ago with 0 failures
[19:35:52] <volans>	 !log killed 822 "/srv/visualdiff/node_modules/phantomjs/lib/phantom/bin/phantomjs" processes on ruthenium. RAM and swap full, host unresponsive
[19:35:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:02] <icinga-wm>	 RECOVERY - configured eth on ruthenium is OK: OK - interfaces up
[19:36:02] <icinga-wm>	 RECOVERY - MD RAID on ruthenium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[19:36:02] <icinga-wm>	 RECOVERY - DPKG on ruthenium is OK: All packages OK
[19:36:12] <icinga-wm>	 RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[19:36:46] <arlolra>	 i see
[19:37:09] <twentyafterfour>	 !log branching 1.29.0-wmf.9 refs T154683
[19:37:09] <volans>	 it recovered immediately
[19:37:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:12] <stashbot>	 T154683: MW-1.29.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T154683
[19:37:47] <arlolra>	 it would be this one /usr/bin/nodejs client-cluster.js -c 4 /etc/testreduce/parsoid-vd-client.config.js
[19:37:58] <arlolra>	 -vd
[19:38:38] <arlolra>	 https://www.mediawiki.org/wiki/Parsoid/Visual_Diffs_Testing
[19:38:56] <arlolra>	 says we want sudo service parsoid-vd-client stop
[19:39:11] <arlolra>	 and sudo service parsoid-vd stop
[19:39:43] <volans>	 !log sudo service parsoid-vd stop on ruthenium
[19:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:57] <volans>	 arlolra: done, and actually it was restarting swaping the subprocesses
[19:40:14] <volans>	 s/swaping/spawning/
[19:40:31] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2966240 (10jcrespo)
[19:40:57] <volans>	 so, something new was merged in the testreduce that is exploding?
[19:41:40] <arlolra>	 pid 1661 looks like it shouldn't be there anymore
[19:42:22] <icinga-wm>	 PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:42:32] <volans>	 yeah I was checking the systemd unit, because it's back running
[19:43:37] <arlolra>	 testreduce is a general thing to run tests of a large set of pages.  we use it for parsoid roundtrip testing and, separately, for visualdiff'ing ... the latter produces a lot of large images on disk
[19:44:00] <wikibugs>	 06Operations, 10DBA, 10MediaWiki-Change-tagging: db1072 change_tag schema and dataset is not consistent - https://phabricator.wikimedia.org/T156166#2966280 (10jcrespo) Adding @TTO and @Cenarium because they may know the actual right people to add to this ticket (probably not them) for the mediawiki bug side...
[19:44:04] <arlolra>	 that would be https://github.com/wikimedia/integration-visualdiff
[19:44:34] <volans>	 so the client is stopped, the server is still running "/usr/bin/nodejs server.js --config /etc/testreduce/parsoid-vd.settings.js", but that seems to be ok, I dont'w see anymore spawning of processes
[19:44:45] <arlolra>	 TimStarling or subbu were probably running it for https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy
[19:45:44] <volans>	 it was killing the server, so if it's soemthing that we run often we might have some regression, otherwise maybe was just too aggressive
[19:45:55] * volans brb
[19:48:10] <arlolra>	 volans: they've respawned!  as long as they're jobs queued with the server this'll probably continue
[19:49:00] <volans>	 arlolra: should I stop the server too?
[19:49:09] <volans>	 parsoid-vd I mean
[19:50:10] <arlolra>	 yes
[19:50:33] <arlolra>	 let's jsut stop it all and i'll let them know
[19:50:51] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#2966308 (10EBernhardson)
[19:51:08] <wikibugs>	 (03PS1) 10EBernhardson: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578)
[19:51:54] <volans>	 !log ruthenium: stopped parsoid-vd and parsoid-vd-client to avoid uncontrolled spawning of phantomjs childs
[19:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:11] <wikibugs>	 (03CR) 10EBernhardson: "The log4j2 properties file was tested in vagrant against a 5.x instance. It looks to do as necessary, but i've never worked with log4j2 be" [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson)
[19:52:28] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Set binlog_format to STATEMENT for db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333970 (https://phabricator.wikimedia.org/T156008)
[19:52:43] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#2966323 (10EBernhardson)
[19:53:16] <volans>	 arlolra: done, notifying them, thanks for the help!
[19:53:57] <arlolra>	 np, thank you, glad it's under control
[19:54:12] <volans>	 I'll keep an eye on it for a bit 
[20:00:05] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170124T2000).
[20:00:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Set binlog_format to STATEMENT for db1052 [puppet] - 10https://gerrit.wikimedia.org/r/333970 (https://phabricator.wikimedia.org/T156008) (owner: 10Jcrespo)
[20:01:27] <wikibugs>	 (03PS1) 10Chad: Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973
[20:04:52] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Service[apparmor]
[20:05:03] <wikibugs>	 (03PS1) 10Chad: Swap wmfwiki docroot to wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/333974
[20:05:28] <wikibugs>	 (03CR) 10BearND: "Generally, this looks good to me from a regex perspective, just a minor nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema)
[20:05:32] <wikibugs>	 (03CR) 10Chad: [C: 04-1] "Also depends on finishing cleaning up existing docroot/foundation" [puppet] - 10https://gerrit.wikimedia.org/r/333974 (owner: 10Chad)
[20:06:31] <wikibugs>	 (03CR) 10Chad: [C: 032] Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 (owner: 10Chad)
[20:08:02] <wikibugs>	 (03Merged) 10jenkins-bot: Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 (owner: 10Chad)
[20:08:16] <wikibugs>	 (03CR) 10jenkins-bot: Creating wikimediafoundation.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333973 (owner: 10Chad)
[20:09:40] <logmsgbot>	 !log demon@tin Synchronized docroot: Adding new wikimediafoundation.org docroot (duration: 01m 05s)
[20:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:12] <icinga-wm>	 PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:10:22] <icinga-wm>	 RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[20:10:37] <ostriches>	 twentyafterfour: I'll stop with my random docroot fixes, forgot it's train time :)
[20:11:04] <twentyafterfour>	 oh
[20:11:06] <twentyafterfour>	 yeah
[20:11:14] <twentyafterfour>	 I just ran `scap prep`
[20:11:39] <wikibugs>	 (03CR) 10Chad: [C: 031] "Furthermore, this is already a symlink to wikimedia.org, so we're just removing a layer of indirection :)" [puppet] - 10https://gerrit.wikimedia.org/r/333958 (owner: 10Chad)
[20:17:08] <wikibugs>	 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2966457 (10Dzahn) @bbogaert    ``` -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256  I can confirm that Riccard (volans) should have access to the Yubikey laptop referenced in Phab Ticket T123818, Zen Desk #9727.  - -- Da...
[20:17:31] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999)
[20:17:52] <twentyafterfour>	 ostriches: all patches failed :-/
[20:18:50] <ostriches>	 No surprise :(
[20:18:58] <ostriches>	 Need some help?
[20:19:22] <wikibugs>	 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, 06Services (watching): Confirm attribution needs - https://phabricator.wikimedia.org/T150875#2966460 (10ZhouZ) Just as an updated reminder to this task.    Our Terms of Use allows for attribution to text contr...
[20:19:30] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999)
[20:19:32] <icinga-wm>	 PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:23:43] <ostriches>	 godog: I got a very small beta-only docroot thing. Should I put it down for thurs' puppetswat? https://gerrit.wikimedia.org/r/#/c/333958/
[20:23:50] <ostriches>	 (or is it small enough we can jfdi?)
[20:24:07] <twentyafterfour>	 ostriches: I think I've got it
[20:24:26] <ostriches>	 👍
[20:25:06] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999)
[20:26:05] <wikibugs>	 (03PS2) 10Dzahn: delete dumps.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940)
[20:29:12] <icinga-wm>	 PROBLEM - MD RAID on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:29:42] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:29:42] <icinga-wm>	 PROBLEM - parsoid on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:29:42] <icinga-wm>	 PROBLEM - Check systemd state on ruthenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:31:22] <icinga-wm>	 PROBLEM - SSH on ruthenium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:31:54] <volans>	 damn... probably puppet restarted them
[20:32:10] <wikibugs>	 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2966486 (10bbogaert) @Dzahn Ricard has the laptop.  ``` byronicle:~ bbogaert$ gpg --verify confirm-volans.sig gpg: Signature made Tue Jan 24 12:14:38 2017 PST using RSA key ID F5F6A067 gpg: Good signature from "Daniel Za...
[20:32:12] <icinga-wm>	 RECOVERY - SSH on ruthenium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[20:32:52] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[20:33:32] <icinga-wm>	 RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational
[20:33:32] <icinga-wm>	 RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.408 second response time
[20:33:33] <icinga-wm>	 RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures
[20:34:02] <icinga-wm>	 RECOVERY - MD RAID on ruthenium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[20:34:49] <wikibugs>	 (03PS1) 10Rush: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978
[20:35:35] <wikibugs>	 (03PS2) 10Rush: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978
[20:36:37] <godog>	 ostriches: yeah if it is beta-only we can jfdi
[20:37:12] <icinga-wm>	 RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[20:38:01] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Revert "wmf_sink:  Remove all ldap handling" [puppet] - 10https://gerrit.wikimedia.org/r/333660 (owner: 10Andrew Bogott)
[20:38:31] <wikibugs>	 (03CR) 10Yuvipanda: [C: 031] "presidented seal of approval" [puppet] - 10https://gerrit.wikimedia.org/r/333978 (owner: 10Rush)
[20:38:49] <wikibugs>	 (03PS3) 10BryanDavis: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 (owner: 10Rush)
[20:39:29] <mutante>	 yuvipanda: http://knowyourmeme.com/memes/seal-of-approval ?
[20:39:40] <wikibugs>	 (03PS4) 10BryanDavis: tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush)
[20:39:53] <ostriches>	 godog: Sweet. It's https://gerrit.wikimedia.org/r/#/c/333958/ :)
[20:40:09] <yuvipanda>	 mutante: more like https://www.theguardian.com/us-news/2016/dec/19/unpresidented-trump-word-definition
[20:40:09] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] "done messing with commit message" [puppet] - 10https://gerrit.wikimedia.org/r/333978 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush)
[20:40:36] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] tools: specify ipaddress_eth0 for HBA [puppet] - 10https://gerrit.wikimedia.org/r/333978 (https://phabricator.wikimedia.org/T156168) (owner: 10Rush)
[20:40:47] <mutante>	 yuvipanda: oh wow, word of the year even
[20:40:57] <wikibugs>	 (03PS2) 10Filippo Giunchedi: beta: standardize deployment.wikimedia.beta.wmflabs.org docroot [puppet] - 10https://gerrit.wikimedia.org/r/333958 (owner: 10Chad)
[20:47:32] <icinga-wm>	 RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[20:48:05] <wikibugs>	 06Operations, 06Parsing-Team: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#2966544 (10Volans)
[20:49:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Move labtestweb openstack::version to newton [puppet] - 10https://gerrit.wikimedia.org/r/333979
[20:49:18] <volans>	 !log disabled puppet on ruthenium to avoid the restart of parsoid-vd and parsoid-vd-client processes T156177
[20:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:22] <stashbot>	 T156177: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177
[20:49:23] <logmsgbot>	 !log twentyafterfour@tin Started scap: test wikis to 1.29.0-wmf.9 refs T155525
[20:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:27] <stashbot>	 T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525
[20:50:08] <wikibugs>	 (03Abandoned) 10Gilles: Fix mechanism to disable default nginx configuration [puppet/nginx] - 10https://gerrit.wikimedia.org/r/333909 (https://phabricator.wikimedia.org/T154270) (owner: 10Gilles)
[20:51:21] <wikibugs>	 (03PS2) 10Andrew Bogott: Move labtestweb openstack::version to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/333979
[20:51:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] beta: standardize deployment.wikimedia.beta.wmflabs.org docroot [puppet] - 10https://gerrit.wikimedia.org/r/333958 (owner: 10Chad)
[20:52:10] <godog>	 ostriches: ^
[20:53:30] <ostriches>	 godog: cool thanks. I'll verify in beta
[20:53:45] <godog>	 ostriches: np
[20:54:10] <ostriches>	 This adventure is nearing completion :)
[20:54:58] <wikibugs>	 (03PS1) 10Yuvipanda: tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980
[20:55:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Move labtestweb openstack::version to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/333979 (owner: 10Andrew Bogott)
[20:55:02] <mutante>	 https://www.quora.com/What-are-the-best-Phabricator-macros-memes
[20:55:06] <wikibugs>	 (03PS3) 10Andrew Bogott: Move labtestweb openstack::version to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/333979
[20:55:44] <mutante>	 https://www.quora.com/What-are-the-best-Phabricator-Pokemon-for-use-in-code-reviews?redirected_qid=1319946
[20:56:05] <yannf>	 brion, did you see my last comment? https://phabricator.wikimedia.org/T155750
[20:56:15] <yannf>	 should I open a new report?
[20:56:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 (owner: 10Yuvipanda)
[20:56:32] <wikibugs>	 (03PS2) 10Yuvipanda: tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980
[20:56:33] <p858snake|>	 mutante: all the copyrights \o/
[20:56:39] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 (owner: 10Yuvipanda)
[20:57:06] <mutante>	 p858snake|: well, i just traced back the origin of "seal of approval" to a Flickr account, and Flickr should be fine to import to commons right :p
[20:57:30] <p858snake|>	 if the licensing for the upload on flickr allows it
[20:57:58] <mutante>	 "Fixes for latent bugs which don't manifest in impactful ways should be accepted with Metapod or Kakuna."
[20:58:02] <wikibugs>	 (03PS3) 10Yuvipanda: tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980
[20:58:07] <wikibugs>	 (03CR) 10Yuvipanda: [V: 032 C: 032] tools: use new kubectl location for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/333980 (owner: 10Yuvipanda)
[20:58:08] <p858snake|>	 flickr uploads aren't CC-<BLAH> by default iirc
[21:01:35] <mutante>	 p858snake|: right.. and sad as it is, this one has "All rights reserved" on it
[21:01:36] <ostriches>	 godog: Force-ran puppet on the beta apaches, picked up the change, everything working just fine :D
[21:02:32] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 52 seconds ago with 6 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/openstack_auth/backend.py],File[/etc/openstack-dashboard/keystone_policy.json],File[/usr/share/openstack-dashboard/openstack_dashboard/local/enabled/_1925_puppet_prefix_panel.py],File[/usr/share/openstack-dashboard/openstack_das
[21:02:56] <godog>	 ostriches: nice \o/
[21:03:02] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[21:03:03] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[21:03:19] <jynus>	 is there any mediawiki deployment in progress?
[21:03:23] <mutante>	 p858snake|: of course it's on reddit, deviantart, twitter, imgur and > 3,800 other pages anyways 
[21:03:38] <jynus>	 ostriches^ ?
[21:03:51] <ostriches>	 twentyafterfour is conducting the train
[21:04:00] <jynus>	 sorry
[21:04:05] <jynus>	 still on it, I assume
[21:04:23] <ostriches>	 Probably, I'll let him give a more exact status :)
[21:04:28] <jynus>	 no need
[21:04:46] <jynus>	 will got away, twentyafterfour ping me when done (I may be away)
[21:05:24] <jynus>	 I am not in hurry, I just do not want to collide
[21:07:38] <Platonides>	 /13/8
[21:07:46] <jynus>	 oh, it is 2 hour window
[21:07:56] <jynus>	 that is my mistake
[21:16:04] <wikibugs>	 (03PS3) 10Brian Wolff: Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618)
[21:20:46] <ostriches>	 hhvm on mw1290 is unhappy
[21:20:55] <ostriches>	 Syntax Error: Couldn't find trailer dictionary
[21:21:09] <ostriches>	 Syntax Error: Couldn't read xref table
[21:21:40] <ostriches>	 Eh, info-level, but seems specific to 1290
[21:22:01] <logmsgbot>	 !log twentyafterfour@tin Finished scap: test wikis to 1.29.0-wmf.9 refs T155525 (duration: 32m 37s)
[21:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:08] <stashbot>	 T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525
[21:22:59] <jynus>	 should I restart hhvm there?
[21:23:14] <ostriches>	 I dunno, I can't find it
[21:23:18] <ostriches>	 Might've been transient
[21:25:15] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon:  Forward some custom files from liberty [puppet] - 10https://gerrit.wikimedia.org/r/333983
[21:25:52] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
[21:26:07] <ostriches>	 jynus: Seems to have just been a brief thing at 21:22, stopped completely. Transient :)
[21:26:44] <wikibugs>	 (03CR) 10Chad: [C: 032] Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 (owner: 10Chad)
[21:27:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Horizon:  Forward some custom files from liberty [puppet] - 10https://gerrit.wikimedia.org/r/333983 (owner: 10Andrew Bogott)
[21:28:27] <wikibugs>	 (03Merged) 10jenkins-bot: Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 (owner: 10Chad)
[21:28:52] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
[21:28:58] <wikibugs>	 (03CR) 10jenkins-bot: Remove labs docroot, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333960 (owner: 10Chad)
[21:30:55] <wikibugs>	 (03PS1) 10Chad: Add .bash_profile to my homedir so my .bashrc works [puppet] - 10https://gerrit.wikimedia.org/r/333984
[21:31:00] <logmsgbot>	 !log demon@tin Synchronized docroot: Drop labs docroot, unused in prod (duration: 00m 44s)
[21:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:19] <wikibugs>	 (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985
[21:32:21] <wikibugs>	 (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 (owner: 1020after4)
[21:32:37] <wikibugs>	 (03PS1) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986
[21:32:46] <twentyafterfour>	 jynus: almost done here
[21:33:10] <wikibugs>	 (03PS3) 10Dzahn: delete dumps.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940)
[21:33:50] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 (owner: 1020after4)
[21:34:01] <wikibugs>	 (03CR) 10jenkins-bot: group0 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333985 (owner: 1020after4)
[21:34:44] <logmsgbot>	 !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.9 refs T155525
[21:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:48] <stashbot>	 T155525: MW-1.29.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T155525
[21:34:58] <wikibugs>	 (03PS2) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986
[21:35:11] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon:  Add mitaka version of the puppetpanel. [puppet] - 10https://gerrit.wikimedia.org/r/333987
[21:35:42] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "key deleted from private repo" [puppet] - 10https://gerrit.wikimedia.org/r/333833 (https://phabricator.wikimedia.org/T154940) (owner: 10Dzahn)
[21:37:01] <wikibugs>	 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2966807 (10Andrew) - I will double-check the caching, although I'm pretty sure I verified that the cache was working previously.  - I'm currently experimenting with the next rev of Hor...
[21:37:33] <wikibugs>	 (03PS2) 10Dzahn: add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780
[21:37:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Horizon:  Add mitaka version of the puppetpanel. [puppet] - 10https://gerrit.wikimedia.org/r/333987 (owner: 10Andrew Bogott)
[21:38:03] <wikibugs>	 (03PS2) 10Andrew Bogott: Horizon:  Add mitaka version of the puppetpanel. [puppet] - 10https://gerrit.wikimedia.org/r/333987
[21:38:10] <twentyafterfour>	 jynus: all done
[21:41:40] <jynus>	 twentyafterfour, thanks!
[21:42:25] <wikibugs>	 (03PS4) 10Dzahn: openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954
[21:42:32] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[21:43:48] <twentyafterfour>	 !log Finished group0 to wmf/1.29.0-wmf.9 (refs T15525) Changelog: https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.9/Changelog
[21:43:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:52] <stashbot>	 T15525: Category Sorting Incorrectly - https://phabricator.wikimedia.org/T15525
[21:44:04] <wikibugs>	 (03PS3) 10Chad: dumps: Add a favicon (using the wmf one) [puppet] - 10https://gerrit.wikimedia.org/r/333080
[21:44:10] <wikibugs>	 (03CR) 10Dzahn: [C: 032] openstack: instancersync not in autoload module layout [puppet] - 10https://gerrit.wikimedia.org/r/332954 (owner: 10Dzahn)
[21:44:39] <wikibugs>	 (03PS3) 10Dzahn: openstack: designate/glance/keystone not in autoload module [puppet] - 10https://gerrit.wikimedia.org/r/332955
[21:45:26] <twentyafterfour>	 ugh I've been ref'ing the wrong tasks :-/
[21:45:26] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999)
[21:46:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo)
[21:47:12] <icinga-wm>	 PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[21:48:47] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo)
[21:48:58] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: repool db1065 as dump/vslow & clean up s1 comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333976 (https://phabricator.wikimedia.org/T155999) (owner: 10Jcrespo)
[21:49:00] <chasemp>	 ^ andrewbogott labtestweb is puking on itself a bit, is that you?
[21:49:12] <andrewbogott>	 definitely me
[21:49:40] <wikibugs>	 (03PS3) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986
[21:50:12] <icinga-wm>	 RECOVERY - DPKG on labtestweb2001 is OK: All packages OK
[21:50:33] <wikibugs>	 (03CR) 10Dzahn: [C: 032] openstack: designate/glance/keystone not in autoload module [puppet] - 10https://gerrit.wikimedia.org/r/332955 (owner: 10Dzahn)
[21:50:46] <chasemp>	 andrewbogott: kk
[21:51:06] <andrewbogott>	 it should clear in a minute or two, everything looks fine locally
[21:51:54] <wikibugs>	 (03CR) 10Rush: [C: 032] labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 (owner: 10Rush)
[21:52:01] <wikibugs>	 (03PS4) 10Rush: labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986
[21:52:05] <wikibugs>	 (03CR) 10Rush: [V: 032 C: 032] labstore: change drbd link over to /30 192.168 [puppet] - 10https://gerrit.wikimedia.org/r/333986 (owner: 10Rush)
[21:52:52] <wikibugs>	 (03CR) 10Dzahn: [C: 032] dumps: Add a favicon (using the wmf one) [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad)
[21:52:58] <wikibugs>	 (03PS4) 10Dzahn: dumps: Add a favicon (using the wmf one) [puppet] - 10https://gerrit.wikimedia.org/r/333080 (owner: 10Chad)
[21:53:10] <chasemp>	 andrewbogott: no worries then just wasn't sure
[21:55:08] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1065 as dump/vslow & clean up s1 comments (duration: 00m 43s)
[21:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:12] <ostriches>	 mutante: Thx, I see our new favicon now :)
[21:56:15] <ostriches>	 no more 404
[21:56:59] <jynus>	 Database::ping, that is new to me
[21:57:15] <ostriches>	 Oldddddd function in MW :)
[21:57:35] <ostriches>	 Falls back on mysqli_ping (or similar, depending on php extension you're using)
[21:57:42] <mutante>	 ostriches: :) i noticed the 404s well. nice!
[22:07:30] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005)
[22:08:12] <icinga-wm>	 PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[22:08:33] <ostriches>	 mutante: Trivial thing for my homedir, if you've got a sec... https://gerrit.wikimedia.org/r/#/c/333984/
[22:09:18] <wikibugs>	 (03PS2) 10Dzahn: Add .bash_profile to my homedir so my .bashrc works [puppet] - 10https://gerrit.wikimedia.org/r/333984 (owner: 10Chad)
[22:09:46] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] Add .bash_profile to my homedir so my .bashrc works [puppet] - 10https://gerrit.wikimedia.org/r/333984 (owner: 10Chad)
[22:10:57] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages andrew bogott Upstream packages seem broken... work in progress.
[22:12:32] <icinga-wm>	 PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/demon/.bash_profile]
[22:13:58] <Pchelolo>	 !log update RESTBase to 69065e2: staging
[22:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:32] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/usr/lib/python2.7/dist-packages/openstack_auth/plugin/wmtotp.py],File[/usr/lib/python2.7/dist-packages/openstack_auth/backend.py],File[/usr/lib/python2.7/dist-packages/openstack_auth/forms.py],Package[openstack-dashboard]
[22:16:57] <bd808>	 greg-g: I just took a window from 18:00Z-19:00Z tomorrow for a Striker deploy
[22:19:26] <Pchelolo>	 !log update RESTBase to 69065e2: canary on restbase1007
[22:19:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:10] <Pchelolo>	 !log update RESTBase to 69065e2
[22:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:05] <wikibugs>	 (03PS2) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971)
[22:31:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo)
[22:32:29] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo)
[22:32:39] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1066 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333991 (https://phabricator.wikimedia.org/T156005) (owner: 10Jcrespo)
[22:33:21] <greg-g>	 bd808: neat :)
[22:40:32] <icinga-wm>	 RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[22:41:04] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 for reimage (duration: 00m 55s)
[22:41:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:23] <ostriches>	 I guess that hhvm syntax thingie is wider than I thought....
[22:48:24] <ostriches>	 https://logstash.wikimedia.org/goto/daf20a1752e93bcb1186bd08916a01ec
[22:49:58] <ostriches>	 Hmm, that error isn't hhvm, it's something with pdfs.
[22:51:15] <ostriches>	 Definitely picked up in last few hours https://logstash.wikimedia.org/goto/6ff9b22efdd3ad556969e8806efb090a
[22:57:42] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.199 second response time
[22:58:42] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.390 second response time
[23:04:02] <jynus>	 !log reimage db1066
[23:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:47] <wikibugs>	 (03Restored) 10Thcipriani: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[23:08:28] <wikibugs>	 (03PS2) 10Thcipriani: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947
[23:09:11] <wikibugs>	 (03CR) 10Dzahn: "when i touched the deployment keys in private repo to change the passphrases, the file names disappeared from the comment column in ssh-ad" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (owner: 10Thcipriani)
[23:09:53] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.29.0-wmf.9/includes/specials/SpecialSearch.php: Update special:search security patc h to not fatal (duration: 00m 44s)
[23:09:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:19] <wikibugs>	 (03PS3) 10Dzahn: Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (https://phabricator.wikimedia.org/T154943) (owner: 10Thcipriani)
[23:10:55] <mutante>	 jouncebot: next
[23:10:55] <jouncebot>	 In 0 hour(s) and 49 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170125T0000)
[23:14:51] <wikibugs>	 (03CR) 10Paladox: [C: 031] Fix failing keyholder arming check [puppet] - 10https://gerrit.wikimedia.org/r/312947 (https://phabricator.wikimedia.org/T154943) (owner: 10Thcipriani)
[23:15:01] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "already cherry-picked on beta and tested on tin" [puppet] - 10https://gerrit.wikimedia.org/r/312947 (https://phabricator.wikimedia.org/T154943) (owner: 10Thcipriani)
[23:26:44] <jynus>	 !log restarting db1052 for kernel upgrade
[23:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:42] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.218 second response time
[23:30:42] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.303 second response time
[23:45:16] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2967328 (10Dzahn) I used these 2 boxes to test install from install1001 (instead of carbon).  The installer started fine on 1003, then the install just fails at grub install for unknown an...
[23:46:17] <wikibugs>	 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2967330 (10Dzahn) I also tested with prometheus1003 if the installer starts. It does.. (fails later at grub install but not related to this here).
[23:49:15] <mutante>	 !log analytics1015 (unused spare system) - use for test OS install
[23:49:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:57] <mutante>	 !log carbon stopping DHCP
[23:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:56] <mutante>	 !log carbon - stopping puppet, stopping atftpd
[23:50:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:02] <icinga-wm>	 PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100%
[23:51:35] <icinga-wm>	 ACKNOWLEDGEMENT - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstall
[23:53:42] <icinga-wm>	 RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[23:54:32] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 2 minutes ago with 21 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[23:55:42] <icinga-wm>	 PROBLEM - puppet last run on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:02] <icinga-wm>	 PROBLEM - dhclient process on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:02] <icinga-wm>	 PROBLEM - configured eth on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:12] <icinga-wm>	 PROBLEM - DPKG on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:17] <wikibugs>	 (03CR) 10Gergő Tisza: "Yet another configuration setting is a pretty ugly solution, I don't have any better one though. (Ideally we would just detect that $authM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333653 (https://phabricator.wikimedia.org/T154064) (owner: 10Niharika29)
[23:56:22] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:22] <icinga-wm>	 PROBLEM - salt-minion processes on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:22] <icinga-wm>	 PROBLEM - MD RAID on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:32] <icinga-wm>	 PROBLEM - Check size of conntrack table on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:56:32] <icinga-wm>	 PROBLEM - Disk space on analytics1015 is CRITICAL: Return code of 255 is out of bounds
[23:58:16] <wikibugs>	 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2967350 (10Dzahn)
[23:58:20] <wikibugs>	 06Operations, 10netops: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2967348 (10Dzahn) 05Open>03Resolved finally tested with analytics1015 (unused spare system), installed trusty image from install1001. services on carbon were down too.  resolving now
[23:58:42] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.589 second response time
[23:59:42] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.895 second response time