[00:00:23] <wikibugs__>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3135842 (10faidon) I haven't heard back, but I noticed [[ https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1205416 | PR1205416 ]] now says: >...
[00:02:19] <icinga-wm>	 PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:06:39] <icinga-wm>	 RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[00:10:59] <icinga-wm>	 RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[00:14:26] <wikibugs__>	 06Operations, 10Wikimedia-Shop, 07Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2190627 (10Dzahn) @Robh I wonder if there is anything left to do on this ticket nowadays.
[00:18:27] <wikibugs__>	 06Operations, 10Packaging: Upgrade php5-json .deb to at least 1.3.8 - https://phabricator.wikimedia.org/T160101#3135905 (10Dzahn) p:05Triage>03Normal
[00:20:31] <wikibugs__>	 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3135906 (10Dzahn) p:05Triage>03Low
[00:21:42] <wikibugs__>	 06Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3135921 (10Dzahn) p:05Triage>03Normal
[00:23:31] <wikibugs>	 06Operations, 10hardware-requests, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3135929 (10Dzahn) p:05Triage>03Normal
[00:23:47] <wikibugs>	 (03PS1) 10Reedy: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074
[00:25:40] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 07HTTPS: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253#3135933 (10Krinkle)
[00:27:54] <Reedy>	 jouncebot: next
[00:27:55] <jouncebot>	 In 12 hour(s) and 32 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300)
[00:29:13] <wikibugs>	 06Operations, 06Operations-Software-Development: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#3135943 (10Dzahn) p:05Triage>03Low
[00:30:12] <wikibugs__>	 (03PS1) 10Dzahn: decom ms-fe100[1-4], remove from DHCP and puppet [puppet] - 10https://gerrit.wikimedia.org/r/345075 (https://phabricator.wikimedia.org/T160986)
[00:30:40] <wikibugs>	 (03CR) 10Tim Starling: [C: 031] Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy)
[00:31:00] <wikibugs__>	 (03PS2) 10Reedy: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074
[00:31:05] <wikibugs>	 (03CR) 10Reedy: [C: 032] Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy)
[00:31:19] <icinga-wm>	 RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[00:32:32] <wikibugs>	 (03PS1) 10Dzahn: remove production IPs for ms-fe100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/345076 (https://phabricator.wikimedia.org/T160986)
[00:32:35] <wikibugs__>	 (03Merged) 10jenkins-bot: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy)
[00:32:48] <wikibugs>	 (03CR) 10jenkins-bot: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy)
[00:33:23] <wikibugs__>	 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3135949 (10Dzahn) a:03Dzahn
[00:34:05] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/CommonSettings.php: Remove $wgProxyList (duration: 00m 43s)
[00:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:23] <wikibugs>	 06Operations, 10Monitoring: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528#3135950 (10Dzahn) p:05Triage>03Normal
[00:35:16] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3135952 (10Dzahn) p:05Triage>03Normal
[00:35:55] <wikibugs__>	 (03PS1) 10Dzahn: add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529)
[00:36:11] <logmsgbot>	 !log reedy@tin Synchronized private: Remove mwblocker.log (duration: 00m 44s)
[00:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:56] <wikibugs__>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3135961 (10Dzahn)
[00:39:43] <wikibugs__>	 (03PS2) 10Dzahn: add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529)
[00:42:01] <wikibugs__>	 06Operations: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096#3135963 (10Dzahn) p:05Triage>03Normal
[00:42:34] <wikibugs__>	 (03CR) 10Smalyshev: varnish: move applayer info back to hiera [WIP, 4/4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack)
[00:43:16] <wikibugs>	 (03CR) 10Smalyshev: varnish: move applayer info back to hiera [WIP, 4/4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack)
[00:45:39] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3129255 (10Dzahn) It looks like the mediawiki.org zone in DNS already has a TXT record for Google verification:                  600 IN TXT  "google-site-verific...
[00:46:26] <wikibugs__>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3135970 (10Dzahn) p:05Triage>03Normal
[00:47:00] <wikibugs__>	 06Operations, 06Services, 10hardware-requests: Eqiad: (3) hardware access request for RESTBase Staging - https://phabricator.wikimedia.org/T161534#3135972 (10Dzahn) p:05Triage>03Normal
[00:51:21] <wikibugs>	 06Operations, 10Wikimedia-Shop, 07Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#3135984 (10RobH) 05Open>03Resolved a:03RobH
[01:06:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 74643.647788 Seconds
[01:06:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 74643.671581 Seconds
[01:06:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 74648.216951 Seconds
[01:09:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:09:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:09:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:12:19] <icinga-wm>	 PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:16:19] <icinga-wm>	 PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:29:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 76023.58268 Seconds
[01:29:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 76023.619729 Seconds
[01:29:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 76028.24587 Seconds
[01:29:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 77793.299967 Seconds
[01:30:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 77814.316889 Seconds
[01:30:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 77814.510575 Seconds
[01:41:19] <icinga-wm>	 RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[01:44:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:44:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:45:19] <icinga-wm>	 RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[01:46:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:47:19] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:47:19] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:47:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 77103.67835 Seconds
[01:47:39] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 77103.68931 Seconds
[01:49:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 78993.227471 Seconds
[01:50:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 79014.172381 Seconds
[01:50:20] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 79014.178217 Seconds
[01:52:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 7.931702 Seconds
[01:52:39] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 7.933534 Seconds
[01:52:40] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 12.570128 Seconds
[01:52:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 20.274013 Seconds
[01:53:19] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 41.021349 Seconds
[01:53:19] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 41.022853 Seconds
[01:55:39] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:56:59] <icinga-wm>	 PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:06:21] <wikibugs__>	 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic, 13Patch-For-Review: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3136099 (10Krinkle)
[02:22:39] <icinga-wm>	 PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:23:39] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[02:24:59] <icinga-wm>	 RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[02:34:25] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 12m 37s)
[02:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:39:54] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Mar 28 02:39:53 UTC 2017 (duration 5m 28s)
[02:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:49:11] <wikibugs__>	 (03CR) 10Dzahn: [C: 04-1] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn)
[02:51:39] <wikibugs>	 (03PS1) 10Dzahn: yubiauth: convert to profile/role structure (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345085
[02:51:40] <icinga-wm>	 RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[02:52:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] yubiauth: convert to profile/role structure (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn)
[02:54:21] <wikibugs>	 (03PS2) 10Dzahn: yubiauth: convert to profile/role structure (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345085
[03:09:02] <wikibugs__>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5930/" [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn)
[03:16:09] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:17:09] <icinga-wm>	 RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures
[03:25:52] <wikibugs>	 (03PS1) 10Dzahn: remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086
[03:26:39] <icinga-wm>	 PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:28:15] <wikibugs>	 (03PS2) 10Dzahn: remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086
[03:37:19] <icinga-wm>	 PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:55:39] <icinga-wm>	 RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[03:58:09] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:58:59] <icinga-wm>	 RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 32 minutes ago with 0 failures
[04:05:19] <icinga-wm>	 RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[04:13:19] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1957.90 Read Requests/Sec=5616.90 Write Requests/Sec=4.70 KBytes Read/Sec=22506.00 KBytes_Written/Sec=2155.20
[04:25:19] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.40 Read Requests/Sec=0.80 Write Requests/Sec=11.90 KBytes Read/Sec=5.60 KBytes_Written/Sec=176.40
[04:34:19] <icinga-wm>	 PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:02:19] <icinga-wm>	 RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[05:26:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR
[05:26:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR
[05:29:59] <icinga-wm>	 PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.286 second response time
[05:30:59] <icinga-wm>	 PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:34:59] <icinga-wm>	 RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.691 second response time
[05:52:07] <marostegui>	 !log Run pt-table-checksum on es2 - T161510
[05:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:52:14] <stashbot>	 T161510: Run pt-table-checksum on es2 - https://phabricator.wikimedia.org/T161510
[05:55:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[05:55:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[05:59:59] <icinga-wm>	 RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[06:01:06] <marostegui>	 !log Deploy schema change on s2.enwiktionary.templatelinks - on codfw master, this will generate lag on codfw slaves (which have been silenced) - T154097
[06:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:13] <stashbot>	 T154097: Remove partitions from enwiktionary.templatelinks in s2 - https://phabricator.wikimedia.org/T154097
[06:03:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR
[06:03:49] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR
[06:11:53] <marostegui>	 !log Keep converting unique keys into PK on db1089 - T17441
[06:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:59] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[06:12:37] <wikibugs__>	 (03PS1) 10Urbanecm: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551)
[06:15:19] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:16:09] <icinga-wm>	 RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures
[06:16:15] <wikibugs__>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3136180 (10Marostegui) 05Open>03Resolved a:03Marostegui We thankfully saved the data before reimaging/rebooting it, it is more about the ser...
[06:19:58] <wikibugs>	 06Operations: add support to offboard-user to support mailman list removal - https://phabricator.wikimedia.org/T161566#3136184 (10MoritzMuehlenhoff) That's mostly a duplicate of T161004
[06:20:39] <icinga-wm>	 PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:25:53] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3126819 (10MoritzMuehlenhoff) Sound fine to me, I don't think this needs new meeting approval, since it was just a regression. But let's upd...
[06:26:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle)
[06:28:11] <wikibugs__>	 (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/345096
[06:28:29] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:30:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn)
[06:36:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[06:36:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[06:37:49] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[06:39:09] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[06:39:49] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:40:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/345096 (owner: 10Muehlenhoff)
[06:42:49] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0]
[06:42:59] <icinga-wm>	 PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:46:43] <wikibugs__>	 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3136188 (10MoritzMuehlenhoff) 05Open>03Resolved @GoranSMilovanovic : I've added you to the wmde group.
[06:47:40] <icinga-wm>	 RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[06:50:09] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[06:54:09] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[06:54:46] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: restbase: convert to use the discovery url for API [puppet] - 10https://gerrit.wikimedia.org/r/345098
[06:54:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service::configuration: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345099
[06:54:49] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[06:54:50] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: parsoid::testing: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345100
[06:54:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101
[06:57:29] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[06:58:09] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: restbase: convert to use the discovery url for API [puppet] - 10https://gerrit.wikimedia.org/r/345098
[06:59:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] restbase: convert to use the discovery url for API [puppet] - 10https://gerrit.wikimedia.org/r/345098 (owner: 10Giuseppe Lavagetto)
[07:00:26] <wikibugs__>	 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3136208 (10MoritzMuehlenhoff)
[07:00:28] <wikibugs>	 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3136209 (10MoritzMuehlenhoff)
[07:00:30] <wikibugs__>	 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3136207 (10MoritzMuehlenhoff)
[07:05:09] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:07:49] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:11:16] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102
[07:11:19] <wikibugs__>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102
[07:17:17] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 (owner: 10Marostegui)
[07:18:13] <moritzm>	 !log installing eject security updates on trusty hosts
[07:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:33] <wikibugs__>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 (owner: 10Marostegui)
[07:18:42] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 (owner: 10Marostegui)
[07:19:41] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T17441 (duration: 00m 43s)
[07:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:48] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[07:37:24] <hashar>	 good morning
[07:39:49] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:40:49] <icinga-wm>	 RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK
[07:41:53] <wikibugs__>	 (03CR) 10Hashar: "Per conversation with Moritz, lets hold until the last Precise instance is removed from labs." [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar)
[07:55:39] <icinga-wm>	 PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:58:12] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: service::configuration: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345099
[08:07:47] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] service::configuration: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345099 (owner: 10Giuseppe Lavagetto)
[08:07:49] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[08:18:16] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3136346 (10elukey) We found a weird regression only on analytics1044, causing sporadic job failures:  ``` 2017-03-27 14:06:02...
[08:21:57] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: parsoid::testing: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345100
[08:24:39] <icinga-wm>	 RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[08:25:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid::testing: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345100 (owner: 10Giuseppe Lavagetto)
[08:29:05] <akosiaris>	 !log enable IGMP snooping on all VLANs on asw2-d-eqiad. T133387
[08:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:13] <stashbot>	 T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387
[08:29:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441)
[08:35:53] <wikibugs>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136379 (10akosiaris) Done. I 've deleted and `vlan default` entry as well as the manually added (by me) `private1-d-eqiad` and added the `all` VLAN.  ``` show...
[08:38:48] <wikibugs__>	 (03PS3) 10Alexandros Kosiaris: logstash: Filter ORES logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010)
[08:38:55] <wikibugs__>	 (03PS4) 10Alexandros Kosiaris: logstash: Filter ORES logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010)
[08:41:44] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 032] logstash: Filter ORES logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) (owner: 10Alexandros Kosiaris)
[08:43:06] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 031] role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101 (owner: 10Giuseppe Lavagetto)
[08:43:47] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101
[08:44:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101 (owner: 10Giuseppe Lavagetto)
[08:45:11] <akosiaris>	 _joe_: merged yours as well
[08:45:27] <_joe_>	 akosiaris: yeah thanks, I was confused :P
[08:46:01] * volans should modify the puppet-merge script so that when multiple committers commit are merged announces it here :D
[08:46:07] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[08:47:19] <wikibugs__>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[08:47:27] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[08:47:42] <_joe_>	 volans: stop.
[08:48:15] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 - T17441 (duration: 00m 44s)
[08:48:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:21] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[08:48:49] <marostegui>	 !log Convert wikidatawiki UNIQUE keys into PK on db1092 - T17441
[08:48:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:45] <wikibugs__>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136403 (10akosiaris) 05stalled>03Open
[08:56:09] <icinga-wm>	 PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[08:57:09] <icinga-wm>	 RECOVERY - DPKG on mw1261 is OK: All packages OK
[09:07:40] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] url_downloader: convert to profile/role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn)
[09:09:49] <icinga-wm>	 RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[09:11:28] <wikibugs>	 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3136433 (10fgiunchedi) @jcrespo the release isn't out yet, though we can test what's in git now on a sample of servers, do you have some we could use?
[09:13:22] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3136437 (10fgiunchedi)
[09:13:24] <wikibugs__>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3136438 (10fgiunchedi)
[09:13:27] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3136435 (10fgiunchedi) 05Open>03Resolved >>! In T123728#3135227, @Krinkle wrote: > Thanks. We'll also need to devise a way to merge...
[09:20:55] <wikibugs>	 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3136452 (10jcrespo) I can give it a look.
[09:21:39] <wikibugs>	 (03PS4) 10Volans: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto)
[09:21:49] <wikibugs__>	 06Operations, 10ops-codfw, 15User-fgiunchedi: Degraded RAID on ms-be2005 - https://phabricator.wikimedia.org/T161358#3136467 (10fgiunchedi) 05Open>03Resolved Disk rebuilding
[09:23:47] <wikibugs>	 (03PS1) 10DCausse: [DNM] new discovery service for CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345115
[09:25:24] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] [DNM] new discovery service for CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345115 (owner: 10DCausse)
[09:31:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> The IRC alerting worked, but the team mailing list never received" [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles)
[09:32:38] <wikibugs__>	 (03PS2) 10Filippo Giunchedi: Increase Thumbor original file size limit to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/344361 (https://phabricator.wikimedia.org/T151456) (owner: 10Gilles)
[09:32:42] <wikibugs>	 06Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360#3129683 (10ema) > upload.wikimedia.org resolves to 194.168.4.100 (cache1.service.virginmedia.net.)  That's not right. Perhaps this was a temporary issue with [[ http://community.virginmedia.com/t5/Switched...
[09:33:49] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 032] Increase Thumbor original file size limit to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/344361 (https://phabricator.wikimedia.org/T151456) (owner: 10Gilles)
[09:35:32] <wikibugs__>	 (03CR) 10Volans: [C: 031] "I would have take a different approach, but overall the logic looks ok." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto)
[09:37:19] <icinga-wm>	 PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:42:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, just a note about restricting access to memcached to thumbor servers only" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles)
[09:44:37] <wikibugs__>	 (03PS2) 10Filippo Giunchedi: Set nginx request time as a header passed to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/344968 (https://phabricator.wikimedia.org/T161535) (owner: 10Gilles)
[09:46:18] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 032] Set nginx request time as a header passed to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/344968 (https://phabricator.wikimedia.org/T161535) (owner: 10Gilles)
[09:48:29] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:49:08] <godog>	 that's me ^
[09:52:09] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:55:29] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757)
[10:00:19] <icinga-wm>	 PROBLEM - thumbor@8829 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8829 is failed
[10:04:19] <icinga-wm>	 PROBLEM - thumbor@8832 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8832 is failed
[10:05:19] <icinga-wm>	 RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[10:07:00] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto)
[10:10:09] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational
[10:10:19] <icinga-wm>	 RECOVERY - thumbor@8832 service on thumbor1002 is OK: OK - thumbor@8832 is active
[10:10:29] <moritzm>	 !log upgraded mw1262 to HHVM 3.18
[10:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:49] <wikibugs>	 (03PS3) 10Elukey: Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527)
[10:12:19] <icinga-wm>	 RECOVERY - thumbor@8829 service on thumbor1001 is OK: OK - thumbor@8829 is active
[10:12:29] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational
[10:14:15] <wikibugs>	 (03CR) 10Elukey: [C: 032] Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey)
[10:14:59] <elukey>	 !log Switching hue.w.o's backend (cache misc) from anaytics1027 to thorium - T159527
[10:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:05] <stashbot>	 T159527: Move away Hue and Camus (and other crons) from analytics1027 - https://phabricator.wikimedia.org/T159527
[10:15:29] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:19:29] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational
[10:26:19] <icinga-wm>	 PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:26:44] <eddiegp>	  /clear
[10:27:32] <marostegui>	 !log Convert dewiki UNIQUE keys into PK on db1092 - https://phabricator.wikimedia.org/T17441
[10:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:25] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10elukey)
[10:34:06] <wikibugs__>	 (03PS1) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597)
[10:35:46] <Guest122358>	 !
[10:36:46] <ema>	 !log upgrading twisted to 16.2.0 on lvs3003 and lvs3004 (esams secondaries) T160433
[10:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:52] <stashbot>	 T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433
[10:39:37] <ema>	 !log upgrading twisted to 16.2.0 on lvs3001 and lvs3002 (esams primaries) T160433
[10:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thumbor: rlimit-fsize firejail to 4GB-1 bytes [puppet] - 10https://gerrit.wikimedia.org/r/345118
[10:40:27] <godog>	 moritzm gilles ^ "fun"
[10:41:05] <moritzm>	 godog: ah, did you report this upstream?
[10:41:24] <godog>	 yeah doing so as we speak
[10:41:40] <gilles>	 good thing we don't need more than 4GB :P
[10:41:50] <wikibugs__>	 (03CR) 10Gilles: [C: 031] thumbor: rlimit-fsize firejail to 4GB-1 bytes [puppet] - 10https://gerrit.wikimedia.org/r/345118 (owner: 10Filippo Giunchedi)
[10:43:42] <moritzm>	 fortunately Linux 4.11 introduced the first support for new page table handling, which will allow to address 4 peta byte on amd64 soon
[10:45:04] <jynus>	 per process?
[10:45:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] thumbor: rlimit-fsize firejail to 4GB-1 bytes [puppet] - 10https://gerrit.wikimedia.org/r/345118 (owner: 10Filippo Giunchedi)
[10:45:28] <moritzm>	 per server
[10:49:19] <wikibugs__>	 (03CR) 10Filippo Giunchedi: "Upstream issue https://github.com/netblue30/firejail/issues/1168" [puppet] - 10https://gerrit.wikimedia.org/r/345118 (owner: 10Filippo Giunchedi)
[10:50:16] <wikibugs__>	 (03CR) 10Gilles: "performance-team@lists.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles)
[10:52:41] <wikibugs>	 (03CR) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles)
[10:54:39] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3136586 (10elukey) The traffic is definitely decreased a lot from last week, but I am still seeing some 503s (way more than before). I a...
[10:55:19] <icinga-wm>	 RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:56:21] <wikibugs>	 06Operations, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3136588 (10MoritzMuehlenhoff)
[11:10:16] <wikibugs__>	 (03PS1) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101)
[11:10:18] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor original file download limit should be 4GB - https://phabricator.wikimedia.org/T151456#3136630 (10Gilles) 05Open>03Resolved
[11:13:09] <icinga-wm>	 PROBLEM - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[11:13:10] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T161600
[11:13:14] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136638 (10ops-monitoring-bot)
[11:13:30] <wikibugs__>	 06Operations, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3136642 (10hashar)
[11:13:31] <wikibugs>	 (03PS2) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101)
[11:14:03] <wikibugs__>	 06Operations, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3136588 (10hashar)
[11:14:14] <wikibugs__>	 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3136652 (10Gilles) p:05Low>03Triage
[11:25:53] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850)
[11:26:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[11:27:08] <wikibugs__>	 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3136674 (10Aklapper)
[11:29:52] <wikibugs__>	 (03PS6) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850)
[11:33:13] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850)
[11:35:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[11:42:30] <wikibugs>	 (03PS1) 10Elukey: Fix Hue apache config [puppet] - 10https://gerrit.wikimedia.org/r/345129 (https://phabricator.wikimedia.org/T159527)
[11:43:49] <wikibugs>	 (03CR) 10Elukey: [C: 032] Fix Hue apache config [puppet] - 10https://gerrit.wikimedia.org/r/345129 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey)
[11:46:39] <hashar>	 jouncebot: next
[11:46:39] <jouncebot>	 In 1 hour(s) and 13 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300)
[11:50:41] <wikibugs>	 (03PS3) 10Gehel: maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 (https://phabricator.wikimedia.org/T161542)
[11:51:50] <wikibugs>	 (03CR) 10Gehel: [C: 032] maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 (https://phabricator.wikimedia.org/T161542) (owner: 10Gehel)
[11:52:26] <wikibugs__>	 (03CR) 10Gilles: "OK, what I'm seeing is that initially there are 80 https connections (expected, 2 per thumbor process) and then gradually they die until t" [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles)
[12:02:48] <wikibugs__>	 (03CR) 10Gilles: "I'll keep the discussion going on the task, where it'll be more readable." [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles)
[12:05:21] <wikibugs__>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn)
[12:17:20] <wikibugs__>	 (03PS1) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130
[12:29:35] <wikibugs__>	 (03PS2) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130
[12:35:43] <wikibugs>	 (03PS3) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130
[12:41:29] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:41:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:42:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:45:01] <wikibugs>	 (03PS4) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130
[12:48:18] <elukey>	 moritzm: are you working on mw1261?
[12:48:23] <wikibugs__>	 (03PS5) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130
[12:48:48] <moritzm>	 not at the moment, in an interview
[12:49:01] <moritzm>	 can you create a stacktrace and depool?
[12:51:46] <wikibugs>	 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3136754 (10dr0ptp4kt) Oh! In that case it is *probably* just actions done with the noc@ account to delegate "full" access to abaso@wikimedia.org to https://media...
[12:52:04] <wikibugs>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136756 (10faidon) We've lived with this bug in codfw for so long, I'd say to let it be as-is until we're done with the switchover and postpone that for May on...
[12:52:31] <hashar>	 jouncebot: next
[12:52:31] <jouncebot>	 In 0 hour(s) and 7 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300)
[12:56:36] <wikibugs>	 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136757 (10akosiaris) Agreed
[12:57:02] <wikibugs__>	 (03PS1) 10Gehel: elasticsearch - add dummy certificates for testing [labs/private] - 10https://gerrit.wikimedia.org/r/345136
[12:58:29] <moritzm>	 !log depooled mw1261
[12:58:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:56] <wikibugs__>	 (03CR) 10Gehel: [C: 032] elasticsearch - add dummy certificates for testing [labs/private] - 10https://gerrit.wikimedia.org/r/345136 (owner: 10Gehel)
[12:59:00] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - add dummy certificates for testing [labs/private] - 10https://gerrit.wikimedia.org/r/345136 (owner: 10Gehel)
[13:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300).
[13:00:04] <jouncebot>	 RoanKattouw: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:00:10] <hashar>	 o/
[13:00:41] <RoanKattouw>	 Hello
[13:01:01] <hashar>	 good morning
[13:01:17] <hashar>	 wanna handle the deployment of your patch?
[13:01:30] <RoanKattouw>	 Sure
[13:01:54] <wikibugs>	 (03CR) 10Hashar: [C: 031] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope)
[13:02:21] <wikibugs__>	 (03CR) 10Catrope: [C: 032] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope)
[13:02:39] <wikibugs__>	 (03PS5) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437
[13:02:45] <wikibugs>	 (03CR) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope)
[13:02:50] <wikibugs__>	 (03CR) 10Catrope: [C: 032] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope)
[13:03:23] <RoanKattouw>	 Wait, wikibugs reports Gerrit changes now, not grrrit-wm?
[13:04:13] <wikibugs__>	 (03Merged) 10jenkins-bot: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope)
[13:04:26] <wikibugs>	 (03CR) 10jenkins-bot: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope)
[13:07:54] <RoanKattouw>	 Alright, pulled it to mwdebug1002
[13:09:00] <RoanKattouw>	 Hmm, why do I have my language set to Frisian on Polish Wikipedia...
[13:09:48] <Trizek>	 hu?
[13:10:07] <RoanKattouw>	 I probably had a reason to set it to that years ago...
[13:13:01] <wikibugs__>	 (03CR) 10Volans: [C: 032] Minor fixes for library integrations [switchdc] - 10https://gerrit.wikimedia.org/r/344951 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:14:22] <wikibugs__>	 (03PS1) 10Gehel: elasticsearch - moved dummy certificates to the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/345137
[13:14:33] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - moved dummy certificates to the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/345137 (owner: 10Gehel)
[13:15:01] <RoanKattouw>	 Trizek: Unfortunately the beta feature name + description aren't translated into Portuguese
[13:15:30] <wikibugs__>	 (03PS2) 10Gehel: [mwgrep] enable more accurate regex timeout [puppet] - 10https://gerrit.wikimedia.org/r/344925 (owner: 10DCausse)
[13:16:51] <RoanKattouw>	 But I'm sure someone can fix that once they see it
[13:16:56] <RoanKattouw>	 It is translated into Polish though
[13:17:31] <wikibugs>	 (03CR) 10Gehel: [C: 032] [mwgrep] enable more accurate regex timeout [puppet] - 10https://gerrit.wikimedia.org/r/344925 (owner: 10DCausse)
[13:18:09] <RoanKattouw>	 Same for the ORES filters themselves, translated into Polish but not Portuguese
[13:18:48] <wikibugs>	 (03PS3) 10Volans: Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178)
[13:20:16] <RoanKattouw>	 OK, let's take this live
[13:21:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139
[13:21:57] <wikibugs__>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139
[13:22:30] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable RCFilters beta feature on plwiki and ptwiki T158336 (duration: 00m 43s)
[13:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:37] <stashbot>	 T158336: Contact group 1 wikis concerning Filters for recent changes - https://phabricator.wikimedia.org/T158336
[13:23:49] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:26:58] <wikibugs>	 (03PS9) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404)
[13:27:25] <RoanKattouw>	 Oh and now I need to run the script for the preference migration
[13:29:42] <RoanKattouw>	 !log Ran initUserPreference.php -s ores-enabled -t rcenhancedfilters   and -s ores-enabled -t oresHighlight  on plwiki and ptwiki
[13:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:30] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 (owner: 10EBernhardson)
[13:30:59] <wikibugs__>	 (03PS1) 10Volans: Reorder tasks [switchdc] - 10https://gerrit.wikimedia.org/r/345141 (https://phabricator.wikimedia.org/T160178)
[13:34:29] <Trizek>	 RoanKattouw: I'm on the translation for the BEta feature title and description
[13:37:12] <RoanKattouw>	 Cool thanks
[13:38:29] <icinga-wm>	 PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:38:35] <wikibugs__>	 (03CR) 10Volans: [C: 032] Reorder tasks [switchdc] - 10https://gerrit.wikimedia.org/r/345141 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[13:38:39] <icinga-wm>	 PROBLEM - HHVM processes on mw1261 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm
[13:39:15] <volans>	 moritzm: are you working on mw1261?
[13:39:48] <wikibugs__>	 (03PS4) 10Volans: Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178)
[13:40:05] <elukey>	 volans: it is depooled afaics
[13:40:25] <elukey>	 sorry moritzm, didn't see your message :(
[13:42:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 72434 bytes in 0.268 second response time
[13:43:19] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.050 second response time
[13:43:29] <icinga-wm>	 RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational
[13:43:35] <wikibugs__>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136845 (10Marostegui)
[13:43:39] <icinga-wm>	 RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 6 processes with command name hhvm
[13:43:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.587 second response time
[13:44:28] <elukey>	 !log started hhvm on mw1261 (still depooled) - no hhvm process running
[13:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:37] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136638 (10Marostegui) @Cmjohnson feel free to replace the disk when you can. Thanks!
[13:44:47] <wikibugs__>	 (03CR) 10Gehel: [WIP] Upgrade logstash to 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson)
[13:46:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Make HostPathAutomounter work for files with . in them [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda)
[13:48:53] <wikibugs__>	 (03PS4) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson)
[13:49:04] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson)
[13:49:21] <wikibugs>	 (03CR) 10Gehel: Allow search clusters to reindex from eachother (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson)
[13:50:16] <marostegui>	 is the swat done?
[13:50:29] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[13:51:21] <marostegui>	 hashar RoanKattouw are you guys done with swat?
[13:51:33] <RoanKattouw>	 marostegui: Yes, sorry
[13:51:34] <moritzm>	 elukey: I caught a backtrace, it also crashed yesterday after a similar run time (4-5 hours)
[13:51:52] <marostegui>	 RoanKattouw: cool thank you!! I will deploy db-eqiad.php then :)
[13:51:58] <moritzm>	 SAL contains a few mentions of that crash with 3.12 as well, but it seems to happen more frequently now
[13:52:11] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 (owner: 10Marostegui)
[13:52:49] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[13:53:29] <wikibugs__>	 (03PS5) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson)
[13:53:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 (owner: 10Marostegui)
[13:53:49] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 (owner: 10Marostegui)
[13:54:45] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 - T17441 (duration: 00m 43s)
[13:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:52] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[13:55:29] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[13:56:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:57:05] <cmjohnson1>	 ottomata: ping 
[13:57:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:57:29] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:59:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.027 second response time
[13:59:59] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 72440 bytes in 0.082 second response time
[14:00:19] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.029 second response time
[14:01:06] <elukey>	 cmjohnson1: o/ - if you need anybody from analytics 
[14:01:19] <cmjohnson1>	 elukey: I need to know what disk to swap out
[14:01:33] <cmjohnson1>	 been waiting long enough
[14:01:35] <elukey>	 on an1028?
[14:01:40] <cmjohnson1>	 they've ^
[14:01:41] <cmjohnson1>	 yes
[14:01:53] <elukey>	 ah okok
[14:02:16] <elukey>	 I can try to dump some data to it to make it blip.. is it ok?
[14:02:19] <wikibugs__>	 (03PS1) 10Andrew Bogott: Nova:  Add labvirt1002 back to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/345145 (https://phabricator.wikimedia.org/T159721)
[14:02:38] <wikibugs>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10fgiunchedi) > Point Swift imagescalers to the active MediaWiki > If this uses imagescaler-r...
[14:02:49] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:03:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "I 've just started a build with this patch included in packager02.packaging.eqiad.wmflabs." [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda)
[14:03:49] <cmjohnson1>	 elukey okay
[14:04:18] <wikibugs>	 (03CR) 10Gehel: [C: 031] "This will require a cluster restart. We have a cluster restart coming up for kernel upgrade, I'll bundle all that together." [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson)
[14:04:35] <cmjohnson1>	 elukey: i am looking at it
[14:04:51] <cmjohnson1>	 pretty confident i know which disk but would like confirmation
[14:05:55] <elukey>	 cmjohnson1: read only disk atm..
[14:06:24] <cmjohnson1>	 I am going to pull the one I think it is...only 1 blinking and it matches up to /dev/sdi
[14:06:27] <cmjohnson1>	 can you monitor
[14:07:22] <elukey>	 sure
[14:07:32] <wikibugs__>	 (03CR) 10Andrew Bogott: [C: 032] Nova:  Add labvirt1002 back to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/345145 (https://phabricator.wikimedia.org/T159721) (owner: 10Andrew Bogott)
[14:07:44] <cmjohnson1>	 elukey: pulled it ou
[14:08:20] <wikibugs__>	 (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema)
[14:09:08] <elukey>	 cmjohnson1: all good for the moment
[14:10:12] <cmjohnson1>	 elukey: disk replaced
[14:11:03] <elukey>	 cmjohnson1: I can see "Firmware state: Unconfigured(good), Spun Up", going to set it up and will let you know. thankS!
[14:11:24] <cmjohnson1>	 cool..lmk if you need anything else
[14:14:49] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 04-1] tlsproxy: simplify prometheus metrics gathering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema)
[14:15:26] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3136925 (10Cmjohnson) Swapped the disk out with a spare on-site. The server is still under warranty so requested a new disk to be sent t...
[14:16:25] <wikibugs>	 (03PS9) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613)
[14:24:45] <wikibugs__>	 (03PS2) 10Gilles: Improve Thumbor nginx timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/344999 (https://phabricator.wikimedia.org/T150746)
[14:24:45] <wikibugs>	 06Operations, 10hardware-requests: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3136941 (10fgiunchedi)
[14:25:09] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[14:26:16] <elukey>	 cmjohnson1: everything looks good! 
[14:26:19] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:28:39] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[14:29:24] <wikibugs>	 06Operations, 06Performance-Team, 15User-fgiunchedi: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193#3136955 (10Peter) Great, thank you @fgiunchedi !!!
[14:29:48] <wikibugs>	 (03PS1) 10Cmjohnson: Changing ms-be1033 ip to match row/rack change from C to Row B [dns] - 10https://gerrit.wikimedia.org/r/345151
[14:30:12] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Changing ms-be1033 ip to match row/rack change from C to Row B [dns] - 10https://gerrit.wikimedia.org/r/345151 (owner: 10Cmjohnson)
[14:31:09] <wikibugs__>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3136969 (10Cmjohnson) @fgiunchedi Moved ms-be1033 to row B. updated dns
[14:31:11] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3136970 (10elukey) 05Open>03Resolved a:03elukey
[14:31:29] <wikibugs__>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441)
[14:32:49] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[14:34:18] <wikibugs__>	 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3136976 (10Cmjohnson) The disk is internal and the server will need to be powered off to replace the disk. Please coordinate scheduled downtime  w/cmjohnson to replace
[14:36:04] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[14:36:19] <icinga-wm>	 PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:37:15] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[14:37:24] <wikibugs__>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[14:37:52] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@bfbaa17]: Increase log level for processinng failures
[14:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:04] <elukey>	 !log ran restart-hhvm on mw1242, hhvm threads stuck (dump debug in /tmp/hhvm.9008.bt.) - HHVM 3.12
[14:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:13] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 - T17441 (duration: 00m 43s)
[14:38:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:18] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[14:38:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.071 second response time
[14:38:59] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@bfbaa17]: Increase log level for processinng failures (duration: 01m 07s)
[14:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:29] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.040 second response time
[14:39:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 72423 bytes in 0.085 second response time
[14:40:06] <marostegui>	 !log  Convert dewiki UNIQUE keys into PK on db1091 (commonswiki) - T17441
[14:40:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:18] <elukey>	 weird, mw2256 is set to pooled=inactive, but I am pretty sure it wasn't the last time that I worked on it (a couple of days ago)
[14:40:48] <elukey>	 alarm started 13 days ago, at this point my pebcak
[14:40:58] <wikibugs__>	 (03PS3) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101)
[14:41:11] <elukey>	 mmmm even if in the SAL I can see  elukey@puppetmaster1001: conftool action : set/pooled=active; selector: name=mw2256.codfw.wmnet
[14:41:24] <wikibugs>	 (03CR) 10Ema: tlsproxy: simplify prometheus metrics gathering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema)
[14:41:45] <elukey>	 sigh set pooled=yes, not active
[14:41:48] * elukey cries in a corner
[14:41:56] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet
[14:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:40] <wikibugs__>	 (03PS1) 10Cmjohnson: Adding mgmt dns entries for T159886 and T159887, frdb1002 and frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345155
[14:45:17] <wikibugs__>	 (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for T159886 and T159887, frdb1002 and frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345155 (owner: 10Cmjohnson)
[14:45:50] <wikibugs__>	 (03PS1) 10Elukey: Prepare mw2090->mw2096 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345156 (https://phabricator.wikimedia.org/T161488)
[14:48:49] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[14:49:59] <icinga-wm>	 PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - /root/bindtest/srv/testsnap is not accessible: Permission denied
[14:52:24] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 031] tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema)
[14:54:19] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[14:56:01] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 032] Improve Thumbor nginx timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/344999 (https://phabricator.wikimedia.org/T150746) (owner: 10Gilles)
[14:58:19] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:59:49] <icinga-wm>	 PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 107, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/5: down - siliconBR
[15:04:59] <icinga-wm>	 RECOVERY - Disk space on labstore1004 is OK: DISK OK
[15:06:19] <icinga-wm>	 RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[15:10:14] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service::node: abstract config for scap3, allow use of confd in configuration [puppet] - 10https://gerrit.wikimedia.org/r/345158
[15:11:34] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3137110 (10Gilles) 05Open>03Resolved Alright, the timeouts are solved now. No more 504s. Now the only errors in nginx's logs are when a thumbor instance dies, which res...
[15:11:59] <icinga-wm>	 PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - /root/bindtest/srv/testsnap is not accessible: Permission denied
[15:13:39] <ottomata>	 cmjohnson1:  thanks for an28 help
[15:13:49] <ottomata>	 what's the word on https://phabricator.wikimedia.org/T155065#3110178 ? 
[15:13:55] <ottomata>	 delivery date was friday, wonder if they showed up
[15:17:05] <wikibugs__>	 (03PS3) 10Giuseppe Lavagetto: service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac)
[15:18:43] <wikibugs__>	 (03PS2) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/344615
[15:20:22] <wikibugs__>	 (03PS1) 10Gilles: Disable storing Thumbor thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/345161
[15:20:25] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082218 (10Cmjohnson) plugged frrdb1002 into pfw-2-eqiad port 2. It is in opposite switch then  frdb1001 and neither is double connected.  ilom is setup
[15:22:25] <wikibugs__>	 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137200 (10Papaul) @fgiunchedi system is up
[15:23:34] <wikibugs__>	 (03CR) 10Muehlenhoff: [C: 032] Adapt debdeploy grain to rename of nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/344615 (owner: 10Muehlenhoff)
[15:23:34] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3137201 (10Cmjohnson) frdev1001 is plugged into pfw1 port 5
[15:23:51] <icinga-wm>	 RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 109, down: 0, dormant: 0, excluded: 2, unused: 0
[15:24:02] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac)
[15:24:30] <wikibugs__>	 (03PS4) 10Giuseppe Lavagetto: service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac)
[15:25:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac)
[15:27:19] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[15:29:37] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2256 is OK: OK
[15:33:07] <wikibugs__>	 (03PS1) 10Cmjohnson: Adding production dns entries for frdev1001 and frdb1002 T159887 and T159886 [dns] - 10https://gerrit.wikimedia.org/r/345163
[15:33:43] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding production dns entries for frdev1001 and frdb1002 T159887 and T159886 [dns] - 10https://gerrit.wikimedia.org/r/345163 (owner: 10Cmjohnson)
[15:40:01] <wikibugs__>	 (03CR) 10Volans: [C: 032] Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[15:41:06] <wikibugs__>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3137300 (10Cmjohnson) Added to racktables
[15:41:10] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3137301 (10Cmjohnson) Added to racktables
[15:41:22] <wikibugs__>	 (03CR) 10Jcrespo: "<3 the name" [dns] - 10https://gerrit.wikimedia.org/r/345163 (owner: 10Cmjohnson)
[15:42:07] <icinga-wm>	 RECOVERY - Disk space on labstore1004 is OK: DISK OK
[15:42:41] <wikibugs>	 (03PS1) 10Cmjohnson: Fixing the name for frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345165
[15:43:15] <wikibugs__>	 (03CR) 10Cmjohnson: [C: 032] Fixing the name for frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345165 (owner: 10Cmjohnson)
[15:43:56] <wikibugs>	 (03CR) 10Andrew Bogott: "It seems reasonable to completely remove the old role config since that role doesn't exist anymore." [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff)
[15:44:19] <godog>	 moritzm gilles looks like upstream has fixed the issue already \o/ https://github.com/netblue30/firejail/commit/671ba2b8ef43edd74b32267f22f053cb510b2bde
[15:49:07] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-3/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR
[15:49:28] <wikibugs__>	 (03PS1) 10Rush: labstore: nfs-manage-binds improvements [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883)
[15:52:26] <wikibugs__>	 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3137355 (10Cmjohnson) The S/N finally shows up..i submitted a case for this    Your case was successfully submitted. Please note your Case ID: 5318424916 for future reference.
[15:53:03] <wikibugs>	 06Operations, 10ops-eqiad, 10hardware-requests: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3137357 (10Cmjohnson) p:05Normal>03Low
[15:55:31] <moritzm>	 godog: yeah, they usually have really good turnarounds in fixing bugs
[15:56:24] <wikibugs__>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3137367 (10Cmjohnson) Disk has been ordered through Dell Create Service Request: Service Tag 3JG3K02  Confirmed: Request 946035459 was successfully submitted.
[15:56:45] <gehel>	 !log banning elastic2021 to run same tests as elastic2020 - T149006
[15:56:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:52] <stashbot>	 T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006
[15:57:34] <wikibugs>	 (03CR) 10Muehlenhoff: "This configures the salt grain which allows debdeploy (which is based on salt ATM) to address systems per role (since it doesn't have a vi" [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff)
[15:57:50] <wikibugs__>	 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3137369 (10Marostegui) Thanks!
[15:57:56] <wikibugs>	 (03PS3) 10Eevans: Mandatory Cassandra client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342904 (https://phabricator.wikimedia.org/T111113)
[15:58:22] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3137372 (10Cmjohnson) @elukey I have the thermal paste....want to plan for this on Thursday morning (my morning)?
[15:59:39] <wikibugs>	 (03CR) 10Andrew Bogott: "> In general, if a role is renamed, the Hiera entry needs to be moved along." [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1600).
[16:00:04] <jouncebot>	 urandom: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:16] * urandom is available
[16:01:23] <moritzm>	 godog: are you handling Eric's cassandra patch from puppet swat?
[16:01:34] <wikibugs__>	 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3137378 (10elukey) >>! In T132256#3137372, @Cmjohnson wrote: > @elukey I have the thermal paste....want to plan for this on Thursday morni...
[16:02:34] <wikibugs>	 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3137379 (10Cmjohnson) To clarify labstore1001 and 2 arrays are going to rack B5 labstore1002 and all arrays are going to B6  Each h...
[16:03:06] <wikibugs__>	 (03PS1) 10EBernhardson: Prevent wikidata dumps from taking all memory on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577)
[16:03:37] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 626656
[16:04:20] <godog>	 moritzm urandom yeah sorry I got distracted
[16:04:27] <wikibugs>	 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137401 (10RobH) So I can login to the mgmt of the system (WMF6406) via serial and see it is rebuilding the raid.  @Papaul, did the system have enough drive trays, or did you have to steal f...
[16:04:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Mandatory Cassandra client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342904 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans)
[16:05:54] <paravoid>	 s/win 23
[16:05:56] <godog>	 urandom: I reckon _joe_ and mobrovac were also restarting / messing with restbase, might conflict ?
[16:06:05] <godog>	 haven't submitted yet
[16:06:15] <urandom>	 godog: shouldn't
[16:06:34] <urandom>	 but we could coordinate
[16:06:42] <urandom>	 godog: it's going to take hours to restart Cassandra
[16:07:02] <urandom>	 mobrovac: are you going to be doing some restbase restarts?
[16:07:04] <wikibugs__>	 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3137404 (10madhuvishy) @Cmjohnson - One edit   > labstore1002 and all arrays are going to B6 I assume that's really all other* arra...
[16:07:43] <godog>	 _joe_: ^ too
[16:08:18] <wikibugs__>	 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3137405 (10Cmjohnson) @madhuvishy that is correct ... labstore1001 and 2 arrays in B5 and labstore1002 and 2 arrays in B6
[16:10:27] <icinga-wm>	 PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:10:54] <_joe_>	 godog: I am restarting rb, yes
[16:10:56] <wikibugs>	 (03PS1) 10EBernhardson: Stop copying cirrus UserTesting logs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/345171
[16:11:00] <wikibugs__>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3137408 (10fgiunchedi) @Cmjohnson thanks!  I've fixed the raid on the machines (cfr https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0_Gen...
[16:11:41] <godog>	 ok, just to make sure, I'll merge the patch now but it isn't going to affect anything until cassandra is restarted
[16:11:50] <wikibugs__>	 (03CR) 10EBernhardson: "It looks like we probably want ensure=>absent, as done in this patch, rather than removing entirely. Not 100% sure though" [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson)
[16:13:17] <urandom>	 godog: wfm
[16:13:28] <wikibugs>	 (03CR) 10EBernhardson: "Logs themselves are no longer generated, since Ia6804d12" [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson)
[16:15:03] <_joe_>	 godog: done now
[16:15:21] <urandom>	 godog: awesome; thanks
[16:15:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Disable storing Thumbor thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/345161 (owner: 10Gilles)
[16:15:24] <urandom>	 perfect timing too!
[16:15:29] <godog>	 np, thanks _joe_ urandom !
[16:16:00] <godog>	 this concludes puppetswat
[16:16:14] <urandom>	 :)
[16:18:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Disable storing Thumbor thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/345161 (owner: 10Gilles)
[16:18:59] <elukey>	 godog: no gif to celebrate?
[16:19:01] <elukey>	 :D
[16:19:09] <urandom>	 !log T111113: Restarting Cassandra on restbase2001 to apply mandatory client encryption (canary)
[16:19:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:15] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[16:19:31] <godog>	 elukey: https://i.imgur.com/fzv0cTM.gif
[16:19:57] <urandom>	 ha
[16:20:00] <mobrovac>	 urandom: _joe_ wants to
[16:20:13] <urandom>	 mobrovac: i think he already did
[16:20:22] <godog>	 elukey: I have more when no patches are scheduled though :D
[16:20:27] <urandom>	 mobrovac: 12:15 < _joe_> godog: done now
[16:21:05] <mobrovac>	 k
[16:21:09] * mobrovac hides
[16:21:57] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:22:12] <urandom>	 ^^^ there are going to be some of these, i guess
[16:22:36] <urandom>	 icinga seems to catch about 1/3 of them now on a rolling restart
[16:22:57] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 167 days)
[16:23:06] <urandom>	 *just* catches them
[16:23:12] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3137505 (10Jgreen)
[16:26:27] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused
[16:27:27] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.033 second response time on 10.192.16.163 port 9042
[16:28:07] <wikibugs>	 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137524 (10RobH)
[16:30:57] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:31:31] * urandom .oO(...or maybe more than 1/3)
[16:31:50] <wikibugs>	 (03PS1) 10RobH: adding temp host graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/345177
[16:31:57] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2017-09-12 15:13:30 +0000 (expires in 167 days)
[16:32:32] <wikibugs>	 (03CR) 10RobH: [C: 032] adding temp host graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/345177 (owner: 10RobH)
[16:33:17] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137568 (10RobH)
[16:37:23] <wikibugs__>	 (03CR) 10Madhuvishy: "small nits, otherwise +1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush)
[16:38:27] <icinga-wm>	 RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[16:39:20] <urandom>	 !log T111113: Restarting remaining Cassandra instances, rack 'b', codfw (restbase20{02,07,10})
[16:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:26] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[16:43:10] <wikibugs__>	 (03CR) 10Nuria: [C: 031] Stop copying cirrus UserTesting logs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson)
[16:44:17] <wikibugs>	 (03PS1) 10Daniel Kinzler: Allow only properties on Special:EntitiesWithoutLabel and Special:EntitiesWithoutDescription. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887)
[16:47:01] <wikibugs>	 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 4 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#3137659 (10mobrovac) 05Open>03Resolved All of the services that do not need the proxy, don't use it. Moreover, with the switch to Scap3 config...
[16:47:02] <urandom>	 _joe_: https://www.mediawiki.org/w/index.php?title=Extension:EventBus&diff=next&oldid=2027568
[16:49:09] <wikibugs__>	 06Operations, 10ops-codfw, 13Patch-For-Review: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137671 (10RobH) Ok, the networking on this is being shitty.  @Papaul: Can you just plug in a usb stick into this system, I'll format it and copy the coal data over.  O...
[16:49:17] <wikibugs__>	 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137672 (10RobH)
[16:49:27] <icinga-wm>	 PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1700). Please do the needful.
[17:00:13] <subbu>	 no parsoid deploy today
[17:00:37] <wikibugs__>	 (03PS1) 10Muehlenhoff: Uninstall eject on jessie onwards [puppet] - 10https://gerrit.wikimedia.org/r/345183
[17:00:39] <wikibugs__>	 (03PS1) 10Jdlrobson: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471)
[17:03:48] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[17:03:54] <wikibugs__>	 (03CR) 10Filippo Giunchedi: Enable memcache-based Thumbor broken thumbnail throttling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles)
[17:04:25] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3137747 (10Papaul) @Gehel Been on the phone with HP for about 45 minutes.  went over all the logs files they requested and can't find any poten...
[17:04:48] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2017-09-12 15:35:55 +0000 (expires in 167 days)
[17:06:35] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2021.codfw.wmnet
[17:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:55] <gehel>	 papaul: elastic2021 is drained, I'm starting some load on it...
[17:07:06] <godog>	 !log swift codfw-prod: bump ms-be2028 ms-be2039 object weight to 3000 - T158337
[17:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:11] <stashbot>	 T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337
[17:07:15] <papaul>	 gehel: thanks
[17:13:39] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 4321
[17:17:28] <icinga-wm>	 RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[17:18:20] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158
[17:22:20] <wikibugs>	 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3137827 (10MaxSem) The only valid use for labs is WMF projects, and those don't support JS in IE under 9 (9 iS soon going to...
[17:22:39] <thcipriani>	 !log starting branch cut for 1.29.0-wmf.18
[17:22:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:34] <wikibugs__>	 06Operations, 10hardware-requests, 15User-fgiunchedi: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3137843 (10fgiunchedi) a:03RobH
[17:34:43] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:34:53] <icinga-wm>	 PROBLEM - DPKG on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:35:13] <icinga-wm>	 PROBLEM - Disk space on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:35:43] <icinga-wm>	 PROBLEM - MD RAID on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:36:04] <icinga-wm>	 PROBLEM - configured eth on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:36:23] <icinga-wm>	 PROBLEM - dhclient process on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:36:33] <icinga-wm>	 PROBLEM - puppet last run on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:36:53] <icinga-wm>	 PROBLEM - salt-minion processes on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:38:13] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:43:23] <icinga-wm>	 PROBLEM - HP RAID on ms-be1039 is CRITICAL: Return code of 255 is out of bounds
[17:43:37] <godog>	 that's me ^
[17:44:20] <wikibugs__>	 (03PS1) 10Dduvall: k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187
[17:44:33] <icinga-wm>	 PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:45:46] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509)
[17:46:02] <wikibugs__>	 (03CR) 10Jcrespo: [C: 04-1] [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo)
[17:46:19] <wikibugs__>	 (03PS3) 10Dzahn: Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle)
[17:47:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo)
[17:49:23] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle)
[17:50:50] <wikibugs>	 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3137904 (10bbogaert) Hi @MoritzMuehlenhoff,  Is it possible for us to modify the replication? We have an ou for ex-employees.   Thanks, Byron
[17:51:30] <wikibugs__>	 (03CR) 10Dzahn: "oops, no-op on tungsten. wrong location, it's "xhgui::app"" [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle)
[17:53:19] <urandom>	 !log T111113: Restarting Cassandra instances, codfw row 'c'
[17:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:27] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[17:55:10] <wikibugs>	 (03PS1) 10Dzahn: admin: fix admin group for xhgui::app role, adjust description [puppet] - 10https://gerrit.wikimedia.org/r/345191 (https://phabricator.wikimedia.org/T161261)
[17:58:52] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] "this finally adds krinkle and gilles to tungsten:  http://puppet-compiler.wmflabs.org/5940/tungsten.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345191 (https://phabricator.wikimedia.org/T161261) (owner: 10Dzahn)
[17:59:53] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused
[18:00:53] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.033 second response time on 10.192.32.135 port 9042
[18:01:18] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3137958 (10Dzahn) 05Open>03Resolved Alright, thanks for your comments Moritz.  Done and adjusted the group description as well.   @Krink...
[18:01:37] <wikibugs>	 06Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3137961 (10bbogaert) Hi @MoritzMuehlenhoff,  I see the value in making a more generic ou address more use cases, but I would rather have an ou that more aligns more with their purpose. These pe...
[18:01:57] <wikibugs>	 (03PS1) 10Dduvall: [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864)
[18:02:04] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Performance-Team: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3137964 (10Dzahn)
[18:03:23] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall)
[18:06:01] <wikibugs__>	 (03PS2) 10Dduvall: [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864)
[18:07:16] <wikibugs__>	 (03PS3) 10Dzahn: admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274)
[18:09:23] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused
[18:09:54] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn)
[18:10:23] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.033 second response time on 10.192.32.137 port 9042
[18:11:08] <wikibugs>	 (03CR) 10Dduvall: "This patchset should probably be split. I just wanted to get my experimental work up for review before pausing the k8s parts in favor of f" [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall)
[18:11:10] <wikibugs__>	 (03PS3) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158
[18:11:12] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193
[18:11:14] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194
[18:11:49] <wikibugs__>	 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3137986 (10Ottomata)
[18:12:09] <wikibugs>	 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3137986 (10Ottomata)
[18:12:12] <wikibugs__>	 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3138004 (10Cmjohnson) labstore1001 and 2 arrays are in B5 connected to ge-5/0/4 labstore1002 and 2 arrays are in B7 connected to ge...
[18:12:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto)
[18:12:30] <wikibugs__>	 (03CR) 10Dzahn: "user has been created on bast1001 (and will be soon on bast2001, bast3001, bast4001). You should already be able to SSH there. Now we can " [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn)
[18:12:33] <icinga-wm>	 RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[18:13:14] <wikibugs>	 06Operations, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138006 (10Ottomata)
[18:13:29] <wikibugs>	 06Operations, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138006 (10Ottomata)
[18:13:45] <wikibugs__>	 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3137986 (10Ottomata)
[18:13:54] <wikibugs>	 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3138024 (10Ottomata)
[18:14:04] <wikibugs__>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3138026 (10Ottomata)
[18:14:11] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138028 (10Ottomata)
[18:16:01] <wikibugs__>	 (03PS1) 10Dzahn: admin: add pnorman to maps/kartotherian/tilerator-admins [puppet] - 10https://gerrit.wikimedia.org/r/345196
[18:18:03] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@1689d86]: Rename event field in logs
[18:18:03] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.139 and port 9042: Connection refused
[18:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:30] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] "as approved on ticket and ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/345196 (owner: 10Dzahn)
[18:18:55] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@1689d86]: Rename event field in logs (duration: 00m 52s)
[18:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:03] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.033 second response time on 10.192.32.139 port 9042
[18:24:47] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158
[18:24:50] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193
[18:24:53] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194
[18:26:19] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto)
[18:29:56] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3138068 (10Dzahn) Hi @Pnorman,  so after [[ https://gerrit.wikimedia.org/r/#/c/345066/ | this ]] merge your shell user has been created on the [[ https://wikitech...
[18:31:23] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.145 and port 9042: Connection refused
[18:32:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is OK: TCP OK - 0.033 second response time on 10.192.32.145 port 9042
[18:38:12] <wikibugs__>	 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3138091 (10chasemp) labs-support vlan please :)
[18:39:03] <icinga-wm>	 PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:44:19] <urandom>	 !log T111113: Restarting Cassandra instances, codfw row 'c' {{done}}
[18:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:26] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[18:45:39] <urandom>	 !log T111113: Restarting Cassandra instances, codfw row 'd'
[18:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:29] <paladox>	 I've managed to add support for eddsa in gerrit -> https://gerrit-review.googlesource.com/#/c/100998/ :)
[18:52:03] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused
[18:53:11] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.033 second response time on 10.192.48.47 port 9042
[18:55:37] <wikibugs__>	 (03PS5) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158
[18:55:39] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193
[18:55:41] <wikibugs__>	 (03PS3) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194
[18:56:41] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto)
[18:56:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 59286.021947 Seconds
[18:56:44] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 59286.026878 Seconds
[18:56:54] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 59288.314324 Seconds
[18:59:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[18:59:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[18:59:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[19:00:04] <jouncebot>	 thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1900). Please do the needful.
[19:00:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "PCC says the change is good, but I won't merge it tonight." [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto)
[19:00:26] * thcipriani does
[19:03:39] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://puppet-compiler.wmflabs.org/5944/wtp1001.eqiad.wmnet/ things are more intertwined than I expected, I need to check this." [puppet] - 10https://gerrit.wikimedia.org/r/345193 (owner: 10Giuseppe Lavagetto)
[19:08:03] <icinga-wm>	 RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[19:08:23] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused
[19:09:23] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is OK: TCP OK - 0.033 second response time on 10.192.48.51 port 9042
[19:17:13] <logmsgbot>	 !log thcipriani@tin Started scap: testwiki to php-1.29.0-wmf.18 and rebuild l10n cache
[19:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:11] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] "nothing to eject anyways" [puppet] - 10https://gerrit.wikimedia.org/r/345183 (owner: 10Muehlenhoff)
[19:30:13] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200)
[19:31:13] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy
[19:32:03] <icinga-wm>	 PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:33:13] <urandom>	 !log T111113: Restarting Cassandra instances, codfw row 'd' {{done}}
[19:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:19] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[19:34:23] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3138227 (10Dzahn) 05Open>03Resolved
[19:35:13] <icinga-wm>	 PROBLEM - Disk space on mwdebug1002 is CRITICAL: DISK CRITICAL - free space: / 1798 MB (3% inode=72%)
[19:37:33] <mobrovac>	 !log restbase deploying d477f495
[19:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:00] <wikibugs__>	 (03PS2) 10Dzahn: decom ms-fe100[1-4], remove from DHCP and puppet [puppet] - 10https://gerrit.wikimedia.org/r/345075 (https://phabricator.wikimedia.org/T160986)
[19:46:57] <wikibugs>	 (03CR) 10Dzahn: [C: 032] decom ms-fe100[1-4], remove from DHCP and puppet [puppet] - 10https://gerrit.wikimedia.org/r/345075 (https://phabricator.wikimedia.org/T160986) (owner: 10Dzahn)
[19:54:43] <icinga-wm>	 PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 1421 MB (3% inode=72%)
[19:55:06] <wikibugs__>	 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3138272 (10grin) >>! In T161256#3137827, @MaxSem wrote: > The only valid use for labs is WMF projects,   This is not true in...
[19:57:33] <logmsgbot>	 !log thcipriani@tin Finished scap: testwiki to php-1.29.0-wmf.18 and rebuild l10n cache (duration: 40m 19s)
[19:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:10] <icinga-wm>	 RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[20:06:56] <mutante>	 !log ms-fe100[1-4] - disable/stop puppet, stop salt minion, decom (T160986)
[20:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:02] <stashbot>	 T160986: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986
[20:07:32] <wikibugs>	 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138301 (10Dzahn)
[20:08:46] <wikibugs__>	 (03PS1) 10Thcipriani: Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219
[20:09:18] <wikibugs__>	 (03CR) 10Thcipriani: [C: 032] Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 (owner: 10Thcipriani)
[20:10:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:10:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:10:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:10:36] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 (owner: 10Thcipriani)
[20:11:41] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 (owner: 10Thcipriani)
[20:13:12] <mutante>	 !log mw1261 HHVM crash as predicted by Moritz - ran sudo hhvm-dump-debug. Backtrace saved as /tmp/hhvm.79460.bt.
[20:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:45] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.18
[20:14:47] <mutante>	 !log mw1261 runs with HHVM 3.18 - which seems to have a bug leading to a deadlock every 4-5 hours
[20:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:46] <mutante>	 !log mw1261 - depooled
[20:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:57] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn runs HHVM 3.18 - Moritz debugging
[20:16:57] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn runs HHVM 3.18 - Moritz debugging
[20:16:57] <icinga-wm>	 ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn runs HHVM 3.18 - Moritz debugging
[20:17:55] <wikibugs>	 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3126688 (10Peachey88) >>! In T161256#3138272, @grin wrote: > I would expect some background check from you before answering....
[20:18:58] <mutante>	 !log mwdebug1001 - was low on disk space, 'apt-get clean' - freed about 4GB
[20:19:00] <icinga-wm>	 RECOVERY - Disk space on mwdebug1001 is OK: DISK OK
[20:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:41] <mutante>	 !log mwdebug1002 - same, was low on disk space, 'apt-get clean' freed > 3GB
[20:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:20] <icinga-wm>	 RECOVERY - Disk space on mwdebug1002 is OK: DISK OK
[20:21:06] <mutante>	 !log copper - puppet errors due to Failed resource /var/lib/docker/devicemapper ??
[20:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:59] <urandom>	 !log T111113: Restarting Cassandra instances, eqiad row 'a'
[20:22:04] <mutante>	 !log mc1019 - puppet fail due to Failed resource /etc/redis/replica since 4 days
[20:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:06] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[20:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:41] <wikibugs__>	 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#3138340 (10ema)
[20:24:55] <mutante>	 !log ms-fe1001 thru msfe1004 - scheduled last downtime for host and services in icinga - shutdown -h now, turn them off, revoke puppet certs, salt-keys...  (T160986)
[20:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:02] <stashbot>	 T160986: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986
[20:30:40] <icinga-wm>	 PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:34:59] <wikibugs__>	 06Operations, 10Pybal, 13Patch-For-Review: Configure pybal ulimits higher - https://phabricator.wikimedia.org/T110091#3138358 (10ema) 05Open>03Resolved Closing, LimitNOFILE is set to infinity in the [[https://github.com/wikimedia/PyBal/blob/master/debian/pybal.service#L5| systemd unit file]].
[20:37:03] <wikibugs>	 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10ema) >>! In T114104#1685872, @mark wrote: > It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirabl...
[20:43:21] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.116 and port 9042: Connection refused
[20:44:22] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042
[20:45:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone 2fa:  Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231
[20:55:51] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.004 second response time
[20:58:01] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time
[20:58:41] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[20:59:46] <chasemp>	 ^ madhuvishy can you look at showmount for tools checker?
[20:59:50] <chasemp>	 that's concerning
[21:01:21] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time
[21:03:13] <wikibugs__>	 (03PS2) 10Andrew Bogott: Keystone 2fa:  Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231
[21:03:31] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.003 second response time
[21:03:45] <madhuvishy>	 chasemp: aah ['/sbin/showmount', '-e', 'labstore.svc.eqiad.wmnet'],
[21:04:04] <madhuvishy>	 labstore.svc must be down since we took apart the boxes
[21:04:13] <chasemp>	 that is checking the old setup
[21:04:15] <madhuvishy>	 also it's checking the wrong thing
[21:04:16] <madhuvishy>	 yes
[21:04:42] <chasemp>	 madhuvishy: well uh, want to correct that check to say "secondary_cluster_showmount" or something and make it hit the cluster ip?
[21:05:01] <madhuvishy>	 yeah
[21:05:18] <madhuvishy>	 nfs-tools-project.svc.eqiad.wmnet
[21:05:41] <icinga-wm>	 PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.004 second response time
[21:05:49] * madhuvishy goes to silence
[21:06:03] <chasemp>	 ah thanks
[21:07:22] <wikibugs>	 (03CR) 10Andrew Bogott: "Tested on labtest, seems to work fine." [puppet] - 10https://gerrit.wikimedia.org/r/345231 (owner: 10Andrew Bogott)
[21:08:18] <urandom>	 !log T111113: Restarting Cassandra instances, eqiad row 'a' {{done}}
[21:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:24] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[21:08:29] <urandom>	 !log T111113: Restarting Cassandra instances, eqiad row 'b'
[21:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:23] <andrewbogott>	 !log upgraded nova-compute on labvirt1014 because it contains a long-awaited bugfix
[21:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:49] <chasemp>	 andrewbogott: slow restart fix?
[21:19:56] <andrewbogott>	 yep!
[21:20:06] <andrewbogott>	 Installing it on 1014 and on labtest, and we'll see if it blows up
[21:20:10] <andrewbogott>	 should be a very minor upgrade though
[21:20:52] <wikibugs__>	 (03PS1) 10Madhuvishy: toolschecker: Update nfs showmount check to test secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/345247
[21:21:04] <chasemp>	 andrewbogott: seems like we should roll through labtest first?
[21:21:33] <andrewbogott>	 chasemp: yeah, doing both.  Since 1014 isn't actively scheduled it's another safe test case.
[21:21:39] <chasemp>	 fair point
[21:23:56] <wikibugs__>	 (03CR) 10Rush: [C: 031] toolschecker: Update nfs showmount check to test secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/345247 (owner: 10Madhuvishy)
[21:25:16] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] toolschecker: Update nfs showmount check to test secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/345247 (owner: 10Madhuvishy)
[21:26:11] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.203 and port 9042: Connection refused
[21:27:11] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is OK: TCP OK - 0.000 second response time on 10.64.32.203 port 9042
[21:33:52] <icinga-wm>	 RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.078 second response time
[21:37:11] <icinga-wm>	 PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:51:16] <wikibugs>	 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138577 (10Dzahn)
[21:54:57] <wikibugs>	 (03PS2) 10Dzahn: remove production IPs for ms-fe100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/345076 (https://phabricator.wikimedia.org/T160986)
[21:55:06] <urandom>	 !log T111113: Restarting Cassandra instances, eqiad row 'b' {{done}}
[21:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:13] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[21:55:31] <urandom>	 !log T111113: Restarting Cassandra instances, eqiad row 'd'
[21:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:01] <wikibugs>	 06Operations, 10hardware-requests, 15User-fgiunchedi: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3138585 (10RobH) I've created sub-task T161634 in the private procurement space for quotes from Dell.  We're going to order via the system manufacturer, since these...
[21:56:20] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] remove production IPs for ms-fe100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/345076 (https://phabricator.wikimedia.org/T160986) (owner: 10Dzahn)
[21:57:32] <wikibugs>	 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138589 (10Dzahn)
[21:59:47] <wikibugs>	 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138592 (10Dzahn) a:05Dzahn>03RobH @Robh all steps done up to switch ports, per checked boxes above, could you disable the ports? thanks
[22:05:11] <icinga-wm>	 RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[22:07:11] <wikibugs>	 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138606 (10RobH) a:05RobH>03Cmjohnson So, in trying to find these ports, they are not set with a description on the switch.  So @cmjohnson will have to manual...
[22:09:57] <wikibugs__>	 (03PS2) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883)
[22:11:13] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush)
[22:13:08] <wikibugs>	 (03PS3) 10Dzahn: add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529)
[22:13:21] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused
[22:14:09] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) (owner: 10Dzahn)
[22:14:21] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.136 port 9042
[22:15:36] <wikibugs__>	 (03CR) 10Dzahn: "http://www-01.sil.org/iso639-3/documentation.asp?id=dty" [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) (owner: 10Dzahn)
[22:19:56] <mutante>	 !log DNS - creating new language "dty" (T160865) - running "authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones" to trigger re-creation of zone files after change in langs.tmpl. (gerrit:345077) | https://www.ethnologue.com/language/dty
[22:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:02] <stashbot>	 T160865: Create Wikipedia Khowar - https://phabricator.wikimedia.org/T160865
[22:20:15] <mutante>	 meh, not that ticket
[22:21:01] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused
[22:21:58] <mutante>	 !log DNS - creating new language "dty" (T161529) - running "authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones" to trigger re-creation of zone files after change in langs.tmpl. (gerrit:345077) | https://www.ethnologue.com/language/dty
[22:22:01] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.138 port 9042
[22:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:04] <stashbot>	 T161529: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529
[22:24:26] <wikibugs__>	 (03PS3) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883)
[22:25:17] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10Dzahn) dty has been created in DNS:  ``` ;; QUESTION SECTION: ;dty.wikipedia.org.  IN A  ;; ANSWER SECTION: dty.wikipedia.org. 600 IN A 208...
[22:26:34] <wikibugs>	 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10BBlack) I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!).,  What we shou...
[22:31:16] <wikibugs__>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3138679 (10Dzahn) created Wikidata item: https://www.wikidata.org/wiki/Q29048035
[22:44:42] <wikibugs>	 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3138702 (10Dzahn) left comment on https://blog.wikimedia.org/2015/11/04/doteli-wikipedia-makes-significant-progress/#comment-162812
[22:45:08] <urandom>	 !log T111113: Restarting Cassandra instances, eqiad row 'd' {{done]}
[22:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:45:15] <stashbot>	 T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113
[22:56:55] <wikibugs>	 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138723 (10pmiazga)
[22:58:49] <wikibugs>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3138751 (10tstarling) Aaron, have you seen [[https://wikitech.wikimedia.org/wiki/Conftool|the conft...
[23:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T2300). Please do the needful.
[23:00:05] <jouncebot>	 Krinkle and jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:53] <wikibugs>	 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138723 (10Dzahn) @pmiazga Hi, it seems you can ssh to bast1001.wikimedia.org. I see in logs "pam_unix(sshd:session): session opened for user pmiazga".  Which server are you trying to connect to? Could you...
[23:02:15] <wikibugs__>	 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138757 (10Dzahn) Your wikitech password is not related to production shell access in any way. You have 2 seperate SSH keys, one for labs and one for production.
[23:03:28] <jdlrobson>	 \o
[23:03:29] <jdlrobson>	 who's deploying?
[23:03:35] <jdlrobson>	 i only have 45 mins :)
[23:03:45] <wikibugs>	 06Operations, 10Ops-Access-Requests: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3138763 (10bmansurov)
[23:03:57] <thcipriani>	 I can SWAT
[23:04:00] <Krinkle>	 o/
[23:05:10] <wikibugs>	 (03PS2) 10Thcipriani: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson)
[23:05:19] <wikibugs__>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson)
[23:08:20] <jdlrobson>	 fyi thcipriani this will take a few minutes for me to test, so if you can and want to queue up Krinkle's patches while i do that go ahead
[23:08:26] <wikibugs>	 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138792 (10pmiazga) ```raynor@DellE6540:~/.ssh » ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000 -vvv                                                            255 ↵ OpenSSH_7.4p1, OpenSSL 1.0.2k  2...
[23:08:37] <jdlrobson>	 (while i test on mwdebug)
[23:08:45] <thcipriani>	 jdlrobson: sure, thanks for the heads-up
[23:08:51] <wikibugs__>	 (03Merged) 10jenkins-bot: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson)
[23:08:59] <wikibugs>	 (03CR) 10jenkins-bot: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson)
[23:09:24] <wikibugs>	 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138796 (10pmiazga) and my ssh config:  ``` Host *     UseRoaming no  Host SmartServer  hostname 192.168.202.2  port 22  user pi  # WikiMedia Host gerrit.wikimedia.org  IdentityFile /home/raynor/.ssh/id_rsa...
[23:09:36] <thcipriani>	 jdlrobson: your patch is on mwdebug1002, check please
[23:12:04] <wikibugs>	 06Operations, 10Ops-Access-Requests: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3138763 (10Dzahn) @bmansurov Would you mind telling me the secret string of your committed identity? (in private email to dzahn@)
[23:14:00] <jdlrobson>	 thcipriani: go for it
[23:14:14] <thcipriani>	 jdlrobson: alright, going live.
[23:15:13] * jdlrobson crosses fingers for no cached html issues
[23:16:31] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:345184|Enable header version 2 on all wikis]] T160471 (duration: 00m 45s)
[23:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:37] <stashbot>	 T160471: Deploy new header to all wikis - https://phabricator.wikimedia.org/T160471
[23:16:55] <thcipriani>	 ^ jdlrobson live everywhere
[23:17:11] <jdlrobson>	 thcipriani: yay thanks doing my second round of testing
[23:17:37] <wikibugs__>	 06Operations, 10Ops-Access-Requests: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3138827 (10bmansurov) @Dzahn thanks for the prompt reply. I've emailed you.
[23:18:05] <thcipriani>	 Krinkle: you patch is live on mwdebug1002, check please
[23:18:07] <thcipriani>	 *your
[23:18:10] <Krinkle>	 thcipriani: Okay!
[23:18:24] <wikibugs__>	 06Operations, 10Ops-Access-Requests: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3138828 (10JoeWalsh)
[23:19:41] <Krinkle>	 thcipriani: verified.
[23:19:44] <wikibugs>	 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138842 (10Dzahn) Hi @pmiazga,  the admin group you are currently in is called "deployment". This gives you access to deployment hosts, tin and mira, mediawiki maintenance hosts / appservers.  And it's inte...
[23:19:51] <thcipriani>	 Krinkle: ok, going live.
[23:20:26] <jdlrobson>	 much appreciated!!!!!!
[23:20:29] <jdlrobson>	 thcipriani: all good here :)
[23:20:43] <thcipriani>	 jdlrobson: cool, thanks for the followup check
[23:22:05] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.29.0-wmf.17/extensions/NavigationTiming/modules/ext.navigationTiming.js: SWAT: [[gerrit:345078|ext.NavigationTiming: Restore unsampled Save Timing]] T161368 (duration: 00m 45s)
[23:22:10] <thcipriani>	 ^ Krinkle live everywhere
[23:22:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:11] <stashbot>	 T161368: Frontend Save Timing broken (<1 metrics per minute) - https://phabricator.wikimedia.org/T161368
[23:24:05] <Krinkle>	 thcipriani: OK. I see traffic recovering in Graphite, so all good.
[23:24:20] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=SaveTiming&from=now-15m&to=now&
[23:25:04] <thcipriani>	 cool. sounds good.
[23:26:54] <wikibugs>	 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3138858 (10Dzahn)
[23:27:31] <icinga-wm>	 PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:28:34] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] "Needs to be rebased and the use_proxy variable must be taken into account too." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto)
[23:28:39] <wikibugs__>	 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3138723 (10Dzahn) You can also just recycle this ticket and turn it into a [[ https://wikitech.wikimedia.org/wiki/Production_shell_access#Additional_permissions_for...
[23:29:11] <wikibugs__>	 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 05codfw-rollout: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#3138873 (10Eevans)
[23:40:25] <wikibugs>	 (03PS1) 10Dzahn: admin: update SSH key for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/345267 (https://phabricator.wikimedia.org/T161660)
[23:42:47] <wikibugs>	 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3138921 (10Krinkle) >>! In T123728#3135277, @Ottomata wrote: >>>! In T123728#3135227, @Krinkle wrote: >> Both Statsv and Graphite have no concept of time for incoming data, everything is "no...
[23:45:12] <wikibugs>	 (03PS2) 10Krinkle: Reapply "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/331636 (https://phabricator.wikimedia.org/T145808) (owner: 10Alex Monk)
[23:55:48] <paladox>	 mutante remeber the gaga thing you were talking about, well go to 1:15 on https://www.youtube.com/watch?v=px593VRs7vk
[23:55:50] <paladox>	 woops
[23:55:54] <paladox>	 wrong place
[23:56:26] <mutante>	 haha, ok
[23:56:31] <icinga-wm>	 RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures