[00:00:23] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3135842 (10faidon) I haven't heard back, but I noticed [[ https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1205416 | PR1205416 ]] now says: >... [00:02:19] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:06:39] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:10:59] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [00:14:26] 06Operations, 10Wikimedia-Shop, 07Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2190627 (10Dzahn) @Robh I wonder if there is anything left to do on this ticket nowadays. [00:18:27] 06Operations, 10Packaging: Upgrade php5-json .deb to at least 1.3.8 - https://phabricator.wikimedia.org/T160101#3135905 (10Dzahn) p:05Triage>03Normal [00:20:31] 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3135906 (10Dzahn) p:05Triage>03Low [00:21:42] 06Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3135921 (10Dzahn) p:05Triage>03Normal [00:23:31] 06Operations, 10hardware-requests, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3135929 (10Dzahn) p:05Triage>03Normal [00:23:47] (03PS1) 10Reedy: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 [00:25:40] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 07HTTPS: Protocol-relative URLs are poorly supported or unsupported by a number of HTTP clients - https://phabricator.wikimedia.org/T54253#3135933 (10Krinkle) [00:27:54] jouncebot: next [00:27:55] In 12 hour(s) and 32 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300) [00:29:13] 06Operations, 06Operations-Software-Development: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#3135943 (10Dzahn) p:05Triage>03Low [00:30:12] (03PS1) 10Dzahn: decom ms-fe100[1-4], remove from DHCP and puppet [puppet] - 10https://gerrit.wikimedia.org/r/345075 (https://phabricator.wikimedia.org/T160986) [00:30:40] (03CR) 10Tim Starling: [C: 031] Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy) [00:31:00] (03PS2) 10Reedy: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 [00:31:05] (03CR) 10Reedy: [C: 032] Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy) [00:31:19] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:32:32] (03PS1) 10Dzahn: remove production IPs for ms-fe100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/345076 (https://phabricator.wikimedia.org/T160986) [00:32:35] (03Merged) 10jenkins-bot: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy) [00:32:48] (03CR) 10jenkins-bot: Remove $wgProxyList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345074 (owner: 10Reedy) [00:33:23] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3135949 (10Dzahn) a:03Dzahn [00:34:05] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Remove $wgProxyList (duration: 00m 43s) [00:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:23] 06Operations, 10Monitoring: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528#3135950 (10Dzahn) p:05Triage>03Normal [00:35:16] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3135952 (10Dzahn) p:05Triage>03Normal [00:35:55] (03PS1) 10Dzahn: add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) [00:36:11] !log reedy@tin Synchronized private: Remove mwblocker.log (duration: 00m 44s) [00:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:56] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3135961 (10Dzahn) [00:39:43] (03PS2) 10Dzahn: add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) [00:42:01] 06Operations: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096#3135963 (10Dzahn) p:05Triage>03Normal [00:42:34] (03CR) 10Smalyshev: varnish: move applayer info back to hiera [WIP, 4/4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [00:43:16] (03CR) 10Smalyshev: varnish: move applayer info back to hiera [WIP, 4/4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/339671 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [00:45:39] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3129255 (10Dzahn) It looks like the mediawiki.org zone in DNS already has a TXT record for Google verification: 600 IN TXT "google-site-verific... [00:46:26] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3135970 (10Dzahn) p:05Triage>03Normal [00:47:00] 06Operations, 06Services, 10hardware-requests: Eqiad: (3) hardware access request for RESTBase Staging - https://phabricator.wikimedia.org/T161534#3135972 (10Dzahn) p:05Triage>03Normal [00:51:21] 06Operations, 10Wikimedia-Shop, 07Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#3135984 (10RobH) 05Open>03Resolved a:03RobH [01:06:39] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 74643.647788 Seconds [01:06:39] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 74643.671581 Seconds [01:06:39] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 74648.216951 Seconds [01:09:39] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:09:39] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:09:39] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:12:19] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:16:19] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:29:39] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 76023.58268 Seconds [01:29:39] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 76023.619729 Seconds [01:29:40] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 76028.24587 Seconds [01:29:59] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 77793.299967 Seconds [01:30:19] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 77814.316889 Seconds [01:30:19] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 77814.510575 Seconds [01:41:19] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:44:39] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:44:39] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:45:19] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:46:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:47:19] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:47:19] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:47:39] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 77103.67835 Seconds [01:47:39] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 77103.68931 Seconds [01:49:59] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 78993.227471 Seconds [01:50:19] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 79014.172381 Seconds [01:50:20] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 79014.178217 Seconds [01:52:39] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 7.931702 Seconds [01:52:39] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 7.933534 Seconds [01:52:40] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 12.570128 Seconds [01:52:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 20.274013 Seconds [01:53:19] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 41.021349 Seconds [01:53:19] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 41.022853 Seconds [01:55:39] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:56:59] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:06:21] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic, 13Patch-For-Review: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3136099 (10Krinkle) [02:22:39] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:23:39] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [02:24:59] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:34:25] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 12m 37s) [02:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Mar 28 02:39:53 UTC 2017 (duration 5m 28s) [02:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:11] (03CR) 10Dzahn: [C: 04-1] mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [02:51:39] (03PS1) 10Dzahn: yubiauth: convert to profile/role structure (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345085 [02:51:40] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [02:52:48] (03CR) 10jerkins-bot: [V: 04-1] yubiauth: convert to profile/role structure (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [02:54:21] (03PS2) 10Dzahn: yubiauth: convert to profile/role structure (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/345085 [03:09:02] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5930/" [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [03:16:09] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:09] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [03:25:52] (03PS1) 10Dzahn: remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086 [03:26:39] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:28:15] (03PS2) 10Dzahn: remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086 [03:37:19] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:39] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [03:58:09] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:58:59] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 32 minutes ago with 0 failures [04:05:19] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:13:19] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1957.90 Read Requests/Sec=5616.90 Write Requests/Sec=4.70 KBytes Read/Sec=22506.00 KBytes_Written/Sec=2155.20 [04:25:19] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.40 Read Requests/Sec=0.80 Write Requests/Sec=11.90 KBytes Read/Sec=5.60 KBytes_Written/Sec=176.40 [04:34:19] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:19] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:26:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [05:26:49] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [05:29:59] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.286 second response time [05:30:59] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:34:59] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.691 second response time [05:52:07] !log Run pt-table-checksum on es2 - T161510 [05:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:14] T161510: Run pt-table-checksum on es2 - https://phabricator.wikimedia.org/T161510 [05:55:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:55:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:59:59] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:01:06] !log Deploy schema change on s2.enwiktionary.templatelinks - on codfw master, this will generate lag on codfw slaves (which have been silenced) - T154097 [06:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:13] T154097: Remove partitions from enwiktionary.templatelinks in s2 - https://phabricator.wikimedia.org/T154097 [06:03:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [06:03:49] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:11:53] !log Keep converting unique keys into PK on db1089 - T17441 [06:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:59] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [06:12:37] (03PS1) 10Urbanecm: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) [06:15:19] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:09] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [06:16:15] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3136180 (10Marostegui) 05Open>03Resolved a:03Marostegui We thankfully saved the data before reimaging/rebooting it, it is more about the ser... [06:19:58] 06Operations: add support to offboard-user to support mailman list removal - https://phabricator.wikimedia.org/T161566#3136184 (10MoritzMuehlenhoff) That's mostly a duplicate of T161004 [06:20:39] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:25:53] 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3126819 (10MoritzMuehlenhoff) Sound fine to me, I don't think this needs new meeting approval, since it was just a regression. But let's upd... [06:26:24] (03CR) 10Muehlenhoff: [C: 031] Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [06:28:11] (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/345096 [06:28:29] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:30:16] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn) [06:36:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:36:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:37:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:39:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [06:39:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:40:01] (03CR) 10Muehlenhoff: [C: 032] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/345096 (owner: 10Muehlenhoff) [06:42:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [06:42:59] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:43] 06Operations, 10LDAP-Access-Requests, 06WMDE-Analytics-Engineering, 10Wikidata, 15User-Addshore: Add goransm to ldap/wmde group - https://phabricator.wikimedia.org/T160924#3136188 (10MoritzMuehlenhoff) 05Open>03Resolved @GoranSMilovanovic : I've added you to the wmde group. [06:47:40] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:50:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:54:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:54:46] (03PS1) 10Giuseppe Lavagetto: restbase: convert to use the discovery url for API [puppet] - 10https://gerrit.wikimedia.org/r/345098 [06:54:48] (03PS1) 10Giuseppe Lavagetto: service::configuration: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345099 [06:54:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:54:50] (03PS1) 10Giuseppe Lavagetto: parsoid::testing: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345100 [06:54:52] (03PS1) 10Giuseppe Lavagetto: role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101 [06:57:29] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:58:09] (03PS2) 10Giuseppe Lavagetto: restbase: convert to use the discovery url for API [puppet] - 10https://gerrit.wikimedia.org/r/345098 [06:59:00] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] restbase: convert to use the discovery url for API [puppet] - 10https://gerrit.wikimedia.org/r/345098 (owner: 10Giuseppe Lavagetto) [07:00:26] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3136208 (10MoritzMuehlenhoff) [07:00:28] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3136209 (10MoritzMuehlenhoff) [07:00:30] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3136207 (10MoritzMuehlenhoff) [07:05:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:07:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:11:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 [07:11:19] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 [07:17:17] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 (owner: 10Marostegui) [07:18:13] !log installing eject security updates on trusty hosts [07:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 (owner: 10Marostegui) [07:18:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345102 (owner: 10Marostegui) [07:19:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T17441 (duration: 00m 43s) [07:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:48] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [07:37:24] good morning [07:39:49] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:49] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [07:41:53] (03CR) 10Hashar: "Per conversation with Moritz, lets hold until the last Precise instance is removed from labs." [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [07:55:39] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:58:12] (03PS2) 10Giuseppe Lavagetto: service::configuration: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345099 [08:07:47] (03CR) 10Giuseppe Lavagetto: [C: 032] service::configuration: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345099 (owner: 10Giuseppe Lavagetto) [08:07:49] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:18:16] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3136346 (10elukey) We found a weird regression only on analytics1044, causing sporadic job failures: ``` 2017-03-27 14:06:02... [08:21:57] (03PS2) 10Giuseppe Lavagetto: parsoid::testing: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345100 [08:24:39] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:25:36] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid::testing: use discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345100 (owner: 10Giuseppe Lavagetto) [08:29:05] !log enable IGMP snooping on all VLANs on asw2-d-eqiad. T133387 [08:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:13] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [08:29:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) [08:35:53] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136379 (10akosiaris) Done. I 've deleted and `vlan default` entry as well as the manually added (by me) `private1-d-eqiad` and added the `all` VLAN. ``` show... [08:38:48] (03PS3) 10Alexandros Kosiaris: logstash: Filter ORES logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) [08:38:55] (03PS4) 10Alexandros Kosiaris: logstash: Filter ORES logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) [08:41:44] (03CR) 10Alexandros Kosiaris: [C: 032] logstash: Filter ORES logstash messages and set port [puppet] - 10https://gerrit.wikimedia.org/r/344407 (https://phabricator.wikimedia.org/T149010) (owner: 10Alexandros Kosiaris) [08:43:06] (03CR) 10Alexandros Kosiaris: [C: 031] role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101 (owner: 10Giuseppe Lavagetto) [08:43:47] (03PS2) 10Giuseppe Lavagetto: role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101 [08:44:10] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mail::mx: switch do discovery host for MW API [puppet] - 10https://gerrit.wikimedia.org/r/345101 (owner: 10Giuseppe Lavagetto) [08:45:11] _joe_: merged yours as well [08:45:27] <_joe_> akosiaris: yeah thanks, I was confused :P [08:46:01] * volans should modify the puppet-merge script so that when multiple committers commit are merged announces it here :D [08:46:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [08:47:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [08:47:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345109 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [08:47:42] <_joe_> volans: stop. [08:48:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 - T17441 (duration: 00m 44s) [08:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:21] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [08:48:49] !log Convert wikidatawiki UNIQUE keys into PK on db1092 - T17441 [08:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:45] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136403 (10akosiaris) 05stalled>03Open [08:56:09] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:57:09] RECOVERY - DPKG on mw1261 is OK: All packages OK [09:07:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] url_downloader: convert to profile/role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [09:09:49] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:11:28] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3136433 (10fgiunchedi) @jcrespo the release isn't out yet, though we can test what's in git now on a sample of servers, do you have some we could use? [09:13:22] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3136437 (10fgiunchedi) [09:13:24] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3136438 (10fgiunchedi) [09:13:27] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie) - https://phabricator.wikimedia.org/T123728#3136435 (10fgiunchedi) 05Open>03Resolved >>! In T123728#3135227, @Krinkle wrote: > Thanks. We'll also need to devise a way to merge... [09:20:55] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3136452 (10jcrespo) I can give it a look. [09:21:39] (03PS4) 10Volans: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto) [09:21:49] 06Operations, 10ops-codfw, 15User-fgiunchedi: Degraded RAID on ms-be2005 - https://phabricator.wikimedia.org/T161358#3136467 (10fgiunchedi) 05Open>03Resolved Disk rebuilding [09:23:47] (03PS1) 10DCausse: [DNM] new discovery service for CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345115 [09:25:24] (03CR) 10jerkins-bot: [V: 04-1] [DNM] new discovery service for CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345115 (owner: 10DCausse) [09:31:34] (03CR) 10Filippo Giunchedi: "> The IRC alerting worked, but the team mailing list never received" [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles) [09:32:38] (03PS2) 10Filippo Giunchedi: Increase Thumbor original file size limit to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/344361 (https://phabricator.wikimedia.org/T151456) (owner: 10Gilles) [09:32:42] 06Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360#3129683 (10ema) > upload.wikimedia.org resolves to 194.168.4.100 (cache1.service.virginmedia.net.) That's not right. Perhaps this was a temporary issue with [[ http://community.virginmedia.com/t5/Switched... [09:33:49] (03CR) 10Filippo Giunchedi: [C: 032] Increase Thumbor original file size limit to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/344361 (https://phabricator.wikimedia.org/T151456) (owner: 10Gilles) [09:35:32] (03CR) 10Volans: [C: 031] "I would have take a different approach, but overall the logic looks ok." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto) [09:37:19] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:42:16] (03CR) 10Filippo Giunchedi: "LGTM overall, just a note about restricting access to memcached to thumbor servers only" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [09:44:37] (03PS2) 10Filippo Giunchedi: Set nginx request time as a header passed to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/344968 (https://phabricator.wikimedia.org/T161535) (owner: 10Gilles) [09:46:18] (03CR) 10Filippo Giunchedi: [C: 032] Set nginx request time as a header passed to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/344968 (https://phabricator.wikimedia.org/T161535) (owner: 10Gilles) [09:48:29] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:49:08] that's me ^ [09:52:09] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:29] (03PS5) 10Giuseppe Lavagetto: utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) [10:00:19] PROBLEM - thumbor@8829 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8829 is failed [10:04:19] PROBLEM - thumbor@8832 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8832 is failed [10:05:19] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:07:00] (03CR) 10Giuseppe Lavagetto: [C: 032] utils: add create_ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/340107 (https://phabricator.wikimedia.org/T158757) (owner: 10Giuseppe Lavagetto) [10:10:09] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational [10:10:19] RECOVERY - thumbor@8832 service on thumbor1002 is OK: OK - thumbor@8832 is active [10:10:29] !log upgraded mw1262 to HHVM 3.18 [10:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:49] (03PS3) 10Elukey: Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) [10:12:19] RECOVERY - thumbor@8829 service on thumbor1001 is OK: OK - thumbor@8829 is active [10:12:29] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [10:14:15] (03CR) 10Elukey: [C: 032] Move hue.w.o's backend to thorium [puppet] - 10https://gerrit.wikimedia.org/r/344916 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [10:14:59] !log Switching hue.w.o's backend (cache misc) from anaytics1027 to thorium - T159527 [10:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:05] T159527: Move away Hue and Camus (and other crons) from analytics1027 - https://phabricator.wikimedia.org/T159527 [10:15:29] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:19:29] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [10:26:19] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:44] /clear [10:27:32] !log Convert dewiki UNIQUE keys into PK on db1092 - https://phabricator.wikimedia.org/T17441 [10:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:25] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10elukey) [10:34:06] (03PS1) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) [10:35:46] ! [10:36:46] !log upgrading twisted to 16.2.0 on lvs3003 and lvs3004 (esams secondaries) T160433 [10:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:52] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [10:39:37] !log upgrading twisted to 16.2.0 on lvs3001 and lvs3002 (esams primaries) T160433 [10:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:58] (03PS1) 10Filippo Giunchedi: thumbor: rlimit-fsize firejail to 4GB-1 bytes [puppet] - 10https://gerrit.wikimedia.org/r/345118 [10:40:27] moritzm gilles ^ "fun" [10:41:05] godog: ah, did you report this upstream? [10:41:24] yeah doing so as we speak [10:41:40] good thing we don't need more than 4GB :P [10:41:50] (03CR) 10Gilles: [C: 031] thumbor: rlimit-fsize firejail to 4GB-1 bytes [puppet] - 10https://gerrit.wikimedia.org/r/345118 (owner: 10Filippo Giunchedi) [10:43:42] fortunately Linux 4.11 introduced the first support for new page table handling, which will allow to address 4 peta byte on amd64 soon [10:45:04] per process? [10:45:14] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: rlimit-fsize firejail to 4GB-1 bytes [puppet] - 10https://gerrit.wikimedia.org/r/345118 (owner: 10Filippo Giunchedi) [10:45:28] per server [10:49:19] (03CR) 10Filippo Giunchedi: "Upstream issue https://github.com/netblue30/firejail/issues/1168" [puppet] - 10https://gerrit.wikimedia.org/r/345118 (owner: 10Filippo Giunchedi) [10:50:16] (03CR) 10Gilles: "performance-team@lists.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/342431 (https://phabricator.wikimedia.org/T156245) (owner: 10Gilles) [10:52:41] (03CR) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [10:54:39] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3136586 (10elukey) The traffic is definitely decreased a lot from last week, but I am still seeing some 503s (way more than before). I a... [10:55:19] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:56:21] 06Operations, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3136588 (10MoritzMuehlenhoff) [11:10:16] (03PS1) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) [11:10:18] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor original file download limit should be 4GB - https://phabricator.wikimedia.org/T151456#3136630 (10Gilles) 05Open>03Resolved [11:13:09] PROBLEM - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [11:13:10] ACKNOWLEDGEMENT - MegaRAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T161600 [11:13:14] 06Operations, 10ops-eqiad: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136638 (10ops-monitoring-bot) [11:13:30] 06Operations, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3136642 (10hashar) [11:13:31] (03PS2) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) [11:14:03] 06Operations, 07HHVM: Monitor/address HHVM bytecode cache depletion on mediawiki app servers - https://phabricator.wikimedia.org/T161598#3136588 (10hashar) [11:14:14] 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3136652 (10Gilles) p:05Low>03Triage [11:25:53] (03PS5) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [11:26:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [11:27:08] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3136674 (10Aklapper) [11:29:52] (03PS6) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [11:33:13] (03PS7) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [11:35:03] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [11:42:30] (03PS1) 10Elukey: Fix Hue apache config [puppet] - 10https://gerrit.wikimedia.org/r/345129 (https://phabricator.wikimedia.org/T159527) [11:43:49] (03CR) 10Elukey: [C: 032] Fix Hue apache config [puppet] - 10https://gerrit.wikimedia.org/r/345129 (https://phabricator.wikimedia.org/T159527) (owner: 10Elukey) [11:46:39] jouncebot: next [11:46:39] In 1 hour(s) and 13 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300) [11:50:41] (03PS3) 10Gehel: maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 (https://phabricator.wikimedia.org/T161542) [11:51:50] (03CR) 10Gehel: [C: 032] maps - keep planet sync logs for 30 days [puppet] - 10https://gerrit.wikimedia.org/r/344974 (https://phabricator.wikimedia.org/T161542) (owner: 10Gehel) [11:52:26] (03CR) 10Gilles: "OK, what I'm seeing is that initially there are 80 https connections (expected, 2 per thumbor process) and then gradually they die until t" [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles) [12:02:48] (03CR) 10Gilles: "I'll keep the discussion going on the task, where it'll be more readable." [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles) [12:05:21] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn) [12:17:20] (03PS1) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130 [12:29:35] (03PS2) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130 [12:35:43] (03PS3) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130 [12:41:29] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:59] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:09] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:01] (03PS4) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130 [12:48:18] moritzm: are you working on mw1261? [12:48:23] (03PS5) 10Gehel: [WIP] elasticsearch - move to ecdsa certificates and tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/345130 [12:48:48] not at the moment, in an interview [12:49:01] can you create a stacktrace and depool? [12:51:46] 06Operations, 10DNS, 10Traffic: mediawiki.org DNS TXT entry for Google Search Console site verification - https://phabricator.wikimedia.org/T161343#3136754 (10dr0ptp4kt) Oh! In that case it is *probably* just actions done with the noc@ account to delegate "full" access to abaso@wikimedia.org to https://media... [12:52:04] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136756 (10faidon) We've lived with this bug in codfw for so long, I'd say to let it be as-is until we're done with the switchover and postpone that for May on... [12:52:31] jouncebot: next [12:52:31] In 0 hour(s) and 7 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300) [12:56:36] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3136757 (10akosiaris) Agreed [12:57:02] (03PS1) 10Gehel: elasticsearch - add dummy certificates for testing [labs/private] - 10https://gerrit.wikimedia.org/r/345136 [12:58:29] !log depooled mw1261 [12:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:56] (03CR) 10Gehel: [C: 032] elasticsearch - add dummy certificates for testing [labs/private] - 10https://gerrit.wikimedia.org/r/345136 (owner: 10Gehel) [12:59:00] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - add dummy certificates for testing [labs/private] - 10https://gerrit.wikimedia.org/r/345136 (owner: 10Gehel) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1300). [13:00:04] RoanKattouw: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:10] o/ [13:00:41] Hello [13:01:01] good morning [13:01:17] wanna handle the deployment of your patch? [13:01:30] Sure [13:01:54] (03CR) 10Hashar: [C: 031] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [13:02:21] (03CR) 10Catrope: [C: 032] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [13:02:39] (03PS5) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 [13:02:45] (03CR) 10Catrope: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [13:02:50] (03CR) 10Catrope: [C: 032] Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [13:03:23] Wait, wikibugs reports Gerrit changes now, not grrrit-wm? [13:04:13] (03Merged) 10jenkins-bot: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [13:04:26] (03CR) 10jenkins-bot: Enable RCFilters beta feature on plwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343437 (owner: 10Catrope) [13:07:54] Alright, pulled it to mwdebug1002 [13:09:00] Hmm, why do I have my language set to Frisian on Polish Wikipedia... [13:09:48] hu? [13:10:07] I probably had a reason to set it to that years ago... [13:13:01] (03CR) 10Volans: [C: 032] Minor fixes for library integrations [switchdc] - 10https://gerrit.wikimedia.org/r/344951 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:14:22] (03PS1) 10Gehel: elasticsearch - moved dummy certificates to the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/345137 [13:14:33] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - moved dummy certificates to the correct location [labs/private] - 10https://gerrit.wikimedia.org/r/345137 (owner: 10Gehel) [13:15:01] Trizek: Unfortunately the beta feature name + description aren't translated into Portuguese [13:15:30] (03PS2) 10Gehel: [mwgrep] enable more accurate regex timeout [puppet] - 10https://gerrit.wikimedia.org/r/344925 (owner: 10DCausse) [13:16:51] But I'm sure someone can fix that once they see it [13:16:56] It is translated into Polish though [13:17:31] (03CR) 10Gehel: [C: 032] [mwgrep] enable more accurate regex timeout [puppet] - 10https://gerrit.wikimedia.org/r/344925 (owner: 10DCausse) [13:18:09] Same for the ORES filters themselves, translated into Polish but not Portuguese [13:18:48] (03PS3) 10Volans: Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178) [13:20:16] OK, let's take this live [13:21:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 [13:21:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 [13:22:30] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable RCFilters beta feature on plwiki and ptwiki T158336 (duration: 00m 43s) [13:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:37] T158336: Contact group 1 wikis concerning Filters for recent changes - https://phabricator.wikimedia.org/T158336 [13:23:49] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:58] (03PS9) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [13:27:25] Oh and now I need to run the script for the preference migration [13:29:42] !log Ran initUserPreference.php -s ores-enabled -t rcenhancedfilters and -s ores-enabled -t oresHighlight on plwiki and ptwiki [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:30] (03CR) 10Gehel: [C: 031] "LGTM" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/344704 (owner: 10EBernhardson) [13:30:59] (03PS1) 10Volans: Reorder tasks [switchdc] - 10https://gerrit.wikimedia.org/r/345141 (https://phabricator.wikimedia.org/T160178) [13:34:29] RoanKattouw: I'm on the translation for the BEta feature title and description [13:37:12] Cool thanks [13:38:29] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:38:35] (03CR) 10Volans: [C: 032] Reorder tasks [switchdc] - 10https://gerrit.wikimedia.org/r/345141 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:38:39] PROBLEM - HHVM processes on mw1261 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [13:39:15] moritzm: are you working on mw1261? [13:39:48] (03PS4) 10Volans: Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178) [13:40:05] volans: it is depooled afaics [13:40:25] sorry moritzm, didn't see your message :( [13:42:59] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 72434 bytes in 0.268 second response time [13:43:19] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.050 second response time [13:43:29] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [13:43:35] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136845 (10Marostegui) [13:43:39] RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 6 processes with command name hhvm [13:43:49] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.587 second response time [13:44:28] !log started hhvm on mw1261 (still depooled) - no hhvm process running [13:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:37] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136638 (10Marostegui) @Cmjohnson feel free to replace the disk when you can. Thanks! [13:44:47] (03CR) 10Gehel: [WIP] Upgrade logstash to 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [13:46:05] (03CR) 10Alexandros Kosiaris: [C: 032] Make HostPathAutomounter work for files with . in them [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [13:48:53] (03PS4) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [13:49:04] (03CR) 10jerkins-bot: [V: 04-1] Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [13:49:21] (03CR) 10Gehel: Allow search clusters to reindex from eachother (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [13:50:16] is the swat done? [13:50:29] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:51:21] hashar RoanKattouw are you guys done with swat? [13:51:33] marostegui: Yes, sorry [13:51:34] elukey: I caught a backtrace, it also crashed yesterday after a similar run time (4-5 hours) [13:51:52] RoanKattouw: cool thank you!! I will deploy db-eqiad.php then :) [13:51:58] SAL contains a few mentions of that crash with 3.12 as well, but it seems to happen more frequently now [13:52:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 (owner: 10Marostegui) [13:52:49] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:53:29] (03PS5) 10Gehel: Allow search clusters to reindex from eachother [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [13:53:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 (owner: 10Marostegui) [13:53:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345139 (owner: 10Marostegui) [13:54:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1092 - T17441 (duration: 00m 43s) [13:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:52] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [13:55:29] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 425 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:56:59] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:05] ottomata: ping [13:57:09] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:29] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:49] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.027 second response time [13:59:59] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 72440 bytes in 0.082 second response time [14:00:19] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.029 second response time [14:01:06] cmjohnson1: o/ - if you need anybody from analytics [14:01:19] elukey: I need to know what disk to swap out [14:01:33] been waiting long enough [14:01:35] on an1028? [14:01:40] they've ^ [14:01:41] yes [14:01:53] ah okok [14:02:16] I can try to dump some data to it to make it blip.. is it ok? [14:02:19] (03PS1) 10Andrew Bogott: Nova: Add labvirt1002 back to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/345145 (https://phabricator.wikimedia.org/T159721) [14:02:38] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10fgiunchedi) > Point Swift imagescalers to the active MediaWiki > If this uses imagescaler-r... [14:02:49] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:03:27] (03CR) 10Alexandros Kosiaris: "I 've just started a build with this patch included in packager02.packaging.eqiad.wmflabs." [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [14:03:49] elukey okay [14:04:18] (03CR) 10Gehel: [C: 031] "This will require a cluster restart. We have a cluster restart coming up for kernel upgrade, I'll bundle all that together." [puppet] - 10https://gerrit.wikimedia.org/r/344517 (owner: 10EBernhardson) [14:04:35] elukey: i am looking at it [14:04:51] pretty confident i know which disk but would like confirmation [14:05:55] cmjohnson1: read only disk atm.. [14:06:24] I am going to pull the one I think it is...only 1 blinking and it matches up to /dev/sdi [14:06:27] can you monitor [14:07:22] sure [14:07:32] (03CR) 10Andrew Bogott: [C: 032] Nova: Add labvirt1002 back to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/345145 (https://phabricator.wikimedia.org/T159721) (owner: 10Andrew Bogott) [14:07:44] elukey: pulled it ou [14:08:20] (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema) [14:09:08] cmjohnson1: all good for the moment [14:10:12] elukey: disk replaced [14:11:03] cmjohnson1: I can see "Firmware state: Unconfigured(good), Spun Up", going to set it up and will let you know. thankS! [14:11:24] cool..lmk if you need anything else [14:14:49] (03CR) 10Filippo Giunchedi: [C: 04-1] tlsproxy: simplify prometheus metrics gathering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema) [14:15:26] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3136925 (10Cmjohnson) Swapped the disk out with a spare on-site. The server is still under warranty so requested a new disk to be sent t... [14:16:25] (03PS9) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [14:24:45] (03PS2) 10Gilles: Improve Thumbor nginx timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/344999 (https://phabricator.wikimedia.org/T150746) [14:24:45] 06Operations, 10hardware-requests: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3136941 (10fgiunchedi) [14:25:09] RECOVERY - Hadoop DataNode on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:26:16] cmjohnson1: everything looks good! [14:26:19] PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:39] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:29:24] 06Operations, 06Performance-Team, 15User-fgiunchedi: Upgrade to Grafana 4.2.0 - https://phabricator.wikimedia.org/T161193#3136955 (10Peter) Great, thank you @fgiunchedi !!! [14:29:48] (03PS1) 10Cmjohnson: Changing ms-be1033 ip to match row/rack change from C to Row B [dns] - 10https://gerrit.wikimedia.org/r/345151 [14:30:12] (03CR) 10Cmjohnson: [C: 032] Changing ms-be1033 ip to match row/rack change from C to Row B [dns] - 10https://gerrit.wikimedia.org/r/345151 (owner: 10Cmjohnson) [14:31:09] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3136969 (10Cmjohnson) @fgiunchedi Moved ms-be1033 to row B. updated dns [14:31:11] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 06DC-Ops: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632#3136970 (10elukey) 05Open>03Resolved a:03elukey [14:31:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) [14:32:49] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [14:34:18] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3136976 (10Cmjohnson) The disk is internal and the server will need to be powered off to replace the disk. Please coordinate scheduled downtime w/cmjohnson to replace [14:36:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [14:36:19] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:37:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [14:37:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345152 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [14:37:52] !log ppchelko@tin Started deploy [changeprop/deploy@bfbaa17]: Increase log level for processinng failures [14:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:04] !log ran restart-hhvm on mw1242, hhvm threads stuck (dump debug in /tmp/hhvm.9008.bt.) - HHVM 3.12 [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 - T17441 (duration: 00m 43s) [14:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:18] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [14:38:39] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.071 second response time [14:38:59] !log ppchelko@tin Finished deploy [changeprop/deploy@bfbaa17]: Increase log level for processinng failures (duration: 01m 07s) [14:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:29] RECOVERY - Nginx local proxy to apache on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.040 second response time [14:39:29] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 72423 bytes in 0.085 second response time [14:40:06] !log Convert dewiki UNIQUE keys into PK on db1091 (commonswiki) - T17441 [14:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:18] weird, mw2256 is set to pooled=inactive, but I am pretty sure it wasn't the last time that I worked on it (a couple of days ago) [14:40:48] alarm started 13 days ago, at this point my pebcak [14:40:58] (03PS3) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) [14:41:11] mmmm even if in the SAL I can see elukey@puppetmaster1001: conftool action : set/pooled=active; selector: name=mw2256.codfw.wmnet [14:41:24] (03CR) 10Ema: tlsproxy: simplify prometheus metrics gathering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema) [14:41:45] sigh set pooled=yes, not active [14:41:48] * elukey cries in a corner [14:41:56] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2256.codfw.wmnet [14:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:40] (03PS1) 10Cmjohnson: Adding mgmt dns entries for T159886 and T159887, frdb1002 and frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345155 [14:45:17] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for T159886 and T159887, frdb1002 and frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345155 (owner: 10Cmjohnson) [14:45:50] (03PS1) 10Elukey: Prepare mw2090->mw2096 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345156 (https://phabricator.wikimedia.org/T161488) [14:48:49] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:49:59] PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - /root/bindtest/srv/testsnap is not accessible: Permission denied [14:52:24] (03CR) 10Filippo Giunchedi: [C: 031] tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema) [14:54:19] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:56:01] (03CR) 10Filippo Giunchedi: [C: 032] Improve Thumbor nginx timeout settings [puppet] - 10https://gerrit.wikimedia.org/r/344999 (https://phabricator.wikimedia.org/T150746) (owner: 10Gilles) [14:58:19] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:49] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 107, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/5: down - siliconBR [15:04:59] RECOVERY - Disk space on labstore1004 is OK: DISK OK [15:06:19] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:10:14] (03PS1) 10Giuseppe Lavagetto: service::node: abstract config for scap3, allow use of confd in configuration [puppet] - 10https://gerrit.wikimedia.org/r/345158 [15:11:34] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Nginx timeouts on Thumbor - https://phabricator.wikimedia.org/T150746#3137110 (10Gilles) 05Open>03Resolved Alright, the timeouts are solved now. No more 504s. Now the only errors in nginx's logs are when a thumbor instance dies, which res... [15:11:59] PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - /root/bindtest/srv/testsnap is not accessible: Permission denied [15:13:39] cmjohnson1: thanks for an28 help [15:13:49] what's the word on https://phabricator.wikimedia.org/T155065#3110178 ? [15:13:55] delivery date was friday, wonder if they showed up [15:17:05] (03PS3) 10Giuseppe Lavagetto: service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [15:18:43] (03PS2) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/344615 [15:20:22] (03PS1) 10Gilles: Disable storing Thumbor thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/345161 [15:20:25] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3082218 (10Cmjohnson) plugged frrdb1002 into pfw-2-eqiad port 2. It is in opposite switch then frdb1001 and neither is double connected. ilom is setup [15:22:25] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137200 (10Papaul) @fgiunchedi system is up [15:23:34] (03CR) 10Muehlenhoff: [C: 032] Adapt debdeploy grain to rename of nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/344615 (owner: 10Muehlenhoff) [15:23:34] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3137201 (10Cmjohnson) frdev1001 is plugged into pfw1 port 5 [15:23:51] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 109, down: 0, dormant: 0, excluded: 2, unused: 0 [15:24:02] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [15:24:30] (03PS4) 10Giuseppe Lavagetto: service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [15:25:16] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node: Do not use the proxy by default [puppet] - 10https://gerrit.wikimedia.org/r/344996 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [15:27:19] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:29:37] RECOVERY - mediawiki-installation DSH group on mw2256 is OK: OK [15:33:07] (03PS1) 10Cmjohnson: Adding production dns entries for frdev1001 and frdb1002 T159887 and T159886 [dns] - 10https://gerrit.wikimedia.org/r/345163 [15:33:43] (03CR) 10Cmjohnson: [C: 032] Adding production dns entries for frdev1001 and frdb1002 T159887 and T159886 [dns] - 10https://gerrit.wikimedia.org/r/345163 (owner: 10Cmjohnson) [15:40:01] (03CR) 10Volans: [C: 032] Add task to update Tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345045 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [15:41:06] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3137300 (10Cmjohnson) Added to racktables [15:41:10] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frdb1002 - https://phabricator.wikimedia.org/T159886#3137301 (10Cmjohnson) Added to racktables [15:41:22] (03CR) 10Jcrespo: "<3 the name" [dns] - 10https://gerrit.wikimedia.org/r/345163 (owner: 10Cmjohnson) [15:42:07] RECOVERY - Disk space on labstore1004 is OK: DISK OK [15:42:41] (03PS1) 10Cmjohnson: Fixing the name for frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345165 [15:43:15] (03CR) 10Cmjohnson: [C: 032] Fixing the name for frdev1001 [dns] - 10https://gerrit.wikimedia.org/r/345165 (owner: 10Cmjohnson) [15:43:56] (03CR) 10Andrew Bogott: "It seems reasonable to completely remove the old role config since that role doesn't exist anymore." [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff) [15:44:19] moritzm gilles looks like upstream has fixed the issue already \o/ https://github.com/netblue30/firejail/commit/671ba2b8ef43edd74b32267f22f053cb510b2bde [15:49:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-3/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR [15:49:28] (03PS1) 10Rush: labstore: nfs-manage-binds improvements [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) [15:52:26] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3137355 (10Cmjohnson) The S/N finally shows up..i submitted a case for this Your case was successfully submitted. Please note your Case ID: 5318424916 for future reference. [15:53:03] 06Operations, 10ops-eqiad, 10hardware-requests: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3137357 (10Cmjohnson) p:05Normal>03Low [15:55:31] godog: yeah, they usually have really good turnarounds in fixing bugs [15:56:24] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3137367 (10Cmjohnson) Disk has been ordered through Dell Create Service Request: Service Tag 3JG3K02 Confirmed: Request 946035459 was successfully submitted. [15:56:45] !log banning elastic2021 to run same tests as elastic2020 - T149006 [15:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:52] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [15:57:34] (03CR) 10Muehlenhoff: "This configures the salt grain which allows debdeploy (which is based on salt ATM) to address systems per role (since it doesn't have a vi" [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff) [15:57:50] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3137369 (10Marostegui) Thanks! [15:57:56] (03PS3) 10Eevans: Mandatory Cassandra client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342904 (https://phabricator.wikimedia.org/T111113) [15:58:22] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3137372 (10Cmjohnson) @elukey I have the thermal paste....want to plan for this on Thursday morning (my morning)? [15:59:39] (03CR) 10Andrew Bogott: "> In general, if a role is renamed, the Hiera entry needs to be moved along." [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff) [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1600). [16:00:04] urandom: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] * urandom is available [16:01:23] godog: are you handling Eric's cassandra patch from puppet swat? [16:01:34] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3137378 (10elukey) >>! In T132256#3137372, @Cmjohnson wrote: > @elukey I have the thermal paste....want to plan for this on Thursday morni... [16:02:34] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3137379 (10Cmjohnson) To clarify labstore1001 and 2 arrays are going to rack B5 labstore1002 and all arrays are going to B6 Each h... [16:03:06] (03PS1) 10EBernhardson: Prevent wikidata dumps from taking all memory on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577) [16:03:37] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 626656 [16:04:20] moritzm urandom yeah sorry I got distracted [16:04:27] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137401 (10RobH) So I can login to the mgmt of the system (WMF6406) via serial and see it is rebuilding the raid. @Papaul, did the system have enough drive trays, or did you have to steal f... [16:04:56] (03CR) 10Filippo Giunchedi: [C: 032] Mandatory Cassandra client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342904 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [16:05:54] s/win 23 [16:05:56] urandom: I reckon _joe_ and mobrovac were also restarting / messing with restbase, might conflict ? [16:06:05] haven't submitted yet [16:06:15] godog: shouldn't [16:06:34] but we could coordinate [16:06:42] godog: it's going to take hours to restart Cassandra [16:07:02] mobrovac: are you going to be doing some restbase restarts? [16:07:04] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3137404 (10madhuvishy) @Cmjohnson - One edit > labstore1002 and all arrays are going to B6 I assume that's really all other* arra... [16:07:43] _joe_: ^ too [16:08:18] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3137405 (10Cmjohnson) @madhuvishy that is correct ... labstore1001 and 2 arrays in B5 and labstore1002 and 2 arrays in B6 [16:10:27] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:54] <_joe_> godog: I am restarting rb, yes [16:10:56] (03PS1) 10EBernhardson: Stop copying cirrus UserTesting logs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/345171 [16:11:00] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3137408 (10fgiunchedi) @Cmjohnson thanks! I've fixed the raid on the machines (cfr https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0_Gen... [16:11:41] ok, just to make sure, I'll merge the patch now but it isn't going to affect anything until cassandra is restarted [16:11:50] (03CR) 10EBernhardson: "It looks like we probably want ensure=>absent, as done in this patch, rather than removing entirely. Not 100% sure though" [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson) [16:13:17] godog: wfm [16:13:28] (03CR) 10EBernhardson: "Logs themselves are no longer generated, since Ia6804d12" [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson) [16:15:03] <_joe_> godog: done now [16:15:21] godog: awesome; thanks [16:15:22] (03PS2) 10Filippo Giunchedi: Disable storing Thumbor thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/345161 (owner: 10Gilles) [16:15:24] perfect timing too! [16:15:29] np, thanks _joe_ urandom ! [16:16:00] this concludes puppetswat [16:16:14] :) [16:18:28] (03CR) 10Filippo Giunchedi: [C: 032] Disable storing Thumbor thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/345161 (owner: 10Gilles) [16:18:59] godog: no gif to celebrate? [16:19:01] :D [16:19:09] !log T111113: Restarting Cassandra on restbase2001 to apply mandatory client encryption (canary) [16:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:15] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:19:31] elukey: https://i.imgur.com/fzv0cTM.gif [16:19:57] ha [16:20:00] urandom: _joe_ wants to [16:20:13] mobrovac: i think he already did [16:20:22] elukey: I have more when no patches are scheduled though :D [16:20:27] mobrovac: 12:15 < _joe_> godog: done now [16:21:05] k [16:21:09] * mobrovac hides [16:21:57] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:22:12] ^^^ there are going to be some of these, i guess [16:22:36] icinga seems to catch about 1/3 of them now on a rolling restart [16:22:57] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2017-09-12 15:13:25 +0000 (expires in 167 days) [16:23:06] *just* catches them [16:23:12] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: rack and cable frdev1001 - https://phabricator.wikimedia.org/T159887#3137505 (10Jgreen) [16:26:27] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [16:27:27] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.033 second response time on 10.192.16.163 port 9042 [16:28:07] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137524 (10RobH) [16:30:57] PROBLEM - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:31:31] * urandom .oO(...or maybe more than 1/3) [16:31:50] (03PS1) 10RobH: adding temp host graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/345177 [16:31:57] RECOVERY - cassandra-c SSL 10.192.16.164:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-c valid until 2017-09-12 15:13:30 +0000 (expires in 167 days) [16:32:32] (03CR) 10RobH: [C: 032] adding temp host graphite2003 [dns] - 10https://gerrit.wikimedia.org/r/345177 (owner: 10RobH) [16:33:17] 06Operations, 10ops-codfw, 13Patch-For-Review: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137568 (10RobH) [16:37:23] (03CR) 10Madhuvishy: "small nits, otherwise +1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush) [16:38:27] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:39:20] !log T111113: Restarting remaining Cassandra instances, rack 'b', codfw (restbase20{02,07,10}) [16:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:26] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:43:10] (03CR) 10Nuria: [C: 031] Stop copying cirrus UserTesting logs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson) [16:44:17] (03PS1) 10Daniel Kinzler: Allow only properties on Special:EntitiesWithoutLabel and Special:EntitiesWithoutDescription. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887) [16:47:01] 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 4 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#3137659 (10mobrovac) 05Open>03Resolved All of the services that do not need the proxy, don't use it. Moreover, with the switch to Scap3 config... [16:47:02] _joe_: https://www.mediawiki.org/w/index.php?title=Extension:EventBus&diff=next&oldid=2027568 [16:49:09] 06Operations, 10ops-codfw, 13Patch-For-Review: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137671 (10RobH) Ok, the networking on this is being shitty. @Papaul: Can you just plug in a usb stick into this system, I'll format it and copy the coal data over. O... [16:49:17] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3137672 (10RobH) [16:49:27] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1700). Please do the needful. [17:00:13] no parsoid deploy today [17:00:37] (03PS1) 10Muehlenhoff: Uninstall eject on jessie onwards [puppet] - 10https://gerrit.wikimedia.org/r/345183 [17:00:39] (03PS1) 10Jdlrobson: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) [17:03:48] PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:03:54] (03CR) 10Filippo Giunchedi: Enable memcache-based Thumbor broken thumbnail throttling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [17:04:25] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3137747 (10Papaul) @Gehel Been on the phone with HP for about 45 minutes. went over all the logs files they requested and can't find any poten... [17:04:48] RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2017-09-12 15:35:55 +0000 (expires in 167 days) [17:06:35] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2021.codfw.wmnet [17:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:55] papaul: elastic2021 is drained, I'm starting some load on it... [17:07:06] !log swift codfw-prod: bump ms-be2028 ms-be2039 object weight to 3000 - T158337 [17:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:11] T158337: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337 [17:07:15] gehel: thanks [17:13:39] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 4321 [17:17:28] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:18:20] (03PS2) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [17:22:20] 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3137827 (10MaxSem) The only valid use for labs is WMF projects, and those don't support JS in IE under 9 (9 iS soon going to... [17:22:39] !log starting branch cut for 1.29.0-wmf.18 [17:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:34] 06Operations, 10hardware-requests, 15User-fgiunchedi: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3137843 (10fgiunchedi) a:03RobH [17:34:43] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:34:53] PROBLEM - DPKG on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:35:13] PROBLEM - Disk space on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:35:43] PROBLEM - MD RAID on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:36:04] PROBLEM - configured eth on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:36:23] PROBLEM - dhclient process on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:36:33] PROBLEM - puppet last run on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:36:53] PROBLEM - salt-minion processes on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:38:13] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:43:23] PROBLEM - HP RAID on ms-be1039 is CRITICAL: Return code of 255 is out of bounds [17:43:37] that's me ^ [17:44:20] (03PS1) 10Dduvall: k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187 [17:44:33] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:46] (03PS1) 10Jcrespo: [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) [17:46:02] (03CR) 10Jcrespo: [C: 04-1] [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo) [17:46:19] (03PS3) 10Dzahn: Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [17:47:25] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Quick & dirty script to check data differences between tables [puppet] - 10https://gerrit.wikimedia.org/r/345188 (https://phabricator.wikimedia.org/T160509) (owner: 10Jcrespo) [17:49:23] (03CR) 10Dzahn: [C: 032] Add admin group perf-roots to role xhgui. [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [17:50:50] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3137904 (10bbogaert) Hi @MoritzMuehlenhoff, Is it possible for us to modify the replication? We have an ou for ex-employees. Thanks, Byron [17:51:30] (03CR) 10Dzahn: "oops, no-op on tungsten. wrong location, it's "xhgui::app"" [puppet] - 10https://gerrit.wikimedia.org/r/344531 (https://phabricator.wikimedia.org/T161261) (owner: 10Krinkle) [17:53:19] !log T111113: Restarting Cassandra instances, codfw row 'c' [17:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:27] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [17:55:10] (03PS1) 10Dzahn: admin: fix admin group for xhgui::app role, adjust description [puppet] - 10https://gerrit.wikimedia.org/r/345191 (https://phabricator.wikimedia.org/T161261) [17:58:52] (03CR) 10Dzahn: [C: 032] "this finally adds krinkle and gilles to tungsten: http://puppet-compiler.wmflabs.org/5940/tungsten.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345191 (https://phabricator.wikimedia.org/T161261) (owner: 10Dzahn) [17:59:53] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [18:00:53] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.033 second response time on 10.192.32.135 port 9042 [18:01:18] 06Operations, 10Ops-Access-Requests, 06Performance-Team, 13Patch-For-Review: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3137958 (10Dzahn) 05Open>03Resolved Alright, thanks for your comments Moritz. Done and adjusted the group description as well. @Krink... [18:01:37] 06Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#3137961 (10bbogaert) Hi @MoritzMuehlenhoff, I see the value in making a more generic ou address more use cases, but I would rather have an ou that more aligns more with their purpose. These pe... [18:01:57] (03PS1) 10Dduvall: [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) [18:02:04] 06Operations, 10Ops-Access-Requests, 06Performance-Team: Restore perf-roots access to xhgui (tungsten) - https://phabricator.wikimedia.org/T161261#3137964 (10Dzahn) [18:03:23] (03CR) 10jerkins-bot: [V: 04-1] [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall) [18:06:01] (03PS2) 10Dduvall: [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) [18:07:16] (03PS3) 10Dzahn: admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) [18:09:23] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [18:09:54] (03CR) 10Dzahn: [C: 032] admin: create shell account for Paul Norman [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn) [18:10:23] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.033 second response time on 10.192.32.137 port 9042 [18:11:08] (03CR) 10Dduvall: "This patchset should probably be split. I just wanted to get my experimental work up for review before pausing the k8s parts in favor of f" [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall) [18:11:10] (03PS3) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [18:11:12] (03PS1) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 [18:11:14] (03PS1) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 [18:11:49] 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3137986 (10Ottomata) [18:12:09] 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3137986 (10Ottomata) [18:12:12] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3138004 (10Cmjohnson) labstore1001 and 2 arrays are in B5 connected to ge-5/0/4 labstore1002 and 2 arrays are in B7 connected to ge... [18:12:21] (03CR) 10jerkins-bot: [V: 04-1] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto) [18:12:30] (03CR) 10Dzahn: "user has been created on bast1001 (and will be soon on bast2001, bast3001, bast4001). You should already be able to SSH there. Now we can " [puppet] - 10https://gerrit.wikimedia.org/r/345066 (https://phabricator.wikimedia.org/T161274) (owner: 10Dzahn) [18:12:33] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:13:14] 06Operations, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138006 (10Ottomata) [18:13:29] 06Operations, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138006 (10Ottomata) [18:13:45] 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3137986 (10Ottomata) [18:13:54] 06Operations, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3138024 (10Ottomata) [18:14:04] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3138026 (10Ottomata) [18:14:11] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138028 (10Ottomata) [18:16:01] (03PS1) 10Dzahn: admin: add pnorman to maps/kartotherian/tilerator-admins [puppet] - 10https://gerrit.wikimedia.org/r/345196 [18:18:03] !log ppchelko@tin Started deploy [changeprop/deploy@1689d86]: Rename event field in logs [18:18:03] PROBLEM - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.139 and port 9042: Connection refused [18:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:30] (03CR) 10Dzahn: [C: 032] "as approved on ticket and ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/345196 (owner: 10Dzahn) [18:18:55] !log ppchelko@tin Finished deploy [changeprop/deploy@1689d86]: Rename event field in logs (duration: 00m 52s) [18:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:03] RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.033 second response time on 10.192.32.139 port 9042 [18:24:47] (03PS4) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [18:24:50] (03PS2) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 [18:24:53] (03PS2) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 [18:26:19] (03CR) 10jerkins-bot: [V: 04-1] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto) [18:29:56] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3138068 (10Dzahn) Hi @Pnorman, so after [[ https://gerrit.wikimedia.org/r/#/c/345066/ | this ]] merge your shell user has been created on the [[ https://wikitech... [18:31:23] PROBLEM - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.145 and port 9042: Connection refused [18:32:24] RECOVERY - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is OK: TCP OK - 0.033 second response time on 10.192.32.145 port 9042 [18:38:12] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3138091 (10chasemp) labs-support vlan please :) [18:39:03] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:19] !log T111113: Restarting Cassandra instances, codfw row 'c' {{done}} [18:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:26] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [18:45:39] !log T111113: Restarting Cassandra instances, codfw row 'd' [18:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:29] I've managed to add support for eddsa in gerrit -> https://gerrit-review.googlesource.com/#/c/100998/ :) [18:52:03] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused [18:53:11] RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.033 second response time on 10.192.48.47 port 9042 [18:55:37] (03PS5) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [18:55:39] (03PS3) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 [18:55:41] (03PS3) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 [18:56:41] (03CR) 10jerkins-bot: [V: 04-1] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto) [18:56:43] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 59286.021947 Seconds [18:56:44] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 59286.026878 Seconds [18:56:54] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 59288.314324 Seconds [18:59:43] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [18:59:43] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [18:59:53] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T1900). Please do the needful. [19:00:24] (03CR) 10Giuseppe Lavagetto: [C: 031] "PCC says the change is good, but I won't merge it tonight." [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [19:00:26] * thcipriani does [19:03:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://puppet-compiler.wmflabs.org/5944/wtp1001.eqiad.wmnet/ things are more intertwined than I expected, I need to check this." [puppet] - 10https://gerrit.wikimedia.org/r/345193 (owner: 10Giuseppe Lavagetto) [19:08:03] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:08:23] PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused [19:09:23] RECOVERY - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is OK: TCP OK - 0.033 second response time on 10.192.48.51 port 9042 [19:17:13] !log thcipriani@tin Started scap: testwiki to php-1.29.0-wmf.18 and rebuild l10n cache [19:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:11] (03CR) 10Dzahn: [C: 031] "nothing to eject anyways" [puppet] - 10https://gerrit.wikimedia.org/r/345183 (owner: 10Muehlenhoff) [19:30:13] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200) [19:31:13] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:32:03] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:33:13] !log T111113: Restarting Cassandra instances, codfw row 'd' {{done}} [19:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:19] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [19:34:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to maps servers for pnorman - https://phabricator.wikimedia.org/T161274#3138227 (10Dzahn) 05Open>03Resolved [19:35:13] PROBLEM - Disk space on mwdebug1002 is CRITICAL: DISK CRITICAL - free space: / 1798 MB (3% inode=72%) [19:37:33] !log restbase deploying d477f495 [19:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:00] (03PS2) 10Dzahn: decom ms-fe100[1-4], remove from DHCP and puppet [puppet] - 10https://gerrit.wikimedia.org/r/345075 (https://phabricator.wikimedia.org/T160986) [19:46:57] (03CR) 10Dzahn: [C: 032] decom ms-fe100[1-4], remove from DHCP and puppet [puppet] - 10https://gerrit.wikimedia.org/r/345075 (https://phabricator.wikimedia.org/T160986) (owner: 10Dzahn) [19:54:43] PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 1421 MB (3% inode=72%) [19:55:06] 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3138272 (10grin) >>! In T161256#3137827, @MaxSem wrote: > The only valid use for labs is WMF projects, This is not true in... [19:57:33] !log thcipriani@tin Finished scap: testwiki to php-1.29.0-wmf.18 and rebuild l10n cache (duration: 40m 19s) [19:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:10] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:06:56] !log ms-fe100[1-4] - disable/stop puppet, stop salt minion, decom (T160986) [20:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:02] T160986: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986 [20:07:32] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138301 (10Dzahn) [20:08:46] (03PS1) 10Thcipriani: Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 [20:09:18] (03CR) 10Thcipriani: [C: 032] Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 (owner: 10Thcipriani) [20:10:10] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:20] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:20] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] (03Merged) 10jenkins-bot: Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 (owner: 10Thcipriani) [20:11:41] (03CR) 10jenkins-bot: Group0 to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345219 (owner: 10Thcipriani) [20:13:12] !log mw1261 HHVM crash as predicted by Moritz - ran sudo hhvm-dump-debug. Backtrace saved as /tmp/hhvm.79460.bt. [20:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:45] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.18 [20:14:47] !log mw1261 runs with HHVM 3.18 - which seems to have a bug leading to a deadlock every 4-5 hours [20:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:46] !log mw1261 - depooled [20:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:57] ACKNOWLEDGEMENT - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn runs HHVM 3.18 - Moritz debugging [20:16:57] ACKNOWLEDGEMENT - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn runs HHVM 3.18 - Moritz debugging [20:16:57] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn runs HHVM 3.18 - Moritz debugging [20:17:55] 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3126688 (10Peachey88) >>! In T161256#3138272, @grin wrote: > I would expect some background check from you before answering.... [20:18:58] !log mwdebug1001 - was low on disk space, 'apt-get clean' - freed about 4GB [20:19:00] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK [20:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:41] !log mwdebug1002 - same, was low on disk space, 'apt-get clean' freed > 3GB [20:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:20] RECOVERY - Disk space on mwdebug1002 is OK: DISK OK [20:21:06] !log copper - puppet errors due to Failed resource /var/lib/docker/devicemapper ?? [20:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:59] !log T111113: Restarting Cassandra instances, eqiad row 'a' [20:22:04] !log mc1019 - puppet fail due to Failed resource /etc/redis/replica since 4 days [20:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:06] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [20:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:41] 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#3138340 (10ema) [20:24:55] !log ms-fe1001 thru msfe1004 - scheduled last downtime for host and services in icinga - shutdown -h now, turn them off, revoke puppet certs, salt-keys... (T160986) [20:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:02] T160986: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986 [20:30:40] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:34:59] 06Operations, 10Pybal, 13Patch-For-Review: Configure pybal ulimits higher - https://phabricator.wikimedia.org/T110091#3138358 (10ema) 05Open>03Resolved Closing, LimitNOFILE is set to infinity in the [[https://github.com/wikimedia/PyBal/blob/master/debian/pybal.service#L5| systemd unit file]]. [20:37:03] 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10ema) >>! In T114104#1685872, @mark wrote: > It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirabl... [20:43:21] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.116 and port 9042: Connection refused [20:44:22] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042 [20:45:42] (03PS1) 10Andrew Bogott: Keystone 2fa: Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231 [20:55:51] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.004 second response time [20:58:01] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time [20:58:41] RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:59:46] ^ madhuvishy can you look at showmount for tools checker? [20:59:50] that's concerning [21:01:21] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.002 second response time [21:03:13] (03PS2) 10Andrew Bogott: Keystone 2fa: Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231 [21:03:31] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.003 second response time [21:03:45] chasemp: aah ['/sbin/showmount', '-e', 'labstore.svc.eqiad.wmnet'], [21:04:04] labstore.svc must be down since we took apart the boxes [21:04:13] that is checking the old setup [21:04:15] also it's checking the wrong thing [21:04:16] yes [21:04:42] madhuvishy: well uh, want to correct that check to say "secondary_cluster_showmount" or something and make it hit the cluster ip? [21:05:01] yeah [21:05:18] nfs-tools-project.svc.eqiad.wmnet [21:05:41] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.004 second response time [21:05:49] * madhuvishy goes to silence [21:06:03] ah thanks [21:07:22] (03CR) 10Andrew Bogott: "Tested on labtest, seems to work fine." [puppet] - 10https://gerrit.wikimedia.org/r/345231 (owner: 10Andrew Bogott) [21:08:18] !log T111113: Restarting Cassandra instances, eqiad row 'a' {{done}} [21:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:24] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [21:08:29] !log T111113: Restarting Cassandra instances, eqiad row 'b' [21:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:23] !log upgraded nova-compute on labvirt1014 because it contains a long-awaited bugfix [21:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:49] andrewbogott: slow restart fix? [21:19:56] yep! [21:20:06] Installing it on 1014 and on labtest, and we'll see if it blows up [21:20:10] should be a very minor upgrade though [21:20:52] (03PS1) 10Madhuvishy: toolschecker: Update nfs showmount check to test secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/345247 [21:21:04] andrewbogott: seems like we should roll through labtest first? [21:21:33] chasemp: yeah, doing both. Since 1014 isn't actively scheduled it's another safe test case. [21:21:39] fair point [21:23:56] (03CR) 10Rush: [C: 031] toolschecker: Update nfs showmount check to test secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/345247 (owner: 10Madhuvishy) [21:25:16] (03CR) 10Madhuvishy: [C: 032] toolschecker: Update nfs showmount check to test secondary cluster [puppet] - 10https://gerrit.wikimedia.org/r/345247 (owner: 10Madhuvishy) [21:26:11] PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.203 and port 9042: Connection refused [21:27:11] RECOVERY - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is OK: TCP OK - 0.000 second response time on 10.64.32.203 port 9042 [21:33:52] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.078 second response time [21:37:11] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:51:16] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138577 (10Dzahn) [21:54:57] (03PS2) 10Dzahn: remove production IPs for ms-fe100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/345076 (https://phabricator.wikimedia.org/T160986) [21:55:06] !log T111113: Restarting Cassandra instances, eqiad row 'b' {{done}} [21:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:13] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [21:55:31] !log T111113: Restarting Cassandra instances, eqiad row 'd' [21:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:01] 06Operations, 10hardware-requests, 15User-fgiunchedi: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3138585 (10RobH) I've created sub-task T161634 in the private procurement space for quotes from Dell. We're going to order via the system manufacturer, since these... [21:56:20] (03CR) 10Dzahn: [C: 032] remove production IPs for ms-fe100[1-4] [dns] - 10https://gerrit.wikimedia.org/r/345076 (https://phabricator.wikimedia.org/T160986) (owner: 10Dzahn) [21:57:32] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138589 (10Dzahn) [21:59:47] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138592 (10Dzahn) a:05Dzahn>03RobH @Robh all steps done up to switch ports, per checked boxes above, could you disable the ports? thanks [22:05:11] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:07:11] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3138606 (10RobH) a:05RobH>03Cmjohnson So, in trying to find these ports, they are not set with a description on the switch. So @cmjohnson will have to manual... [22:09:57] (03PS2) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) [22:11:13] (03CR) 10jerkins-bot: [V: 04-1] WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush) [22:13:08] (03PS3) 10Dzahn: add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) [22:13:21] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [22:14:09] (03CR) 10Dzahn: [C: 032] add new language "dty" (Doteli) [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) (owner: 10Dzahn) [22:14:21] RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.136 port 9042 [22:15:36] (03CR) 10Dzahn: "http://www-01.sil.org/iso639-3/documentation.asp?id=dty" [dns] - 10https://gerrit.wikimedia.org/r/345077 (https://phabricator.wikimedia.org/T161529) (owner: 10Dzahn) [22:19:56] !log DNS - creating new language "dty" (T160865) - running "authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones" to trigger re-creation of zone files after change in langs.tmpl. (gerrit:345077) | https://www.ethnologue.com/language/dty [22:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:02] T160865: Create Wikipedia Khowar - https://phabricator.wikimedia.org/T160865 [22:20:15] meh, not that ticket [22:21:01] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused [22:21:58] !log DNS - creating new language "dty" (T161529) - running "authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsd checkconf && gdnsd reload-zones" to trigger re-creation of zone files after change in langs.tmpl. (gerrit:345077) | https://www.ethnologue.com/language/dty [22:22:01] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.138 port 9042 [22:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:04] T161529: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529 [22:24:26] (03PS3) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) [22:25:17] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10Dzahn) dty has been created in DNS: ``` ;; QUESTION SECTION: ;dty.wikipedia.org. IN A ;; ANSWER SECTION: dty.wikipedia.org. 600 IN A 208... [22:26:34] 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10BBlack) I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!)., What we shou... [22:31:16] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3138679 (10Dzahn) created Wikidata item: https://www.wikidata.org/wiki/Q29048035 [22:44:42] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3138702 (10Dzahn) left comment on https://blog.wikimedia.org/2015/11/04/doteli-wikipedia-makes-significant-progress/#comment-162812 [22:45:08] !log T111113: Restarting Cassandra instances, eqiad row 'd' {{done]} [22:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:15] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [22:56:55] 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138723 (10pmiazga) [22:58:49] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3138751 (10tstarling) Aaron, have you seen [[https://wikitech.wikimedia.org/wiki/Conftool|the conft... [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170328T2300). Please do the needful. [23:00:05] Krinkle and jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:53] 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138723 (10Dzahn) @pmiazga Hi, it seems you can ssh to bast1001.wikimedia.org. I see in logs "pam_unix(sshd:session): session opened for user pmiazga". Which server are you trying to connect to? Could you... [23:02:15] 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138757 (10Dzahn) Your wikitech password is not related to production shell access in any way. You have 2 seperate SSH keys, one for labs and one for production. [23:03:28] \o [23:03:29] who's deploying? [23:03:35] i only have 45 mins :) [23:03:45] 06Operations, 10Ops-Access-Requests: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3138763 (10bmansurov) [23:03:57] I can SWAT [23:04:00] o/ [23:05:10] (03PS2) 10Thcipriani: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson) [23:05:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson) [23:08:20] fyi thcipriani this will take a few minutes for me to test, so if you can and want to queue up Krinkle's patches while i do that go ahead [23:08:26] 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138792 (10pmiazga) ```raynor@DellE6540:~/.ssh » ssh -N notebook1001.eqiad.wmnet -L 8000:127.0.0.1:8000 -vvv 255 ↵ OpenSSH_7.4p1, OpenSSL 1.0.2k 2... [23:08:37] (while i test on mwdebug) [23:08:45] jdlrobson: sure, thanks for the heads-up [23:08:51] (03Merged) 10jenkins-bot: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson) [23:08:59] (03CR) 10jenkins-bot: Enable header version 2 on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345184 (https://phabricator.wikimedia.org/T160471) (owner: 10Jdlrobson) [23:09:24] 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138796 (10pmiazga) and my ssh config: ``` Host * UseRoaming no Host SmartServer hostname 192.168.202.2 port 22 user pi # WikiMedia Host gerrit.wikimedia.org IdentityFile /home/raynor/.ssh/id_rsa... [23:09:36] jdlrobson: your patch is on mwdebug1002, check please [23:12:04] 06Operations, 10Ops-Access-Requests: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3138763 (10Dzahn) @bmansurov Would you mind telling me the secret string of your committed identity? (in private email to dzahn@) [23:14:00] thcipriani: go for it [23:14:14] jdlrobson: alright, going live. [23:15:13] * jdlrobson crosses fingers for no cached html issues [23:16:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:345184|Enable header version 2 on all wikis]] T160471 (duration: 00m 45s) [23:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:37] T160471: Deploy new header to all wikis - https://phabricator.wikimedia.org/T160471 [23:16:55] ^ jdlrobson live everywhere [23:17:11] thcipriani: yay thanks doing my second round of testing [23:17:37] 06Operations, 10Ops-Access-Requests: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3138827 (10bmansurov) @Dzahn thanks for the prompt reply. I've emailed you. [23:18:05] Krinkle: you patch is live on mwdebug1002, check please [23:18:07] *your [23:18:10] thcipriani: Okay! [23:18:24] 06Operations, 10Ops-Access-Requests: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3138828 (10JoeWalsh) [23:19:41] thcipriani: verified. [23:19:44] 06Operations, 10Ops-Access-Requests: Production shell access - https://phabricator.wikimedia.org/T161658#3138842 (10Dzahn) Hi @pmiazga, the admin group you are currently in is called "deployment". This gives you access to deployment hosts, tin and mira, mediawiki maintenance hosts / appservers. And it's inte... [23:19:51] Krinkle: ok, going live. [23:20:26] much appreciated!!!!!! [23:20:29] thcipriani: all good here :) [23:20:43] jdlrobson: cool, thanks for the followup check [23:22:05] !log thcipriani@tin Synchronized php-1.29.0-wmf.17/extensions/NavigationTiming/modules/ext.navigationTiming.js: SWAT: [[gerrit:345078|ext.NavigationTiming: Restore unsampled Save Timing]] T161368 (duration: 00m 45s) [23:22:10] ^ Krinkle live everywhere [23:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:11] T161368: Frontend Save Timing broken (<1 metrics per minute) - https://phabricator.wikimedia.org/T161368 [23:24:05] thcipriani: OK. I see traffic recovering in Graphite, so all good. [23:24:20] https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=SaveTiming&from=now-15m&to=now& [23:25:04] cool. sounds good. [23:26:54] 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3138858 (10Dzahn) [23:27:31] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:28:34] (03CR) 10Mobrovac: [C: 04-1] "Needs to be rebased and the use_proxy variable must be taken into account too." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [23:28:39] 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3138723 (10Dzahn) You can also just recycle this ticket and turn it into a [[ https://wikitech.wikimedia.org/wiki/Production_shell_access#Additional_permissions_for... [23:29:11] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 05codfw-rollout: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#3138873 (10Eevans) [23:40:25] (03PS1) 10Dzahn: admin: update SSH key for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/345267 (https://phabricator.wikimedia.org/T161660) [23:42:47] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3138921 (10Krinkle) >>! In T123728#3135277, @Ottomata wrote: >>>! In T123728#3135227, @Krinkle wrote: >> Both Statsv and Graphite have no concept of time for incoming data, everything is "no... [23:45:12] (03PS2) 10Krinkle: Reapply "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/331636 (https://phabricator.wikimedia.org/T145808) (owner: 10Alex Monk) [23:55:48] mutante remeber the gaga thing you were talking about, well go to 1:15 on https://www.youtube.com/watch?v=px593VRs7vk [23:55:50] woops [23:55:54] wrong place [23:56:26] haha, ok [23:56:31] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures