[00:27:11] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:49:53] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Granting wmde group access to grafana-admin.wikimedia.org - https://phabricator.wikimedia.org/T161484#3132935 (10Dzahn) It has been said in ops meeting that NDA is needed for grafana-admin. It looks like we'll have to start that process with all the gr... [00:56:11] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:57:13] !log Removing upload.wikimedia.org/index.html ("swift delete root index.html") from both eqiad/codfw [00:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:02] (03PS7) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [00:58:38] 06Operations, 10Ops-Access-Requests: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3138828 (10Dzahn) @Fjalapeno Could you approve this request? @JoeWalsh ok, looks good. I see L3 is already signed by you and since it's extending existing access there is not that much else... [01:01:13] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3139099 (10Dzahn) @bmansurov Thanks, i received your mail but could not generate that hash yet. I replied for details. [01:07:35] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3139110 (10Dzahn) a:05Papaul>03None @papaul all wipes done and servers are shut down? If yea, please assign to Rob for switch ports. @Robh there is https://gerrit.wi... [01:07:58] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3139112 (10Dzahn) a:03Papaul [01:16:21] !log rsyncing librenms/torrus/smokeping app data from netmon1001 to gerrit2001. adding alias "syncit" to do it all at once (T125020) [01:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:27] T125020: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020 [01:24:28] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3139119 (10Dzahn) @fgiunchedi To rsync app data tomorrow you can simply type `syncit` on netmon1001 now. The long form is: ``` rsync -avp /var/lib/librenms/ rsync://gerrit2001.wik... [01:29:51] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 82874.310199 Seconds [01:29:52] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82874.314241 Seconds [01:29:52] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82874.32967 Seconds [01:30:11] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 82217.537851 Seconds [01:30:31] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82242.865275 Seconds [01:30:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 82242.876652 Seconds [01:31:11] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:33:35] (03CR) 10VolkerE: [C: 04-1] errorpages: Restyle 404.php to be like other error pages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:34:11] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 82457.566895 Seconds [01:35:06] (03PS8) 10Krinkle: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) [01:35:12] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [01:35:43] (03CR) 10VolkerE: [C: 031] errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:39:08] (03CR) 10Krinkle: [C: 032] errorpages: Restyle 404.php to be like other error pages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:40:16] (03Merged) 10jenkins-bot: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:40:25] (03CR) 10jenkins-bot: errorpages: Restyle 404.php to be like other error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343819 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [01:41:53] !log krinkle@tin Synchronized errorpages/404.php: Match 404.html and default.html - Id58e25afbe (duration: 00m 44s) [01:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:44] (03PS1) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) [01:52:30] (03PS2) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) [01:53:11] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [01:55:01] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 17.140982 Seconds [01:55:01] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 17.150795 Seconds [01:55:01] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 17.183476 Seconds [01:55:31] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 40.608706 Seconds [01:55:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 40.821484 Seconds [01:59:11] 06Operations, 10Traffic: Configure varnish to use "Unconfigured domain" page for 404 Not Served (instead of generic error) - https://phabricator.wikimedia.org/T112316#3139133 (10Krinkle) [02:35:39] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 13m 41s) [02:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:21] (03CR) 10Dzahn: url_downloader: convert to profile/role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [02:53:39] (03PS5) 10Dzahn: url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 [02:55:14] (03CR) 10Dzahn: url_downloader: convert to profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [02:55:26] (03PS6) 10Dzahn: url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 [03:08:44] (03PS1) 10Andrew Bogott: Nova: Remove wikistatus callbacks and support code. [puppet] - 10https://gerrit.wikimedia.org/r/345275 [03:09:24] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 14m 55s) [03:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:23] (03CR) 10Andrew Bogott: "This will need some tests with the puppet compiler." [puppet] - 10https://gerrit.wikimedia.org/r/345275 (owner: 10Andrew Bogott) [03:15:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 29 03:15:16 UTC 2017 (duration 5m 53s) [03:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:01] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:14:41] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:45] (03CR) 10Aude: "If this is what we want, then ok, but I think the problem with the query is the 'all' option." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887) (owner: 10Daniel Kinzler) [04:27:41] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:29:01] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:38:18] (03CR) 10Aude: "see also my comment in the ticket" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887) (owner: 10Daniel Kinzler) [04:42:41] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:46:27] (03PS2) 10BryanDavis: Nova: Remove wikistatus callbacks and support code. [puppet] - 10https://gerrit.wikimedia.org/r/345275 (https://phabricator.wikimedia.org/T161662) (owner: 10Andrew Bogott) [04:50:01] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.423 second response time [04:55:02] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.866 second response time [04:55:41] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:13:08] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Platform-Team, 06Performance-Team, and 7 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3139228 (10tstarling) [05:23:41] PROBLEM - puppet last run on poolcounter1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:51:41] RECOVERY - puppet last run on poolcounter1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:01:45] !log Keep converting UNIQUE keys to PK on s4 - db1091 - T17441 [06:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:55] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [06:12:11] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:15:26] 06Operations, 10Pybal, 10Traffic: pybal doesn't fully manage LVS table leaving stale services (on IP change) - https://phabricator.wikimedia.org/T114104#1684739 (10Joe) The real solution for this is to dedicate real developer time to pybal to move it to use a FSM and a netlink-based python ipvs client. All... [06:25:57] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139285 (10Joe) [06:27:06] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139297 (10Joe) [06:27:21] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1150 MB (2% inode=90%) [06:32:01] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:40:11] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:40:51] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [06:41:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [06:49:51] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:50:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:59:01] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:13:17] 06Operations, 06Performance-Team, 10Thumbor: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3139356 (10Gilles) [07:14:11] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3139359 (10Peter) a:05Peter>03None [07:15:25] (03PS6) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [07:15:37] (03CR) 10Giuseppe Lavagetto: [C: 031] service::node: refactor configuration, allow use of confd for scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [07:17:09] (03PS2) 10Muehlenhoff: Uninstall eject on jessie onwards [puppet] - 10https://gerrit.wikimedia.org/r/345183 [07:19:29] (03CR) 10Muehlenhoff: [C: 032] Uninstall eject on jessie onwards [puppet] - 10https://gerrit.wikimedia.org/r/345183 (owner: 10Muehlenhoff) [07:26:08] (03PS1) 10Gilles: Upgrade to 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345283 (https://phabricator.wikimedia.org/T151553) [07:32:21] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Transit: NTT (service ID 253066) {#11376} [10Gbps]BR [07:33:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [07:34:21] (03CR) 10VolkerE: errorpages: Restyle 503/php-fatal error to match Varnish error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [07:42:12] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 32 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:50:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345285 (https://phabricator.wikimedia.org/T17441) [07:52:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [07:52:21] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:52:30] (03PS4) 10Ema: varnish: swap around backend ttl cap and keep values [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/343844 [07:52:53] (03CR) 10Ema: [V: 032 C: 032] varnish: swap around backend ttl cap and keep values [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/343844 (owner: 10Ema) [07:53:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345285 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:54:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345285 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:54:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345285 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [07:55:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 - T17441 (duration: 00m 54s) [07:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:49] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [07:56:11] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:57:13] !log Convert s6 UNIQUE keys into PK on db1093 - T17441 [07:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:20] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3139527 (10MoritzMuehlenhoff) [08:11:39] moritzm: sigh --^ [08:11:52] I hoped that the problem would have gone away [08:11:56] it's a source of neverending fun :-) [08:12:17] <_joe_> moritzm: told you upgrading HHVM is always the merrier experience [08:12:48] <_joe_> we should seriously consider exploring a php7 migration [08:13:18] it's much worse with 3.18, I'm sure the traces we've logged in SAL are just the tip of the iceberg, but with 3.18 it deadlocks after at most 6 hours... [08:13:47] _joe_: yeah, but that's for later. I'll report this to upstream in a bit [08:15:39] <_joe_> moritzm: 3.3 -> 3.6 was much more awful than this, trust me [08:16:04] <_joe_> 3.6 => 3.12 was way less painful tbh [08:25:11] (03CR) 10Volans: "Minor comments inline, looks good otherwise. Better to check the compiler to ensure it's a noop for now (not using discovery) and then the" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [08:25:11] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:29:27] !log upgrading ssl cert rendering.svc.codfw.wmnet to include the new discovery endpoints [08:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:21] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1653 MB (3% inode=90%) [08:34:35] subbu, mobrovac: ruthenium space ^^^^ there are 32GB of /srv/visualdiff/pngs [08:39:24] !log apt.w.o: set digest-algo to sha256 in gpg.conf T132325 [08:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:30] T132325: Weak digest algorithm (SHA1) used to sign InRelease on apt.wikimedia.org - https://phabricator.wikimedia.org/T132325 [08:40:13] 06Operations: Weak digest algorithm (SHA1) used to sign InRelease on apt.wikimedia.org - https://phabricator.wikimedia.org/T132325#3139569 (10ema) Support for signatures using SHA1 has been disabled altogether starting with apt 1.4~beta1: `W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/jessie-wikim... [08:45:16] (03CR) 10Ema: [V: 032 C: 032] tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema) [08:45:23] (03PS4) 10Ema: tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) [08:45:28] (03CR) 10Ema: [V: 032 C: 032] tlsproxy: simplify prometheus metrics gathering [puppet] - 10https://gerrit.wikimedia.org/r/345123 (https://phabricator.wikimedia.org/T161101) (owner: 10Ema) [08:49:01] (03PS1) 10Filippo Giunchedi: swift: add ms-be1028 -> ms-be1039 [puppet] - 10https://gerrit.wikimedia.org/r/345290 (https://phabricator.wikimedia.org/T160640) [08:54:37] !log upgrading twisted to 16.2.0 on lvs200[456] (codfw secondaries) T160433 [08:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:46] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [09:07:44] (03PS1) 10Elukey: Update rendering.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345291 [09:08:11] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3139596 (10MoritzMuehlenhoff) Reported upstream at https://github.com/facebook/hhvm/issues/7756 [09:12:04] (03CR) 10Hoo man: "What is supposed to happen once we hit that limit? Are the processes going to die?" [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577) (owner: 10EBernhardson) [09:31:30] (03CR) 10Volans: [C: 031] "LGTM, it adds the imagescaler-r{o,w}.discovery.wmnet to the SAN keeping the rest as is." [puppet] - 10https://gerrit.wikimedia.org/r/345291 (owner: 10Elukey) [09:32:29] 07Puppet, 10Continuous-Integration-Infrastructure: Need a better way of testing puppet patches for contint/integration stuff - https://phabricator.wikimedia.org/T126370#3139613 (10hashar) There is a related task to add Puppet environments to #beta-cluster : {T161675} [09:37:59] (03CR) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [09:39:01] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:39:32] !log upgrading twisted to 16.2.0 on lvs200[123] (codfw primaries) T160433 [09:39:37] (03PS7) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [09:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:39] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [09:39:39] (03PS4) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 [09:41:03] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139285 (10hashar) > We might want to have all nodes in this environment derive from a base node that includes all the labs b... [09:45:49] (03CR) 10Elukey: [C: 032] Update rendering.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345291 (owner: 10Elukey) [09:46:01] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:50:02] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139659 (10Joe) @hashar the point is to have something that resembles production, including the role-based hiera lookup. It... [09:57:01] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:04:31] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [10:04:35] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/5948/" [puppet] - 10https://gerrit.wikimedia.org/r/345158 (owner: 10Giuseppe Lavagetto) [10:04:43] (03PS8) 10Giuseppe Lavagetto: service::node: refactor configuration, allow use of confd for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/345158 [10:07:50] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3139712 (10MoritzMuehlenhoff) Unfortunately the copy or copysrc commands in reprepro don't support copying between components, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=496347#44, so copying betwee... [10:07:54] (03PS1) 10Volans: Fix database selection in tendril task [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) [10:11:13] (03CR) 10Jcrespo: "I do not like normally out-of-db changes (they make STATEMENT based replication filters break, and eventually, replication to break), but " [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:11:37] !log upgrading ssl cert api.svc.codfw.wmnet to include the new discovery endpoints [10:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:01] !log emptying /srv/log/parsoid/main.log.1 (3.2G!) on ruthenium to reclaim some disk space [10:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:17] (03PS2) 10Muehlenhoff: Change email address for moushira [puppet] - 10https://gerrit.wikimedia.org/r/342812 [10:13:14] 06Operations, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3139736 (10grin) >>! In T161256#3138330, @Peachey88 wrote: >>>! In T161256#3138272, @grin wrote: >> I would expect some back... [10:13:21] RECOVERY - Disk space on ruthenium is OK: DISK OK [10:15:36] (03CR) 10Muehlenhoff: [C: 032] Change email address for moushira [puppet] - 10https://gerrit.wikimedia.org/r/342812 (owner: 10Muehlenhoff) [10:15:58] (03PS2) 10Volans: Fix database selection in tendril task [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) [10:18:11] (03PS3) 10Volans: Fix database selection in tendril task [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) [10:18:19] (03CR) 10Jcrespo: [C: 031] Fix database selection in tendril task [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:18:36] (03CR) 10Volans: "> I do not like normally out-of-db changes (they make STATEMENT based" [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:19:09] (03CR) 10Jcrespo: [C: 031] Fix database selection in tendril task [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:19:34] thanks jaime! also super quick :) [10:20:22] (03CR) 10Volans: [C: 032] Fix database selection in tendril task [switchdc] - 10https://gerrit.wikimedia.org/r/345298 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:24:01] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:26:50] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#1972424 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by filippo on neodymium.eqiad.wmnet for hosts: ``` ['netmon1001.wikimedia.org'] ``` The log can be found in `/var/l... [10:27:01] !log reimage netmon1001 with jessie [10:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:15] (03PS1) 10Elukey: Update api.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345300 [10:36:12] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:36:43] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Platform-Team, 06Performance-Team, and 7 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3139759 (10Joe) >>! In T156924#3139186, @tstarling wrote: >>>! In T156924#3126417, @aaron wr... [10:39:39] (03PS5) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 [10:41:31] (03CR) 10Elukey: [C: 032] "Checked SANs and the only diff is api-r{wo}.discovery.wmnet. Checked also if the key in the private repo matches with this cert." [puppet] - 10https://gerrit.wikimedia.org/r/345300 (owner: 10Elukey) [10:48:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345301 [10:48:54] (03CR) 10Marostegui: [C: 04-1] "Wait for the last alter to finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345301 (owner: 10Marostegui) [10:52:01] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:53:01] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [10:55:43] (03PS1) 10Hoo man: Use Zend php to create the Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/345303 (https://phabricator.wikimedia.org/T161577) [10:55:45] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3139802 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['netmon1001.wikimedia.org'] ``` and were **ALL** successful. [10:55:56] (03PS7) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) [10:56:15] (03CR) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [10:57:16] (03CR) 10jerkins-bot: [V: 04-1] Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [10:57:59] (03PS8) 10Gilles: Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) [10:58:04] (03CR) 10ArielGlenn: [C: 032] Use Zend php to create the Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/345303 (https://phabricator.wikimedia.org/T161577) (owner: 10Hoo man) [10:58:36] (03PS6) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 [10:58:48] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345301 (owner: 10Marostegui) [11:00:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345301 (owner: 10Marostegui) [11:00:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345301 (owner: 10Marostegui) [11:01:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 - T17441 (duration: 00m 44s) [11:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:11] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [11:01:13] !log Linux 4.9 uploaded for jessie-wikimedia (along with new meta package linux-meta-4.9 and updated firmware) [11:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:53] (03PS1) 10Volans: Fix database selection for mysql lib [switchdc] - 10https://gerrit.wikimedia.org/r/345304 (https://phabricator.wikimedia.org/T160178) [11:02:28] (03CR) 10Volans: [C: 032] Fix database selection for mysql lib [switchdc] - 10https://gerrit.wikimedia.org/r/345304 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:03:18] !log upgrading ssl cert appservers.svc.codfw.wmnet to include the new discovery endpoints [11:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:05] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:05:05] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:08:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345306 (https://phabricator.wikimedia.org/T17441) [11:09:00] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: make config management independent of service::node [puppet] - 10https://gerrit.wikimedia.org/r/345193 (owner: 10Giuseppe Lavagetto) [11:11:48] (03PS1) 10Volans: Fix query for tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345308 (https://phabricator.wikimedia.org/T160178) [11:12:54] (03CR) 10Volans: [C: 032] Fix query for tendril [switchdc] - 10https://gerrit.wikimedia.org/r/345308 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:13:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345306 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [11:14:29] (03PS1) 10Elukey: Update appservers.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345309 [11:14:38] <_joe_> puppet on ruthenium is me, will fix later [11:15:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345306 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [11:16:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 - T17441 (duration: 00m 44s) [11:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [11:17:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345306 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [11:18:16] (03PS1) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [11:18:24] (03PS1) 10Muehlenhoff: Remove use of experimental apt source for servers which were used to test Linux 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/345311 [11:18:36] (03PS2) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [11:18:39] (03PS2) 10Elukey: Update appservers.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345309 [11:20:25] (03PS2) 10Muehlenhoff: Remove use of experimental apt source for servers which were used to test Linux 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/345311 [11:23:09] (03CR) 10Muehlenhoff: [C: 032] Remove use of experimental apt source for servers which were used to test Linux 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/345311 (owner: 10Muehlenhoff) [11:23:11] (03CR) 10Elukey: [C: 032] Update appservers.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345309 (owner: 10Elukey) [11:24:36] (03CR) 10Elukey: "rebase" [puppet] - 10https://gerrit.wikimedia.org/r/345309 (owner: 10Elukey) [11:24:43] (03PS3) 10Elukey: Update appservers.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345309 [11:24:45] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [11:24:59] moritzm: --^ [11:25:00] (03PS3) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [11:25:32] it's depooled, should recover soon [11:25:45] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.461 second response time [11:25:57] super [11:26:21] (03CR) 10Elukey: [C: 032] Update appservers.svc.codfw.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345309 (owner: 10Elukey) [11:30:30] !log Started a Wikidata JSON dump run on snapshot1007 using Zend (due to T161695). [11:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:35] T161695: Wikidata dump maintenance scripts cause HHVM to leak memory heavily - https://phabricator.wikimedia.org/T161695 [11:33:00] 06Operations, 05Prometheus-metrics-monitoring: prometheus-hhvm-exporter slightly spammy in syslog - https://phabricator.wikimedia.org/T161699#3139950 (10MoritzMuehlenhoff) [11:33:05] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:35:34] 06Operations, 10hardware-requests: Site: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3139970 (10akosiaris) [11:37:24] (03PS10) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [11:45:48] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3139989 (10akosiaris) [11:47:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 - T17441 (duration: 00m 44s) [11:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:13] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [11:51:21] (03PS4) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [11:52:16] 06Operations, 10hardware-requests: COFW: (2) hardware access request for ganeti - https://phabricator.wikimedia.org/T161701#3139994 (10akosiaris) [11:52:19] 06Operations, 10hardware-requests: EQIAD: (4) hardware access request for ganeti - https://phabricator.wikimedia.org/T161702#3140005 (10akosiaris) [11:54:09] (03PS1) 10Muehlenhoff: Install jessie systems with Linux 4.9 by default [puppet] - 10https://gerrit.wikimedia.org/r/345314 (https://phabricator.wikimedia.org/T154934) [11:55:11] (03PS1) 10Gilles: Use proper proxy_next_upstream configuration for Thumbor's nginx [puppet] - 10https://gerrit.wikimedia.org/r/345315 (https://phabricator.wikimedia.org/T161613) [11:56:05] (03PS1) 10Giuseppe Lavagetto: parsoid: make config management independent of service::node (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/345316 [11:57:50] 06Operations, 06Performance-Team: Add performance-team contact group to private.git - https://phabricator.wikimedia.org/T161703#3140025 (10Gilles) [11:58:55] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:59:08] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: make config management independent of service::node (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/345316 (owner: 10Giuseppe Lavagetto) [12:01:56] (03CR) 10Alexandros Kosiaris: "@yuvipanda," [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [12:14:39] 06Operations, 05Prometheus-metrics-monitoring: prometheus-hhvm-exporter slightly spammy in syslog - https://phabricator.wikimedia.org/T161699#3140054 (10fgiunchedi) [12:24:50] (03PS4) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 [12:26:55] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:33:55] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:56] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:56] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:56] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:05] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:05] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:05] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:05] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:05] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:06] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:06] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:07] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:34:15] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:15] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:15] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:15] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:15] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:16] I guess backups still running [12:34:18] I will silence it [12:34:25] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:25] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:35] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:45] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:35:05] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:05] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:35:05] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:35:06] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:06] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:06] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:15] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:15] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:35:25] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:35:35] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:35:45] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:46] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:35:46] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:35:46] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:55] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:35:55] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [12:35:55] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:35:55] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [12:35:55] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:35:56] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:38:34] (03PS11) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [12:39:56] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3140077 (10elukey) @Addshore: I am going to close this task but we might want to open another one to... [12:40:21] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113734 (10elukey) a:03elukey [12:40:52] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3140081 (10elukey) a:03elukey [12:41:09] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017036 (10elukey) a:05elukey>03None [12:41:37] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3140084 (10elukey) [12:43:46] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:48:46] (03PS5) 10Giuseppe Lavagetto: parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 [12:49:18] jouncebot: next [12:49:18] In 0 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T1300) [12:49:22] Perfect timing [12:51:12] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: add ability to use confd to configure active/passive [puppet] - 10https://gerrit.wikimedia.org/r/345194 (owner: 10Giuseppe Lavagetto) [12:51:27] hashar: looks like there is no patches for swat today [12:51:37] unless Reedy has something in plan... [12:51:45] zeljkof: One very small patch :) [12:51:55] Reedy: :) [12:51:59] deploying yourself? [12:52:00] (03PS1) 10Reedy: Swap to use wfLoadExtension for VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345320 (https://phabricator.wikimedia.org/T140852) [12:52:05] Yeah, I might aswell [12:52:19] cool [12:52:33] <_joe_> !log depooling wtp1001 to test puppet/confd transfer of responsibilities [12:52:35] (03PS2) 10Ema: bgp: log with util.log instead of printing to stdout [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/344659 [12:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:45] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:53:14] (03CR) 10Ema: bgp: log with util.log instead of printing to stdout (032 comments) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/344659 (owner: 10Ema) [12:53:17] !log reimage analytics1045 to Debian Jessie [12:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:35] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:11] (03PS1) 10Giuseppe Lavagetto: service::node::scap3: fix confd declaration [puppet] - 10https://gerrit.wikimedia.org/r/345322 [12:58:30] (03PS1) 10Reedy: Update Bytemark wikimedia mirror hostname [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) [12:58:32] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node::scap3: fix confd declaration [puppet] - 10https://gerrit.wikimedia.org/r/345322 (owner: 10Giuseppe Lavagetto) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T1300). Please do the needful. [13:02:54] (03PS1) 10Reedy: Add Bytemark to public_mirrors.html list [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) [13:03:03] jouncebot: now [13:03:03] For the next 0 hour(s) and 56 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T1300) [13:03:05] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:03:08] (03PS1) 10Giuseppe Lavagetto: service::node:config::scap3: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/345326 [13:03:27] (03CR) 10Reedy: [C: 032] Swap to use wfLoadExtension for VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345320 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [13:04:05] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3140160 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['ana... [13:04:17] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node:config::scap3: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/345326 (owner: 10Giuseppe Lavagetto) [13:04:34] (03PS3) 10Ema: bgp: log with util.log instead of printing to stdout [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/344659 [13:06:36] (03Merged) 10jenkins-bot: Swap to use wfLoadExtension for VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345320 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [13:07:26] (03CR) 10jenkins-bot: Swap to use wfLoadExtension for VisualEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345320 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [13:08:05] !log reedy@tin Synchronized wmf-config/CommonSettings.php: use wfLoadExtension for VisualEditor (duration: 00m 44s) [13:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:45] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:09:46] 06Operations, 10MediaWiki-Configuration, 10MediaWiki-Platform-Team, 06Performance-Team, and 7 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3140180 (10jcrespo) I need wmfDataCenter on etcd if it is going to disappear from puppet (wh... [13:11:46] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:14:25] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 648433 [13:15:53] (03PS4) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) [13:17:02] (03CR) 10Jforrester: "Oops, I thought we'd done this ages ago." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345320 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [13:19:40] (03PS1) 10Giuseppe Lavagetto: service::node::config::scap3: further tweaks [puppet] - 10https://gerrit.wikimedia.org/r/345331 [13:20:35] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:20:51] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3140220 (10Dzahn) [13:21:54] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#2237817 (10Dzahn) netmon1001 has been reinstalled (thanks godog!) current count:** 0** :) goal has been reached [13:22:14] <_joe_> mutante: \o/ [13:22:19] <_joe_> kudos for the great job [13:22:29] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node::config::scap3: further tweaks [puppet] - 10https://gerrit.wikimedia.org/r/345331 (owner: 10Giuseppe Lavagetto) [13:22:34] \o/ \o/ [13:23:01] _joe_: :)) [13:23:11] thanks godog for the last one [13:23:23] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3140230 (10Dzahn) 05Open>03Resolved [13:23:35] aye, almost fully back up [13:23:39] <_joe_> win 27 [13:23:53] :) can't wait to remove precise code snippets [13:24:02] \o/ for 0 precise [13:24:12] (03PS13) 10Elukey: Complete the role memcached refactor in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [13:24:49] (03PS2) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::controller role [puppet] - 10https://gerrit.wikimedia.org/r/344614 [13:28:35] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3140234 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1045.eqiad.wmnet'] ``` and were **ALL** suc... [13:32:15] 06Operations, 10Continuous-Integration-Infrastructure: (Nodepool) CI is really slow tonight - https://phabricator.wikimedia.org/T155444#3140242 (10hashar) [13:32:45] (03PS1) 10Giuseppe Lavagetto: parsoid: fix key lookup in etcd [puppet] - 10https://gerrit.wikimedia.org/r/345335 [13:33:16] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] parsoid: fix key lookup in etcd [puppet] - 10https://gerrit.wikimedia.org/r/345335 (owner: 10Giuseppe Lavagetto) [13:33:32] (03PS3) 10Muehlenhoff: Adapt debdeploy grain to rename of nova::controller role [puppet] - 10https://gerrit.wikimedia.org/r/344614 [13:33:43] <_joe_> sorry moritzm [13:34:09] nah, all fine [13:34:15] nice work mutante and godog! \o/ [13:35:46] \o/ thanks elukey [13:38:11] (03PS1) 10Filippo Giunchedi: netmon: post jessie reimage fixes [puppet] - 10https://gerrit.wikimedia.org/r/345337 (https://phabricator.wikimedia.org/T125020) [13:38:45] (03CR) 10Muehlenhoff: [C: 032] Adapt debdeploy grain to rename of nova::controller role [puppet] - 10https://gerrit.wikimedia.org/r/344614 (owner: 10Muehlenhoff) [13:40:42] (03PS2) 10Filippo Giunchedi: netmon: post jessie reimage fixes [puppet] - 10https://gerrit.wikimedia.org/r/345337 (https://phabricator.wikimedia.org/T125020) [13:40:44] (03PS2) 10Filippo Giunchedi: swift: add ms-be1028 -> ms-be1039 [puppet] - 10https://gerrit.wikimedia.org/r/345290 (https://phabricator.wikimedia.org/T160640) [13:41:29] (03CR) 10Elukey: "Looks good from https://puppet-compiler.wmflabs.org/5957/" [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [13:42:38] (03PS1) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 [13:45:00] (03CR) 10Filippo Giunchedi: [C: 032] netmon: post jessie reimage fixes [puppet] - 10https://gerrit.wikimedia.org/r/345337 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [13:45:05] (03PS3) 10Filippo Giunchedi: netmon: post jessie reimage fixes [puppet] - 10https://gerrit.wikimedia.org/r/345337 (https://phabricator.wikimedia.org/T125020) [13:45:10] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] netmon: post jessie reimage fixes [puppet] - 10https://gerrit.wikimedia.org/r/345337 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [13:49:00] !log upgrading ssl cert rendering.svc.eqiad.wmnet to include the new discovery endpoints [13:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:51] (03PS1) 10Ema: New release: 1.14 [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/345340 (https://phabricator.wikimedia.org/T82747) [13:54:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345341 [13:54:22] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345341 [13:55:37] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3140291 (10EddieGP) [13:56:21] (03PS3) 10Filippo Giunchedi: swift: add ms-be1028 -> ms-be1039 [puppet] - 10https://gerrit.wikimedia.org/r/345290 (https://phabricator.wikimedia.org/T160640) [13:57:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345341 (owner: 10Marostegui) [13:58:53] (03PS1) 10Steinsplitter: Set wmgULSAnonCanChangeLanguage true for commonswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345342 [13:58:54] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [13:58:54] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [13:59:09] (03CR) 10Filippo Giunchedi: [C: 032] swift: add ms-be1028 -> ms-be1039 [puppet] - 10https://gerrit.wikimedia.org/r/345290 (https://phabricator.wikimedia.org/T160640) (owner: 10Filippo Giunchedi) [13:59:14] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [13:59:14] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [13:59:14] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [13:59:34] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: Test Get pagecounts returned the unexpected status 404 (expecting: 200) [13:59:56] elukey: ^ known? [14:00:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345341 (owner: 10Marostegui) [14:00:13] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345341 (owner: 10Marostegui) [14:00:34] PROBLEM - confd service on wtp1002 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [14:01:00] godog: ouch this one is probably my team working on the new keyspace [14:01:07] (03PS1) 10Giuseppe Lavagetto: service::node::config::scap3: exec reload as the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/345344 [14:01:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 - T17441 (duration: 00m 44s) [14:01:09] joal --^ [14:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:14] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [14:01:39] (03PS1) 10Elukey: Update rendering.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345345 [14:02:07] (03PS2) 10Steinsplitter: Set wmgULSAnonCanChangeLanguage true for commonswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345342 [14:02:26] (03PS2) 10Giuseppe Lavagetto: service::node::config::scap3: exec reload as the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/345344 [14:02:42] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node::config::scap3: exec reload as the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/345344 (owner: 10Giuseppe Lavagetto) [14:02:51] (03PS3) 10Steinsplitter: Set wmgULSAnonCanChangeLanguage true for commonswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345342 [14:03:20] (03PS1) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) [14:03:31] godog: keyspace has been truncated during some work, we are working on it [14:03:34] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [14:03:54] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [14:03:54] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [14:04:06] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3140321 (10Cmjohnson) I mentioned this to Chase in PM on IRC...there is not a labs-support vlan in row B currently. [14:04:14] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [14:04:14] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [14:04:14] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [14:04:50] (03PS4) 10Steinsplitter: Set wmgULSAnonCanChangeLanguage true for commonswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345342 (https://phabricator.wikimedia.org/T161517) [14:04:54] (03CR) 10Elukey: [C: 032] Update rendering.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345345 (owner: 10Elukey) [14:05:00] (03PS2) 10Elukey: Update rendering.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345345 [14:05:34] RECOVERY - confd service on wtp1002 is OK: OK - confd is active [14:05:38] <_joe_> oh jeez I forgot another thing [14:06:07] (03CR) 10Elukey: [V: 032 C: 032] Update rendering.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345345 (owner: 10Elukey) [14:06:35] (03PS2) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) [14:06:47] (03PS1) 10Volans: Avoid naming clash with conftool [switchdc] - 10https://gerrit.wikimedia.org/r/345349 (https://phabricator.wikimedia.org/T160178) [14:07:44] (03PS1) 10Giuseppe Lavagetto: service::node::config::scap3: fix path of executable [puppet] - 10https://gerrit.wikimedia.org/r/345350 [14:08:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] service::node::config::scap3: fix path of executable [puppet] - 10https://gerrit.wikimedia.org/r/345350 (owner: 10Giuseppe Lavagetto) [14:09:04] (03CR) 10jerkins-bot: [V: 04-1] [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [14:10:03] (03PS1) 10Alexandros Kosiaris: servermon: Specify USE_X_FORWARDED_HOST and ALLOWED_HOSTS [puppet] - 10https://gerrit.wikimedia.org/r/345351 [14:10:34] (03PS3) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) [14:11:42] (03CR) 10Gehel: "Looking at T161577, it seems that we have a better solution with https://gerrit.wikimedia.org/r/#/c/345303/. Should we drop this? It still" [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577) (owner: 10EBernhardson) [14:12:01] (03CR) 10Volans: [C: 032] Avoid naming clash with conftool [switchdc] - 10https://gerrit.wikimedia.org/r/345349 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:13:30] (03CR) 10BBlack: [C: 031] "PS11 compiler outputs (12 hosts: all 4x clusters, 3 hosts each: eqiad, codfw, and 1x cache-only dc): https://puppet-compiler.wmflabs.org/5" [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [14:14:43] !log disabling puppet on labs hosts for a staged rollout of https://gerrit.wikimedia.org/r/#/c/345275/ [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:16] (03PS3) 10Andrew Bogott: Nova: Remove wikistatus callbacks and support code. [puppet] - 10https://gerrit.wikimedia.org/r/345275 (https://phabricator.wikimedia.org/T161662) [14:16:14] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Specify USE_X_FORWARDED_HOST and ALLOWED_HOSTS [puppet] - 10https://gerrit.wikimedia.org/r/345351 (owner: 10Alexandros Kosiaris) [14:16:22] (03PS2) 10Alexandros Kosiaris: servermon: Specify USE_X_FORWARDED_HOST and ALLOWED_HOSTS [puppet] - 10https://gerrit.wikimedia.org/r/345351 [14:16:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] servermon: Specify USE_X_FORWARDED_HOST and ALLOWED_HOSTS [puppet] - 10https://gerrit.wikimedia.org/r/345351 (owner: 10Alexandros Kosiaris) [14:17:54] (03CR) 10Andrew Bogott: [C: 032] Nova: Remove wikistatus callbacks and support code. [puppet] - 10https://gerrit.wikimedia.org/r/345275 (https://phabricator.wikimedia.org/T161662) (owner: 10Andrew Bogott) [14:18:02] (03PS4) 10Andrew Bogott: Nova: Remove wikistatus callbacks and support code. [puppet] - 10https://gerrit.wikimedia.org/r/345275 (https://phabricator.wikimedia.org/T161662) [14:18:48] akosiaris: , yt? [14:18:56] ottomata: yup [14:19:22] i'm trying to build a backport package that has some tests where it tries to create a directory in in /dev/shm [14:19:32] from what I can tell, on most of our jessie boxes [14:19:40] /run/shm is a symlink to /dev/shm [14:19:49] but, in the chroot from pdebbuild [14:20:01] /run/shm is the actual tmpfs mount [14:20:05] DEB_BUILD_OPTIONS=nocheck [14:20:08] (03PS1) 10Volans: Fix import path for mediawiki [switchdc] - 10https://gerrit.wikimedia.org/r/345355 (https://phabricator.wikimedia.org/T160178) [14:20:11] and /dev/hsm is a normal directory [14:20:15] akosiaris: that just skips tests? [14:20:18] yes [14:20:27] ha, ok [14:20:29] will try [14:20:34] :) [14:20:39] but, just curious, why would the chroot mount shm stuff differently? [14:21:19] why pdebuilder would do things differently that systemd ? [14:21:26] pbuilder* [14:21:32] can't say I know [14:21:59] ha, ok. its the same on trusty too [14:22:05] just not in the chroot [14:22:14] ooo i think DEB_BUILD_OPTIONS=nocheck worked! [14:22:15] thank you! [14:22:22] (03PS1) 10Dzahn: DHCP: remove backup4001 (does not exist, precise) [puppet] - 10https://gerrit.wikimedia.org/r/345356 [14:22:23] yw [14:22:37] (03CR) 10Volans: [C: 032] Fix import path for mediawiki [switchdc] - 10https://gerrit.wikimedia.org/r/345355 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:22:52] mutante: what do you mean "does not exist"? [14:23:20] that's the fr box, no ? [14:23:29] (03CR) 10Faidon Liambotis: [C: 04-2] "backup4001 does exist; not sure if it's still precise." [puppet] - 10https://gerrit.wikimedia.org/r/345356 (owner: 10Dzahn) [14:23:32] yes it is [14:23:38] * akosiaris proud he remembered it this time around [14:24:12] paravoid: i could not find it because it's called "frbackup4001" in other places [14:24:28] ah this has happened ? nice! [14:24:35] way more helpful [14:24:59] we're actually decom'ing that fully [14:25:04] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:25:28] see https://phabricator.wikimedia.org/T158676#3108240 and below [14:25:33] (03Abandoned) 10Ema: New release: 1.14 [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/345340 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [14:25:47] aha, and it's broken. ok, thanks [14:26:38] being replaced by frbackup2001, which would make this less of a snowflake [14:26:44] :-) [14:27:04] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): User[rancid] [14:27:24] PROBLEM - Check systemd state on netmon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:29:04] mobrovac: o/ - you there? [14:31:04] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:31:12] !log upgrading ssl cert api.svc.eqiad.wmnet to include the new discovery endpoints [14:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:25] (03CR) 10Gehel: "Yes, wee want to `ensure => absent` to remove the cron, but in a second step, we also want to cleanup this now dead code (unless we plan t" [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson) [14:32:33] !log installing apparmor security updates on trusty [14:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:01] elukey: what's uuup? [14:33:03] 06Operations: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984#3140401 (10MoritzMuehlenhoff) I've reviewed the suggested kernel hardening options against the choices used in the stretch 4.9 kernel (and also also our jessie backport). I... [14:34:10] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [14:34:35] (03PS1) 10Dzahn: osm: remove "if os_version('ubuntu <= precise')"-stanza [puppet] - 10https://gerrit.wikimedia.org/r/345358 [14:34:39] (03PS1) 10Alexandros Kosiaris: servermon: Allow access to static assets [puppet] - 10https://gerrit.wikimedia.org/r/345359 [14:34:59] (03PS1) 10Ottomata: Add debian [debs/python-mmh3] (debian) - 10https://gerrit.wikimedia.org/r/345360 [14:35:10] (03CR) 10Ottomata: [V: 032 C: 032] Add debian [debs/python-mmh3] (debian) - 10https://gerrit.wikimedia.org/r/345360 (owner: 10Ottomata) [14:35:23] 06Operations, 10Ops-Access-Requests: Requesting access to deploy hosts for musikanimal - https://phabricator.wikimedia.org/T161181#3140408 (10MusikAnimal) Thank you! :) [14:35:54] mobrovac: I am upgrading the nginx tlsproxy ssl certs for api.svc.eqiad.wmnet (to include the new discovery SANs). Would you mind to keep an eye on metrics for the next 10/15 mins ? [14:36:15] sure elukey [14:36:19] (03PS1) 10Dzahn: url_downloader: remove precise squid config [puppet] - 10https://gerrit.wikimedia.org/r/345361 [14:36:46] (03PS2) 10Alexandros Kosiaris: servermon: Allow access to static assets [puppet] - 10https://gerrit.wikimedia.org/r/345359 [14:36:49] (03PS1) 10Gehel: Cirrus / Analytics - remove deprecated rsync job [puppet] - 10https://gerrit.wikimedia.org/r/345362 [14:36:50] mobrovac: thanks :) [14:37:31] (03PS2) 10Dzahn: url_downloader: remove precise squid config [puppet] - 10https://gerrit.wikimedia.org/r/345361 [14:37:33] <_joe_> /win 26 [14:40:20] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:17] (03PS1) 10Elukey: Update api.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345365 [14:43:27] (03CR) 10Elukey: [V: 032 C: 032] Update api.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345365 (owner: 10Elukey) [14:43:48] (03PS1) 10Dzahn: check_puppetrun: remove 'require json' (precise) [puppet] - 10https://gerrit.wikimedia.org/r/345366 [14:44:49] (03PS1) 10Cmjohnson: Updating dhcpd/mac address for ms-be1031-34 [puppet] - 10https://gerrit.wikimedia.org/r/345367 [14:45:08] (03PS2) 10Cmjohnson: Updating dhcpd/mac address for ms-be1031-34 [puppet] - 10https://gerrit.wikimedia.org/r/345367 [14:45:36] mobrovac: rolling out the ssl cert now [14:45:46] (03PS1) 10Filippo Giunchedi: rancid: back up /var/lib/rancid [puppet] - 10https://gerrit.wikimedia.org/r/345368 (https://phabricator.wikimedia.org/T125020) [14:45:48] (03PS1) 10Filippo Giunchedi: rancid: create 'configs' directory [puppet] - 10https://gerrit.wikimedia.org/r/345369 [14:45:52] kk elukey, monitoring [14:46:00] (03CR) 10Jforrester: [C: 04-1] "This needs sign-off from Performance (and possible Ops) before merge. Will ping them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345342 (https://phabricator.wikimedia.org/T161517) (owner: 10Steinsplitter) [14:46:02] 06Operations, 10ChangeProp, 06Services (later): Add storage to Change-Prop for deduplication - https://phabricator.wikimedia.org/T157089#2995197 (10Pchelolo) We might need to add storage to #changeprop not only for deduplication, but also for automatic page blacklisting, see T161710 for details. [14:48:03] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Updating dhcpd/mac address for ms-be1031-34 [puppet] - 10https://gerrit.wikimedia.org/r/345367 (owner: 10Cmjohnson) [14:48:32] (03PS1) 10Ema: Release version 1.14 [debs/pybal] - 10https://gerrit.wikimedia.org/r/345370 [14:49:00] (03PS3) 10Alexandros Kosiaris: servermon: Allow access to static assets [puppet] - 10https://gerrit.wikimedia.org/r/345359 [14:49:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] servermon: Allow access to static assets [puppet] - 10https://gerrit.wikimedia.org/r/345359 (owner: 10Alexandros Kosiaris) [14:51:58] (03PS1) 10Dzahn: aptly: remove special case to remove multiarch support on precise [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) [14:52:05] (03CR) 10Alexandros Kosiaris: [C: 032] osm: remove "if os_version('ubuntu <= precise')"-stanza [puppet] - 10https://gerrit.wikimedia.org/r/345358 (owner: 10Dzahn) [14:52:11] (03PS2) 10Alexandros Kosiaris: osm: remove "if os_version('ubuntu <= precise')"-stanza [puppet] - 10https://gerrit.wikimedia.org/r/345358 (owner: 10Dzahn) [14:52:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm: remove "if os_version('ubuntu <= precise')"-stanza [puppet] - 10https://gerrit.wikimedia.org/r/345358 (owner: 10Dzahn) [14:52:33] (03PS1) 10Marostegui: mysql-predump.erb: Reduce the number of jobs [puppet] - 10https://gerrit.wikimedia.org/r/345372 [14:52:54] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: remove precise squid config [puppet] - 10https://gerrit.wikimedia.org/r/345361 (owner: 10Dzahn) [14:52:58] (03PS3) 10Alexandros Kosiaris: url_downloader: remove precise squid config [puppet] - 10https://gerrit.wikimedia.org/r/345361 (owner: 10Dzahn) [14:53:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] url_downloader: remove precise squid config [puppet] - 10https://gerrit.wikimedia.org/r/345361 (owner: 10Dzahn) [14:54:10] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [14:55:35] (03PS1) 10Muehlenhoff: Drop precise from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/345375 [14:55:49] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/5961/" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [14:57:31] mobrovac: all done! Will proceed with appservers in a bit but it shouldn't be a problem like the apis [14:57:49] <_joe_> elukey: "a problem" how? [14:57:53] elukey: all good on our side, thnx! [14:58:15] (03PS2) 10Dzahn: DHCP: fix host name backup4001 -> frbackup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 [14:58:18] <_joe_> mobrovac: fyi, now that elukey has finished with API, I will switch parsoid in codfw to the discovery api as well [14:58:30] kk _joe_ [14:58:38] (03CR) 10Muehlenhoff: [C: 032] Drop precise from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/345375 (owner: 10Muehlenhoff) [14:58:52] _joe_ I wanted to say that it should be a noop unlike the apis :) [14:59:00] <_joe_> mobrovac: also, I want badly to refactor all the services classes [14:59:09] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:36] (03CR) 10Alexandros Kosiaris: "Added a -1 on the first strings.Replace call and rebuilding. If all goes well, I 'll upload the change" [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/343797 (owner: 10Yuvipanda) [14:59:36] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3140487 (10Gehel) The same kind of tests as we did on elastic2020 are running on elastic2021 at the moment. This should help validate that ther... [14:59:44] (03CR) 10Dzahn: "@paravoid amended to just fix the host name. frbackup4001 is in DNS, backup4001 is not" [puppet] - 10https://gerrit.wikimedia.org/r/345356 (owner: 10Dzahn) [14:59:54] <_joe_> mobrovac: the added bonus of all the work I've done is that now we can have the services config depend on etcd [15:00:00] _joe_: yeah, if we switched rb and graphoid to full scap3 deploys, we could simplify them a lot [15:00:22] <_joe_> mobrovac: it's ok for now tbh [15:01:35] mutante: that's fine [15:01:44] alright [15:02:05] akosiaris: I was planning on a reboot of netmon1001 to verify we're good and have linux 4.9 running, good to reboot? [15:02:19] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:02:25] (03PS3) 10Dzahn: DHCP: fix host name backup4001 -> frbackup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 [15:02:58] <_joe_> grr another typo [15:03:53] (03PS4) 10Dzahn: DHCP: fix host name backup4001 -> frbackup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) [15:04:39] (03PS1) 10Giuseppe Lavagetto: parsoid: fix template for confd [puppet] - 10https://gerrit.wikimedia.org/r/345376 [15:04:54] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] parsoid: fix template for confd [puppet] - 10https://gerrit.wikimedia.org/r/345376 (owner: 10Giuseppe Lavagetto) [15:05:22] (03CR) 10Dzahn: [C: 031] rancid: back up /var/lib/rancid [puppet] - 10https://gerrit.wikimedia.org/r/345368 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [15:05:28] (03PS1) 10MarkTraceur: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) [15:06:09] (03PS2) 10Dzahn: rancid: back up /var/lib/rancid [puppet] - 10https://gerrit.wikimedia.org/r/345368 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [15:07:30] <_joe_> elukey: you're still not done with api, I assume? [15:07:46] (03CR) 10Dzahn: [C: 032] rancid: back up /var/lib/rancid [puppet] - 10https://gerrit.wikimedia.org/r/345368 (https://phabricator.wikimedia.org/T125020) (owner: 10Filippo Giunchedi) [15:08:06] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3140522 (10Papaul) @Gehel Thanks. Once that done i will also update the task on the troubleshooting steps of eastic2020. [15:08:07] godog: fine by me [15:08:13] (03CR) 10Jgreen: [C: 04-1] "We should remove the DHCP entry altogether. The box that was backup4001 failed and is, afaik, powered down and will be decomissioned. The " [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) (owner: 10Dzahn) [15:08:16] <_joe_> elukey: trying to call the host I obtain [15:08:17] <_joe_> * subjectAltName does not match api-rw.discovery.wmnet [15:08:39] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:08:52] _joe_: calling codfw or eqiad? [15:09:10] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3140525 (10Papaul) [15:09:19] <_joe_> volans: eqiad [15:09:26] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3066883 (10Papaul) a:05Papaul>03RobH [15:09:44] <_joe_> I tried https://api-rw.discovery.wmnet [15:09:47] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3140532 (10Papaul) p:05Triage>03Normal [15:10:29] _joe_ the apis should all be fine [15:10:34] paravoid: Jeff says -1, remove it all, heh:) [15:10:37] I mean, puppet ran in there [15:10:40] <_joe_> elukey: they're not [15:10:41] I can re-check [15:10:47] _joe_: api-rw.discovery.wmnet is in the SAN [15:10:57] <_joe_> try curl https://api-rw.discovery.wmnet [15:11:00] 06Operations, 10ops-codfw: wtp2019 has faulty memory - https://phabricator.wikimedia.org/T146009#3140535 (10Papaul) 05Open>03Resolved [15:11:35] elukey: did you reload? [15:11:42] or is puppet doing it? [15:12:08] puppet should have done it [15:12:15] <_joe_> puppet won't reload nginx afair? [15:12:36] the certs are in place AFAICT with cumin [15:13:02] (03PS2) 10MarkTraceur: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) [15:13:31] I remember to see tlsproxy output while running puppet, I was almost sure it was a reload [15:13:34] but I might be wrong [15:13:41] openssl s_client worked [15:14:00] <_joe_> well, what command exactly? [15:14:14] 06Operations, 10ops-codfw, 10DBA, 10procurement: Adquire temporary box for x1 failover (spare available?) - https://phabricator.wikimedia.org/T161712#3140543 (10jcrespo) [15:14:34] 06Operations, 10ops-codfw, 10DBA, 10procurement: Adquire temporary box for x1 failover on codfw (spare available?) - https://phabricator.wikimedia.org/T161712#3140558 (10jcrespo) [15:14:45] echo y | openssl s_client -connect api-rw.discovery.wmnet:443 - but I guess it doesn't do the check that I expected [15:14:58] <_joe_> no it doesn't :P [15:15:23] _joe_ I can try to systemctl reload nginx on the apis [15:15:33] in codfw [15:15:42] <_joe_> please do [15:16:50] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3140566 (10BBlack) Adding Traffic and myself and @ema to this. I don't think we've been aware of the uselang hack... [15:17:19] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 653791 [15:17:36] (03CR) 10jerkins-bot: [V: 04-1] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [15:18:50] <_joe_> oh shit, I wait [15:18:56] <_joe_> I found the issue [15:19:02] _joe_ something is strange, nginx doesn't reload [15:19:05] :( [15:19:12] <_joe_> and it's in dns, not in puppet [15:19:32] 06Operations, 10ops-codfw, 10DBA, 10procurement: Aquire temporary box for x1 failover on codfw (spare available?) - https://phabricator.wikimedia.org/T161712#3140573 (10Reedy) [15:19:58] (03PS3) 10MarkTraceur: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) [15:20:07] 06Operations, 10ops-codfw, 10DBA, 10procurement: Aquire temporary box for x1 failover on codfw (spare available?) - https://phabricator.wikimedia.org/T161712#3140576 (10jcrespo) I think we could use temporarily es2002, but asking first in case there is a more suitable machine available. [15:20:30] <_joe_> I'm fixing it [15:21:10] 06Operations, 10ops-codfw, 10DBA, 10procurement: Acquire temporary box for x1 failover on codfw (spare available?) - https://phabricator.wikimedia.org/T161712#3140583 (10jcrespo) [15:21:15] ah it points to appservers! [15:24:21] (03PS1) 10Giuseppe Lavagetto: Fix discovery api-rw and api-ro geomaps [dns] - 10https://gerrit.wikimedia.org/r/345379 [15:24:27] <_joe_> bblack: ^^ [15:25:43] ah that makes sense [15:25:48] <_joe_> yeah [15:25:56] <_joe_> everything makes sense now, doesn't it? [15:27:13] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:27:15] 06Operations, 10hardware-requests: codfw/eqiad:(4+4) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3140600 (10akosiaris) I 'll be updating the task based on internal emails, quite well summed up in T157222#3072376. [15:27:22] (03CR) 10BBlack: [C: 031] Fix discovery api-rw and api-ro geomaps [dns] - 10https://gerrit.wikimedia.org/r/345379 (owner: 10Giuseppe Lavagetto) [15:28:07] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix discovery api-rw and api-ro geomaps [dns] - 10https://gerrit.wikimedia.org/r/345379 (owner: 10Giuseppe Lavagetto) [15:29:05] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3140605 (10akosiaris) [15:29:53] RECOVERY - salt-minion processes on ms-be1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:30:13] RECOVERY - configured eth on ms-be1039 is OK: OK - interfaces up [15:30:23] RECOVERY - Disk space on ms-be1039 is OK: DISK OK [15:30:23] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational [15:30:33] RECOVERY - dhclient process on ms-be1039 is OK: PROCS OK: 0 processes with command name dhclient [15:30:33] RECOVERY - MD RAID on ms-be1039 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:31:34] RECOVERY - DPKG on ms-be1039 is OK: All packages OK [15:31:43] 06Operations, 10ops-codfw, 10DBA, 10procurement: Acquire temporary box for x1 failover on codfw (spare available?) - https://phabricator.wikimedia.org/T161712#3140609 (10jcrespo) I think that is the right term, sorry, I had to look it up in the dictionary. English is not my strong point as a non-native-spe... [15:32:33] (03CR) 10EBernhardson: "i'm not particularly happy with the soultion i provided here, but it's the easiest solution for providing some level of isolation between " [puppet] - 10https://gerrit.wikimedia.org/r/345170 (https://phabricator.wikimedia.org/T161577) (owner: 10EBernhardson) [15:33:15] 06Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3140611 (10akosiaris) [15:33:43] RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:34:43] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be1039 is OK: OK: synced at Wed 2017-03-29 15:34:38 UTC. [15:37:03] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:38:59] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3140621 (10Cmjohnson) Updated the mac address for 1031-34, console issue with 1036, cable was not in correct port. 1039, fat fingered the mgmt ip address during set... [15:39:22] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3140623 (10Cmjohnson) @marostegui disk replaced...rebuilding now nclosure Device ID: 32 Slot Number: 11 Drive's position: DiskGroup: 0, Span: 1, Arm: 5 Enclosure position: 1 Device Id: 11 WWN: 5000039788210... [15:39:53] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3136638 (10jcrespo) Thank you very much! :-) [15:40:01] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3140626 (10Marostegui) Thanks Chris, will monitor it and close the ticket when it's finished [15:40:49] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3140640 (10Cmjohnson) New disk controller arrived...spoke with @fgiunchedi and we'll take care of this in a couple of weeks when he gets back from vacation. [15:40:50] (03CR) 10Jcrespo: [C: 031] "I am ok with this. My question is if to deploy this or increase the timeout?" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [15:41:38] (03PS1) 10Andrew Bogott: nova scheduler: scheduler_host_subset_size = 2 [puppet] - 10https://gerrit.wikimedia.org/r/345381 (https://phabricator.wikimedia.org/T161006) [15:42:42] 06Operations, 10ops-codfw, 10DBA, 10procurement: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140657 (10RobH) [15:42:53] (03CR) 10Andrew Bogott: [C: 032] nova scheduler: scheduler_host_subset_size = 2 [puppet] - 10https://gerrit.wikimedia.org/r/345381 (https://phabricator.wikimedia.org/T161006) (owner: 10Andrew Bogott) [15:43:54] (03CR) 10Marostegui: "> I am ok with this. My question is if to deploy this or increase the" [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [15:47:33] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139285 (10bd808) > on the deployment-prep puppetmaster, define a disk-based hiera hierarchy to mimic 1:1 what we have in pro... [15:48:37] 06Operations, 10ops-codfw, 10DBA, 10procurement: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140666 (10jcrespo) [15:51:18] (03PS2) 10Ottomata: Stop copying cirrus UserTesting logs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson) [15:51:24] (03CR) 10Ottomata: [V: 032 C: 032] Stop copying cirrus UserTesting logs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/345171 (owner: 10EBernhardson) [15:52:56] 06Operations, 10ops-codfw, 10DBA, 10procurement: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140668 (10RobH) a:03faidon So, for this I have 3 spare machines in codfw. One of them is being used to restore graphite data (so short term... [15:53:26] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140685 (10RobH) [15:53:43] marostegui, jynus : we have scheduled the rename of tables on EL dbs today cc ottomata (happening in 2 hours) [15:54:16] nuria ok, thanks for the heads up! [15:55:21] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140695 (10faidon) That's totally fine, approved. [15:56:13] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:48] (03CR) 10Subramanya Sastry: [C: 031] remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086 (owner: 10Dzahn) [15:57:13] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3140702 (10Joe) >>! In T161675#3140664, @bd808 wrote: >> on the deployment-prep puppetmaster, define a disk-based hiera hiera... [15:57:50] (03PS1) 10Daniel Kinzler: Try using redisLockManager for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) [15:59:02] (03PS2) 10Daniel Kinzler: Try using redisLockManager for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345387 (https://phabricator.wikimedia.org/T159828) [16:03:04] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3133989 (10Krinkle) > Adding Traffic and myself and @ema to this. I don't think we've been aware of the uselang hac... [16:07:26] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 391688 [16:07:26] 06Operations: Point swiftrepl to swift HTTPS - https://phabricator.wikimedia.org/T161717#3140769 (10fgiunchedi) [16:09:04] 06Operations, 15User-fgiunchedi: Point swiftrepl to swift HTTPS - https://phabricator.wikimedia.org/T161717#3140785 (10fgiunchedi) [16:12:23] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3140803 (10RobH) a:05faidon>03RobH [16:14:26] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [16:14:33] 06Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3140815 (10RobH) I'll get quotes for this shortly, they'll be based on the identical specifications from task T145026, which was the kubernetes order for eqiad last September. (Thi... [16:24:16] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:24:40] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3140855 (10Dzahn) I got a reply and was now able to confirm the committed identity from https://www.mediawiki.org/wiki/User:Bmansurov_%28WMF%29 The SHA512 hash matched the secret t... [16:24:49] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3140857 (10Dzahn) a:03Dzahn [16:25:26] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:25:51] (03CR) 10Dzahn: [C: 032] "confirmed identity via committed identity on https://www.mediawiki.org/wiki/User:Bmansurov_%28WMF%29" [puppet] - 10https://gerrit.wikimedia.org/r/345267 (https://phabricator.wikimedia.org/T161660) (owner: 10Dzahn) [16:26:08] (03PS2) 10Dzahn: admin: update SSH key for bmansurov [puppet] - 10https://gerrit.wikimedia.org/r/345267 (https://phabricator.wikimedia.org/T161660) [16:28:51] (03Abandoned) 10Chad: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [16:31:11] <_joe_> !log rolling restart of parsoid in codfw [16:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:22] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3140899 (10Dzahn) Done. already replaced on bast1001. All other places will follow automatically within max 30 min (puppet). [16:34:21] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3140900 (10Dzahn) 05Open>03Resolved [16:34:51] (03Restored) 10Jcrespo: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [16:36:10] (03PS4) 10Thcipriani: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [16:36:29] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 605 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3538794 keys, up 6 days 21 minutes - replication_delay is 605 [16:36:44] 06Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3140905 (10RobH) I've requested quotes for both T142578 & T161700, as they are both identical hardware spec to the older order on T145026. Once I have the quote updates back from D... [16:36:48] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3139970 (10RobH) I've requested quotes for both T142578 & T161700, as they are both identical hardware spec to the older order on T145026. Once I have the quote updates back from De... [16:37:31] (03PS3) 10Jcrespo: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [16:37:36] (03CR) 10jerkins-bot: [V: 04-1] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [16:37:52] (03PS4) 10Jcrespo: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [16:38:54] (03PS3) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) [16:39:48] 06Operations, 10Ops-Access-Requests: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3140924 (10Dzahn) @JoeWalsh could you add a reason why you need private data? [16:40:29] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3510790 keys, up 6 days 25 minutes - replication_delay is 0 [16:40:35] (03CR) 10Jcrespo: "role/mariadb is not the right place, but it is better than the current location, and much better than the mariadb module- these are eventl" [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [16:41:17] (03CR) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:41:45] 06Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#2539946 (10RobH) [16:46:34] (03PS5) 10Thcipriani: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [16:49:30] AndyRussG: Regarding https://phabricator.wikimedia.org/T115642 (just saw it in the backlog) - may wanna update to be about ESLint instead (if not yet resolved) - see also T118941 / T161142. [16:49:30] T118941: Switch to eslint for our linting and our code styling - https://phabricator.wikimedia.org/T118941 [16:49:31] T161142: CentralNotice: for JS linting, switch from jshint+jscs to eslint - https://phabricator.wikimedia.org/T161142 [16:49:31] (03CR) 10VolkerE: "That's what I'd suggest, not pointing at something that isn't there. Let's remove the footer there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:50:42] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3141009 (10Aklapper) Environmental discussion seems to be better at https://meta.wikimedia.org/wiki/Sustainability_Initiative (mentioned already in this task) and now https://wikimediafoundation.org/wiki/Reso... [16:51:16] <_joe_> !log actually performing the parsoid rolling restart in codfw [16:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:30] !log upgrading ssl cert appservers.svc.eqiad.wmnet to include the new discovery endpoints [16:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:45] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5962/" [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [16:53:01] !log Disable puppet on db1047 and dbstore1002 for maintenance - T160454 [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:07] T160454: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454 [16:53:08] from mail subject: Delivery Problem, Reason: Nobody Home [16:53:26] disabled puppet on all the mw appservers in eqiad FYI [16:53:29] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:55:19] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T161600#3141035 (10Marostegui) 05Open>03Resolved a:03Marostegui Rebuilt ``` root@db1067:~# megacli -PDRbld -ShowProg -PhysDrv [32:11] -aALL ; megacli -ldinfo -l0 -a0 Device(Encl-32 Slot-11) is not in rebuil... [16:55:19] RECOVERY - MegaRAID on db1067 is OK: OK: optimal, 1 logical, 6 physical [16:55:52] !log Stop eventlog syncs to db1047 and dbstore1002 for maintenance - T160454 [16:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:48] (03PS1) 10Elukey: Update appservers.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345396 [16:59:54] mutante must be a spammer. Or someones on holiday [16:59:54] PROBLEM - swift-container-server on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:00:04] PROBLEM - swift-container-updater on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:00:24] PROBLEM - swift-object-auditor on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:00:30] (03CR) 10Elukey: [V: 032 C: 032] Update appservers.svc.eqiad.crt to include discovery endpoints [puppet] - 10https://gerrit.wikimedia.org/r/345396 (owner: 10Elukey) [17:00:34] PROBLEM - swift-object-replicator on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:00:44] PROBLEM - swift-object-server on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:01:04] PROBLEM - swift-object-updater on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:01:04] RECOVERY - swift-container-updater on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:01:24] RECOVERY - swift-object-auditor on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:01:44] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 34 seconds ago with 6 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Service[swift-account-replicator],Service[swift-account-reaper],Service[swift-account-auditor] [17:01:54] RECOVERY - swift-container-server on ms-be1034 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:02:04] RECOVERY - swift-object-updater on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:02:14] PROBLEM - swift-account-auditor on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:02:24] PROBLEM - swift-account-reaper on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:02:34] PROBLEM - swift-account-replicator on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:03:08] paladox: hehe, yes, spammer for sure. it was just a funnier subject :) [17:03:30] lol what subject is that? [17:03:31] mutante ^^ [17:04:14] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:17] slowly running puppet on all the eqiadappservers [17:04:45] anybody checking ms-be1034 ? [17:05:06] paladox: stuff like "delivery failure: nobody home" and "Your message to BREXIT awaits moderator approval" [17:05:07] ah maybe new? [17:05:11] uptime 52 min [17:05:30] oh [17:05:58] ah yes https://phabricator.wikimedia.org/T160640 [17:06:06] so probably downtime expired [17:06:12] godog: --^ [17:09:04] (03PS3) 10Dzahn: remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086 [17:11:23] !log restarting nginx on eqiad appservers to pick up the new certs [17:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:58] (03CR) 10Dzahn: [C: 032] remove parsoid-tests.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/345086 (owner: 10Dzahn) [17:12:01] (03PS1) 10Dzahn: admin: add joewalsh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/345398 (https://phabricator.wikimedia.org/T161663) [17:12:19] !log nuria@tin Started deploy [eventlogging/analytics@2874077]: (no justification provided) [17:12:23] !log nuria@tin Finished deploy [eventlogging/analytics@2874077]: (no justification provided) (duration: 00m 03s) [17:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:45] !log removing parsoid-tests.wikimedia.org from DNS - replaced by more specific parsoid-rt-tests and parsoid-vd-tests [17:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:15] (03PS2) 10Jforrester: Show 'Publish' not 'Save' on final Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337531 (https://phabricator.wikimedia.org/T131132) [17:14:17] (03PS1) 10Jforrester: Show 'Publish' not 'Save' on Wikipedias except de/en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345399 (https://phabricator.wikimedia.org/T131132) [17:14:30] (03PS3) 10Jforrester: Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724 [17:15:25] (03PS12) 10BBlack: varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) [17:16:21] all right https://appservers-rw.discovery.wmnet works now [17:17:33] elukey: yup, sorry [17:18:44] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:45] (03PS5) 10BryanDavis: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 [17:25:34] !log puppet disabled on all cp* ahead of careful deploy for https://gerrit.wikimedia.org/r/#/c/339667/ [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:45] RECOVERY - swift-object-server on ms-be1034 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:30:56] (03CR) 10BBlack: [C: 032] varnish: refactor all clusters for active/active [puppet] - 10https://gerrit.wikimedia.org/r/339667 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [17:31:14] RECOVERY - swift-account-auditor on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:31:24] RECOVERY - swift-account-reaper on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:31:27] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#2961839 (10faidon) I think this task (matching eqiad's capacity) i... [17:31:34] RECOVERY - swift-object-replicator on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:31:34] RECOVERY - swift-account-replicator on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:31:43] !log remove ge-3/0/27 from interface-range labs-instance-ports (now for ms-be1031) [17:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:14] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:34:13] (03PS5) 10Jcrespo: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [17:36:51] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3139285 (10thcipriani) > * on the deployment-prep puppetmaster, configure a 'staging' environment for puppet, with its own si... [17:37:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Update bmansurov's SSH key - https://phabricator.wikimedia.org/T161660#3141160 (10bmansurov) Thanks you. I'm able to SSH now. [17:42:41] (03Abandoned) 10Ema: Release version 1.14 [debs/pybal] - 10https://gerrit.wikimedia.org/r/345370 (owner: 10Ema) [17:44:10] (03CR) 10Thcipriani: "Hrm puppet compiler isn't happy with the changes on tin, but I'm unclear why: https://puppet-compiler.wmflabs.org/5963/" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [17:48:44] RECOVERY - puppet last run on ms-be1034 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:53:44] (03CR) 10Bartosz Dziewoński: [C: 04-1] "I think we should totally do this, but it first requires an operations/puppet patch to actually make these page views uncacheable – otherw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345342 (https://phabricator.wikimedia.org/T161517) (owner: 10Steinsplitter) [17:55:02] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3141211 (10bd808) >>! In T161675#3141154, @thcipriani wrote: >>>! In T161675#3140664, @bd808 wrote: >> I think I would sugges... [17:55:44] !log ppchelko@tin Started deploy [changeprop/deploy@e4547cd]: Support regexed topics [17:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:39] !log ppchelko@tin Finished deploy [changeprop/deploy@e4547cd]: Support regexed topics (duration: 00m 55s) [17:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:00] (03PS4) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T1800). Please do the needful. [18:00:04] RoanKattouw and James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:23] I'll SWAT today [18:00:39] Thanks, RoanKattouw. [18:00:56] (03CR) 10Catrope: [C: 032] Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724 (owner: 10Jforrester) [18:01:19] (03PS2) 10Catrope: Show 'Publish' not 'Save' on Wikipedias except de/en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345399 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [18:01:40] (03PS1) 10Volans: Do not auto-ucfirst when the query is a regex [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) [18:02:07] (03Merged) 10jenkins-bot: Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724 (owner: 10Jforrester) [18:02:18] (03CR) 10jenkins-bot: Set wgOOUIEditPage false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344724 (owner: 10Jforrester) [18:03:41] (03PS3) 10Catrope: Show 'Publish' not 'Save' on Wikipedias except de/en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345399 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [18:03:47] (03CR) 10Catrope: [C: 032] Show 'Publish' not 'Save' on Wikipedias except de/en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345399 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [18:03:59] (03CR) 10Dzahn: "@Jgreen i was wondering, when you install servers in FR, do you use this install server or not anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) (owner: 10Dzahn) [18:04:19] (03CR) 10Dzahn: "well, PS1 was downvoted for removing it, PS2 was downvoted for not removing it :)" [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) (owner: 10Dzahn) [18:06:26] (03PS1) 10Andrew Bogott: bootstrap_vz: Increase dhcp retry loop by a lot [puppet] - 10https://gerrit.wikimedia.org/r/345403 [18:07:29] (03PS5) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) [18:07:31] (03Merged) 10jenkins-bot: Show 'Publish' not 'Save' on Wikipedias except de/en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345399 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [18:07:34] (03CR) 10Krinkle: "@Volker: Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:08:57] (03CR) 10jenkins-bot: Show 'Publish' not 'Save' on Wikipedias except de/en [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345399 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [18:09:20] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz: Increase dhcp retry loop by a lot [puppet] - 10https://gerrit.wikimedia.org/r/345403 (owner: 10Andrew Bogott) [18:09:35] (03PS1) 10Andrew Bogott: Nova fullstack test: Increaes timeouts [puppet] - 10https://gerrit.wikimedia.org/r/345405 [18:10:07] (03PS2) 10Andrew Bogott: Nova fullstack test: Increase timeouts [puppet] - 10https://gerrit.wikimedia.org/r/345405 [18:10:27] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Save->Publihs on Wikipedias except dewiki and enwiki (T131132); set wgOOUIEditPage false everywhere (duration: 00m 57s) [18:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:34] T131132: Re-label the "Save" button to be "Publish", to better indicate to users the outcomes of their action - https://phabricator.wikimedia.org/T131132 [18:10:37] (03CR) 10VolkerE: [C: 031] errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [18:11:20] Volker_E: Thanks, good idea with the 503 change. It's much better :) [18:12:31] (03CR) 10Andrew Bogott: [C: 032] Nova fullstack test: Increase timeouts [puppet] - 10https://gerrit.wikimedia.org/r/345405 (owner: 10Andrew Bogott) [18:14:53] (03PS6) 10Jcrespo: Mariadb: Move remaining non-module files to the module [puppet] - 10https://gerrit.wikimedia.org/r/344195 (owner: 10Chad) [18:16:04] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:16:44] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team, 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3141291 (10Joe) >>! In T161675#3141154, @thcipriani wrote: > Will this repo //just// be a different `site.pp` for beta node d... [18:21:54] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:22:06] !log catrope@tin Synchronized php-1.29.0-wmf.18/includes/Title.php: T159319 (duration: 00m 46s) [18:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:51] !log catrope@tin Synchronized php-1.29.0-wmf.18/includes/page/WikiPage.php: T159319 (duration: 00m 44s) [18:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:54] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [18:35:40] 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3141362 (10pmiazga) @Dzahn I'm following this article [[ https://wikitech.wikimedia.org/wiki/SWAP#Access | SWAP Access ]] to get access to Simple Wikimedia Analytic... [18:38:10] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:08] !log Started a Wikidata TTL dump run on snapshot1007 using Zend (due to T161695). [18:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:14] T161695: Wikidata dump maintenance scripts cause HHVM to leak memory heavily - https://phabricator.wikimedia.org/T161695 [18:43:40] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:45:00] !log varnish active/active deploy done ( https://gerrit.wikimedia.org/r/#/c/339667/ ) - all caches running the new code, puppet re-enabled, etc. [18:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:19] (03PS10) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [18:57:21] (03PS10) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [18:57:25] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T1900). Please do the needful. [19:00:05] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.229 second response time [19:00:12] * thcipriani does [19:02:10] (03CR) 10BBlack: "FYI, the part of this that was touching modules/role/manifests/cache/misc.pp will no longer rebase cleanly. That data is now in hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey) [19:02:59] (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345409 [19:03:01] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345409 (owner: 10Thcipriani) [19:04:25] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345409 (owner: 10Thcipriani) [19:04:34] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345409 (owner: 10Thcipriani) [19:05:05] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.065 second response time [19:05:16] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.18 [19:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:15] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [19:07:05] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [19:07:05] PROBLEM - puppet last run on ms-fe1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:16] (03PS1) 10Thcipriani: Revert "group1 wikis to 1.29.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345410 [19:08:43] (03CR) 10Thcipriani: [C: 032] Revert "group1 wikis to 1.29.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345410 (owner: 10Thcipriani) [19:09:21] bblack: do you have 5 minutes? [19:10:19] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345410 (owner: 10Thcipriani) [19:10:31] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345410 (owner: 10Thcipriani) [19:10:58] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 back to 1.29.0-wmf.17 [19:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:16] Quiz? [19:19:58] (03PS11) 10EBernhardson: [WIP] Upgrade logstash to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344964 (https://phabricator.wikimedia.org/T154473) [19:20:00] (03PS11) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [19:33:26] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3141540 (10JoeWalsh) [19:35:05] RECOVERY - puppet last run on ms-fe1005 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:39:02] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3141583 (10Nuria) 05Open>03Resolved [19:39:15] 06Operations, 10Traffic, 07HTTPS, 15User-fgiunchedi: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3105285 (10aaron) SwiftFileBackend will need to force an https URL when it gets the storage_url back in the JSON auth response. [19:44:56] !log ppchelko@tin Started deploy [changeprop/deploy@1150cf5]: Config: Enabling regex-based topic subscription [19:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:42] !log ppchelko@tin Finished deploy [changeprop/deploy@1150cf5]: Config: Enabling regex-based topic subscription (duration: 01m 45s) [19:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:00] (03PS1) 10Dduvall: [DO NOT MERGE] ci-staging: Docker registry for container builds [puppet] - 10https://gerrit.wikimedia.org/r/345422 (https://phabricator.wikimedia.org/T161657) [19:47:15] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 257 bytes in 0.048 second response time [19:49:53] ^ madhuvishy more toolschecker madness? :) [19:50:17] hmmm [19:50:47] (03PS2) 10Dduvall: k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187 [19:50:49] (03PS3) 10Dduvall: [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) [19:52:40] chasemp: madhuvishy nope, that's me upgrading. [19:52:48] ah [19:53:01] yuvipanda: are you upgrading w/ bd808's patch? [19:53:03] yuvipanda: ah okay - i was just typing that it shouldn't be toolscher related [19:53:05] chasemp: the package had one more error unfortunately, taking down our master for 30s or so. [19:53:08] toolschecker* [19:53:14] chasemp: nope, just baseline. [19:53:15] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.269 second response time [19:53:21] yuvipanda: so package is still busted? [19:53:32] chasemp: yeah, but I fixed it manually. [19:53:43] chasemp: package is using user 'kube', puppet uses 'kubernetes' [19:53:47] ah [19:53:51] chasemp: other than that, it's all fine. [19:54:06] I think the package should just use 'kubernetes' [19:55:25] yuvipanda: where does the kube username come from? some upstream? [19:55:39] chasemp: don't think so [19:55:42] I'll check tho [19:57:30] chasemp: ah, debian's experimental kubernetes packages also use 'kube' and not 'kubernetes' [19:57:36] so I guess I'll change prod to say 'kube' [19:57:42] seems reasonable [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T2000). [20:01:49] \o/ ORES is happening [20:01:57] Amir1, you around? [20:02:08] (03PS2) 10Andrew Bogott: Nova: Remove unused rsync server [puppet] - 10https://gerrit.wikimedia.org/r/344691 [20:04:24] (03PS12) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [20:04:24] chasemp: madhuvishy haha, it's actually worse. so we have these systemd unit files from debs, and they say 'kube'. Then puppet overrides them to say 'kubernetes' [20:04:36] so it failed after first puppet run :D [20:04:43] :| [20:04:44] so, install package, fail, manual fix, puppet breaks? [20:04:46] (have fixed it manually again, pondering options) [20:05:04] yeah [20:05:42] chasemp: I think I'm going to rip the systemd files out of puppet (seems to not be doing much changes outside of the group / user change) [20:05:52] that's the right thing to do I thikn [20:05:54] and just use 'kube' everywhere [20:06:06] right, since we moved to packages that should be their domain [20:08:34] (03CR) 10Andrew Bogott: [C: 032] Nova: Remove unused rsync server [puppet] - 10https://gerrit.wikimedia.org/r/344691 (owner: 10Andrew Bogott) [20:08:44] !log ppchelko@tin Started deploy [changeprop/deploy@ef62908]: Fix metrics for regex topics [20:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:02] <_joe_> you can have systemd overrides in puppet [20:09:12] (03PS13) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [20:09:16] <_joe_> just to override settings in the base package-provided unit [20:09:40] !log ppchelko@tin Finished deploy [changeprop/deploy@ef62908]: Fix metrics for regex topics (duration: 00m 56s) [20:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:00] OK. Looks like I'm starting ORES. [20:12:10] Amir1, ping me when you get here [20:13:36] !log arlolra@tin Started deploy [parsoid/deploy@bc798dc]: Updating Parsoid to b1b27146 [20:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:34] OLDHASH for ORES is bc0bc74 [20:16:45] !log halfak@tin Started deploy [ores/deploy@554ea12]: T160638 [20:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:52] T160638: Deploy ORES late march - https://phabricator.wikimedia.org/T160638 [20:20:27] graphs look good and curl on the host is good. Moving on. [20:21:02] !log arlolra@tin Finished deploy [parsoid/deploy@bc798dc]: Updating Parsoid to b1b27146 (duration: 07m 26s) [20:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:36] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3141742 (10chasemp) [20:24:08] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3029754 (10chasemp) [20:25:35] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3141755 (10chasemp) [20:26:50] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3029672 (10chasemp) [20:28:37] (03PS1) 10Yuvipanda: k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 [20:28:54] (03CR) 10jerkins-bot: [V: 04-1] k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 (owner: 10Yuvipanda) [20:29:32] 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3141772 (10Dzahn) @pmiazga Ok, thanks for pointing out that article and what it's for. I don't really know about SWAP Access yet. Let me add @madhuvishy , i think s... [20:29:38] (03PS2) 10Yuvipanda: k8s: Use packages everywhere [puppet] - 10https://gerrit.wikimedia.org/r/345441 [20:30:01] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3141783 (10chasemp) >>! In T158204#3116230, @RobH wrote: > Is there a specific cpu seed we have to stick to? 24 cores without HT is dual 12 core CPUs. Anything between... [20:30:09] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3141784 (10chasemp) 05stalled>03Open [20:31:25] !log Updated Parsoid to b1b27146 (T161558, T160207, T153798) [20:31:28] 06Operations, 10Ops-Access-Requests: Production shell access (request for notebook-roots for pmiazga?) - https://phabricator.wikimedia.org/T161658#3141787 (10madhuvishy) @pmiazga @Dzahn Notebook access piggy backs on analytics cluster access. Currently if you have access to researchers or analytics-privatedata... [20:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:33] T160207: Unnecessary when editing indent-pre with lines starting with '*' - https://phabricator.wikimedia.org/T160207 [20:31:33] T161558: Parsoid linter request failed with The "revision" parameter must be set - https://phabricator.wikimedia.org/T161558 [20:31:33] T153798: Lower Parsoid heap limit - https://phabricator.wikimedia.org/T153798 [20:34:57] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3141806 (10chasemp) [20:35:08] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3141807 (10chasemp) [20:35:25] !log halfak@tin Finished deploy [ores/deploy@554ea12]: T160638 (duration: 18m 40s) [20:35:30] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtest - https://phabricator.wikimedia.org/T154706#3141810 (10chasemp) [20:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:32] T160638: Deploy ORES late march - https://phabricator.wikimedia.org/T160638 [20:35:33] \o/ all looks good. [20:35:58] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3141814 (10chasemp) [20:39:45] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#3141865 (10chasemp) [20:39:47] (03PS14) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [20:40:44] ORES is done FYI [20:45:01] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3141878 (10chasemp) [20:46:18] 06Operations, 10Ops-Access-Requests: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3141892 (10Dzahn) [20:50:05] 06Operations, 10Ops-Access-Requests: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3141923 (10Dzahn) @Madhuvishy Thank you! I added "(ask for the "researchers" (or "analytics-privatedata-users") group)" to that wiki page. @pmiazga I think the "resear... [20:50:35] 06Operations, 10Ops-Access-Requests: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3141938 (10Dzahn) a:03Dzahn [20:52:47] (03PS6) 10Andrew Bogott: Keystonehooks: Exclude 'novaobserver' user from posix user group. [puppet] - 10https://gerrit.wikimedia.org/r/343074 (https://phabricator.wikimedia.org/T158650) [20:54:27] (03CR) 10Andrew Bogott: [C: 032] Keystonehooks: Exclude 'novaobserver' user from posix user group. [puppet] - 10https://gerrit.wikimedia.org/r/343074 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott) [20:55:38] (03PS3) 10Andrew Bogott: Designate: Don't use keystone to resolve project id [puppet] - 10https://gerrit.wikimedia.org/r/343356 (https://phabricator.wikimedia.org/T158650) [20:56:00] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3141980 (10Reedy) [20:56:37] (03PS1) 10Dzahn: admin: add pmiazga to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/345469 (https://phabricator.wikimedia.org/T161658) [20:56:45] !log thcipriani@tin Synchronized php-1.29.0-wmf.18/extensions/ProofreadPage/includes/page/ProofreadPagePage.php: [[gerrit:345423|Makes sure to always return a Title in ProofreadPagePage::findIndexTitle]] T161734 (duration: 00m 46s) [20:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:52] T161734: fatal error: Argument 1 passed to ProofreadIndexPage::newFromTitle() must be an instance of Title, ProofreadIndexPage given - https://phabricator.wikimedia.org/T161734 [20:57:05] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:57:13] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10chasemp) [21:00:14] (03CR) 10Andrew Bogott: [C: 032] Designate: Don't use keystone to resolve project id [puppet] - 10https://gerrit.wikimedia.org/r/343356 (https://phabricator.wikimedia.org/T158650) (owner: 10Andrew Bogott) [21:00:35] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 593.00 seconds [21:01:01] (03CR) 10Hashar: [C: 031] "Looks good to me. There are bits that just reimplementing the puppet class but that is not so important." [puppet] - 10https://gerrit.wikimedia.org/r/345338 (owner: 10Gehel) [21:02:50] (03CR) 10Hashar: [C: 031] "Forgot: the "rake spec" task ignore the postgresql module in /Rakefile:" [puppet] - 10https://gerrit.wikimedia.org/r/345338 (owner: 10Gehel) [21:07:13] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3142027 (10Dzahn) p:05Triage>03Normal @pmiazga We should be good to go here. We'll just have to follow that 3-day waiting period from https:/... [21:16:14] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1004 & 5 - https://phabricator.wikimedia.org/T161754#3142042 (10chasemp) [21:16:44] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1004 & 5 refresh - https://phabricator.wikimedia.org/T161754#3142056 (10chasemp) [21:18:07] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1006 & 7 refresh - https://phabricator.wikimedia.org/T161755#3142057 (10chasemp) [21:18:23] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3142069 (10chasemp) [21:20:35] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:21:03] (03PS6) 10Krinkle: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) [21:21:12] (03CR) 10Krinkle: [C: 032] errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:22:14] (03Merged) 10jenkins-bot: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:22:23] (03CR) 10jenkins-bot: errorpages: Restyle 503/php-fatal error to match Varnish error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345274 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:23:20] !log krinkle@tin Synchronized errorpages/: I15295835a1a (duration: 00m 44s) [21:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:59] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:27:55] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10faidon) labnodepool wasn't actually on my list, since it was slated for Q4 of FY17-18 and I only focused on accelerating more immediate orders (Q1 mostly).... [21:29:38] (03PS15) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [21:30:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [21:31:05] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3142153 (10chasemp) 05Open>03stalled I had noted on the procurement sheet to forward to Q4 of 16/17 if budget allowed due to the age from a previous convo. I'm go... [21:35:27] (03PS16) 10EBernhardson: [WIP] Update elk stack to 5.x [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) [21:40:17] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#3142206 (10chasemp) [21:43:08] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#3142213 (10chasemp) [21:45:17] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3142232 (10chasemp) [21:48:45] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3142249 (10chasemp) [21:50:44] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3142261 (10chasemp) [21:52:37] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3142262 (10chasemp) [21:52:53] 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142263 (10chasemp) [21:53:50] (03CR) 10Krinkle: [WIP] Update elk stack to 5.x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344965 (https://phabricator.wikimedia.org/T154473) (owner: 10EBernhardson) [21:53:58] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3142274 (10chasemp) a:03RobH [21:54:05] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3142275 (10chasemp) a:03RobH [21:54:22] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#3142276 (10chasemp) a:03RobH [21:54:44] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3142277 (10chasemp) a:03RobH [21:54:52] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3142278 (10chasemp) a:03RobH [21:55:12] 06Operations, 10DBA, 06Labs: eqiad: (2) hardware access request for labsdb1006 & 7 refresh - https://phabricator.wikimedia.org/T161755#3142279 (10chasemp) a:03RobH [21:55:20] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3142280 (10chasemp) a:03RobH [21:55:29] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3142281 (10chasemp) a:03RobH [21:55:35] 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142282 (10chasemp) a:03RobH [21:56:01] jesus chase [21:56:10] you know im on vacatoin starting tomorrow right? ;D [21:56:35] :) by request I swear! [21:56:46] more reasons to drink on vacation [21:56:57] yeah i know i was expecting a lot of tasks for new quotes [21:57:00] just teasing you ;D [21:57:51] robh: my theory of grouping is basically like specs and site and I may be overdoing that with separate tasks where it's just different disk and/or RAM or something but I think it's all covered and we can negotiate from there mate [21:58:08] yeah any different spec should likely have its own task [21:58:14] so even though its a lot of tasks, it seems right [21:59:38] cool [22:03:05] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:06:01] 06Operations, 06Labs, 07Wikimedia-Incident: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#3142322 (10chasemp) [22:06:03] 06Operations, 06Labs, 13Patch-For-Review: Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#3142320 (10chasemp) [22:20:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3142372 (10madhuvishy) @dzahn @pmiazga One note - researchers only gives mysql access - not Hive/Hadoop. If you need access to Hive/Hadoop + Mysql... [22:31:05] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:42:55] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:49:20] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3142439 (10DStrine) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T2300). Please do the needful. [23:11:55] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [23:12:56] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3142466 (10Dzahn) @pmiazga Do you think you need Hive/Hadoop? Details about the groups we are talking about are on https://wikitech.wikimedia.org/... [23:16:11] (03PS1) 10Reedy: Don't use EP_NS in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345474 (https://phabricator.wikimedia.org/T87911) [23:16:58] jouncebot: next [23:16:59] In 0 hour(s) and 43 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T0000) [23:20:29] (03CR) 10Jforrester: [C: 031] "Let's SWAT this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345474 (https://phabricator.wikimedia.org/T87911) (owner: 10Reedy) [23:20:40] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3142490 (10Dzahn) @JoeWalsh thank you! we'll just need the manager approval now (@Fjalapeno ) and then we could merge this on Friday. Cheers, Daniel [23:20:56] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3142494 (10Dzahn) a:03Dzahn [23:21:08] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3138828 (10Dzahn) p:05Triage>03Normal [23:28:51] (03PS5) 10Dzahn: DHCP: remove backup4001 [puppet] - 10https://gerrit.wikimedia.org/r/345356 (https://phabricator.wikimedia.org/T158220) [23:35:29] (03CR) 10Dzahn: "@Akosiaris: Looks like you added the "TODO: Remove once we are free from precise". I am blindly following that." [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [23:38:05] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:49] jouncebot: now [23:39:49] For the next 0 hour(s) and 20 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170329T2300) [23:39:55] It's swat? [23:40:30] (03CR) 10Reedy: [C: 032] Don't use EP_NS in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345474 (https://phabricator.wikimedia.org/T87911) (owner: 10Reedy) [23:42:47] (03Merged) 10jenkins-bot: Don't use EP_NS in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345474 (https://phabricator.wikimedia.org/T87911) (owner: 10Reedy) [23:42:57] (03CR) 10jenkins-bot: Don't use EP_NS in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345474 (https://phabricator.wikimedia.org/T87911) (owner: 10Reedy) [23:43:54] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Dont use EP_NS in CommonSettings (duration: 00m 44s) [23:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:19] !log reedy@tin Synchronized php-1.29.0-wmf.18/extensions/Quiz: Fix undefined variable stateObject T161735 (duration: 00m 49s) [23:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:26] T161735: Notice: Undefined variable: stateObject in /srv/mediawiki/php-1.29.0-wmf.18/extensions/Quiz/Quiz.class.php on line 405 - https://phabricator.wikimedia.org/T161735 [23:50:45] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1774 MB (3% inode=90%) [23:52:45] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1775 MB (3% inode=90%) [23:53:43] (03CR) 10Andrew Bogott: "We'll be mopping up the last Precise instances on Friday, after which we can start purging precise code all over the place :)" [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn)