[00:02:30] 06Operations, 10EventBus, 07Regression, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2478094 (10GWicke) [00:05:26] 06Operations, 10EventBus, 07Regression, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2478102 (10demon) Thanks for the quick diagnosis! It's not super urgent I don't think since we know what's going on, just... [00:07:08] gwicke: Thanks for figuring out what's up with there. [00:16:12] (03PS1) 10Dzahn: osmium: also copy /srv for migration [puppet] - 10https://gerrit.wikimedia.org/r/299908 (https://phabricator.wikimedia.org/T132530) [00:17:13] (03CR) 10Paladox: [C: 031] Minor tweaks to 2.12.2 package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (owner: 10Chad) [00:17:16] (03CR) 10Dzahn: [C: 032] osmium: also copy /srv for migration [puppet] - 10https://gerrit.wikimedia.org/r/299908 (https://phabricator.wikimedia.org/T132530) (owner: 10Dzahn) [00:18:15] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:26:22] (03PS1) 10Dzahn: osmium: copy data back from hafnium after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/299911 (https://phabricator.wikimedia.org/T132530) [00:26:37] Dereckson: If you ever wanted a hit list, BTW, https://www.irccloud.com/pastebin/aeyuR5NG/ … :-( [00:27:13] (03CR) 10Paladox: [C: 031] planet: add phabricator releng blog feed [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [00:27:14] ostriches: not a good idea the full scap for dblist changes [00:27:18] 25 error: Uncaught exception 'Exception' with message 'MWWikiversions::readDbListFile(): unable to read visualeditor-default.#012' in /srv/mediawiki/multi [00:27:21] version/MWWikiversions.php:79#012Stack trace:#012#0 /srv/mediawiki/wmf-config/CommonSettings.php(171): MWWikiversions::readDbListFile()#012#1 /srv/mediawiki/ph [00:27:24] p-1.28.0-wmf.10/LocalSettings.php(3): include()#012#2 /srv/mediawiki/php-1.28.0-wmf.10/includes/WebStart.php(124): include()#012#3 /srv/mediawiki/php-1.28.0-wm [00:27:27] f.10/index.php(40): include()#012#4 /srv/mediawiki/w/index.php(3): include()#012#5 {main} [00:27:56] (03PS1) 10Dzahn: osmium: delete temp. migration class [puppet] - 10https://gerrit.wikimedia.org/r/299912 [00:28:10] ostriches: they can be a tiny number of requests during the rsync operation it seems [00:28:18] Those commits need to be sync-file'd in the correct order [00:28:22] those files* [00:28:22] Bleh. [00:28:30] That shouldn't happen. [00:28:32] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2478206 (10GWicke) [00:28:47] Can't just sync-dir or scap [00:28:59] sync the settings file first that stops calling it, and then the directory sync that removes the file [00:29:01] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2478208 (10BBlack) [00:29:03] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2478209 (10BBlack) [00:29:06] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2478207 (10BBlack) [00:29:24] Meh full scap should be more atomic than that though [00:29:36] w/e, it'll go away [00:31:45] its only as "atomic" as rsync and that turns out to not be very atomic really [00:32:25] we do delay-delete and some other flag I'm not remembering to try and make ti a bit better than default [00:32:45] !log osmium - reboot into PXE, reinstall [00:32:49] but really we need to de-pool/re-pool as the updates hit [00:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:07] bd808: Yep. [00:33:26] MatmaRex: Also, note the difference between T140850 and T140852. [00:33:27] T140852: Load all Wikimedia-deployed extensions and skins via extension registration - https://phabricator.wikimedia.org/T140852 [00:33:27] T140850: Remove all PHP entry points from all Wikimedia-deployed extensions and skins - https://phabricator.wikimedia.org/T140850 [00:33:38] bd808: also https://phabricator.wikimedia.org/T47877 [00:34:19] Krinkle: T104352 [00:34:19] T104352: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352 [00:34:25] bd808: ^ That's fixable in other ways though, but it'd also be interesting to do a switch over internally (e.g. gradually depool/update 50%, and then switch over and continue) [00:34:43] James_F: right, i trust you to get tasks right :) i was just pointing out that there is a lot of related stuff, that may need to be duped, or turned into a nice dependency tree [00:34:53] kubernetes has this built-in [00:35:04] bd808: Yeah, atomic-ish-sorta. [00:35:38] As a first step, we want to depool the scap proxies. [00:35:59] T125629 [00:35:59] T125629: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629 [00:36:13] Keeps proxies from flapping and lets them dedicate 100% to sync. [00:36:36] MatmaRex: There's about 200 tasks already in that tree. :-) [00:38:11] ostriches: aren't most of the fanout rsync servers jobrunners? [00:38:34] I dunno, haven't looked lately [00:39:04] sync-apache done, 46 errors, 31 for /w/index.php, 15 for /w/api.php [00:41:44] 06Operations, 10EventBus, 07Regression, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2478265 (10GWicke) Info from @ottomata per SMS: > Schemas are cloned by puppet at either /etc or /srv event-schemas > Pre... [00:42:42] !log Temporarily reducing compaction throughput to 10MB/s on restbase1013-c.eqiad.wmnet : T134016 [00:42:43] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [00:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:14] the apache part in sync-apache is just a naming thing at this point i assume? [00:44:31] !log dereckson@tin Finished scap: wmf-config/ upgrade: Gerrit changes 296770, 296767, 296929, 296930, 292623, 292624 (duration: 45m 42s) [00:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:38] mutante: yes, it means "all the server not master not proxies" [00:44:51] aha [00:45:18] James_F: changes live in prod [00:45:59] Dereckson: Thank you. [00:46:38] Logs look good. [00:48:34] bd808: ostriches: scap proxies are regular apaches (1211, 1216) or api (1201, 1280) or jobrunner (2080) etc.. all mixed [00:49:18] *nod* I think R.eedy was trying to pick jobrunners at one point [00:49:47] mostly we were trying to get one in each row then though [00:49:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:50:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [00:51:23] bd808: yes, you can tell it's per row .. from the comments [00:51:33] 4 - "mw1280.eqiad.wmnet" # A7 eqiad [00:51:33] 5 - "mw1211.eqiad.wmnet" # B7 eqiad [00:51:33] 6 - "mw1216.eqiad.wmnet" # B8 eqiad [00:56:30] (03PS6) 10Dzahn: planet: add phabricator releng blog feed [puppet] - 10https://gerrit.wikimedia.org/r/299867 [00:56:36] (03CR) 10Dzahn: [C: 032] planet: add phabricator releng blog feed [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [00:56:50] ostriches: so, are you not deploying https://gerrit.wikimedia.org/r/#/c/283243/ ? (just wondering) [00:58:16] I was waiting for swat to end. [00:58:28] (03PS4) 10Chad: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [00:58:43] ostriches: green light so [00:58:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:00:05] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:01:05] 06Operations, 10Traffic, 07HTTPS, 07Tracking: Requests for resources through a non-canonical address over HTTPS redirect to the canonical address on HTTP (tracking) - https://phabricator.wikimedia.org/T38952#2478299 (10Danny_B) [01:01:09] 06Operations, 10Wikimedia-Apache-configuration, 07Verified: Non-canonical HTTPS URLs quietly redirect to HTTP - https://phabricator.wikimedia.org/T33369#2478303 (10Danny_B) [01:01:26] 06Operations, 10Traffic, 07HTTPS: Requests for resources through a non-canonical address over HTTPS redirect to the canonical address on HTTP - https://phabricator.wikimedia.org/T38952#2478305 (10Danny_B) [01:09:06] (03CR) 10Chad: [C: 032] Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [01:09:42] (03Merged) 10jenkins-bot: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [01:13:12] !log demon@tin Synchronized wmf-config/abusefilter.php: Disable abusefilter profiling on commonswiki (duration: 00m 26s) [01:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:13:20] (03PS10) 10MaxSem: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [01:19:13] !log osmium - after reinstall with jessie, did not boot with 4.4 kernel, _does_ boot with 3.16.04.. still jessie just booted manually into the older kernel in grub [01:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:45] !log osmium - revoke old puppet cert, salt-key .. sign new ones [01:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:20:04] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1195.62 seconds [01:20:57] MatmaRex: Hehe https://logstash.wikimedia.org/goto/bf2a44a61d1387fc6d23a838dfdb988f [01:22:14] (03PS2) 10Dzahn: osmium: copy data back from hafnium after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/299911 (https://phabricator.wikimedia.org/T132530) [01:22:49] (03CR) 10Dzahn: [C: 032] osmium: copy data back from hafnium after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/299911 (https://phabricator.wikimedia.org/T132530) (owner: 10Dzahn) [01:23:44] ostriches: pretty graph [01:26:02] gehel: on the salt master, neodymium, there are a bunch of unaccepted salt keys for elasticsearch [01:26:49] gehel: if they are all new, we can salt-key -a them [01:29:14] !log rhodium, new puppetmaster, add to salt [01:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:33:04] !log labstore1005, accepting salt key (reinstall 2016-06-25) [01:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:33] (03PS2) 10Chad: Remove SiteConfiguration::isLocalVHost() from test class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299906 [01:35:20] !log hafnium stopping rsyncd, deleting configs [01:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:35:59] (03CR) 10Chad: [C: 032] Remove SiteConfiguration::isLocalVHost() from test class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299906 (owner: 10Chad) [01:36:39] (03Merged) 10jenkins-bot: Remove SiteConfiguration::isLocalVHost() from test class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299906 (owner: 10Chad) [01:38:07] Danny_B: "merging" tasks on phabricator means the content of one of the tasks is gone [01:38:55] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Apache-configuration, 07HHVM: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#2478415 (10Danny_B) [01:39:04] eh, nevermind [01:39:18] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Apache-configuration, 07HHVM: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#760100 (10Danny_B) [01:39:42] !log demon@tin Synchronized tests/SiteConfiguration.php: for completeness (duration: 00m 24s) [01:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:42:39] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [01:44:05] be back in a while [01:50:12] mutante: ? [02:12:19] (03PS3) 10Chad: Delist Special:CodeReview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298544 (https://phabricator.wikimedia.org/T116948) (owner: 10Awight) [02:15:07] (03CR) 10Chad: [C: 032] Delist Special:CodeReview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298544 (https://phabricator.wikimedia.org/T116948) (owner: 10Awight) [02:15:47] (03Merged) 10jenkins-bot: Delist Special:CodeReview [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298544 (https://phabricator.wikimedia.org/T116948) (owner: 10Awight) [02:24:49] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.001 second response time on port 9042 [02:26:40] da heck? [02:26:42] 02:26:31 sync-file failed: Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap' [02:26:50] mwdeploy owns it, and I was the one to do it.... [02:28:34] thur we go [02:28:52] !log demon@tin Synchronized wmf-config/CommonSettings.php: delist codereview (duration: 00m 27s) [02:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:32] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.10) (duration: 08m 41s) [02:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:54] (03CR) 10Ori.livneh: [C: 031] prometheus: monitor hosts in the current site [puppet] - 10https://gerrit.wikimedia.org/r/299540 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [02:41:04] Oh, i18n. That was it [02:53:53] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 07m 55s) [02:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:43] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jul 20 03:00:43 UTC 2016 (duration 6m 50s) [03:00:44] (03CR) 10Alex Monk: [C: 04-1] "Designate doesn't want to create a zone with a forward slash, will use hyphen" [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [03:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:12] (03PS4) 10Alex Monk: Delegate 208.80.155.128/25 (labs instances) PTR records to labs-ns* so they can be managed automatically [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) [03:50:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:56:41] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:06:39] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:12:40] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:35:43] !log osmium edit grub config to boot second entry (3.16), update-grub, reboot [04:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:38:41] !log osmium result: boots into 4.4 kernel which would not work before.. lol [04:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:39:30] menu entry 2 then i guess [04:45:14] !log osmium - rsyncing /home , /srv (except /srv/mediawiki created by puppet) back from temp backup on hafnium [04:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:49:17] (03PS2) 10Dzahn: osmium: delete temp. migration class [puppet] - 10https://gerrit.wikimedia.org/r/299912 [04:52:21] (03CR) 10Dzahn: [C: 032] "it has done what it was for" [puppet] - 10https://gerrit.wikimedia.org/r/299912 (owner: 10Dzahn) [04:55:26] !log osmium package chromium-browser is missing after upgrade, refered to by jsbench [04:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:02:03] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2478616 (10Dzahn) reinstalled with jessie, then after first boot: 18:11 < mutante> "mdadm: No devices listed in conf file were found. 18:14 < mutante> ALERT! /dev/d... [05:03:41] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 5 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2478618 (10KartikMistry) [05:08:05] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2478625 (10Dzahn) @ori @TimStarling I rsynced -avp the entire /home and entire /srv (minus /srv/mediawiki which got recreated by puppet and was quite large temp over t... [05:15:13] (03PS1) 10MaxSem: Remove wmgUseContributionReporting, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299941 [05:15:15] (03PS1) 10MaxSem: Labs: remove wgRCWatchCategoryMembership - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299942 [05:15:17] (03PS1) 10MaxSem: Remove wgMFEnableBetaDiff, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299943 [05:15:19] (03PS1) 10MaxSem: Labs: remove wmgMFUseCentralAuthToken - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299944 [05:15:21] (03PS1) 10MaxSem: Labs: remove wmgEnableGeoData - matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299945 [05:15:23] (03PS1) 10MaxSem: Labs: remove wmgGeoDataDebug - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299946 [05:15:25] (03PS1) 10MaxSem: Labs: remove wmgUseCodeEditorForCore - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299947 [05:15:36] hahaha, I spam the world! [05:27:03] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2478641 (10Dzahn) [05:29:21] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2478642 (10Dzahn) osmium has software RAID1 now (from mw-raid1 partman recipe) [05:33:17] (03CR) 10Nemo bis: planet: add phabricator releng blog feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [05:36:50] MaxSem: You call that spam‽, Time to up your game >.> [06:14:54] <_joe_> !log updating parsoid on wtp100[12] [06:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:15:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:16:36] <_joe_> uhm [06:20:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [06:21:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:24:28] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2478798 (10elukey) @hashar: thanks for the info! I was wondering why apache2log... [06:26:17] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2478806 (10Joe) @elukey I didn't add it on purpose, as I was pretty sure it was... [06:27:21] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::webserver: add conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/298939 [06:31:04] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:34] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [06:32:05] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:05] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:34] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:54] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:15] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:21] (03CR) 10Bmansurov: [C: 031] Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [06:37:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:40:35] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::webserver: add conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/298939 [06:42:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:43:16] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:45:07] (03Abandoned) 10ArielGlenn: lock wikis for dump runs by date, permitting runs across multiple dates [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274997 (https://phabricator.wikimedia.org/T126341) (owner: 10ArielGlenn) [06:45:22] <_joe_> aqs again [06:45:55] (03PS4) 10ArielGlenn: extend dumps cron job to run partial dumps as well [puppet] - 10https://gerrit.wikimedia.org/r/299527 (https://phabricator.wikimedia.org/T126339) [06:46:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:51:36] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::webserver: add conftool scripts [puppet] - 10https://gerrit.wikimedia.org/r/298939 (owner: 10Giuseppe Lavagetto) [06:53:02] _joe_ it might be restbase throttling, will double check later, thanks! [06:53:21] <_joe_> elukey: and it spits out 503s? [06:53:23] <_joe_> that's bad [06:53:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:53:54] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:55:50] (03PS5) 10ArielGlenn: extend dumps cron job to run partial dumps as well [puppet] - 10https://gerrit.wikimedia.org/r/299527 (https://phabricator.wikimedia.org/T126339) [06:56:35] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:45] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:00] _joe_ I think so because I don't see 50x in my dashboard, I might be wrong.. [06:57:04] (03CR) 10ArielGlenn: [C: 032] extend dumps cron job to run partial dumps as well [puppet] - 10https://gerrit.wikimedia.org/r/299527 (https://phabricator.wikimedia.org/T126339) (owner: 10ArielGlenn) [06:57:05] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:58:25] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:45] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:47] (03PS1) 10ArielGlenn: make sure directory creation uses the dump cron job startdate [puppet] - 10https://gerrit.wikimedia.org/r/299951 (https://phabricator.wikimedia.org/T126339) [07:06:10] (03CR) 10ArielGlenn: [C: 032] make sure directory creation uses the dump cron job startdate [puppet] - 10https://gerrit.wikimedia.org/r/299951 (https://phabricator.wikimedia.org/T126339) (owner: 10ArielGlenn) [07:06:34] (03PS2) 10Giuseppe Lavagetto: mediawiki::conftool: add mw-pool [puppet] - 10https://gerrit.wikimedia.org/r/298940 [07:07:15] (03PS3) 10Giuseppe Lavagetto: mediawiki::conftool: add mw-pool [puppet] - 10https://gerrit.wikimedia.org/r/298940 [07:14:11] (03PS1) 10ArielGlenn: make sure directory creation uses the dump cron job startdate, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/299952 [07:15:04] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::conftool: add mw-pool [puppet] - 10https://gerrit.wikimedia.org/r/298940 (owner: 10Giuseppe Lavagetto) [07:15:39] (03PS2) 10ArielGlenn: make sure directory creation uses the dump cron job startdate, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/299952 [07:17:59] (03CR) 10ArielGlenn: [C: 032] make sure directory creation uses the dump cron job startdate, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/299952 (owner: 10ArielGlenn) [07:28:08] !log restarting evenbus on kafka100[12] (T140848) [07:28:09] T140848: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848 [07:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:34:23] (03CR) 10Jcrespo: "Why are you connecting to puppet's database instead of using puppet itself? Seems like a violation of a separation of concerns." [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [07:38:24] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on port 9042 [07:39:36] (03PS1) 10ArielGlenn: dumps: disable specific run settings only if they are enabled [dumps] - 10https://gerrit.wikimedia.org/r/299953 [07:40:48] (03CR) 10ArielGlenn: [C: 032] dumps: disable specific run settings only if they are enabled [dumps] - 10https://gerrit.wikimedia.org/r/299953 (owner: 10ArielGlenn) [08:20:25] 06Operations, 10EventBus, 07Regression, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2477982 (10elukey) For the record I am adding in here what me and Marko did yesterday to manually force a SIGHUP to eventb... [08:27:43] 06Operations, 06Release-Engineering-Team, 10Wikimedia-Apache-configuration, 07HHVM: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#2479025 (10hashar) [08:32:05] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2479027 (10hashar) related: @Neil_P._Quinn_WMF has overhauled the wikitech page [[ htt... [08:33:39] 06Operations, 10EventBus, 07Regression, 15User-mobrovac, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2479029 (10mobrovac) 05Open>03Resolved a:03mobrovac After the restart, there are no more events b... [08:40:00] 06Operations, 10EventBus, 06Services, 15User-mobrovac: EventBus Proxy Service Doesn't Handle SIGHUP Correctly - https://phabricator.wikimedia.org/T140868#2479063 (10mobrovac) [08:47:23] off to the beach, back in two hours [08:48:42] 06Operations, 10EventBus, 06Services, 15User-mobrovac: EventBus Proxy Service Doesn't Handle SIGHUP Correctly - https://phabricator.wikimedia.org/T140868#2479113 (10mobrovac) [08:49:16] 06Operations, 10EventBus, 07Regression, 15User-mobrovac, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2477982 (10mobrovac) The follow-up ticket is {T140870}. [08:52:16] (03CR) 10Filippo Giunchedi: "LGTM, just some nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [09:02:59] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 1 failures [09:13:06] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2479153 (10Gehel) Connectivity issue continues after switching cable. We'll need to find another cause. @Cmjohnson sorry for the bother! [09:17:38] (03PS2) 10Filippo Giunchedi: puppetmaster: generate prometheus targets from ganglia [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) [09:17:40] (03PS2) 10Filippo Giunchedi: prometheus: monitor hosts in the current site [puppet] - 10https://gerrit.wikimedia.org/r/299540 (https://phabricator.wikimedia.org/T126785) [09:17:44] (03PS5) 10Elukey: Add more to stats:wmde config [puppet] - 10https://gerrit.wikimedia.org/r/298931 (owner: 10Addshore) [09:22:49] (03CR) 10Filippo Giunchedi: "thanks Ori for the review!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:23:23] (03CR) 10Elukey: [C: 032] "Added fake data to the labs private repo, pcc looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/298931 (owner: 10Addshore) [09:24:53] (03CR) 10Filippo Giunchedi: "Jaime: the script will get eventually used from within puppet with generate() but afaik there's no sane way to do the same with exported r" [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:25:11] (03PS3) 10Filippo Giunchedi: nutcracker: default verbosity to 4 [puppet] - 10https://gerrit.wikimedia.org/r/299146 (https://phabricator.wikimedia.org/T136078) [09:25:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] nutcracker: default verbosity to 4 [puppet] - 10https://gerrit.wikimedia.org/r/299146 (https://phabricator.wikimedia.org/T136078) (owner: 10Filippo Giunchedi) [09:26:29] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:31:13] 06Operations, 10EventBus, 06Services, 15User-mobrovac: EventBus Proxy Service Doesn't Handle SIGHUP Correctly - https://phabricator.wikimedia.org/T140868#2479198 (10Gehel) p:05Triage>03High Changing priority as high. This will bite us again. [09:32:57] (03PS1) 10Elukey: Add special Cassandra compaction configs for aqs100[456] [puppet] - 10https://gerrit.wikimedia.org/r/299956 [09:36:52] 06Operations, 13Patch-For-Review: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2479202 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I'm tentatively resolving this @Mattflaschen-WMF though the actual fix was defaulting to lower verbo... [09:37:37] (03CR) 10Elukey: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/3402/" [puppet] - 10https://gerrit.wikimedia.org/r/299956 (owner: 10Elukey) [09:40:04] (03CR) 10Addshore: [C: 031] Labs: remove wgRCWatchCategoryMembership - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299942 (owner: 10MaxSem) [09:44:00] 06Operations, 10DBA, 06Revision-Scoring-As-A-Service, 07Blocked-on-schema-change, and 3 others: Remove oresc_rev index - https://phabricator.wikimedia.org/T140803#2479251 (10Gehel) p:05Triage>03Normal [09:44:40] (03CR) 10Joal: [C: 031] "LGTM ! Thanks Luca :)" [puppet] - 10https://gerrit.wikimedia.org/r/299956 (owner: 10Elukey) [09:48:36] (03CR) 10Elukey: [C: 032] "cassandra base::service has refresh => false, merging and then restarting cassandra instances on aqs100[456] later on." [puppet] - 10https://gerrit.wikimedia.org/r/299956 (owner: 10Elukey) [09:50:19] (03PS1) 10Filippo Giunchedi: prometheus: fix ferm::service and include node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/299958 [09:52:47] hi [09:52:50] is tools-static one of yours [09:52:52] ? [09:53:03] because it's been a bit erratic performance wise for a few days [09:57:27] (03PS1) 10Gehel: Add Bryan to labtest roots. [puppet] - 10https://gerrit.wikimedia.org/r/299959 (https://phabricator.wikimedia.org/T140830) [09:57:34] (03PS2) 10Filippo Giunchedi: prometheus: fix ferm::service and include node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/299958 [09:57:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] prometheus: fix ferm::service and include node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/299958 (owner: 10Filippo Giunchedi) [09:59:10] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to labtest root for bd808 - https://phabricator.wikimedia.org/T140830#2479283 (10Gehel) p:05Triage>03Normal Patch uploaded, to merge after approval. [09:59:52] (03PS1) 10Giuseppe Lavagetto: service::node: add git as deployment method [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) [09:59:58] <_joe_> mobrovac: ^^ [10:03:59] Good morning, jynus :) [10:04:05] Another 503 error on the same URL [10:04:15] https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930 [10:04:35] abian, at this point you should file a bug on phabricator [10:04:51] indicate the url and when those happened [10:05:10] and any other information that could be useful [10:05:30] I cannot figure out exactly when, really [10:05:43] Sometimes it happens, and sometimes it doesn't [10:06:09] put the most acurate time and that that you can say when they happened [10:06:12] do fill a bug and it will be triaged / figured out eventually :-] [10:06:21] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail [10:07:04] abian: there must be an error on the servers, so if we get a task one will be able to look at the log and see what is happening there [10:13:05] <_joe_> that page renders correctly for me atm [10:13:28] got me a 503 [10:13:36] 503 as well [10:13:50] I saw some memcache errors yesterday [10:16:02] on esams I've a 200 [10:16:13] _joe_: I have removed from the beta cluster puppet the cherry pick of https://gerrit.wikimedia.org/r/#/c/258979/ "mediawiki: add conftool-specifc credentials and scripts" abandonned/superseeded and caused a conflict [10:16:33] <_joe_> hashar: yeah, sorry, I forgot to remove it [10:16:44] no problems :) [10:16:46] "Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR" [10:17:04] (03PS2) 10Giuseppe Lavagetto: puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 [10:18:12] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 (owner: 10Giuseppe Lavagetto) [10:19:21] mobrovac: do you mind if I remove from beta the cherry pick of "Parsoid: Move to service::node" https://gerrit.wikimedia.org/r/#/c/298436/ ? [10:19:41] mobrovac: it has a bunch of conflicts with recently merged parsoid patches [10:24:03] (03CR) 10MarcoAurelio: "@Jforrester: Yes, please see - it's called populateShortUrlTable.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298344 (https://phabricator.wikimedia.org/T138507) (owner: 10Jforrester) [10:28:06] (03PS3) 10Giuseppe Lavagetto: puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 [10:28:14] <_joe_> hashar: yes, remove it [10:28:46] _joe_: done thx :) [10:29:51] (03CR) 10Hashar: "I have removed the cherry pick from the beta cluster. It conflicted with a few other Parsoid related changes." [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [10:30:03] <_joe_> hashar: what's the name of the parsoid machines in deployment-prep? [10:30:30] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:31:19] <_joe_> ok found them [10:33:51] 06Operations: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2479387 (10abian) [10:36:34] godog: thank you to have lowered the default nutcracker verbosity :-} [10:38:59] hashar: np! [10:44:35] 06Operations, 10MediaWiki-General-or-Unknown: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2479430 (10jcrespo) ``` { "_index": "logstash-2016.07.20", "_type": "mediawiki", "_id": "AVYHyECFhDxk9Z8yut3T", "_score": null, "_source": { "messa... [10:52:19] (03PS2) 10MarcoAurelio: Configuration changes for he.wikinews.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299446 (https://phabricator.wikimedia.org/T140544) [10:52:20] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: puppet fail [10:55:39] (03PS2) 10Giuseppe Lavagetto: service::node: add git as deployment method [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) [11:00:35] (03CR) 10Giuseppe Lavagetto: "quick check" [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [11:02:28] (03PS1) 10Jcrespo: Add fake prometheus mysql password [labs/private] - 10https://gerrit.wikimedia.org/r/299969 (https://phabricator.wikimedia.org/T128185) [11:03:20] (03CR) 10Jcrespo: [C: 032 V: 032] Add fake prometheus mysql password [labs/private] - 10https://gerrit.wikimedia.org/r/299969 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [11:04:21] jynus: https://gerrit.wikimedia.org/r/#/c/299827/ [11:04:22] hey [11:04:49] hey [11:07:36] I'm trying to run it on enwiki in beta cluster [11:07:53] great, test it extensively [11:08:27] sure [11:08:33] the important thing when dropping indexes is that they are no longer used [11:08:38] there are tools for that [11:09:28] (03CR) 10Filippo Giunchedi: [C: 04-1] puppetmaster: declare NameVirtualHost where expected (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299752 (owner: 10Giuseppe Lavagetto) [11:10:49] (03PS1) 10Filippo Giunchedi: site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) [11:12:34] so it seems this guy, jaime already had done most of the work regarding mysql and prometheus [11:12:45] I only have to deploy it! [11:13:07] oh yeah? nice :D [11:13:21] (03PS3) 10Jcrespo: [WIP]New user for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) [11:13:56] (03PS4) 10Jcrespo: New user for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) [11:13:57] 06Operations, 10EventBus, 07Regression, 15User-mobrovac, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2479487 (10Ottomata) Aye yai yai, sorry yall! Thanks for responding. Will be back at work tomorrow an... [11:16:27] (03PS1) 10Filippo Giunchedi: prometheus: use DOMAIN_NETWORKS not INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/299971 [11:17:55] (03CR) 10Jcrespo: [C: 032] New user for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/280939 (https://phabricator.wikimedia.org/T128185) (owner: 10Jcrespo) [11:18:05] jynus: http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:RecentChanges&hidenondamaging=1 works fine [11:18:10] I'm testing other parts too [11:18:23] (the enwiki doesn't have index but other ones have) [11:18:40] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:18:57] Amir1, I prefer if you generate a summary and copy it on the ticket [11:19:06] jynus: I was going to say, LGTM but WITH MAX_USER_CONNECTIONS is missing [11:19:07] after you do all tests [11:19:20] okay [11:19:30] godog, how many connections does it use? [11:19:35] more than 1? [11:19:44] (03PS3) 10Yuvipanda: tools: Fix webservice toolschecker check [puppet] - 10https://gerrit.wikimedia.org/r/299831 [11:19:46] (03PS1) 10Yuvipanda: tools: Make toolschecker webservice actions non silent [puppet] - 10https://gerrit.wikimedia.org/r/299972 [11:19:46] or are you saying "just in case" [11:19:56] jynus: yeah just in case, say max 5 [11:20:08] it uses a connection each time metrics are requested iirc [11:20:10] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix webservice toolschecker check [puppet] - 10https://gerrit.wikimedia.org/r/299831 (owner: 10Yuvipanda) [11:20:14] I on porpouse say that we can update the grants later [11:20:27] as we do not yet know all collectors we will enable [11:20:45] let's enable it now [11:20:49] and we will tune [11:20:58] it later [11:21:03] jynus ok to merge your change? [11:21:06] (on palladium that is) [11:21:06] in fact, by testing how many connections [11:21:07] it doesn't depend on the number of collectors, but sure that works too for now jynus, thanks! [11:21:20] Guest9334, yes, I was about to do it, whover you are [11:21:25] :-) [11:21:37] :D [11:21:40] (03CR) 10jenkins-bot: [V: 04-1] tools: Make toolschecker webservice actions non silent [puppet] - 10https://gerrit.wikimedia.org/r/299972 (owner: 10Yuvipanda) [11:21:47] in fact I had the Merge these changes? (yes/no)? in front of me [11:22:28] godog, as we are not going to enable it all at once [11:22:45] I can monitor is behaviour on passive hosts without caps [11:22:55] e.g. check resources [11:23:08] and then we harden it [11:23:28] (03PS2) 10Yuvipanda: tools: Make toolschecker webservice actions non silent [puppet] - 10https://gerrit.wikimedia.org/r/299972 [11:24:32] jynus: yep that works too, a connection happens every time a client requests metrics via http btw [11:24:36] I have reached a philosophy of "merge fast" (fast != risky) if things are better than what we have now, and continue doing small tunings [11:25:10] there is also a lot of work pending related to account handling [11:25:23] so bit by bit [11:25:51] if this makes you feel better- that commit only updates a text file- it does not make it go live [11:26:29] hehe it does -- thanks! [11:27:04] puppet has prohibited touching mysql [11:27:29] jynus: I'm about to go for lunch, if you have time later we can chat about the db roles and puppet and how to generate the prometheus config [11:27:35] yes [11:27:38] definitelly [11:27:45] I need to get up to date [11:27:54] now that I finally was able to start working on this [11:28:14] so now you will have my 25% of attention all for you [11:28:32] hahaha nice, we can chat in chunks of 15 min [11:30:10] I will have to drop temporary account on db2070 [11:30:16] we will talk later [11:31:24] (03CR) 10Yuvipanda: [C: 032] tools: Make toolschecker webservice actions non silent [puppet] - 10https://gerrit.wikimedia.org/r/299972 (owner: 10Yuvipanda) [11:31:41] (03PS2) 10Yuvipanda: add-ldap-user: Don't use sillyshell, it's silly (and doesn't exist anymore) [puppet] - 10https://gerrit.wikimedia.org/r/299812 (https://phabricator.wikimedia.org/T86668) (owner: 10Chad) [11:31:48] (03CR) 10Yuvipanda: [C: 032 V: 032] add-ldap-user: Don't use sillyshell, it's silly (and doesn't exist anymore) [puppet] - 10https://gerrit.wikimedia.org/r/299812 (https://phabricator.wikimedia.org/T86668) (owner: 10Chad) [11:32:39] (03PS5) 10Yuvipanda: Include nova mysql password in novaenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/299590 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [11:32:54] (03CR) 10Yuvipanda: [C: 032 V: 032] Include nova mysql password in novaenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/299590 (https://phabricator.wikimedia.org/T139272) (owner: 10Andrew Bogott) [11:47:59] 06Operations, 10DBA, 06Revision-Scoring-As-A-Service, 07Blocked-on-schema-change, and 3 others: Remove oresc_rev index - https://phabricator.wikimedia.org/T140803#2479599 (10Ladsgroup) Here's one of queries: ``` mysql> explain SELECT /* SpecialRecentChanges::doMainQuery Someone */ rc_id,rc_timestamp,rc_us... [11:48:26] jynus: https://phabricator.wikimedia.org/T140803#2479599 [11:54:14] (03PS3) 10Dereckson: Deploy the Kartographer extension to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298042 (https://phabricator.wikimedia.org/T139787) (owner: 10Tpt) [11:55:36] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2479613 (10abian) [11:57:16] (03CR) 10Dereckson: [C: 031] "PS3: added task reference" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298042 (https://phabricator.wikimedia.org/T139787) (owner: 10Tpt) [12:01:05] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10elukey) Afaik we should restart from https://wikitech.wikimedia.org/wiki/Production_shell_access#Requesting_access [12:03:20] (03CR) 10BBlack: [C: 031] cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [12:03:36] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2479613 (10Peachey88) Why should the #operations team put energy into monitoring sites that the foundation doesn't operate? [12:05:13] (03PS1) 10Ladsgroup: Beta: Test OresEnabledNamespaces on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299976 [12:05:27] (03PS3) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [12:05:56] (03CR) 10jenkins-bot: [V: 04-1] Beta: Test OresEnabledNamespaces on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299976 (owner: 10Ladsgroup) [12:07:54] (03PS1) 10ArielGlenn: don't write run settings for createdir jobs [dumps] - 10https://gerrit.wikimedia.org/r/299989 [12:09:01] (03CR) 10ArielGlenn: [C: 032] don't write run settings for createdir jobs [dumps] - 10https://gerrit.wikimedia.org/r/299989 (owner: 10ArielGlenn) [12:09:53] (03PS2) 10Ladsgroup: Beta: Test OresEnabledNamespaces on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299976 [12:17:55] <_joe_> !log disabling puppet on all parsoid hosts for the transition to service-runner T90668 [12:17:56] T90668: Replace custom server.js with service-runner - https://phabricator.wikimedia.org/T90668 [12:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:03] !log oblivian@palladium conftool action : set/pooled=no; selector: name= [12:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:23] <_joe_> ouch [12:22:28] (03CR) 10Mobrovac: [C: 04-1] "Question / concern in-lined, otherwise - exactly what I had in mind too." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [12:23:23] !log oblivian@palladium conftool action : set/pooled=no; selector: name=wtp2001.codfw.wmnet [12:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:59] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2479695 (10Danny_B) [12:24:55] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2479698 (10Gehel) p:05Triage>03Low The main monitoring that we have is not public, so there is not much use in monitoring sites not managed by WMF. https://status.wikimedia.o... [12:25:55] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2465141 (10elukey) @Brentjoseph we'd also need manager's approval before proceeding, thanks! [12:26:13] <_joe_> !log transition ongoing on wtp2001 [12:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:26:37] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2465026 (10elukey) Thanks @Jksamra, now we'd need manager's approval to proceed! [12:27:26] (03PS3) 10Giuseppe Lavagetto: parsoid: move to role::parsoid for all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/299718 (https://phabricator.wikimedia.org/T90668) [12:27:36] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2463390 (10elukey) >>! In T140399#2472630, @Mpany wrote: > Do you need anything else? Yes we'd need m... [12:27:49] <_joe_> mobrovac: I'm merging ^^ [12:27:58] yup, ok [12:28:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] parsoid: move to role::parsoid for all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/299718 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [12:28:10] <_joe_> and applying on wtp2001 [12:29:33] (03Abandoned) 10Mobrovac: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [12:29:43] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2479713 (10elukey) My bad, I believe that Daniel was referencing @Addshore's manager a... [12:29:45] k [12:29:57] PROBLEM - Parsoid on wtp2001 is CRITICAL: Connection refused [12:30:12] <_joe_> expected [12:31:04] _joe_: pulling from the target works today? [12:31:10] <_joe_> yes [12:31:19] <_joe_> the problem was with git-upload-pack [12:31:36] k [12:31:42] <_joe_> but honestly, whatever [12:32:08] <_joe_> wtp2001 done, mobrovac [12:32:21] <_joe_> I'd take a look at its status and then move on [12:32:27] (03PS1) 10Ladsgroup: Let bureaucrats in fawiki remove sysop user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299992 (https://phabricator.wikimedia.org/T140810) [12:32:31] i can see parsoid running [12:32:34] looking at the logs [12:32:36] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2479717 (10elukey) @Brentjoseph we'd need formal manager's approval statement in this task, thanks! [12:33:05] (03CR) 10jenkins-bot: [V: 04-1] Let bureaucrats in fawiki remove sysop user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299992 (https://phabricator.wikimedia.org/T140810) (owner: 10Ladsgroup) [12:34:20] _joe_: ok, looking good, let's proceed [12:34:26] <_joe_> ok [12:34:28] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10Dereckson) [12:34:45] RECOVERY - Parsoid on wtp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.081 second response time [12:34:56] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479759 (10Dereckson) [12:35:00] !log oblivian@palladium conftool action : set/pooled=yes; selector: name=wtp2001.codfw.wmnet [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:09] <_joe_> !log transitioning wtp2002-2010 [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:31] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2479761 (10Dereckson) 05Open>03Resolved a:03Dereckson Error disappears from the logs. Another issue has been spotted... [12:36:13] 06Operations, 10Deployment-Systems, 03Scap3, 15User-bd808: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2479780 (10Dereckson) a:05Dereckson>03bd808 [12:37:28] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2479786 (10Addshore) I'm not really sure what manager that would refer to as I do not... [12:39:41] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2479788 (10elukey) >>! In T140342#2479786, @Addshore wrote: > I'm not really sure what... [12:41:35] (03PS1) 10Yuvipanda: tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) [12:42:53] (03CR) 10jenkins-bot: [V: 04-1] tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) (owner: 10Yuvipanda) [12:43:34] (03PS2) 10Hashar: contint: APPEND unattended upgrade allowed-origins [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) [12:43:56] (03PS2) 10Yuvipanda: tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) [12:44:04] (03PS2) 10Ladsgroup: Let bureaucrats in fawiki remove sysop user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299992 (https://phabricator.wikimedia.org/T140810) [12:44:12] (03PS1) 10ArielGlenn: fix typo in runner settings reference [dumps] - 10https://gerrit.wikimedia.org/r/299995 [12:47:27] (03CR) 10jenkins-bot: [V: 04-1] tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) (owner: 10Yuvipanda) [12:47:40] (03PS1) 10Dereckson: Allow maintenante scripts to work on wikitech without private settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) [12:48:24] (03CR) 10Dereckson: [C: 04-1] "We should first evaluate the content of these settings and what happens when scripts run without them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson) [12:48:28] (03PS4) 10Yuvipanda: tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) [12:49:45] (03CR) 10jenkins-bot: [V: 04-1] tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) (owner: 10Yuvipanda) [12:52:06] (03PS1) 10BBlack: ssl_ciphersuite: drop non-FS AES256 options [puppet] - 10https://gerrit.wikimedia.org/r/299997 (https://phabricator.wikimedia.org/T118181) [12:54:55] (03CR) 10Hashar: [C: 031] "Cherry picked on CI puppetmaster. Confirmed to upgrade properly. I have added this change to the July 21st puppet swat" [puppet] - 10https://gerrit.wikimedia.org/r/298568 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [12:55:33] (03PS7) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) [12:55:39] Amir1, you do not need to convice me for that to be done [12:55:46] I am already convinced [12:55:56] (03CR) 10Hashar: [C: 031] "Added to July 21st puppet swat" [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [12:56:15] jynus: thanks, I'm kinda lost for the next steps [12:56:29] the only thing you need to convice me is that that is an emergency causing immediate issues- if not, it will happen as soon as possible (but there are other schema changes pending, too) [12:56:57] so just wait and relax :-) [12:57:11] it's just matter of storage for wikidata [12:57:27] continue investigating optimizations (I think there was some improvements pending [12:57:29] etc. [12:58:03] so, as I think it is not top priority, it will be sent to my queue and eventually be processe3d [12:58:18] no need for you do do anything else [12:58:35] okay [12:58:39] thanks for the confirmation :) [12:58:41] schema changes are dangerous, so they cannot just be applied without preparation [12:58:54] (03CR) 10Jhernandez: "Related docs:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [12:59:01] (03CR) 10Jhernandez: [C: 031] Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [12:59:16] (03PS5) 10Yuvipanda: tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) [13:05:12] (03PS1) 10Yuvipanda: grafana: Make role explicitly reference production secrets [puppet] - 10https://gerrit.wikimedia.org/r/299999 (https://phabricator.wikimedia.org/T120295) [13:05:33] (03PS6) 10Yuvipanda: tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) [13:05:43] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add a kubernetes diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/299993 (https://phabricator.wikimedia.org/T140887) (owner: 10Yuvipanda) [13:05:58] (03PS2) 10Yuvipanda: grafana: Make role explicitly reference production secrets [puppet] - 10https://gerrit.wikimedia.org/r/299999 (https://phabricator.wikimedia.org/T120295) [13:08:46] (03PS1) 10ArielGlenn: remove dump monitor from snapshot1004 [puppet] - 10https://gerrit.wikimedia.org/r/300002 [13:11:19] godog I'm refactoring bits of the grafana role now to make a labs specific one... [13:12:24] (03CR) 10ArielGlenn: [C: 032] remove dump monitor from snapshot1004 [puppet] - 10https://gerrit.wikimedia.org/r/300002 (owner: 10ArielGlenn) [13:12:44] <_joe_> !log transitioning wtp2011-2020 [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:14] someone's unmerged stuff is on palladium [13:13:17] YuviPanda: nice! I'd be happy to review [13:13:17] yuvi? can I merge? [13:13:23] apergos whops, yes [13:13:26] k [13:13:40] done [13:13:42] godog ok! [13:13:46] (03PS1) 10Elukey: Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) [13:15:29] !log cr2-eqiad: setting VRRP priority to 50 for all subnets, effectively switching the VRRP master to cr1-eqiad [13:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:27] 06Operations, 10Cassandra, 10Mobile-Content-Service, 06Services: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314#2479928 (10Eevans) [13:18:35] (03PS2) 10Filippo Giunchedi: prometheus: use DOMAIN_NETWORKS not INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/299971 [13:18:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] prometheus: use DOMAIN_NETWORKS not INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/299971 (owner: 10Filippo Giunchedi) [13:20:21] (03PS1) 10Yuvipanda: grafana: Mark role explicitly as production [puppet] - 10https://gerrit.wikimedia.org/r/300004 (https://phabricator.wikimedia.org/T120295) [13:20:23] (03PS1) 10Yuvipanda: grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) [13:20:50] godog ok, https://gerrit.wikimedia.org/r/#/c/299999/ https://gerrit.wikimedia.org/r/#/c/300004/ and https://gerrit.wikimedia.org/r/#/c/300005/ [13:20:52] should all be no-ops [13:20:53] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:53] PROBLEM - Parsoid on wtp2014 is CRITICAL: Connection refused [13:21:01] joe ^ you? [13:21:04] PROBLEM - Parsoid on wtp2013 is CRITICAL: Connection refused [13:21:55] <_joe_> parsoid, yes [13:21:59] <_joe_> puppet, no [13:22:05] <_joe_> getting on palladium now [13:22:08] ok [13:22:24] YuviPanda: heheh almost got 300k [13:23:00] godog yeah :) [13:23:12] (03CR) 10jenkins-bot: [V: 04-1] grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:23:18] <_joe_> nothing getting to strontium either [13:23:22] <_joe_> as in no requests [13:23:55] no neon failures yet [13:24:33] <_joe_> ok strontium was marked as failed by proxybalancer [13:24:42] <_joe_> which killed palladium too [13:24:43] <_joe_> nice [13:24:44] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 378 bytes in 1.653 second response time [13:24:54] RECOVERY - Parsoid on wtp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.102 second response time [13:25:04] RECOVERY - Parsoid on wtp2013 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.085 second response time [13:25:16] killed due to overload? [13:25:29] <_joe_> bblack: classical thundering herd it seems [13:25:37] <_joe_> but I am looking at strontium logs now [13:26:30] <_joe_> but I can't really get what happened there [13:27:39] but if there was overload, wouldn't we get lots of puppet errors? [13:28:07] (e.g. last time the db from puppet failed, all hosts started complaining) [13:28:39] !log Restarting restbase1008-a.eqiad.wmnet to apply a (ephemeral) 7200000ms streaming timeout : T138314 [13:28:40] T138314: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314 [13:28:41] (03PS4) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [13:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:56] (03PS1) 10ArielGlenn: set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 [13:29:59] !log Performing rolling RESTBase restart to work-around Cassandra instance restart fallout : T138314 and T138314 [13:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:30:23] T138314: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314 [13:34:32] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2479963 (10abian) I'm not talking about a sophisticated system, but about a simple `curl --head` that gets the status code for every site. That's something I can run on Tool Labs... [13:35:38] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2479965 (10hashar) [13:35:48] PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: Connection refused [13:35:59] <_joe_> uh urandom it's you I guess [13:36:30] yup [13:36:48] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: Allow RB to be programmatically pooled/depooled during restarts - https://phabricator.wikimedia.org/T140895#2479978 (10mobrovac) [13:37:01] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: Allow RB to be programmatically pooled/depooled during restarts - https://phabricator.wikimedia.org/T140895#2479990 (10mobrovac) p:05Triage>03High [13:37:09] (03PS2) 10Yuvipanda: grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) [13:37:10] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: Connection refused eevans Restarting [13:37:41] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2479613 (10BBlack) Nothing is effortless - our cognitive and alerting load is high as it is. The problem with operations monitoring sites we don't control is that there isn't an... [13:38:12] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: Allow RB to be programmatically pooled/depooled during restarts - https://phabricator.wikimedia.org/T140895#2479999 (10Joe) @mobrovac which user on the rb-systems should be able to repool/depool the server itself? Once I have that info, it's a 10 minu... [13:38:35] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, remember to add to private.git too unless you've done it already!" [puppet] - 10https://gerrit.wikimedia.org/r/299999 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:38:58] (03CR) 10jenkins-bot: [V: 04-1] grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:39:11] (03CR) 10Filippo Giunchedi: [C: 031] grafana: Mark role explicitly as production [puppet] - 10https://gerrit.wikimedia.org/r/300004 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:40:16] (03PS1) 10Yuvipanda: Add new secrets for grafana labs / prod instances [labs/private] - 10https://gerrit.wikimedia.org/r/300007 (https://phabricator.wikimedia.org/T120295) [13:41:05] (03PS3) 10Yuvipanda: grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) [13:41:27] (03PS2) 10ArielGlenn: set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 [13:41:45] (03PS2) 10Yuvipanda: Add new secrets for grafana labs / prod instances [labs/private] - 10https://gerrit.wikimedia.org/r/300007 (https://phabricator.wikimedia.org/T120295) [13:41:53] (03CR) 10Yuvipanda: [C: 032 V: 032] Add new secrets for grafana labs / prod instances [labs/private] - 10https://gerrit.wikimedia.org/r/300007 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:42:03] (03CR) 10Ema: "The tests should be ran as follows on v4:" [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [13:42:39] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [13:42:59] (03PS3) 10Yuvipanda: grafana: Make role explicitly reference production secrets [puppet] - 10https://gerrit.wikimedia.org/r/299999 (https://phabricator.wikimedia.org/T120295) [13:43:08] (03CR) 10Yuvipanda: [C: 032 V: 032] grafana: Make role explicitly reference production secrets [puppet] - 10https://gerrit.wikimedia.org/r/299999 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:43:54] (03PS4) 10Giuseppe Lavagetto: puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 [13:44:38] (03PS4) 10Yuvipanda: grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) [13:44:39] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:44:40] (03PS2) 10Yuvipanda: grafana: Mark role explicitly as production [puppet] - 10https://gerrit.wikimedia.org/r/300004 (https://phabricator.wikimedia.org/T120295) [13:45:15] 06Operations, 10RESTBase, 06Services, 15User-mobrovac: Allow RB to be programmatically pooled/depooled during restarts - https://phabricator.wikimedia.org/T140895#2480026 (10mobrovac) >>! In T140895#2479999, @Joe wrote: > @mobrovac which user on the rb-systems should be able to repool/depool the server its... [13:45:28] (03CR) 10Yuvipanda: [C: 032 V: 032] grafana: Mark role explicitly as production [puppet] - 10https://gerrit.wikimedia.org/r/300004 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:46:28] godog https://puppet-compiler.wmflabs.org/3409/krypton.eqiad.wmnet/ is puppet compiler result, seems like a noop [13:46:39] (03CR) 10jenkins-bot: [V: 04-1] grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:47:09] (03PS5) 10Yuvipanda: grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) [13:49:13] _joe_: subbu will play a bit with the codfw nodes and then we can proceed to the rest of eqiad if all's looking good to him too [13:49:26] <_joe_> mobrovac: makes sense [13:49:36] subbu: judging by the logs on wtp1001, things have been looking pretty good there [13:49:43] yes. [13:49:54] <_joe_> ok I will take a break [13:49:56] (03PS3) 10ArielGlenn: set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 [13:50:07] i've been looking at logs, ganglia, kibana since y'day .. looking good to me. [13:50:12] godog not sure how to deal with ssl though. I guess easiest would be to just use misc-varnish and procure a separate ssl cert, but I was thinking of using letsencrypt... [13:50:26] bblack do we use letsencrypt with misc varnish anywhere yet? Do we plan to? [13:51:29] <_joe_> yes, we do [13:51:35] no, we don't [13:51:39] <_joe_> no? [13:51:51] <_joe_> didn't planet move to letsencrypt [13:52:06] YuviPanda: what's this in reference to? we generally have our globalsign wildcards on misc-varnish [13:52:22] _joe_: not that I know of, unless that all happened while I was on vacation [13:52:52] <_joe_> bblack: nope, I clearly got it wrong, it's on misc-varnish and uses the wildcard [13:52:59] YuviPanda: misc-varnish already uses *.wikimedia.org so it would work [13:53:04] <_joe_> so it must have been something else I forgot [13:53:04] (would be impossible anyways I now realize, because LE doesn't do wildcards and planet is a wildcard cert) [13:53:14] <_joe_> yes, that too [13:53:17] <_joe_> just wrong memory [13:53:35] ah, didn't realize we had a global *.wikimedia.org [13:53:36] so nevermind [13:53:43] this was for grafana-labs.wikimedia.org [13:53:44] yeah usually when we move a non-cache-cluster one-off service into misc-varnish, we stop renewing the old cert because it falls under our wikimedia.org wildcard [13:54:50] in theory we could use that wildcard for other one-off public hosts, but we don't for security reasons (those direct public hosts are more likely to be compromised than most, and we don't want that leaking a wildcard that covers e.g. login and commons .wikimedia.org) [13:55:26] so that's why we buy smaller certs for them when they're not through the cache clusters in general [13:55:28] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: puppet fail [13:55:51] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10MF-Warburg) [13:56:41] bblack, yeah that makes sense... [13:56:47] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480093 (10MF-Warburg) [13:57:01] and I just verified cp* hosts can hit labmon, so this can go behind misc [13:57:10] cool [13:57:43] <_joe_> subbu, mobrovac so I am going afk now, will be back in ~ 20, ok? [13:57:56] <_joe_> it won't take us long to migrate, hopefully [13:58:14] YuviPanda: usually we just add a stanza in modules/role/manifests/cache/misc.pp alongside all the other services there, which defines the public hostname and the backend service hostname [13:58:33] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480096 (10MarcoAurelio) [13:58:54] YuviPanda: and optionally force all traffic as "pass" for some services, because maybe they don't emit good caching headers and caching between users would be bad, which would be in the obvious place in template/varnish/misc-common.inc.vcl.erb [13:58:56] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! PCC, https://puppet-compiler.wmflabs.org/3409/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:59:23] (03CR) 10Yuvipanda: [C: 032 V: 032] grafana: Refactor production role into base role [puppet] - 10https://gerrit.wikimedia.org/r/300005 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [13:59:35] oh you already have a patch heh [13:59:52] and it's already going to the existing krypton, nice [13:59:53] bblack right. I'll copy the config for grafana.wikimedia.org and propose a patch soon [14:00:09] RECOVERY - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is OK: TCP OK - 0.005 second response time on port 9042 [14:00:25] (03PS1) 10Yuvipanda: grafana: Add and provision labs grafana role [puppet] - 10https://gerrit.wikimedia.org/r/300020 (https://phabricator.wikimedia.org/T120295) [14:00:46] bblack I think labs grafana will live on labmon along with labs graphite rather than krypton [14:00:54] godog ^ [14:00:55] oh ok [14:01:02] godog https://gerrit.wikimedia.org/r/#/c/300020/ I mean [14:01:37] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480107 (10MarcoAurelio) IMPORTANT: @Coren @jcrespo - As per wikitech:Add a wiki, informing you that this is happening, so the storage layer can be prepared and checked (labs... [14:02:11] (03PS4) 10ArielGlenn: set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 [14:02:57] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480111 (10jcrespo) Thanks. [14:03:03] (03PS1) 10Yuvipanda: cache: Add labs grafana behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/300021 (https://phabricator.wikimedia.org/T120295) [14:04:53] (03PS1) 10Yuvipanda: Add grafana-labs and grafana-labs-admin domains [dns] - 10https://gerrit.wikimedia.org/r/300023 (https://phabricator.wikimedia.org/T120295) [14:04:56] bblack ^ [14:05:07] bblack added a DNS change and misc varnish change [14:05:32] (03CR) 10Filippo Giunchedi: [C: 031] grafana: Add and provision labs grafana role [puppet] - 10https://gerrit.wikimedia.org/r/300020 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [14:05:39] yep looks good YuviPanda [14:05:42] !log Resuming failed bootstrap on restbase1013-c.eqiad.wmnet : T134016 [14:05:43] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [14:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:03] godog I ran puppet on Krypton, looks like a no-op [14:06:21] (03CR) 10Yuvipanda: [C: 032] grafana: Add and provision labs grafana role [puppet] - 10https://gerrit.wikimedia.org/r/300020 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [14:07:35] yup, thanks! [14:07:59] jynus: let me know when it iss a good time to talk about mysql and how to get shard info and so on [14:08:43] (03PS5) 10ArielGlenn: set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 [14:09:32] bblack I've added you to the cache and DNS change as reviewer as well, let me know when you have time to look at them :) [14:10:57] _joe_: subbu: going afk for 15 mins or so (relocating back home) [14:11:04] k [14:12:09] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2479654 (10Gehel) I tried running `checkPad.js` and it runs without error (or without any output actually): gehel@etherpad1001:/usr/share/... [14:12:22] (03PS1) 10MarcoAurelio: DNS configuration changes for creating tcy.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) [14:12:41] YuviPanda: I've seen there a couple of places in templates/varnish where grafana is mentioned too btw [14:12:57] looking [14:13:01] (03CR) 10jenkins-bot: [V: 04-1] DNS configuration changes for creating tcy.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [14:13:16] godog I see it [14:14:37] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2479613 (10jcrespo) > websites of some Wikimedia chapters that use their own infrastructure, servers or hosting services I do **not** speak on behalf of WMF, but I would like to... [14:14:40] (03PS2) 10Yuvipanda: cache: Add labs grafana behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/300021 (https://phabricator.wikimedia.org/T120295) [14:14:42] godog amended [14:15:54] YuviPanda: yup, LGTM but bblack seems more qualified to sign off on it :) [14:16:28] yeah I'll wait :) [14:17:40] godog, let's do it [14:17:42] (03PS2) 10MarcoAurelio: DNS configuration changes for creating tcy.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) [14:19:24] jynus: ok! so my idea would be to associate all metrics coming from a mysql server with its shard, though the shard isn't really mediawiki-specific correct? [14:19:42] meaning we have non-mw things on m1 for example [14:19:43] <_joe_> subbu, mobrovac let's start at 14:30 utc then [14:19:44] (03CR) 10MarcoAurelio: "Fow whoever merges, please run afterwards authdns-update. Query the DNS servers to make sure it has been correctly deployed. Thank you." [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [14:20:02] godog, you are correct [14:20:18] works for me. [14:20:34] !log Restarting bootstrap on restbas1013.eqiad.wmnet (duplicate resume?) : T134016 [14:20:34] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [14:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:46] but there are host with several shards at the same time, godog [14:20:48] (03PS6) 10ArielGlenn: set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 [14:21:01] what I do not want to to have multiple sources of truth [14:21:07] we have mediawiki [14:21:09] we have puppet [14:21:14] we have tendril's database [14:21:28] (we do not need to fix those issues) [14:21:33] but think about the future [14:22:07] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480135 (10MarcoAurelio) [14:22:22] jynus: makes sense, what's the most accurate usually? [14:22:47] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:23:01] jynus: also in the multiple shards per machine, is it still a single mysql server instance running? [14:23:39] godog, none [14:23:53] and the second question is: both cases [14:24:16] if you are ok, let's continue talking on #wikimedia-databases [14:24:18] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused [14:24:27] got it [14:24:30] ^^^ [14:24:30] (anyone else is invited, as it is still public) [14:24:33] sure jynus [14:24:45] but that way we do not spam this for ops/releng [14:24:56] as it will not be a short conversation :-P [14:26:38] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused eevans Bootstrapping. - The acknowledgement expires at: 2016-07-22 14:26:12. [14:27:59] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480145 (10MarcoAurelio) p:05Triage>03Normal [14:28:48] _joe_: subbu: here, we can start whenever you want [14:29:08] <_joe_> ok let's go [14:29:10] ok [14:29:12] <_joe_> wtp1003 [14:29:56] _joe_: should we depool and do 2 to 3 at the same time? [14:31:25] <_joe_> mobrovac: let's do 1 [14:31:30] <_joe_> and ofc I am depooling [14:31:39] k [14:31:46] <_joe_> done [14:31:48] wtp1003 done already! [14:31:50] nice [14:31:55] checked the logs, looking good [14:32:00] <_joe_> yes it's fast [14:32:10] <_joe_> give me the green light to proceed [14:32:49] i think we can go [14:32:50] subbu: ? [14:33:14] Hi all, is it possible to just clone one puppet module subdirectory, for example: https://gerrit.wikimedia.org/r/p/operations/puppet/openldap? This works for /nginx/, because it's in it's own repo at it seems, but fails for /openldap/. I'm aware of git sparse-checkout and shallow clone, but just wanted to ask if there is a "cleaner way"? TIA, Georg [14:33:15] let me quickly run one test. and we can go with the rest quickly. [14:33:52] (03CR) 10ArielGlenn: [C: 032] set up proper dump monitor role and add to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/300006 (owner: 10ArielGlenn) [14:34:13] <_joe_> georg-: not that I know of [14:34:21] _joe_, mobrovac go ahead [14:34:33] <_joe_> subbu: ok I'll do 04 and 05 [14:35:35] (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/298500 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [14:37:00] (03PS3) 10Jgreen: SPF/DKIM records for wikipedia.org and domains sharing that zonefile. [dns] - 10https://gerrit.wikimedia.org/r/298500 (https://phabricator.wikimedia.org/T135410) [14:37:31] <_joe_> 04 done [14:37:38] <_joe_> starting 06 [14:38:19] (03PS1) 10Andrew Bogott: Revert "Disable user creation of new VMs until we increase capacity." [puppet] - 10https://gerrit.wikimedia.org/r/300028 [14:38:33] 04 looks fine. [14:38:48] (03CR) 10Jgreen: [C: 032 V: 031] SPF/DKIM records for wikipedia.org and domains sharing that zonefile. [dns] - 10https://gerrit.wikimedia.org/r/298500 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [14:38:51] <_joe_> 05 done, doing 07 [14:39:23] 04 and 05 looking good for me [14:39:42] 06Operations, 10Traffic, 13Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#2480158 (10BBlack) Copying in an old argument never recorded, I think from @faidon: While many services on cache_misc are obvious targets for "mid", phabricator its... [14:40:15] !log running authdns-update to deploy SPF/DKIM records for wikipedia.org [14:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:36] <_joe_> 06 done, doing 08 [14:40:49] 06Operations, 10EventBus, 07Regression, 15User-mobrovac, 07Wikimedia-log-errors: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848#2480161 (10demon) Everything looks good to me now. Train un-halted. Thanks everyone! [14:41:54] <_joe_> 07 done, doing 09 [14:41:56] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2480166 (10Jgreen) DNS changes are deployed, let's leave this ticket open until the fundraising mailing tests are complete... [14:42:15] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2480168 (10Jgreen) p:05High>03Low [14:42:51] 06 and 07 looking good [14:44:02] 06Operations, 06Labs: Move labs graphite to graphite-labs.wikimedia.org - https://phabricator.wikimedia.org/T140899#2480171 (10yuvipanda) [14:44:14] (03PS1) 10Yuvipanda: Add graphite-labs.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/300030 (https://phabricator.wikimedia.org/T140899) [14:44:51] <_joe_> 08 and 09 done, doing 10 and then pausing [14:45:08] ema I've added you to a misc varnish patch as well as a reviewer :) [14:45:59] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480193 (10MarcoAurelio) Please note T125501#2021343 for reference when creating a new project. [14:46:55] (03PS1) 10Yuvipanda: graphite: Move labs graphite to graphite-labs.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/300031 (https://phabricator.wikimedia.org/T140899) [14:47:47] _joe_: Thanks! I was afraid of this.. :) [14:48:13] <_joe_> georg-: if we were any better, we'd publish some of those modules on puppet-forge [14:48:51] <_joe_> or, more people taking care of this; and figuring out a procedure for publication that doesn't kill our own development [14:49:20] Yeah...I guess it's the common problems of large orgs with an large complex infra which grew over some time [14:49:25] <_joe_> subbu, mobrovac do a thorough test of wtp1002-10 and I'll go on with the rest if those are ok [14:49:48] <_joe_> georg-: I don't think there are a lot of orgs with their whole puppet tree that is FLOSS [14:49:51] <_joe_> either [14:50:13] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2480205 (10Danny_B) [14:50:15] 06Operations, 10Traffic, 13Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#2480206 (10BBlack) Going into that list a little deeper, though, there's a secondary pragmatic issue. Most (all?) of the servers for the services above are still i... [14:51:26] _joe_: Yeah, I think so too, didn't want to imply that [14:52:09] <_joe_> georg-: what I meant is, it's hard to make your own puppet tree reusable by others [14:52:20] <_joe_> we tried that in the past creating submodules [14:52:20] _joe_: Yes! :) [14:52:24] <_joe_> (like nginx) [14:52:33] but it didn't work out? [14:52:35] (03CR) 10Rush: [C: 031] "sure yeah this makes more sense I imagine" [puppet] - 10https://gerrit.wikimedia.org/r/300031 (https://phabricator.wikimedia.org/T140899) (owner: 10Yuvipanda) [14:52:40] <_joe_> but submodules are a pain for us to manage day-to-day [14:52:47] <_joe_> subbu, mobrovac ok to go on? [14:52:57] did you thought or have a look at r10k for example? [14:53:21] <_joe_> honestly, no [14:53:28] it's work a look, at least [14:53:36] -work + worth [14:53:43] <_joe_> also, we're still on 3.4 :P [14:53:44] _joe_: looking good for me, subbu? [14:54:05] yes, looking good. [14:54:08] _joe_: it makes managing all modules in their own repos so much more comfortable that submodules or subtrees [14:54:15] -that +than [14:54:24] <_joe_> georg-: yeah I know about r10k [14:54:43] <_joe_> georg-: again, lack of time, and practical purposes, mainly [14:54:55] <_joe_> sorry, gotta get back to business [14:55:10] <_joe_> subbu, mobrovac ok I'll transition the remaining 14 hosts then [14:55:17] <_joe_> I'll tell you when they're done [14:55:20] k [14:55:49] _joe_: All good, thanks for your help. My words weren't anything like critic, just a recommendation how this could be handled in a nice and comfortable manner. Good luck and have a nice day! [14:56:04] <_joe_> georg-: thanks a lot :) [14:56:21] (03PS1) 10Chad: group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300033 [14:56:30] <_joe_> georg-: and sorry if I seemed dry/irritated, I'm just doing 4 things at the same time [14:56:47] _joe_: Yeah...same for me, thus my typos! :) [14:56:48] (03CR) 10Chad: [C: 04-2] "Just prepping for train later. Choo choo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300033 (owner: 10Chad) [14:56:49] 06Operations, 10Monitoring, 13Patch-For-Review: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#2409174 (10elukey) @ema: afaiu we just patched the TCP collector but we still need to investigate the overall diamond behavior? [14:56:51] <_joe_> eheh [14:57:09] _joe_: Ciao and thanks again! [14:57:11] <_joe_> ostriches: "choo choo"? [14:57:14] <_joe_> :P [14:57:22] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [14:57:32] _joe_: We've been meaning to get train conductor hats :p [14:57:43] And those little wooden whistles. [14:57:49] (03CR) 10BBlack: [C: 031] cache: Add labs grafana behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/300021 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [14:57:56] <_joe_> ostriches: oh man. [14:58:40] (03CR) 10ArielGlenn: [C: 032] fix typo in runner settings reference [dumps] - 10https://gerrit.wikimedia.org/r/299995 (owner: 10ArielGlenn) [14:58:42] I presume people know about the spike in dberrors around 14:28: https://logstash.wikimedia.org/goto/78b55fe6c3e95b51eb1a0c7f915e52b7 [14:58:44] (03CR) 10BBlack: [C: 031] Add graphite-labs.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/300030 (https://phabricator.wikimedia.org/T140899) (owner: 10Yuvipanda) [14:58:48] <_joe_> wtp1011,12 done [14:58:55] bblack is just merging it enough (for misc varnish change)? do I need to do anything special? does varnish get reloaded by puppet? [14:59:18] 06Operations, 10MediaWiki-General-or-Unknown: 503 error raises again while trying to load a Wikidata page - https://phabricator.wikimedia.org/T140879#2480250 (10elukey) p:05Triage>03High [14:59:36] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2480251 (10Nuria) mmm... need to consult with @ottomata. I think sudo-ing as the cron... [15:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160720T1500). [15:00:05] Nikerabbit, mafk, Amir1, and Addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:07] YuviPanda: yeah the varnish config gets reloaded by puppet, but it will take a full puppet cycle before it hits all the caches [15:00:16] I'm around [15:00:16] * YuviPanda nods [15:00:17] * mafk is here [15:00:18] \o7 [15:00:24] * addshore is here [15:00:27] YuviPanda: and if you hit it with a browser before it's done, it will cache the 404 for a while too :) [15:00:38] I'll remember to not do that ;) [15:01:00] I can SWAT, looks like a busy one. [15:01:01] (03CR) 10Elukey: "This would also fix some terbium cronspam that we receive:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson) [15:01:12] (03PS2) 10Thcipriani: Compact Language Links: To beta in ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299751 (owner: 10Nikerabbit) [15:01:14] (03PS2) 10Yuvipanda: Add grafana-labs and grafana-labs-admin domains [dns] - 10https://gerrit.wikimedia.org/r/300023 (https://phabricator.wikimedia.org/T120295) [15:01:21] (03CR) 10Yuvipanda: [C: 032 V: 032] Add grafana-labs and grafana-labs-admin domains [dns] - 10https://gerrit.wikimedia.org/r/300023 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [15:01:21] <_joe_> wtp1013/14 done [15:01:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299751 (owner: 10Nikerabbit) [15:01:32] (03PS2) 10Yuvipanda: Add graphite-labs.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/300030 (https://phabricator.wikimedia.org/T140899) [15:01:40] (03CR) 10Yuvipanda: [C: 032 V: 032] Add graphite-labs.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/300030 (https://phabricator.wikimedia.org/T140899) (owner: 10Yuvipanda) [15:02:04] (03PS1) 10ArielGlenn: never write run settings unless we are actually running dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/300034 [15:02:09] (03Merged) 10jenkins-bot: Compact Language Links: To beta in ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299751 (owner: 10Nikerabbit) [15:02:54] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2480268 (10elukey) [15:02:55] Nikerabbit: 299751 in on mw1099 please check there [15:03:14] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2480267 (10elukey) [15:03:15] !log RESTBase Cassandra: raising stream throughput to 25Mbit/s; lowering compaction throughput to 10MB/s : T134016 [15:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:21] (03PS3) 10Yuvipanda: cache: Add labs grafana behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/300021 (https://phabricator.wikimedia.org/T120295) [15:03:22] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:03:27] 06Operations, 10Deployment-Systems, 10MediaWiki-extensions-WikimediaMaintenance, 13Patch-For-Review: WikimediaMaintenance refreshMessageBlobs: wmf-config/wikitech.php requires non existing /etc/mediawiki/WikitechPrivateSettings.php - https://phabricator.wikimedia.org/T140889#2479747 (10elukey) p:05Triage... [15:03:30] (03CR) 10Yuvipanda: [C: 032 V: 032] cache: Add labs grafana behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/300021 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [15:03:33] checking [15:03:43] (03CR) 10jenkins-bot: [V: 04-1] never write run settings unless we are actually running dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/300034 (owner: 10ArielGlenn) [15:03:48] bblack can you +1 https://gerrit.wikimedia.org/r/#/c/300031/ as well? [15:04:48] !log Ammending my last to include 'Rack b' [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:15] thcipriani: first time I am using the chrome plugin, but I am not sure if the issue is there or on the config [15:05:28] (03CR) 10BBlack: [C: 031] graphite: Move labs graphite to graphite-labs.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/300031 (https://phabricator.wikimedia.org/T140899) (owner: 10Yuvipanda) [15:05:37] <_joe_> wtp1015/16 done [15:05:39] bblack thanks! [15:05:41] Nikerabbit: whoops forgot to rebase, one sec [15:05:41] Nikerabbit: looks like wikivoyage=>false is overriding [15:05:48] oh. :) [15:06:07] Nikerabbit: try now, sorry about that [15:06:16] ok [15:06:29] the overide should be easy to check with echo 'var_dump( $wgULSCompactLanguageLinksBetaFeature );' | mwscript eval.php ruwikivoyage [15:06:34] Nikerabbit: sorry for noise. WFM. [15:06:38] (03PS3) 10Thcipriani: Configuration changes for he.wikinews.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299446 (https://phabricator.wikimedia.org/T140544) (owner: 10MarcoAurelio) [15:06:42] Nikerabbit: thanks. Noted. [15:06:51] <_joe_> subbu, mobrovac I will go on with the remaining 8 servers [15:06:51] thcipriani: works on mw1099 [15:06:55] <_joe_> 17-24 [15:07:00] k [15:07:00] Nikerabbit: kk, thanks, going to prod [15:07:01] k [15:07:10] (03PS2) 10ArielGlenn: never write run settings unless we are actually running dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/300034 [15:08:11] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:299751|Compact Language Links: To beta in ruwikivoyage]] (duration: 00m 33s) [15:08:15] ^ Nikerabbit double check live please [15:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:51] thcipriani: disabled chrome plugin, still looks as expected [15:09:01] Nikerabbit: ack, thanks for checking [15:09:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299446 (https://phabricator.wikimedia.org/T140544) (owner: 10MarcoAurelio) [15:09:13] * mafk blinks [15:09:38] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2480344 (10Gehel) p:05Triage>03High I'm too much lost in Etherpad to be of much help. Let's wait for @akosiaris, unless it is urgent to rec... [15:09:38] (03PS1) 10Yuvipanda: redirector: Pass along request_uri to new location as well [puppet] - 10https://gerrit.wikimedia.org/r/300038 [15:09:47] (03Merged) 10jenkins-bot: Configuration changes for he.wikinews.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299446 (https://phabricator.wikimedia.org/T140544) (owner: 10MarcoAurelio) [15:10:08] 06Operations, 10Traffic, 13Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#2480349 (10Dzahn) gerrit will be replaced by https://gerrit-new.wikimedia.org/r/#/q/status:open soon-ish , then it will be jessie (and use Letsencrypt for certs).... [15:10:31] mafk: change is live on mw1099, please check [15:10:43] thcipriani: on it [15:11:27] (03PS1) 10Reedy: 5 more to extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300039 (https://phabricator.wikimedia.org/T139800) [15:11:31] _joe_: subbu: up to wtp1016 looking good to me [15:11:41] <_joe_> 17-18 done too, btw [15:12:12] I stopped looking after 1014 ... but, let me start verifying each one after that. [15:12:18] thcipriani: no, it's not working on he.wikinews [15:12:21] i don't expect any issues at this point. [15:12:48] (03CR) 10BryanDavis: "The easiest thing to do here would be to use include_once rather than require_once I think. The config file in question is only provisione" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson) [15:12:59] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2480365 (10Dzahn) a:03Dzahn [15:13:05] 06Operations: setup server osmium as parse benchmarking server - https://phabricator.wikimedia.org/T83861#2480367 (10Dzahn) [15:13:07] 06Operations, 10VisualEditor, 13Patch-For-Review, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2201664 (10Dzahn) 05Open>03Resolved [15:13:12] <_joe_> subbu: well it's a good idea to check every server at the end of the procedure [15:13:22] will do. [15:14:01] thcipriani: I guess it's because of '+hewikinews', which should be just 'hewikinews' [15:14:13] unless the change isn't live yet? [15:14:33] mafk: hmm, I see the patroller group on mw1099, it's only live there [15:15:02] thcipriani: it's been some time since I don't do SWAT, can you point me where mw1099 is so I can see? [15:15:04] are you checking with: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug ? [15:15:29] mafk: ah, yeah, we moved to staging on mw1099 before rolling out everywhere, you'll need to use the X-Wikimedia-Debug header (linked above) to check [15:15:38] (03CR) 10ArielGlenn: [C: 032] never write run settings unless we are actually running dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/300034 (owner: 10ArielGlenn) [15:16:15] (03PS3) 10Thcipriani: Let bureaucrats in fawiki remove sysop user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299992 (https://phabricator.wikimedia.org/T140810) (owner: 10Ladsgroup) [15:16:28] thcipriani: oh, well, didn't knew that... [15:19:15] thcipriani: done - it's all ok at special:listgrouprights at he.wikinews.org [15:19:31] mafk: ack, thanks for checking, rolling to production now [15:19:51] thcipriani: thanks for deploy [15:20:25] <_joe_> subbu, mobrovac : transition done AFAICT [15:20:39] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:299446|Configuration changes for he.wikinews.org]] (duration: 00m 28s) [15:20:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299992 (https://phabricator.wikimedia.org/T140810) (owner: 10Ladsgroup) [15:20:40] ok .. \o/ .. will now verify the rest. [15:20:48] ^ mafk should be live everywhere now [15:20:57] checking again [15:21:00] <_joe_> now I am left with https://gerrit.wikimedia.org/r/#/c/299960/ [15:21:09] <_joe_> mobrovac: I'll look at your comment and amend [15:21:17] (03Merged) 10jenkins-bot: Let bureaucrats in fawiki remove sysop user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299992 (https://phabricator.wikimedia.org/T140810) (owner: 10Ladsgroup) [15:21:24] thnx _joe_ [15:21:37] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:21:38] all looks ok [15:22:00] mafk: thanks for checking and jumping through the new hoops :) [15:22:17] (03CR) 10Giuseppe Lavagetto: service::node: add git as deployment method (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [15:22:20] mobrovac: did you have any memory issue after changes for precaching? [15:22:27] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:22:34] Amir1: https://gerrit.wikimedia.org/r/#/c/299992/ is live on mw1099 check please [15:22:37] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:22:48] thcipriani: sure, on it [15:22:50] <_joe_> who forgot merging? [15:22:52] <_joe_> :) [15:23:13] <_joe_> mobrovac: see my response [15:23:16] Amir1: mem is still tight on scb nodes [15:23:50] Amir1: you told _joe_ and me you'd have an answer today about the refactoring work that would bring mem pressure down [15:23:55] Amir1: any news on that front? [15:23:58] thcipriani: heh, thanks - now I can go to sleep (bad night, etc.) [15:24:05] <_joe_> mobrovac: we have more than 30% of free memory [15:24:24] and also biggest problem was the spikes [15:24:28] are they resolved? [15:24:53] thcipriani: my chrome extension doesn't work [15:25:05] it might take some time to see what's going on [15:26:05] !log SWAT for {{gerrit|299446}} done, which fixes T140544 [15:26:06] T140544: Activate patrol system for hewikinews - https://phabricator.wikimedia.org/T140544 [15:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:36] thcipriani: works just fine [15:26:56] Amir1: ack, kk, rolling out everywhere [15:27:22] thanks [15:28:13] (03PS3) 10Thcipriani: Beta: Test OresEnabledNamespaces on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299976 (owner: 10Ladsgroup) [15:28:16] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:299992|Let bureaucrats in fawiki remove sysop user group (T140810)]] (duration: 00m 25s) [15:28:17] T140810: Allow bureaucrats to remove 'sysop' user group on fawiki - https://phabricator.wikimedia.org/T140810 [15:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:23] ^ Amir1 should be live everywhere [15:29:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299976 (owner: 10Ladsgroup) [15:29:05] testing [15:30:40] (03Merged) 10jenkins-bot: Beta: Test OresEnabledNamespaces on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299976 (owner: 10Ladsgroup) [15:30:59] thcipriani: thanks, works just fine [15:31:09] Amir1: cool, thanks for checking [15:32:23] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:299976|Beta: Test OresEnabledNamespaces on enwiki]] (duration: 00m 25s) [15:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:36] ^ Amir1 merged, should go out to beta with next beta-scap-eqiad [15:32:52] thanks :) [15:33:04] I need to test it carefully [15:33:19] later, then we might have some stuff for prod [15:34:10] _joe_, mobrovac lgtm ... i looked at rcstream on a bunch of wikis and verified VE diffs there, did a test edit, verified new logs exist on all servers. [15:34:23] \o/ [15:34:54] <_joe_> subbu: great, I will now try to fix ruthenium [15:35:08] all tests ran to completion on ruthenium as well. [15:35:18] (03PS3) 10Giuseppe Lavagetto: service::node: add git as deployment method [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) [15:35:22] <_joe_> yeah I mean the fact it's a trebuchet target [15:35:30] ok. [15:35:37] <_joe_> subbu: I'm unsure what will happen when you decide to deploy btw [15:35:46] right, that was my next question .. [15:35:51] <_joe_> we didn't test trebuchet after all our mess [15:36:17] i was wondering earlier today if we shoudl do a trivial-change test deploy tomorrow after you have a chance to look at it. [15:36:30] <_joe_> seems like a good idea [15:37:34] <_joe_> mobrovac: I'm merging 299960 if I am confident it will work on ruthenium, then I am left with removing ruthenium from the trebuchet minions and I should be done, right? [15:37:40] (03PS2) 10Yuvipanda: redirector: Pass along request_uri to new location as well [puppet] - 10https://gerrit.wikimedia.org/r/300038 [15:37:42] (03PS2) 10Yuvipanda: graphite: Move labs graphite to graphite-labs.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/300031 (https://phabricator.wikimedia.org/T140899) [15:38:15] (03CR) 10Yuvipanda: [C: 032 V: 032] redirector: Pass along request_uri to new location as well [puppet] - 10https://gerrit.wikimedia.org/r/300038 (owner: 10Yuvipanda) [15:38:27] _joe_: right [15:38:46] addshore: https://gerrit.wikimedia.org/r/#/c/299974/ is live on mw1099 please test there [15:38:53] {{doing}} [15:38:59] <_joe_> mobrovac: seems fine [15:39:04] <_joe_> I'll merge [15:39:08] kk [15:39:30] thcipriani: looks good! [15:39:34] <_joe_> I really want to take a break earlier today :P [15:39:41] addshore: kk, rolling out everywhere [15:39:41] (03PS4) 10Giuseppe Lavagetto: service::node: add git as deployment method [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) [15:39:44] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [15:40:00] mobrovac, what is the story with parsoid on the beta cluster now .. i saw that hashar removed your cherrypick earlier today. [15:40:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] service::node: add git as deployment method [puppet] - 10https://gerrit.wikimedia.org/r/299960 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [15:41:18] !log thcipriani@tin Synchronized php-1.28.0-wmf.11/extensions/RevisionSlider/modules/ext.RevisionSlider.HelpDialog.js: SWAT: [[gerrit:299974|Open links in the "tutorial" in the new window (T140875)]] (duration: 00m 27s) [15:41:19] T140875: Last minute found errors on RevisionSlider - https://phabricator.wikimedia.org/T140875 [15:41:21] ^ addshore live everywhere [15:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:23] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [15:41:52] thcipriani: looks good :) [15:42:30] addshore: kk, once i18n update merges I'll start a full scap [15:42:47] Awesome! :) [15:42:53] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:42:58] <_joe_> mobrovac: change was a noop on ruthenium [15:43:04] <_joe_> so it's ok I guess [15:43:09] :) [15:43:29] <_joe_> mobrovac: as for restbase restart, we do have one issue [15:43:56] <_joe_> all users created by service::node have home /nonexistent [15:44:18] _joe_, mobrovac i'm going to be afk for ~10 mins [15:44:33] <_joe_> so we can't install the etcd credentials in their home dir [15:44:34] _joe_: yes, and /bin/false [15:44:45] <_joe_> /bin/false is not a problem [15:45:41] _joe_: would root be an option? [15:45:58] !log thcipriani@tin Started scap: SWAT: [[gerrit:299973|Fix spelling of RevisionSlider (T140875)]] [15:46:00] T140875: Last minute found errors on RevisionSlider - https://phabricator.wikimedia.org/T140875 [15:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:13] thanks thcipriani :) [15:46:53] addshore: yw [15:48:40] <_joe_> !log removed ruthenium from the list of trebuchet minions [15:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:00] <_joe_> mobrovac: I am thinking of using a special user [15:49:10] <_joe_> mobrovac: on the mediawiki hosts we have mw-deploy [15:49:44] <_joe_> what user is using scap3 to deploy changes/execute commands? [15:49:47] <_joe_> restbase? [15:49:57] we use ansible for deploying restbase [15:50:09] <_joe_> well, what user you deploy as? [15:50:41] <_joe_> anyways, I suggest we move away from /nonexistent and use the restbase user [15:51:17] <_joe_> ok, off for now, although we have a meeting in 10 [15:51:18] <_joe_> sigh [15:52:39] (03PS1) 10Chad: WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 [15:52:43] !log Resuming failed bootstrap on restbase2003.codfw.wmnet : T134016 [15:52:44] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [15:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:37] 06Operations, 10vm-requests, 13Patch-For-Review, 05Prometheus-metrics-monitoring: eqiad/codfw: 4 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2480542 (10fgiunchedi) 05Open>03stalled stalling waiting on more disk space on ganeti eqiad, see also {T138414} [15:53:54] 06Operations, 10vm-requests, 13Patch-For-Review, 05Prometheus-metrics-monitoring: eqiad/codfw: 4 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2480551 (10fgiunchedi) [15:54:08] (03CR) 10Paladox: [C: 031] WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [15:57:55] (03CR) 10jenkins-bot: [V: 04-1] WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [15:58:18] _joe_: we deploy as root [15:58:35] <_joe_> mobrovac: ok, root it is then :P [15:58:49] yay [15:59:00] subbu|afk: checking parsoid in beta now [15:59:17] (03PS2) 10Chad: WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 [16:01:08] (03CR) 10Paladox: [C: 031] WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [16:06:54] (03PS2) 10Reedy: 5 more to extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300039 (https://phabricator.wikimedia.org/T139800) [16:07:13] (03PS3) 10Reedy: 5 more to extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300039 (https://phabricator.wikimedia.org/T139800) [16:07:27] (03CR) 10jenkins-bot: [V: 04-1] WIP: Gerrit: Greatly simplify directory management on host [puppet] - 10https://gerrit.wikimedia.org/r/300048 (owner: 10Chad) [16:07:30] (03CR) 10Reedy: [C: 032] 5 more to extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300039 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [16:08:25] (03Merged) 10jenkins-bot: 5 more to extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300039 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [16:08:50] subbu: all ok with parsoid in beta, just deployed the latest code [16:09:17] !log thcipriani@tin Finished scap: SWAT: [[gerrit:299973|Fix spelling of RevisionSlider (T140875)]] (duration: 23m 18s) [16:09:18] T140875: Last minute found errors on RevisionSlider - https://phabricator.wikimedia.org/T140875 [16:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:28] ^ addshore check please [16:09:33] *chekcs* [16:09:50] looks good thcipriani ! :D [16:09:51] mobrovac, great. thanks. [16:09:54] !log reedy@tin Synchronized wmf-config/extension-list: More to extension registration for l10n (duration: 00m 27s) [16:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:31] _joe_: we still have to change the remote on ruthenium, it still points to tin [16:11:31] cc subbu [16:11:45] mobrovac: congratulations! [16:11:58] hashar: have i won something? [16:11:59] :D [16:12:17] * subbu awaits the detailed hashar explanation like last time. :) [16:13:06] mobrovac: yeah about parsoid [16:13:14] subbu: oh I have done nothing :-} [16:13:23] ah that [16:13:34] hashar: kudos go to _joe_ and subbu as well [16:13:37] and arlorla [16:14:12] next would be to have scap3 on beta to be driven by Jenkins [16:15:11] yup, we need to fix this up after we move all the relevant repos to scap3 [16:17:43] (03PS23) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (https://phabricator.wikimedia.org/T138501) [16:21:26] 06Operations, 10ops-codfw, 10hardware-requests: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2480753 (10RobH) [16:23:20] <_joe_> mobrovac: oh right [16:25:24] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [16:30:07] (03PS1) 10Eevans: RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) [16:32:06] (03PS2) 10Eevans: RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) [16:39:37] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2480961 (10MeganHernandez_WMF) Thank you! I approve the access for CPS. [16:40:59] 06Operations, 06Performance-Team, 06Services, 07Availability, and 3 others: Create BagOStuff subclass for HTTP - https://phabricator.wikimedia.org/T137272#2480975 (10Krinkle) [16:47:56] (03PS1) 10Eevans: Disable `streaming_socket_timeout_in_ms` setting [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) [16:50:49] (03PS2) 10Gehel: Configure new relevance forge servers [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) [16:51:34] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:05:22] <_joe_> mobrovac: done btw (change the remote on ruthenium) [17:05:25] <_joe_> I am off now [17:05:28] yay [17:05:29] thnx [17:05:57] _joe_: one last thing [17:05:57] _joe_: have you removed ruthenium from the list of trebuchet minions? [17:09:05] <_joe_> mobrovac: yes [17:09:14] awesome! thnx [17:09:21] really appreciate your help on this _joe_ [17:21:37] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2481082 (10mmodell) a:05mmodell>03None I'm not sure if my patch fix... [17:22:35] 06Operations, 10Ops-Access-Requests: Platonides access to #mediawiki_security - https://phabricator.wikimedia.org/T140288#2481086 (10RobH) a:05RobH>03None [17:23:05] !log Stopping restbase1013-c bootstrap (pending better timeouts) : T134016 [17:23:06] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:30] (03PS1) 10BBlack: ssl_ciphersuite: auto-downgrade to compat when necc [puppet] - 10https://gerrit.wikimedia.org/r/300065 (https://phabricator.wikimedia.org/T118181) [17:27:00] (03CR) 10jenkins-bot: [V: 04-1] ssl_ciphersuite: auto-downgrade to compat when necc [puppet] - 10https://gerrit.wikimedia.org/r/300065 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [17:27:23] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [17:27:28] ^^^ on it [17:29:25] ACKNOWLEDGEMENT - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Service disabled (bootstrap aborted) - The acknowledgement expires at: 2016-07-22 17:29:02. [17:31:44] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:58] (03PS1) 10Mobrovac: Parsoid: clean up the manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/300067 (https://phabricator.wikimedia.org/T90668) [17:35:05] ACKNOWLEDGEMENT - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures eevans cassandra-c unit is masked pending a proper solution to disabling - The acknowledgement expires at: 2016-07-22 17:34:14. [17:41:01] (03CR) 10Mobrovac: "PCC is happy - https://puppet-compiler.wmflabs.org/3415/" [puppet] - 10https://gerrit.wikimedia.org/r/300067 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [17:45:35] (03PS1) 10Eevans: Disable 1013-c instance [puppet] - 10https://gerrit.wikimedia.org/r/300070 (https://phabricator.wikimedia.org/T134016) [17:48:51] _joe_: do you still want a release date? We prevented spikes, improved memory usage, fixed memory leaks, etc. Because I really don't want to rush that patch into prod. It might make a big mess [17:49:33] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2481184 (10CCogdill_WMF) Thanks @Jgreen, everything looks good on our end. We're still pushing IBM for security specs. We... [17:53:43] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:54:14] ^^^ this is going to be a problem [17:54:31] if someone wants to merge https://gerrit.wikimedia.org/r/#/c/300070, we can make it stop [17:55:12] (03PS2) 10BBlack: ssl_ciphersuite: auto-downgrade to compat when necc [puppet] - 10https://gerrit.wikimedia.org/r/300065 (https://phabricator.wikimedia.org/T118181) [17:56:42] (03CR) 10Eevans: "Puppet compiler output here: http://puppet-compiler.wmflabs.org/3416/" [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans) [18:00:57] 06Operations, 10ops-eqiad: Rack/Setup Carbon/Apt Server Replacement - https://phabricator.wikimedia.org/T139171#2481219 (10Cmjohnson) Confirmed that the pxe is enabled on the 10G NIC, disabled the 1G NICS in bios. The servers is connected via fiber w/SFP+'s, going to remove and use a DAC cable. [18:02:12] (03CR) 10Eevans: [C: 031] "Puppet compiler output: http://puppet-compiler.wmflabs.org/3417/" [puppet] - 10https://gerrit.wikimedia.org/r/300059 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [18:06:16] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2481228 (10Cmjohnson) Rebuild is extremely slow.... Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 45% in 391 Minutes. [18:08:13] 06Operations, 06Services: Move all Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#2481237 (10Jdforrester-WMF) [18:11:31] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures [18:12:42] (03PS1) 10BBlack: Ciphersuite upgrades for one-off sites [puppet] - 10https://gerrit.wikimedia.org/r/300071 (https://phabricator.wikimedia.org/T118181) [18:19:04] (03CR) 10Chad: [C: 031] Add conduit_token to the .arcrc on nodepool slaves [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [18:19:24] (03CR) 10BBlack: [C: 032] "Compiler confirms this works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/300065 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [18:23:11] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:23:46] (03PS2) 10BBlack: ssl_ciphersuite: drop non-FS AES256 options [puppet] - 10https://gerrit.wikimedia.org/r/299997 (https://phabricator.wikimedia.org/T118181) [18:24:53] (03CR) 10Subramanya Sastry: [C: 031] Parsoid: clean up the manifests and files [puppet] - 10https://gerrit.wikimedia.org/r/300067 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [18:25:37] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: drop non-FS AES256 options [puppet] - 10https://gerrit.wikimedia.org/r/299997 (https://phabricator.wikimedia.org/T118181) (owner: 10BBlack) [18:26:19] 06Operations, 10ops-codfw, 10hardware-requests: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2481272 (10RobH) a:05RobH>03Papaul So ge-4/0/11 shows up, even though the server that should be in it is in the decommissioned rack. @papaul: Can you investigate what system is... [18:34:08] (03PS1) 10Yuvipanda: grafana: Expand edit access in labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/300076 (https://phabricator.wikimedia.org/T120295) [18:37:07] (03PS3) 10Yuvipanda: graphite: Move labs graphite to graphite-labs.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/300031 (https://phabricator.wikimedia.org/T140899) [18:37:09] (03PS2) 10Yuvipanda: grafana: Expand edit access in labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/300076 (https://phabricator.wikimedia.org/T120295) [18:37:39] (03CR) 10Yuvipanda: [C: 032 V: 032] grafana: Expand edit access in labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/300076 (https://phabricator.wikimedia.org/T120295) (owner: 10Yuvipanda) [18:37:51] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#2481313 (10ssastry) [18:37:54] 06Operations, 10Parsoid, 06Services, 10service-runner, and 2 others: Replace custom server.js with service-runner - https://phabricator.wikimedia.org/T90668#2481307 (10ssastry) 05Open>03Resolved This is now done and the production cluster has migrated over. There is a bunch of followup to do including... [18:40:55] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures [18:41:24] (03PS1) 10Andrew Bogott: Disable instance rebuild in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/300077 (https://phabricator.wikimedia.org/T140259) [18:42:22] 06Operations, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2481328 (10demon) [18:43:23] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2481341 (10demon) [18:44:43] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2481348 (10GWicke) [18:47:18] 06Operations, 10ops-codfw, 10hardware-requests: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2481381 (10RobH) a:05Papaul>03RobH irc update: @papaul checked and mw2250 is plugged into ge-4/0/11 So it seems that racks port descriptions are incorrect. That being noted, i... [18:49:27] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2465141 (10MeganHernandez_WMF) Approved, thank you! [18:49:34] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2481437 (10MeganHernandez_WMF) approved thank you! [18:49:45] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2481440 (10MeganHernandez_WMF) Approved, thank you! [18:52:44] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:56:55] 06Operations, 10ops-codfw, 10netops: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2481487 (10RobH) [19:00:00] (03PS19) 10Thcipriani: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [19:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160720T1900). Please do the needful. [19:04:30] (03PS2) 10Chad: group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300033 [19:05:44] (03PS1) 10Reedy: Apply WMF specific SiteMatrix config in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300081 (https://phabricator.wikimedia.org/T132125) [19:08:50] (03CR) 10Chad: [C: 032] group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300033 (owner: 10Chad) [19:09:27] (03Merged) 10jenkins-bot: group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300033 (owner: 10Chad) [19:09:35] PROBLEM - puppet last run on mw2230 is CRITICAL: CRITICAL: Puppet has 1 failures [19:09:55] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Puppet has 2 failures [19:10:35] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:36] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:44] PROBLEM - puppet last run on lvs1012 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:55] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures [19:12:04] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [19:12:06] !log demon@tin Started scap: group1 to wmf.11 [19:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:35] !log demon@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="fawiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.KjzXNQRvAU" ' returned non-zero exit status 1 (duration: 00m 28s) [19:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:31] Blahhhhh [19:13:33] I hate that file [19:13:48] It's probably my fault [19:13:51] But I didn't see it break beta [19:14:52] Extension /srv/mediawiki-staging/php-1.28.0-wmf.10/extensions/XAnalytics/extension.json doesn't exist [19:14:56] grr [19:15:05] Guess that one wasn't merged in time, but it's going away [19:15:40] ostriches: s/extension.json/XAnalytics.php/ as a live hack? [19:15:52] Yeah [19:16:03] .10 is being undeployed today, right? [19:16:08] Tomorrow [19:16:16] I'll make it as a commit then [19:17:09] !log demon@tin Started scap: group1 to wmf.11 (2nd best try) [19:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:52] (03PS1) 10Reedy: Revert XAnalytics extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300084 [19:18:40] !log demon@tin scap aborted: group1 to wmf.11 (2nd best try) (duration: 01m 30s) [19:18:42] (03PS1) 10Hashar: Revert "contint: tidy Nodepool slaves config history" [puppet] - 10https://gerrit.wikimedia.org/r/300085 (https://phabricator.wikimedia.org/T126552) [19:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:49] (03PS2) 10Hashar: Revert "contint: tidy Nodepool slaves config history" [puppet] - 10https://gerrit.wikimedia.org/r/300085 (https://phabricator.wikimedia.org/T126552) [19:18:59] (03PS1) 10Ladsgroup: ORES score edits in main and Property namespaces in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300086 (https://phabricator.wikimedia.org/T139660) [19:20:08] !log demon@tin Started scap: group1 to wmf.11 (3rd bestest try) [19:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:32] (03CR) 10Chad: [C: 032 V: 032] Revert XAnalytics extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300084 (owner: 10Reedy) [19:21:18] Reedy: Pulled in, thx [19:21:25] np, sorry for breaking it :P [19:24:15] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:15] *cough* https://phabricator.wikimedia.org/D288 [19:24:30] (03PS1) 10Dzahn: admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) [19:24:44] heh [19:25:02] Unfortunately, we've still got a handful of extension that won't work for [19:25:29] don't fit the normal patterns [19:25:51] I wonder if we could fix mergeMessageFileList to take both and combine them? [19:25:56] Seems easy enough [19:25:59] mmm, maybe [19:26:09] Or just have extension-list for the exceptions? [19:26:15] Yeah that was my thinking [19:26:22] And combine the two lists. [19:26:25] mmm [19:26:28] use what you can find, AND these too [19:26:41] siebrand did tidy up a lot of these entry point issues before... but there's new ones come in etc [19:26:47] Plus some with 2 entry points... [19:26:51] (03PS2) 10Dzahn: admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) [19:27:02] Finishing the extension.json migration generally would be a big win [19:27:10] (03PS3) 10Dzahn: admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) [19:27:14] Actually, you can do both. --extension-dir just overrides the former. [19:28:34] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:30:54] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481685 (10ori) [19:30:55] Reedy: Hehe, we read the list of extensions twice. [19:31:06] We pass it extension-list as a --list-file [19:31:14] But also define $wgExtensionEntryPointListFiles, so we read them again [19:31:17] You know, for good measure. [19:31:25] (03PS4) 10Dzahn: admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) [19:31:55] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:31:56] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: Connection refused [19:32:12] Aren't those just extra lists? [19:32:25] PROBLEM - cassandra-b service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [19:32:34] ie the wikidata and labs/wikitech ones? [19:32:56] Hmm true. [19:33:32] Shouldn't we add extension-list globally to the config, then we can skip passing it as a file in scap? The only reason that works is a happy accident really. [19:33:34] I think [19:33:46] but, I guess it does prove we can do "multiple sources" [19:33:54] i'd concur with that, yeah [19:34:17] extension-list-wikidata is loaded on all wikis. [19:34:33] remember, it's only for localisation stuffs anyway [19:34:48] and we build all the localisation caches into one [19:34:49] Oooh, downside actually. [19:34:55] Reedy: is-UK-still-online.co.uk? [19:35:05] i am told BT is broken [19:35:05] mutante: Apparently we have internet now [19:35:07] power cut [19:35:09] lolol [19:35:09] heh, good [19:35:11] Eh.... [19:35:12] ups failure [19:35:13] :) [19:35:19] looking at restbase1011 ^^^ [19:35:25] RECOVERY - puppet last run on mw2230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:35:28] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481712 (10bd808) There are some graphite metrics for authn things: https://grafana.wikimedia.org/dashboard/db/authentication-metrics One thing we noticed about lo... [19:35:45] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:16] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:36:26] RECOVERY - puppet last run on lvs1012 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:36:45] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:53] Reedy: http://downdetector.co.uk/problems/bt-british-telecom/map [19:36:56] :p [19:38:44] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:38:57] !log Starting Casssandra on restbase1011-b.eqiad.wmnet [19:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:24] Yeh i had a bt fault today [19:39:53] didnt even know it until it was on the news, phabricator was blocked for me but youtube was working which was strange, but all is working now. [19:39:54] (03PS1) 10Chad: Remove RevisionSlider from beta's extension-list. Already in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300091 [19:40:15] RECOVERY - cassandra-b service on restbase1011 is OK: OK - cassandra-b is active [19:40:15] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures [19:41:12] Reedy ^^ and did you find it difficult access phabricator today since i did but managed to access google and youtube fine. [19:41:22] No, I don't use BT as an ISP [19:41:51] (03PS1) 10Hashar: contint: tidy Nodepool slaves config history [puppet] - 10https://gerrit.wikimedia.org/r/300092 (https://phabricator.wikimedia.org/T126552) [19:42:00] Oh. [19:42:18] there are probably a bunch of reseller companies [19:42:23] using BT network [19:42:25] Yes [19:42:40] Actually, someone with 10 thousond users coulden access the website [19:42:50] due to bt faults this mornning [19:43:04] (03CR) 10Hashar: "Does not work as expected. I should have read the inline doc more carefully." [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [19:43:05] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures [19:43:16] (03CR) 10Hashar: "Switching to tmpreaper instead: https://gerrit.wikimedia.org/r/300092" [puppet] - 10https://gerrit.wikimedia.org/r/300085 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [19:43:33] mutante: There's resellers, but most use their own kit [19:43:46] BT terminate to wherever in london, and then it's transferred to their own networks [19:44:02] Plusnet are owned by BT... and use their kit, hence also having problems [19:44:22] They really need to fix what ever is brokem in london [19:44:31] second time in a row they had a fault this year. [19:45:03] Different faults [19:45:06] (03CR) 10Hashar: "There is the utility "tmpreaper" that takes care of purging files older than X days. Ori has been kind enough to write a puppet module ar" [puppet] - 10https://gerrit.wikimedia.org/r/300092 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [19:45:11] This was a UPS failure at a datacentre they use [19:45:19] BT are just incompetent [19:45:58] Yes but that is the second time [19:46:25] First time was a power failure to one of the routers there, then this time something today with power falure with UPS [19:46:44] mutante ^^ [19:47:09] (puppet error on gallium was me, I Ctrl+C an ongoing run) [19:48:26] paladox: aha! UPS or somebody digging is common [19:48:34] Met office and bbc were affected, i coulden access those today. [19:49:02] once UC Berkeley was down because people just stole the cable [19:49:05] to sell copper [19:49:14] Ha. [19:49:16] (03CR) 10GWicke: [C: 031] RESTBase Cassandra: Lower compaction throughput to 20MB/s [puppet] - 10https://gerrit.wikimedia.org/r/300056 (https://phabricator.wikimedia.org/T140825) (owner: 10Eevans) [19:49:45] you'll be glad to know the UK parliament voted to renew Trident, the UK's aging nuclear weapons program [19:49:48] Not much you can do about 3rd parties [19:49:54] ori: pew pew pew [19:50:00] heh [19:50:01] hashar: thanks for the headsup [19:50:20] The uk news is slow to pick that up [19:50:31] (03PS2) 10Dereckson: Allow maintenance scripts to work on wikitech without private settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) [19:50:32] http://news.sky.com still hasent picked that up [19:50:36] ori ^^ [19:51:03] the british public is close to evenly divided on trident, parliament was not :/ [19:51:11] News agencies are biased and have agendas [19:51:35] Yes, i thinkso they take so long to pick up something, but the govement is keeping a tighter hand over bbc. [19:52:05] urandom: do you want restbase1013 disabled now/anytime/later? [19:52:16] "Disable 1013-c instance" [19:52:20] mutante: yeah, the sooner the beter [19:52:22] better [19:52:28] then we can stop the puppet noise [19:52:30] (03PS2) 10Dzahn: Disable 1013-c instance [puppet] - 10https://gerrit.wikimedia.org/r/300070 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:52:45] (03CR) 10Dzahn: [C: 032] Disable 1013-c instance [puppet] - 10https://gerrit.wikimedia.org/r/300070 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:53:06] (03CR) 10Dereckson: "PS2: simplify with an include_once per previous comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299996 (https://phabricator.wikimedia.org/T140889) (owner: 10Dereckson) [19:53:31] * paladox really likes three mobile for unlimited data but because of less outrages. [19:53:43] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2481749 (10GWicke) Longer term effect on iowait: {F4292424} Overall read latency has dropped for the periods wher... [19:54:05] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:06] (Some data mobile prices in Belgium of cheapest operators: 10 € for 1 Gb, 15 € for 2 Gb) [19:54:33] urandom: icinga-wm says recovery but it wasnt actually submitted yet, :p [19:54:49] yeah, it keeps flip-flopping [19:54:53] ah,ok [19:55:01] Oh, it is £25 pounds here for unlimited data (Fair use 1tb plus 150mb for free) [19:55:01] (03CR) 10Dzahn: [V: 032] Disable 1013-c instance [puppet] - 10https://gerrit.wikimedia.org/r/300070 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:55:29] Pay as you go. [19:55:50] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2481769 (10GWicke) One more potential improvement I noticed while investigating this is that write IO is still rela... [19:55:52] urandom: alright, should be done on next run [19:55:54] Dereckson ^^ [19:56:11] mutante: thanks! [19:56:14] I've a fix merged in master for Warning: Division by zero in /srv/mediawiki/php-1.26wmf18/extensions/PagedTiffHandler/PagedTiffHandler_body.php on line 806 [19:56:14] yw [19:56:56] Dereckson: Awesome! [19:57:34] Should I cherry pick it for our current branches or do we wait next train? [19:57:52] We could cherry-pick it to wmf.11 [19:57:53] :) [19:57:55] k [19:58:31] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481777 (10Tgr) >>! In T140942#2481712, @bd808 wrote: > One thing we noticed about login failures especially during the SessionManager deployment is that this can s... [19:59:15] (03PS1) 10GWicke: Lower trickle_fsync_interval to 8mb [puppet] - 10https://gerrit.wikimedia.org/r/300100 (https://phabricator.wikimedia.org/T140825) [19:59:45] Reedy: You're right, finishing migration is best. [19:59:50] lol [19:59:55] I'm trying :D [20:00:05] gwicke, cscott, arlolra, subbu, bearND, and mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160720T2000). Please do the needful. [20:00:15] Then we can avoid requiring full extension entry just for a code path that might not even use it [20:00:42] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481780 (10Tgr) FWIW I think the obvious thing to alert on in this case would have been the thousands of exceptions per day in production. [20:02:45] no parsoid deploy now. [20:03:27] no mobileapps deployment. [20:06:11] !log demon@tin Finished scap: group1 to wmf.11 (3rd bestest try) (duration: 46m 03s) [20:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:41] RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.002 second response time on port 9042 [20:10:34] 06Operations, 10Monitoring, 06Release-Engineering-Team: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481801 (10Tgr) It seems like T117470 would have helped as well (compare [[https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=(refreshInterval:(d... [20:11:31] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:47] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2481805 (10DaBPunkt) >>! In T140536#2474949, @Peachey88 wrote: > @DaBPunkt Do you know if the original thumbs before the first delete exhibited the black lines? No idea, sorry. [20:17:15] (03PS1) 10Chad: Remove OATHAuth from wikitech's extension-list, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300105 [20:24:05] i wonder if silver (wikitech) can be jessie already [20:24:19] probably [20:24:28] it is like an appserver [20:24:35] and in deployment groups.. right [20:25:38] (03PS1) 10Ppchelko: Change-prop: Ignore bot edits on ORES precache updates. [puppet] - 10https://gerrit.wikimedia.org/r/300108 [20:28:41] !log demon@tin Synchronized php-1.28.0-wmf.11/extensions/PagedTiffHandler/: (no message) (duration: 00m 25s) [20:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:42] bblack: archiva.wm.org also has "parts of the page are not secure" warning [20:30:00] mixed content [20:30:30] who is using archiva mostly? [20:32:07] (03CR) 10Ladsgroup: [C: 031] Change-prop: Ignore bot edits on ORES precache updates. [puppet] - 10https://gerrit.wikimedia.org/r/300108 (owner: 10Ppchelko) [20:39:06] 06Operations, 10ops-codfw, 10media-storage: ms-be2017 failed disk - https://phabricator.wikimedia.org/T140948#2481880 (10RobH) [20:39:55] (03PS3) 10Dzahn: Revert "contint: tidy Nodepool slaves config history" [puppet] - 10https://gerrit.wikimedia.org/r/300085 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [20:40:07] (03CR) 10Dzahn: [C: 032] Revert "contint: tidy Nodepool slaves config history" [puppet] - 10https://gerrit.wikimedia.org/r/300085 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [20:40:32] mutante: sorry about that tidy{} patch. I hadn't tested it and badly read the doc :( [20:41:22] hashar: no problem! i see there is a replacement already [20:41:40] (03CR) 10Jgreen: [C: 031] admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) (owner: 10Dzahn) [20:41:50] (03CR) 10EBernhardson: [C: 04-1] "I ended up manually running these in a loop outside of cron against testwiki, and everything worked out just fine. As such i don't think w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/297459 (owner: 10DCausse) [20:41:57] (03PS5) 10Gehel: Setup CirrusSearch continuous saneitization process to run via cron [puppet] - 10https://gerrit.wikimedia.org/r/297276 (https://phabricator.wikimedia.org/T139200) (owner: 10DCausse) [20:42:42] hashar: i never knew about that "class { '::tmpreader::reap':" [20:42:49] ostriches: uploaded to Commons a 2048px width file, no divide by zero error in the logs, looks good to me [20:43:36] Awesome :) [20:46:30] 06Operations, 10Parsoid, 06Services, 10service-runner, and 2 others: Replace custom server.js with service-runner - https://phabricator.wikimedia.org/T90668#2481920 (10GWicke) Congratulations, @arlolra, @ssastry, @mobrovac, @Joe! As they say, "That's one small step for [a] man, one giant leap for mankind.... [20:46:32] mutante: apparently not used but the puppet module is around at least :] [20:47:03] mutante: but maybe a daily cronjob with a find /path -wholename stuff -delete would be bette [20:47:03] r [20:47:09] (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/297276 (https://phabricator.wikimedia.org/T139200) (owner: 10DCausse) [20:47:42] (03CR) 10Dereckson: [C: 031] DNS configuration changes for creating tcy.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [20:47:48] hashar: gotta check that class. well. find/cron is used in several places already [20:47:52] (03PS3) 10Dereckson: DNS configuration change for creating tcy.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [20:49:12] (03CR) 10Gehel: [C: 032] Setup CirrusSearch continuous saneitization process to run via cron [puppet] - 10https://gerrit.wikimedia.org/r/297276 (https://phabricator.wikimedia.org/T139200) (owner: 10DCausse) [20:49:33] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2481926 (10Dereckson) [20:49:48] (03PS4) 10Dzahn: Revert "contint: tidy Nodepool slaves config history" [puppet] - 10https://gerrit.wikimedia.org/r/300085 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [20:49:53] (03PS1) 10Ppchelko: Change-Prop: Fix error ignoring config bug [puppet] - 10https://gerrit.wikimedia.org/r/300166 [20:49:56] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10Dereckson) [20:51:56] (03CR) 10Dzahn: "running authdns-update is not sufficient for a new language, see https://phabricator.wikimedia.org/T97051" [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [20:55:04] (03CR) 10Hashar: "diskimage-builder is merely to build an image, I usually run it from my desktop. It is using shell scripts and puppet to bring it up but" [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [20:56:42] (03CR) 10Dereckson: "Updated documentation to reflect that:" [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [20:57:15] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:57:38] (03CR) 1020after4: "hashar: It's more than just collecting results. The arcanist token is needed to even run `arc unit`" [puppet] - 10https://gerrit.wikimedia.org/r/298097 (owner: 1020after4) [21:02:07] 06Operations: Monitorize availability of Wikimedia websites that are not hosted by the WMF - https://phabricator.wikimedia.org/T140884#2481966 (10abian) 05Open>03declined I've created a simple tool for this purpose that is already running on Tool Labs. I will try to continue developing it in a few days. How... [21:03:27] jynus, are you able to come to the #wikimedia-office RFC discussion on cross-wiki watchlists? It's now. [21:03:29] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: puppet fail [21:03:41] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:12:27] (03CR) 10Dzahn: Add new user 'hjiang' for Helen Jiang (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [21:14:16] (03CR) 10Dzahn: [C: 04-1] "uid 15082 seems to be the existing wikitech user, let's use that please" [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [21:16:19] (03PS2) 10Andrew Bogott: Revert "Disable user creation of new VMs until we increase capacity." [puppet] - 10https://gerrit.wikimedia.org/r/300028 [21:16:21] (03PS5) 10Dzahn: admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) [21:16:58] (03CR) 10Dzahn: [C: 032] "has manager approval, had waiting period, has +1 from Jeff.. merging" [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) (owner: 10Dzahn) [21:19:37] (03CR) 10Andrew Bogott: [C: 032] Revert "Disable user creation of new VMs until we increase capacity." [puppet] - 10https://gerrit.wikimedia.org/r/300028 (owner: 10Andrew Bogott) [21:19:58] (03PS6) 10Dzahn: admin: add mpany to analytics-privatedata-users,researchers [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) [21:20:56] (03CR) 10Dzahn: [V: 032] "V+2 with a sigh" [puppet] - 10https://gerrit.wikimedia.org/r/300088 (https://phabricator.wikimedia.org/T140399) (owner: 10Dzahn) [21:21:50] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 294 bytes in 3.023 second response time [21:22:22] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2482070 (10Dzahn) [21:23:52] andrewbogott: ^ k8s etcd issues [21:24:16] seems like tools-k8s-etcd-01.eqiad.wmflabs [21:24:59] shall I just restart it, or do we want to save it for investigation? [21:25:36] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2463390 (10Dzahn) on bast1001: Notice: /Stage[main]/Admin/Admin::Hashuser[mpany]/Admin::User[mpany]/User[mpany]/ensure: creat... [21:26:19] I was thinking reboot atm as I don't have to dig i and I am no thrilled about leaving 2/3rds till ? [21:26:21] * andrewbogott reboots it [21:26:24] yuvi is in flight tomorrow [21:26:41] don't have time to dig in even [21:27:09] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482090 (10MarcoAurelio) I think I will try to do the rOMWC changes for initial configuration as well. [21:29:41] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.159 second response time [21:30:16] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: Clones from git.wikimedia.org are not redirected - https://phabricator.wikimedia.org/T139206#2422729 (10saper) I just run into this today by adding this to my `composer.local.json`: ```` { "require": { "mediawiki/vector-skin": "@dev"... [21:30:34] (03PS4) 10MarcoAurelio: DNS configuration change for creating tcy.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) [21:30:51] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2482100 (10Dzahn) @MeganHernandez_WMF thanks for approving. user has been created now @Mpany You should be able to SSH to our... [21:31:57] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: Clones from git.wikimedia.org are not redirected - https://phabricator.wikimedia.org/T139206#2482107 (10Paladox) Yes please. [21:32:20] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10Dereckson) You can put everythign in one change, that's easier to improve and review. [21:32:20] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:33:08] (03CR) 10Dzahn: planet: add phabricator releng blog feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [21:33:40] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 381 bytes in 0.059 second response time [21:35:12] (03PS1) 10Dzahn: planet: "RelEng" (jargon)-> "WMF Release Engineering" [puppet] - 10https://gerrit.wikimedia.org/r/300172 [21:36:47] (03CR) 10Dzahn: planet: add phabricator releng blog feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [21:38:32] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482118 (10MarcoAurelio) >>! In T140898#2482108, @Dereckson wrote: > You can put everythign in one change, that's easier to improve and review. Thank y... [21:39:20] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: puppet fail [21:39:40] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:41:23] * paladox wonders is https://en.wikipedia.org/wiki/File:Wikipedia_Search_April_2015.png that real. [21:41:30] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.093 second response time [21:46:11] (03CR) 10Dzahn: [C: 032] "yep, tcy is approved language code" [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [21:49:46] !log DNS authdns-gen-zones on all servers to add new language tcy (bug T97051) [21:49:47] T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051 [21:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:18] (03PS20) 10Thcipriani: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [21:53:20] (03PS1) 10Thcipriani: Prerequisites for logstash_checker use [puppet] - 10https://gerrit.wikimedia.org/r/300175 [21:53:54] (03CR) 10Dzahn: "root@radon:/etc/gdnsd/zones# grep tcy wikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/300025 (https://phabricator.wikimedia.org/T140898) (owner: 10MarcoAurelio) [21:54:29] PROBLEM - Disk space on ms-be3004 is CRITICAL: DISK CRITICAL - free space: / 2129 MB (3% inode=98%) [21:56:29] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2480071 (10Dzahn) - tcy has been added to langlist / DNS now https://tcy.wikipedia.org https://tcy.m.wikipedia.org/ https://tcy.zero.wikipedia.org/ [21:59:11] !log new language "tcy" (Tulu) has been approved and added today - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Tulu [21:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:55] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482211 (10Dzahn) everything that was needed on T134017 to add "jam" recently, also needs to be done here. we can make a checklist out of the jam ticket. [22:04:23] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2482214 (10MarcoAurelio) >>! In T140898#2482211, @Dzahn wrote: > everything that was needed on T134017 to add "jam" recently, also needs to be done here... [22:05:57] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:08:31] (03CR) 10Smalyshev: "There is wdqs-test.wmflabs.org which has slightly different config due to being labs but still probably can be used for that. Will need so" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [22:28:20] 06Operations, 10VisualEditor, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2482280 (10Dzahn) [22:40:50] (03PS1) 10MarcoAurelio: [WIP] Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) [22:48:30] (03CR) 10MarcoAurelio: [C: 04-1] "Do not merge yet. Logo issue still needs to be sorted out and before that, optiPNG the logo uploaded here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [22:49:21] paladox: ^ there you see somebody who is already into logos [22:50:06] Oh [22:50:13] I was too slow [22:51:33] 06Operations, 10Ops-Access-Requests: Requesting access to text caches for andyrussg - https://phabricator.wikimedia.org/T140958#2482358 (10AndyRussG) [22:54:58] 06Operations, 10Parsoid, 06Services, 10service-runner, and 2 others: Replace custom server.js with service-runner - https://phabricator.wikimedia.org/T90668#2482415 (10Jdforrester-WMF) [22:56:22] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2481328 (10mmodell) 2.9.2 is backported to trusty at https://launchpad.net/~git-core/+archive/ubuntu/ppa It seems fairly straightforward to backpor... [22:58:00] mutante paladox - any issues with that patch or logos? [22:58:40] paladox: no, it's for another project [22:58:52] Oh [22:59:17] mafk: we were just talking about uploading a logo for "tcy" wikipedia.. and svg -> png, using optiping [22:59:30] optiping for file size [22:59:32] that one is for mk.wiktionary which requested $wgSiteName and $wgMetaNamespaces [22:59:43] mutante: ah, right - I'm working on that too [22:59:54] so in Debian/Ubuntu it's package optiping [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160720T2300). [23:00:04] MaxSem, Amir1, and Pchelolo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] I'm around [23:00:15] I'll follow the suggestion by Dereckson and upload all the initial config in one patch [23:00:17] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 10Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2482428 (10demon) p:05Triage>03High Let's do this tomorrow morning maybe? [23:00:18] jynus: https://phabricator.wikimedia.org/T139055#2418115 can you tell me which query you ran here? perhaps you ran a query that's not impacted by https://www.percona.com/blog/2011/12/23/solving-information_schema-slowness/, the fix there doesn't work for my MariaDB 10.0 installation (if I'm gone please send a MemoServ message to IRC user 'Southparkfan') [23:00:27] mafk: ok, cool [23:00:30] but not today :) [23:00:50] yep [23:00:54] mutante: s3 for the new wiki is ok? [23:01:53] I'm here [23:02:12] yes, s3 seems correct. it has all the small wikis [23:02:34] ok, I'll write it down for when I write the patch [23:02:50] * mafk gone for zZzZ [23:02:56] okay, cu [23:04:03] Hi. [23:04:16] I can SWAT this evening. [23:04:22] hi Dereckson [23:04:27] Hi [23:07:36] Pchelolo: you need to create a new 299778, against origin/wmf/1.28.0-wmf.10 [23:07:47] (we merged and reverted it yesterday) [23:07:48] Dereckson: ok, one minute [23:08:10] perhaps you can revert the revert? [23:08:39] Dereckson: I can just create a new one [23:09:53] Amir1: 300086 (config) should be deployed before 300081 (extension) shouldn't it? [23:09:58] Pchelolo: as you prefer [23:10:20] no, the config depends on the wmf.11 patch [23:10:24] Dereckson: here we go: https://gerrit.wikimedia.org/r/#/c/300180/ [23:11:58] (03PS5) 10Chad: WIP: Gerrit: Swap lead to point at production data [puppet] - 10https://gerrit.wikimedia.org/r/298673 [23:12:08] wmf.11 introduces a new config and the config patch uses it [23:12:17] Amir1: yeah, but the patch provision config, it doesn't require it's used, and that could help you to test [23:13:46] hmm, Even both deployed testing is difficult [23:14:02] (03PS6) 10Chad: WIP: Gerrit: Swap lead to point at production data [puppet] - 10https://gerrit.wikimedia.org/r/298673 [23:14:09] there is no way to test them separately [23:14:54] Amir1: okay, so we'll deploy them both on mw1099, then config first, extension second in prod, so we push directly a configured code [23:15:51] Dereckson: sounds like a plan [23:22:26] (03PS2) 10Dereckson: ORES score edits in main and Property namespaces in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300086 (https://phabricator.wikimedia.org/T139660) (owner: 10Ladsgroup) [23:22:33] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300086 (https://phabricator.wikimedia.org/T139660) (owner: 10Ladsgroup) [23:23:17] (03Merged) 10jenkins-bot: ORES score edits in main and Property namespaces in wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300086 (https://phabricator.wikimedia.org/T139660) (owner: 10Ladsgroup) [23:24:47] (03PS1) 10Paladox: Initialize configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [23:25:24] (03PS2) 10Paladox: Initialize configuration for tcy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300182 (https://phabricator.wikimedia.org/T140898) [23:26:06] er [23:27:01] Pchelolo: you created it for wmf10, but if you look https://tools.wmflabs.org/versions/ we're tomorrow everywhere in 11, your change is already included .11? [23:27:27] Dereckson: yep [23:27:44] Okay, I remembered well from yesterday so. [23:29:06] Amir1: live on mw1099 [23:29:43] Dereckson: in order to test it I need to make three edits in three different namespaces (0, 2, 120) and then make db queries to see if it's okay [23:29:47] it might take time [23:29:58] Amir1: okay i'm going to push Pchelolo code in this case, so you've time to test [23:30:18] awesome [23:30:20] :) [23:32:23] Pchelolo: live on mw1099 [23:33:24] (03PS2) 10Dereckson: Remove wmgUseContributionReporting, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299941 (owner: 10MaxSem) [23:33:29] Dereckson: kk, 5 minutes I'll check smth [23:33:40] (03CR) 10Dereckson: [C: 031] "I confirm, wmgUseContributionReporting isn't used in the repo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299941 (owner: 10MaxSem) [23:34:00] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299941 (owner: 10MaxSem) [23:34:14] \m/ [23:34:49] (03Merged) 10jenkins-bot: Remove wmgUseContributionReporting, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299941 (owner: 10MaxSem) [23:35:13] MaxSem: live on mw1099 [23:35:26] -labs ;) [23:36:20] anyway, no - no mysteros shit:) [23:36:22] So, good question for the future: should we pull on mw1099 labs only changes? [23:36:25] Dereckson: works like charm [23:36:28] no [23:36:40] Amir1: k [23:36:42] (03PS2) 10MaxSem: Labs: remove wgRCWatchCategoryMembership - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299942 [23:37:08] Tested with 358631138, 358630967, 358630934 [23:37:08] Dereckson: no [23:37:22] Dereckson: mine works [23:37:27] ack [23:37:33] hrmm, maybe? if it's weird looking, but anything in -labs.php files don't need it [23:37:56] (03CR) 10MaxSem: [C: 032] Labs: remove wgRCWatchCategoryMembership - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299942 (owner: 10MaxSem) [23:38:04] (03PS2) 10MaxSem: Remove wgMFEnableBetaDiff, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299943 [23:38:13] (03CR) 10MaxSem: [C: 032] Remove wgMFEnableBetaDiff, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299943 (owner: 10MaxSem) [23:38:21] (03PS2) 10MaxSem: Labs: remove wmgMFUseCentralAuthToken - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299944 [23:38:32] (03CR) 10MaxSem: [C: 032] Labs: remove wmgMFUseCentralAuthToken - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299944 (owner: 10MaxSem) [23:38:36] (03Merged) 10jenkins-bot: Labs: remove wgRCWatchCategoryMembership - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299942 (owner: 10MaxSem) [23:38:43] (03PS2) 10MaxSem: Labs: remove wmgEnableGeoData - matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299945 [23:38:50] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: ORES score edits in main and Property namespaces in wikidatawiki ([[Gerrit:300086]]) (duration: 00m 33s) [23:38:50] (03CR) 10MaxSem: [C: 032] Labs: remove wmgEnableGeoData - matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299945 (owner: 10MaxSem) [23:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:58] (03PS2) 10MaxSem: Labs: remove wmgGeoDataDebug - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299946 [23:39:17] (03CR) 10MaxSem: [C: 032] Labs: remove wmgGeoDataDebug - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299946 (owner: 10MaxSem) [23:39:28] (03Merged) 10jenkins-bot: Remove wgMFEnableBetaDiff, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299943 (owner: 10MaxSem) [23:39:35] (03Merged) 10jenkins-bot: Labs: remove wmgMFUseCentralAuthToken - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299944 (owner: 10MaxSem) [23:39:45] (03PS2) 10MaxSem: Labs: remove wmgUseCodeEditorForCore - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299947 [23:39:51] (03CR) 10MaxSem: [C: 032] Labs: remove wmgUseCodeEditorForCore - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299947 (owner: 10MaxSem) [23:40:13] that's my project to reduce the differences between beta and prod, btw [23:40:18] !log dereckson@tin Synchronized php-1.28.0-wmf.11/extensions/ORES/: Let ORES extension score for some namespaces instead of all ([[Gerrit:300083]]]) (duration: 00m 30s) [23:40:22] Amir1: both changes are deployed in prod ^ [23:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:31] thanks [23:40:44] I monitor logs [23:40:51] (03Merged) 10jenkins-bot: Labs: remove wmgEnableGeoData - matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299945 (owner: 10MaxSem) [23:41:13] (03Merged) 10jenkins-bot: Labs: remove wmgGeoDataDebug - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299946 (owner: 10MaxSem) [23:41:17] (03Merged) 10jenkins-bot: Labs: remove wmgUseCodeEditorForCore - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299947 (owner: 10MaxSem) [23:41:53] !log dereckson@tin Synchronized php-1.28.0-wmf.10/extensions/EventBus/extension.json: Add rev_by_bot flag to revision_create event (1/2) (duration: 00m 26s) [23:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:59] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 6 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:42:35] !log dereckson@tin Synchronized php-1.28.0-wmf.10/extensions/EventBus/EventBus.hooks.php: Add rev_by_bot flag to revision_create event (2/2) (duration: 00m 23s) [23:42:38] Pchelolo: live in prod ^ [23:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:03] Dereckson: Great, thank you. The event stream looks good [23:43:08] !log maxsem@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT, no-op (duration: 00m 24s) [23:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:51] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:45:03] (03PS1) 10MaxSem: Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 [23:45:05] (03PS1) 10MaxSem: Labs: remove wmgUseOATHAuth - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300184 [23:45:07] (03PS1) 10MaxSem: Labs: remove wmgCirrusSearchUseCompletionSuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300185 [23:45:09] (03PS1) 10MaxSem: Labs: remove wmgUseUrlShortener - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300186 [23:45:11] (03PS1) 10MaxSem: Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 [23:45:13] (03PS1) 10MaxSem: Labs: remove wmgUseBounceHandler - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300188 [23:45:15] (03PS1) 10MaxSem: Labs: remove wmgUseApiFeatureUsage - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300189 [23:45:17] (03PS1) 10MaxSem: Labs: remove wgUploadThumbnailRenderMethod - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300190 [23:45:19] (03PS1) 10MaxSem: Labs: remove wgUploadThumbnailRenderMap - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300191 [23:45:21] (03PS1) 10MaxSem: Labs: remove wmgUseEventBus - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300192 [23:45:23] (03PS1) 10MaxSem: Labs: remove wmgUseContentTranslation - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300193 [23:45:25] (03PS1) 10MaxSem: Labs: remove wmgUseEventLogging - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300194 [23:45:27] (03PS1) 10MaxSem: Labs: remove wmgUseCampaigns - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300195 [23:45:49] oops:P [23:46:00] ebernhardson: SMalyshev: according fatalmonitor, the current leading error in prod is Undefined property: CirrusSearch\InterwikiSearcher::$searchContext, would you get a testing procedure to test that? If so, we can perhaps backport https://gerrit.wikimedia.org/r/#/c/300168/ [23:46:08] looks like labs is like prod for testing [23:46:44] Dereckson: probably [23:48:03] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2482539 (10Dzahn) a:05Dzahn>03None [23:48:10] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2482540 (10Dzahn) a:05Dzahn>03None [23:48:17] 06Operations: reinstall maps-test200[1234] with RAID - https://phabricator.wikimedia.org/T140440#2482543 (10Dzahn) a:05Dzahn>03None [23:48:24] 06Operations: reinstall snapshot100[1234].eqiad.wmnet with RAID - https://phabricator.wikimedia.org/T140439#2482544 (10Dzahn) a:05Dzahn>03None [23:48:37] SMalyshev: if you're ready and okay to test it, I can backport and deploy it now, we've still time in the SWAT window [23:49:08] Dereckson: I'd rather ping ebernhardson, and I unfortunately will have to leave literally in 10 mins [23:49:13] sorry [23:49:39] okay [23:49:46] but ebernhardson should be able to check it, it's his patch anyway :) [23:56:13] (03PS2) 10Dzahn: Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey) [23:57:52] (03CR) 10Gergő Tisza: [C: 031] Labs: remove wmgLogAuthmanagerMetrics - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300187 (owner: 10MaxSem) [23:58:06] (03CR) 10Gergő Tisza: [C: 031] Labs: remove wgDisableAuthManager - matches prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300183 (owner: 10MaxSem) [23:58:38] (03CR) 10Dzahn: [C: 031] Add new user 'hjiang' for Helen Jiang [puppet] - 10https://gerrit.wikimedia.org/r/300003 (https://phabricator.wikimedia.org/T140659) (owner: 10Elukey)