[00:00:31] Dereckson: sorry for causing a long SWAT window! [00:01:30] No problem. Try to have your cherry-pick ready before the SWAT next time, that will help to speed up the process. [00:02:51] !log dereckson@tin Synchronized php-1.28.0-wmf.7/extensions/MobileFrontend/: Introduce config variable to control tagline (T138738) (duration: 00m 39s) [00:02:52] T138738: [Regression] Wikidata description not showing in search results on mobile web - https://phabricator.wikimedia.org/T138738 [00:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:38] Dereckson: yeah, it was an odd situation today; didn't help that we didn't catch the UBN bug until half-way through my day [00:05:41] !log dereckson@tin Synchronized php-1.28.0-wmf.6/extensions/MobileFrontend/: Introduce config variable to control tagline (T138738) (duration: 00m 29s) [00:05:42] T138738: [Regression] Wikidata description not showing in search results on mobile web - https://phabricator.wikimedia.org/T138738 [00:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:07:46] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Introduce config variable to control tagline (no-op) (duration: 00m 27s) [00:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:13] Dereckson: since we're past time, i would be okay with postponing my patch until tomorrow. i'm around though, if you can deploy it. [00:08:34] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Introduce config variable to control tagline (T138738, 1/2) (duration: 00m 32s) [00:08:35] T138738: [Regression] Wikidata description not showing in search results on mobile web - https://phabricator.wikimedia.org/T138738 [00:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:41] MatmaRex: it's merged, so it would bep referable to deploy it now [00:08:48] alright [00:09:20] !log dereckson@tin Synchronized wmf-config/mobile.php: Introduce config variable to control tagline (T138738, 2/2) (duration: 00m 27s) [00:09:21] T138738: [Regression] Wikidata description not showing in search results on mobile web - https://phabricator.wikimedia.org/T138738 [00:09:22] jhobs: okay, all is merged on the production cluster, could you check if all still looks good to you? [00:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:51] Dereckson: yes, looking good, thanks! Let's postpone https://gerrit.wikimedia.org/r/#/c/296242/ till tomorrow so you can get to MatmaRex's patch. It's not part of the unbreak now bug. Thanks for all your help! [00:11:24] (03PS4) 10Jhobs: Enable Wikibase descriptions on Catalan and Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 (https://phabricator.wikimedia.org/T135429) [00:11:30] You're welcome jhobs [00:11:54] jhobs: if you've still 5 minutes to test 296242, I'm okay to deploy it, tomorrow SWAT is really packed with a lot of changes already [00:12:07] Dereckson: ok, sure [00:12:17] I just rebased it, so waiting on jenkins atm [00:12:25] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 (https://phabricator.wikimedia.org/T135429) (owner: 10Jhobs) [00:13:10] (03Merged) 10jenkins-bot: Enable Wikibase descriptions on Catalan and Polish wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296242 (https://phabricator.wikimedia.org/T135429) (owner: 10Jhobs) [00:14:35] jhobs: live on mw1017 [00:17:47] Dereckson: this change only applies to cawiki & plwiki, does X-Wikimedia-Debug not work the same on non enwiki or something? Because it doesn't appear to be working [00:17:48] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:57] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:59] it works on every wiki yes [00:19:14] dereckson@mw1017:~$ mwrepl plwiki [00:19:17] echo $wgMFDisplayWikibaseDescriptionsAsTaglines [00:19:17] 1 [00:19:58] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 67244 bytes in 0.677 second response time [00:20:08] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.058 second response time [00:20:18] jhobs: I wonder if API calls are correctly X-Wikimedia-Debug [00:20:37] well nothing is broken on mw1017, we can send it to prod [00:20:43] Dereckson: I'm trying to check the behavior in the browser by spoofing the header [00:20:49] yeah I don't see anything broken either [00:20:54] but I don't see it working as intended [00:20:55] oh, there is an extension to help you for that [00:21:02] oh yeah, I'm using one [00:21:06] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Wikibase descriptions on Catalan and Polish wikis (T135429) (duration: 00m 26s) [00:21:07] T135429: Deploy Wikidata descriptions to Catalan and Polish mobile web Wikipedias stable - https://phabricator.wikimedia.org/T135429 [00:21:08] it worked for the last one [00:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:49] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:23:13] MatmaRex: live on mw1017 [00:24:07] Dereckson: this is logging for something we couldn't reproduce :) so i can't test :D [00:24:51] and a https://pt.wikinews.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases didn't broke anything [00:24:55] so it looks good [00:26:24] !log dereckson@tin Synchronized php-1.28.0-wmf.7/includes/api/ApiMain.php: UsageException to try to catch T138585 issue (duration: 00m 27s) [00:26:25] T138585: Spike of "unknown" errors experienced by users of UploadWizard after wmf.7 deployment - https://phabricator.wikimedia.org/T138585 [00:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:36] We're done. [00:26:51] Dereckson: odd, I can see the change working on cawiki but not plwiki... [00:28:32] not odd [00:28:36] ca.wiki is 1.28.0-wmf.7 [00:28:54] pl.wiki is 1.28.0-wmf.6 [00:29:01] ahhhh ok [00:29:13] pl wiki will be .7 on thursday I take it? [00:29:29] 00:05 logmsgbot: dereckson@tin Synchronized php-1.28.0-wmf.6/extensions/MobileFrontend/: Introduce config variable to control tagline (T138738) (duration: 00m 29s) [00:29:29] T138738: [Regression] Wikidata description not showing in search results on mobile web - https://phabricator.wikimedia.org/T138738 [00:30:16] and the last change in /srv/mediawiki-staging/php-1.28.0-wmf.6/extensions/MobileFrontend is well Introduce config variable to control tagline [00:30:21] Dereckson: thanks! [00:30:26] You're welcome MatmaRex. [00:32:51] jhobs: https://pl.wikipedia.org/wiki/Specjalna:Wersja [00:33:02] 1.0.0 (9b2b845) 22:57, 13 June 2016 [00:33:24] 9b2b845 is a commit *before* Add missing dependencies to nearby and editor and *before* your change [00:33:55] hmm nevermind, it's the commit before the 'Creating new wmf/1.28.0-wmf.6 branch' so I guess this info is legit [00:34:59] jhobs: you have any idea or we wait 1.28.0-wmf.7 reaches pl.wiki? [00:35:28] Dereckson: this change isn't as urgent, but I just need to be clear on the current state so I can inform the rest of the team appropriately. So cawiki is working fine because it's on 1.28.0-wmf.7, but plwiki won't catch up until it's on the same branch, which won't happen until when? [00:36:34] as long as it's within the next few days or so, we should be fine to wait [00:36:46] twentyafterfour: there? [00:36:57] I just don't know where to check that info [00:37:09] jhobs: you can ask twentyafterfour, it handles the MediaWiki train this week [00:38:30] at the worst, next pl.wikipedia version upgrade is scheduled Thursday 19:00 UTC. [00:38:39] ok, that's totally fine [00:39:00] we can leave it as-is then [00:39:12] thanks again for your help tonight Dereckson [00:39:56] jhobs: according https://phabricator.wikimedia.org/T136973#2404464 a CSS issue triggered a rollback to .6 [00:39:59] You're welcome. [01:31:13] So I could roll forward with the train at any time but probably will do it tomorrow - roll out wmf.7 concurrently with branching wmf.8 [01:59:57] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures [02:17:55] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: puppet fail [02:24:26] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1500.0] [02:26:45] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:27:41] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 10m 51s) [02:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:46] (03PS2) 10Rush: tools: allow users to create unpriv chroot [puppet] - 10https://gerrit.wikimedia.org/r/296245 [02:37:25] (03CR) 10Rush: [C: 032] tools: allow users to create unpriv chroot [puppet] - 10https://gerrit.wikimedia.org/r/296245 (owner: 10Rush) [02:44:24] RECOVERY - puppet last run on rdb2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:47:40] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.7) (duration: 08m 59s) [02:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 28 02:54:56 UTC 2016 (duration 7m 16s) [02:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:56] (03PS1) 10Rush: phab: phab.wm.net.ru stop cloning us thanks. [puppet] - 10https://gerrit.wikimedia.org/r/296365 [02:59:39] (03PS2) 10Rush: phab: phab.wm.net.ru stop cloning us thanks. [puppet] - 10https://gerrit.wikimedia.org/r/296365 [03:01:41] (03CR) 10Rush: [C: 032] phab: phab.wm.net.ru stop cloning us thanks. [puppet] - 10https://gerrit.wikimedia.org/r/296365 (owner: 10Rush) [03:16:34] (03PS1) 10KartikMistry: apertium-kaz: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/296366 (https://phabricator.wikimedia.org/T107306) [03:35:01] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: puppet fail [03:41:13] is ipv6 down? I'm getting some huge packet loss [03:54:12] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2410813 (10KartikMistry) [04:00:01] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:05:21] Polsaker, no packet loss with ipv4? [04:05:45] I can successfully ping the production bastions from my VPS over IPv6 [04:07:38] ipv4 works ok for me [04:07:51] maybe it's my isp, I can ping google and other sites without any problems [04:08:08] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2410817 (10KartikMistry) [04:08:36] can you run traceroute Polsaker ? [04:10:11] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2410818 (10KartikMistry) [04:10:47] running it now. [04:12:07] oh well [04:12:15] looks like my isp is tunneling ipv6 through he.net [04:12:52] nevermind then [04:24:23] (03PS1) 10KartikMistry: apertium-tat: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-tat] - 10https://gerrit.wikimedia.org/r/296367 (https://phabricator.wikimedia.org/T107306) [04:25:02] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2410833 (10KartikMistry) [04:30:16] (03PS1) 10KartikMistry: apertium-urd-hin: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-urd-hin] - 10https://gerrit.wikimedia.org/r/296368 (https://phabricator.wikimedia.org/T107306) [04:32:50] (03PS1) 10KartikMistry: apertium-kaz-tat: Rebuild for Jessie and cleanup [debs/contenttranslation/apertium-kaz-tat] - 10https://gerrit.wikimedia.org/r/296369 (https://phabricator.wikimedia.org/T107306) [04:33:28] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2410848 (10KartikMistry) [04:34:40] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2367433 (10KartikMistry) @akosiaris All packages are updated for Jessie (some are new upst... [04:51:32] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] [05:05:23] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] [05:23:32] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] [05:42:03] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [1000.0] [05:45:52] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: puppet fail [06:11:12] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:31:21] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 4 failures [06:32:02] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:32:02] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: puppet fail [06:32:32] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] [06:33:51] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:40:17] (03PS4) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) [06:48:29] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] [06:56:40] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:31] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:20] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:56] PROBLEM - HP RAID on ms-be2022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [07:07:41] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto) [07:09:07] RECOVERY - HP RAID on ms-be2022 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [07:20:57] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [07:22:02] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2410948 (10mehtab.ahmed) Author is not responding anymore. :( [07:30:27] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [1000.0] [07:34:49] (03PS2) 10KartikMistry: apertium-hin: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-hin] - 10https://gerrit.wikimedia.org/r/296228 (https://phabricator.wikimedia.org/T107306) [07:35:23] (03PS2) 10KartikMistry: apertium-urd: New upstream release and rebuild for Jessie [debs/contenttranslation/apertium-urd] - 10https://gerrit.wikimedia.org/r/296229 (https://phabricator.wikimedia.org/T107306) [07:47:35] (03PS1) 10Giuseppe Lavagetto: etcd::backup: consider absence of logs [puppet] - 10https://gerrit.wikimedia.org/r/296371 [07:48:37] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:49:28] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures [08:02:07] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd::backup: consider absence of logs [puppet] - 10https://gerrit.wikimedia.org/r/296371 (owner: 10Giuseppe Lavagetto) [08:12:37] (03PS2) 10Elukey: mediawiki: add mw1295/6 to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/296260 (owner: 10Gehel) [08:15:06] !log rebooting aqs100[23].eqiad for kernel upgrades [08:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:58] (03CR) 10Elukey: [C: 032] mediawiki: add mw1295/6 to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/296260 (owner: 10Gehel) [08:16:56] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] [08:17:06] !log elukey@palladium conftool action : set/pooled=no; selector: aqs1002.eqiad.wmnet [08:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:09] PROBLEM - Host aqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:25:21] ah sorry this is me! --^ [08:26:37] RECOVERY - Host aqs1002 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [08:30:08] !log elukey@palladium conftool action : set/pooled=yes; selector: aqs1002.eqiad.wmnet [08:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:15] !log rolling reboot of mediawiki canaries for kernel security update [08:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:42:07] (03PS1) 10Jcrespo: Prepare old servers for decom by sending all queries to new servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296372 (https://phabricator.wikimedia.org/T134476) [08:42:30] (03PS1) 10Andrew Bogott: Assing IPs to labvirt1012, 1013, 1014 [dns] - 10https://gerrit.wikimedia.org/r/296373 (https://phabricator.wikimedia.org/T138509) [08:43:10] (03PS2) 10Jcrespo: Prepare old servers for decom by sending all queries to new servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296372 (https://phabricator.wikimedia.org/T134476) [08:44:13] (03CR) 10Jcrespo: [C: 032] Prepare old servers for decom by sending all queries to new servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296372 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [08:45:16] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [1000.0] [08:46:31] (03PS2) 10Andrew Bogott: Assign IPs to labvirt1012, 1013, 1014 [dns] - 10https://gerrit.wikimedia.org/r/296373 (https://phabricator.wikimedia.org/T138509) [08:47:37] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Prepare old servers for decom by sending all queries to new servers (duration: 01m 39s) [08:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:53] mw1018.eqiad.wmnet returned [255]: ssh: connect to host mw1018.eqiad.wmnet port 22: No route to host [08:47:57] checking [08:49:44] scheduled down, but id doesn't say by whom [08:49:59] I am installing mw1297 and mw1298, last two I promise, I might miss icinga alerts [08:50:28] ah, it says below, Mr. Muehlenhoff [08:53:10] jynus: kernel security reboots, I can briefly hold to help you sync mw-config? [08:53:43] mw1018 doesn't seem have come up, though, looking into serial console [08:54:39] don't hold [08:55:02] just ping when done with it [08:56:49] jynus: you can resync now, need to look into mw1018 anyway [08:57:08] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1500.0] [08:57:43] !log powercycling mw1018, didn't come up after reboot [08:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:58:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/294058 (owner: 10Muehlenhoff) [09:01:03] jynus: mw1018 back up [09:06:00] * gehel is looking into elasticsearch slow down on codfw [09:06:01] can someone synch https://meta.wikimedia.org/wiki/Interwiki_map = [09:06:02] *? [09:08:18] Probably [09:11:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Allow wdqs admins to control wdqs-updater service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295968 (https://phabricator.wikimedia.org/T138627) (owner: 10Smalyshev) [09:13:47] !log elukey@palladium conftool action : set/pooled=no; selector: aqs1003.eqiad.wmnet [09:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:14] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [1000.0] [09:16:59] (03PS1) 10Reedy: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296374 [09:17:08] Steinsplitter: ^^ [09:18:21] (03CR) 10Steinsplitter: [C: 031] "looks fine, thanks :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296374 (owner: 10Reedy) [09:23:17] (03CR) 10Reedy: [C: 032] Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296374 (owner: 10Reedy) [09:23:51] (03Merged) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296374 (owner: 10Reedy) [09:25:06] !log reedy@tin Synchronized wmf-config/interwiki.php: Updated IW map (duration: 00m 49s) [09:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:33] !log powercycling mw1019, didn't come up after reboot [09:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:41] heh, I was about to mention that :) [09:28:01] !log elukey@palladium conftool action : set/pooled=yes; selector: aqs1003.eqiad.wmnet [09:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:28:28] Reedy: it's back up [09:28:56] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:34] mmm this is not me [09:29:45] PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:29] memory pressure looks good from server-board, checking hhvm [09:30:45] (03PS2) 10Filippo Giunchedi: prometheus: add init.pp class and expand documentation [puppet] - 10https://gerrit.wikimedia.org/r/296248 (https://phabricator.wikimedia.org/T126785) [09:30:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] prometheus: add init.pp class and expand documentation [puppet] - 10https://gerrit.wikimedia.org/r/296248 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [09:31:25] RECOVERY - mediawiki-installation DSH group on mw1295 is OK: OK [09:31:25] RECOVERY - mediawiki-installation DSH group on mw1296 is OK: OK [09:32:15] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 65626 bytes in 0.193 second response time [09:32:28] !log restarted hhvm on mw1238, memory pressure ok but hhvm stuck (hhvm-dump-debug in /tmp/hhvm.14788.bt.) [09:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:11] (03PS1) 10Alexandros Kosiaris: Introduce $::networks::constants::networks [puppet] - 10https://gerrit.wikimedia.org/r/296375 [09:33:13] (03PS1) 10Alexandros Kosiaris: Introduce $::networks::constants::labs_networks [puppet] - 10https://gerrit.wikimedia.org/r/296376 [09:35:19] a lot of threads waiting for [09:35:21] #01 0x00007fdeebbeef2c in __lll_lock_wait () from /build/eglibc-3GlaMS/eglibc-2.19/nptl/../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133 [09:35:46] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.030 second response time [09:36:55] PROBLEM - Check size of conntrack table on mw1298 is CRITICAL: Timeout while attempting connection [09:37:32] this one in mine, new imagescaler, just silenced --^ [09:37:44] (03PS1) 10Alexandros Kosiaris: otrs: Replace EXTERNAL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296377 [09:39:07] !log powercycling mw1021, didn't come up after reboot [09:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:39] (03CR) 10Muehlenhoff: [C: 031] otrs: Replace EXTERNAL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296377 (owner: 10Alexandros Kosiaris) [09:49:16] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2411057 (10Danny_B) >>! In T137224#2407860, @Dzahn wrote: > I merged... [09:51:50] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2411064 (10Danny_B) I guess the cause may be the asterisk in ` (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/3213/, +1ed already, merging. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/296377 (owner: 10Alexandros Kosiaris) [09:56:23] (03PS2) 10Alexandros Kosiaris: otrs: Replace EXTERNAL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296377 [09:56:27] (03CR) 10Alexandros Kosiaris: [V: 032] otrs: Replace EXTERNAL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296377 (owner: 10Alexandros Kosiaris) [10:01:34] (03PS1) 10Alexandros Kosiaris: ferm: Remove EXTERNAL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296379 [10:01:40] (03PS1) 10Ema: Add 0009-tcp-collector-specify-gauges.patch [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/296380 (https://phabricator.wikimedia.org/T138758) [10:03:14] (03CR) 10Muehlenhoff: [C: 031] "Let's kill it" [puppet] - 10https://gerrit.wikimedia.org/r/296379 (owner: 10Alexandros Kosiaris) [10:17:36] RECOVERY - Check size of conntrack table on mw1298 is OK: OK: nf_conntrack is 0 % full [10:17:40] <_joe_> moritzm: leave mw1021 to die if it can't come up [10:18:40] it came up after a powercycle [10:20:25] <_joe_> I should actually start killing the old jobrunners [10:41:56] (03PS1) 10Elukey: Add mw129[78] to the MediaWiki scap dsh list. [puppet] - 10https://gerrit.wikimedia.org/r/296381 [10:44:03] !log rolling reboot of mediawiki in codfw for kernel security update [10:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:17] (03PS1) 10Jcrespo: Depool db1039, 72, 61, 68, 64, 35, 44, 63, 67, 72, 73 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296382 (https://phabricator.wikimedia.org/T134476) [10:44:51] (03CR) 10Elukey: [C: 032] Add mw129[78] to the MediaWiki scap dsh list. [puppet] - 10https://gerrit.wikimedia.org/r/296381 (owner: 10Elukey) [10:46:35] (03PS1) 10Giuseppe Lavagetto: jobrunners: decommission mw1001-mw1008 [puppet] - 10https://gerrit.wikimedia.org/r/296383 [10:51:33] (03PS2) 10Jcrespo: Depool db1023, 24, 33, 35, 39, 44, 52, 61, 64, 68, 63, 67, 72, 73 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296382 (https://phabricator.wikimedia.org/T134476) [10:54:06] 06Operations, 07Availability: Depool all eqiad hosts from mediawiki - https://phabricator.wikimedia.org/T138810#2411102 (10jcrespo) [10:55:01] 06Operations, 07Availability: Depool all eqiad hosts from mediawiki - https://phabricator.wikimedia.org/T138810#2411115 (10jcrespo) [10:55:03] 06Operations, 10DBA, 13Patch-For-Review: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2411114 (10jcrespo) [10:55:26] 06Operations, 07Availability: Depool all eqiad hosts from mediawiki - https://phabricator.wikimedia.org/T138810#2411102 (10jcrespo) [10:55:41] (03PS1) 10Giuseppe Lavagetto: Remove mw1001-1008 [dns] - 10https://gerrit.wikimedia.org/r/296384 [10:55:54] 06Operations, 07Availability: Depool all eqiad hosts from mediawiki - https://phabricator.wikimedia.org/T138810#2411102 (10jcrespo) [10:57:22] 06Operations, 07Availability: Depool all eqiad hosts from mediawiki - https://phabricator.wikimedia.org/T138810#2411122 (10jcrespo) [10:59:34] <_joe_> depool all eqiad hosts? or all old eqiad hosts? [11:02:49] all aka datacenter failover, but not necesarilly that [11:03:11] if there was another way of doing that (e.g. read-only maintenance) [11:04:59] the reason is that I had to use older hosts last time because we didn't have enough new machines [11:05:17] and the master always has to be the slower machine of the shard [11:06:29] we may also want to have a way to test codfw in read-only mode, because some of that changes are dangerous [11:06:38] *those [11:09:34] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [11:15:39] (03PS1) 10Filippo Giunchedi: prometheus: add mysqld exporter [puppet] - 10https://gerrit.wikimedia.org/r/296385 (https://phabricator.wikimedia.org/T126757) [11:19:54] (03PS1) 10Giuseppe Lavagetto: role::etcd: fix bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/296386 [11:20:22] (03CR) 10Filippo Giunchedi: "one nit, LGTM otherwise" (031 comment) [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/296380 (https://phabricator.wikimedia.org/T138758) (owner: 10Ema) [11:21:26] (03PS2) 10Giuseppe Lavagetto: role::etcd: fix bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/296386 [11:21:30] !log rolling restart of elasticsearch eqiad [11:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:19] (03CR) 10Giuseppe Lavagetto: [C: 032] role::etcd: fix bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/296386 (owner: 10Giuseppe Lavagetto) [11:25:04] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 4 failures [11:31:05] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:31:53] PROBLEM - Host elastic1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:32:29] ^elastic1001 is me, taking more time to reboot than expected. Looking... [11:33:44] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:34:05] ACKNOWLEDGEMENT - Host elastic1001 is DOWN: PING CRITICAL - Packet loss = 100% Gehel server not rebooting as expected, looking into it [11:35:04] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:36:36] (03PS4) 10Elukey: Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 [11:37:18] 06Operations, 05Security: CVE-2016-4997 - https://phabricator.wikimedia.org/T138811#2411144 (10MoritzMuehlenhoff) [11:38:45] moritzm: mw1291 to mw1298 (new imagescalers) ready to be pooled [11:41:54] elukey: thanks, I'll do that tomorrow [11:43:41] !log powercycling elastic1001 (server not coming up during restart - T138811) [11:43:42] T138811: CVE-2016-4997 - https://phabricator.wikimedia.org/T138811 [11:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:14] RECOVERY - Host elastic1001 is UP: PING OK - Packet loss = 0%, RTA = 2.85 ms [11:53:14] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:53:34] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:58:04] PROBLEM - Host elastic1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:59:23] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:59:37] ^ elastic1002 is me again [12:00:19] !log powercycling elastic1002 (server not coming up during restart - T138811) [12:00:21] T138811: CVE-2016-4997 - https://phabricator.wikimedia.org/T138811 [12:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:35] ACKNOWLEDGEMENT - Host elastic1002 is DOWN: PING CRITICAL - Packet loss = 100% Gehel Server not coming up at reboot, powercycling [12:03:30] RECOVERY - Host elastic1002 is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [12:04:30] PROBLEM - NTP on elastic1002 is CRITICAL: NTP CRITICAL: Offset unknown [12:06:40] (03PS1) 10Gehel: Elasticsearch response time threshold now uses more_like query [puppet] - 10https://gerrit.wikimedia.org/r/296388 [12:06:42] RECOVERY - NTP on elastic1002 is OK: NTP OK: Offset 0.001044869423 secs [12:09:12] (03CR) 10DCausse: [C: 031] Elasticsearch response time threshold now uses more_like query [puppet] - 10https://gerrit.wikimedia.org/r/296388 (owner: 10Gehel) [12:14:25] (03CR) 10Gehel: [C: 032] Elasticsearch response time threshold now uses more_like query [puppet] - 10https://gerrit.wikimedia.org/r/296388 (owner: 10Gehel) [12:24:59] 06Operations, 10ops-codfw: mw2098 not coming up after reboot - https://phabricator.wikimedia.org/T138812#2411196 (10MoritzMuehlenhoff) [12:25:43] ACKNOWLEDGEMENT - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff hardware problem, T138812 [12:28:20] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:31:35] (03PS2) 10Gehel: Allow wdqs admins to control wdqs-updater service [puppet] - 10https://gerrit.wikimedia.org/r/295968 (https://phabricator.wikimedia.org/T138627) (owner: 10Smalyshev) [12:32:09] (03CR) 10Gehel: [C: 04-1] "Still needs discussion in Ops weekly before merging" [puppet] - 10https://gerrit.wikimedia.org/r/295968 (https://phabricator.wikimedia.org/T138627) (owner: 10Smalyshev) [12:33:09] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:37:43] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable updater service - https://phabricator.wikimedia.org/T138627#2411213 (10Gehel) As pointed out by @akosiaris, what is actually needed is mask/unmaks, not enable/disable that is needed here. [12:39:05] (03CR) 10Gehel: Allow wdqs admins to control wdqs-updater service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295968 (https://phabricator.wikimedia.org/T138627) (owner: 10Smalyshev) [12:44:08] (03CR) 10Jcrespo: [C: 032] Depool db1023, 24, 33, 35, 39, 44, 52, 61, 64, 68, 63, 67, 72, 73 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296382 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [12:49:00] (03PS1) 10Elukey: Allow communications between AQS and Analytics Hadoop on port 7000. [puppet] - 10https://gerrit.wikimedia.org/r/296389 [12:50:25] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1023, 24, 33, 35, 39, 44, 52, 61, 64, 68, 63, 67, 72, 73 (duration: 02m 39s) [12:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:37] ssh: connect to host mw2123.codfw.wmnet port 22: Connection timed out [12:50:42] ssh: connect to host mw2098.codfw.wmnet port 22: Connection timed out [12:51:39] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:52:24] now I have to play picture cards with the available hosts [12:52:30] mw2098 is T138812 [12:52:31] T138812: mw2098 not coming up after reboot - https://phabricator.wikimedia.org/T138812 [12:52:49] currently looking into 2123, connection to serial console times out for me [12:53:03] ok, then I will put it out of pool and dsh [12:53:15] (but I have a meeting first) [12:53:20] (03PS2) 10Elukey: Allow communications between AQS and Analytics Hadoop on port 7000. [puppet] - 10https://gerrit.wikimedia.org/r/296389 [12:53:29] 2098 is already depooled [12:53:41] thanks for that [12:54:16] 06Operations, 10ops-codfw: mw2098 not coming up after reboot - https://phabricator.wikimedia.org/T138812#2411196 (10jcrespo) > 2098 is already depooled I will take it out of dsh so scap sync doesn't timout. [12:55:33] 06Operations, 10ops-codfw: mw2098 / mw2123 not coming up after reboot - https://phabricator.wikimedia.org/T138812#2411280 (10MoritzMuehlenhoff) [12:56:28] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:57:07] (03PS1) 10Jcrespo: Remove from dsh mw2098 [puppet] - 10https://gerrit.wikimedia.org/r/296390 (https://phabricator.wikimedia.org/T138812) [12:57:16] (03PS5) 10Elukey: Add the -T VSL API timeout parameter plus the related formatter. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 [12:57:21] ACKNOWLEDGEMENT - Host mw2123 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff hardware problem, T138812 [12:57:25] if someone can review and baby-sit that and/or the other one, nice, if not, I can do it in 1 hour [12:57:42] 06Operations, 10ops-codfw, 13Patch-For-Review: mw2098 / mw2123 not coming up after reboot - https://phabricator.wikimedia.org/T138812#2411291 (10MoritzMuehlenhoff) Also depooled mw2123 [12:57:50] jynus: I'll do that [12:57:51] (I do not want to deploy and then forget) [12:58:09] also, it would not be the first time I "depool" the wrong host [12:58:39] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:59:04] (03CR) 10Ema: [C: 031] "LGTM" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/295652 (owner: 10Elukey) [12:59:19] jynus: does that need anything more than the mere puppet merge? [12:59:54] I suppose apply it on tin/mira [13:00:19] ok [13:00:36] I'll extend that with mw2123 and merge [13:00:52] ask for help to someone that deploys to +1 it if in doubt [13:01:28] jynus has a meeting now :) [13:03:49] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:09:04] (03CR) 10Elukey: [C: 031] "Had a chat with Ema about the change, LGTM (Ema also tried it with PCC and looks good)." [puppet/nginx] - 10https://gerrit.wikimedia.org/r/295937 (owner: 10Ema) [13:10:36] (03CR) 10Elukey: "Puppet compiler: https://puppet-compiler.wmflabs.org/3216/" [puppet] - 10https://gerrit.wikimedia.org/r/296389 (owner: 10Elukey) [13:12:42] (03CR) 10Ema: [C: 032 V: 032] Reload nginx upon config file modification [puppet/nginx] - 10https://gerrit.wikimedia.org/r/295937 (owner: 10Ema) [13:13:18] (03PS13) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [13:16:26] (03CR) 10Muehlenhoff: Allow communications between AQS and Analytics Hadoop on port 7000. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296389 (owner: 10Elukey) [13:17:21] (03PS1) 10Ema: nginx module: add reload support [puppet] - 10https://gerrit.wikimedia.org/r/296393 [13:18:11] (03PS2) 10Muehlenhoff: Remove mw2098 and mw2123 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/296390 (https://phabricator.wikimedia.org/T138812) (owner: 10Jcrespo) [13:20:35] (03PS3) 10Elukey: Allow communications between AQS and Analytics Hadoop on port 7000. [puppet] - 10https://gerrit.wikimedia.org/r/296389 [13:23:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove mw2098 and mw2123 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/296390 (https://phabricator.wikimedia.org/T138812) (owner: 10Jcrespo) [13:23:27] (03PS4) 10Elukey: Allow communications between AQS and Analytics Hadoop on port 7000. [puppet] - 10https://gerrit.wikimedia.org/r/296389 [13:26:47] (03PS2) 10Ema: nginx module: add reload support [puppet] - 10https://gerrit.wikimedia.org/r/296393 [13:26:56] (03CR) 10Ema: [C: 032 V: 032] nginx module: add reload support [puppet] - 10https://gerrit.wikimedia.org/r/296393 (owner: 10Ema) [13:28:37] chasemp: FYI labstore1004 has its network pegged since yesterday, known? [13:29:33] godog: yeah it's replicating to the secondary, which is taking enough time to maybe not be practical [13:29:36] but yeah it's not an issue [13:29:53] oh ok, yeah saw nothing in sal hence the asking [13:32:23] that's on me, not serving any users and I'm looknig to do some real io bencharking to see if this is going to work out but I should have noted it [13:34:14] chasemp: nice, replicating with drdb I see? [13:34:36] yep [13:38:11] one thing about DRBD is if the secondary gets far enough behind for protocol A it throws in the towel and starts replication from scratch [13:38:34] so this replication taking eons may be a thing, I tested a basic setup but obv didnt' have 20T to throw at it [13:39:39] aye, though I suspect it should be done soon with 20T at 120MB/s [13:40:35] !log elukey@palladium conftool action : set/pooled=yes; selector: aqs1001.eqiad.wmnet [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:00] godog: yeah I'm not too distraught since it's all things at once now and the actual tools operational data is only 9T going forward so we'll see [13:42:39] yeah not too bad [13:43:11] (03PS2) 10Filippo Giunchedi: Keep daily graphite data for 5 years [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson) [13:43:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Keep daily graphite data for 5 years [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson) [13:46:58] !log bounce carbon on graphite machines after applying https://gerrit.wikimedia.org/r/266567 [13:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:09] PROBLEM - Host mw2134 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:01] 06Operations, 10ops-codfw, 13Patch-For-Review: mw2098 / mw2123 / mw2134 not coming up after reboot - https://phabricator.wikimedia.org/T138812#2411378 (10MoritzMuehlenhoff) [13:50:38] ACKNOWLEDGEMENT - Host mw2134 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff hardware problem, T138812 [13:51:29] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:12] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2411398 (10elukey) 05Resolved>03Open [13:57:03] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2411402 (10elukey) Reopening this task after a chat with Alex. The Analytics team discovered that we should open port 7000, usually reserved for Cassandra internode... [13:57:41] joal: --^ re-opened the task, going to wait for an answer from Alex about what to do [13:58:01] elukey: means we are ready to open ports on switch? [13:58:09] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:58:39] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:59:30] joal: we'd need to discuss it with Alex but it should be ok. The main weird thing is of course opening the Cassandra internode communication port to hadoop, but we are doing something completely new so it should be fine [14:00:59] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2185.codfw.wmnet because of too many down! [14:02:16] elukey: You're my man :) [14:04:25] <_joe_> uh moritzm ^^ [14:04:31] <_joe_> it's codfw I know [14:04:38] <_joe_> but still, something strange there [14:05:00] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2196.codfw.wmnet because of too many down! [14:05:38] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:06:26] hmm, I rebooted the same amount of servers per batch, but maybe it now triggers the threshold since three servers haven't come up? will reduce further for the remaining ones [14:07:01] 06Operations, 07Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821#2411427 (10fgiunchedi) [14:07:09] 06Operations, 07Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821#2411439 (10fgiunchedi) p:05Triage>03Normal [14:07:29] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [14:07:50] <_joe_> moritzm: I'm taking a look anyways [14:08:18] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [14:08:39] (03PS2) 10Ema: Add 0009-tcp-collector-specify-gauges.patch [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/296380 (https://phabricator.wikimedia.org/T138758) [14:08:58] (03PS1) 10Jdrewniak: Bumping portals to master Removing survey banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296399 (https://phabricator.wikimedia.org/T136874) [14:19:03] moritzm, what was the final state, I see more servers didn't come up? [14:20:31] 06Operations, 07Graphite: extend existing graphite whisper files retention to five years - https://phabricator.wikimedia.org/T138821#2411456 (10fgiunchedi) a:03fgiunchedi [14:21:59] (03CR) 10Filippo Giunchedi: [C: 031] Add 0009-tcp-collector-specify-gauges.patch [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/296380 (https://phabricator.wikimedia.org/T138758) (owner: 10Ema) [14:22:36] (03PS1) 10Muehlenhoff: Also remove mw2134 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/296400 (https://phabricator.wikimedia.org/T138812) [14:24:22] joal: time to chat with akosiaris about Cassandra and streaming? [14:24:51] sure elukey and akosiaris [14:25:33] hey Joal [14:25:39] Hi akosiaris [14:25:44] akosiaris: howdy? [14:25:49] quite well, you ? [14:25:58] jynus: one more, I'm merging https://gerrit.wikimedia.org/r/296400 as we speak [14:26:05] good as well, fighthing with cassandra :) [14:26:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also remove mw2134 from dsh [puppet] - 10https://gerrit.wikimedia.org/r/296400 (https://phabricator.wikimedia.org/T138812) (owner: 10Muehlenhoff) [14:26:38] joal: so, as I was looking into T138609 I could not help noticing you want to use the intra cluster protocol to ship the precompacted SSTables [14:26:38] T138609: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609 [14:26:44] am I right ? [14:26:49] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:26:51] correct akosiaris [14:27:07] We finally go the direction suggested by urandom [14:27:23] It is supposed to help a lot from a performance pespective [14:27:28] akosiaris: --^ [14:28:04] ok, so my question is, since you will be basically shipping something quite internal to cassandra, how will you handle SSTable format changes ? [14:29:10] akosiaris: we need to ensure version match from the loading and receiving side (jars / cassandra) [14:29:21] (03CR) 10Ema: [C: 032 V: 032] Add 0009-tcp-collector-specify-gauges.patch [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/296380 (https://phabricator.wikimedia.org/T138758) (owner: 10Ema) [14:29:42] joal: and will you be doing that for every cassandra upgrade ? [14:29:55] and quite importantly, will it be worth it ? [14:29:56] akosiaris: and, we also have the non-bulk loading way, in case jobs start failing [14:30:21] akosiaris: it's worth it now as a one of for sure, for loading 1 year of data [14:30:52] after that, I can't really say: I'd need to test on real conditions to measure the gain [14:30:54] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Traffic, 07HTTPS: https://ores.wikimedia.org redirects me to HTTP when I don't include a trailing slash - https://phabricator.wikimedia.org/T138682#2407209 (10Halfak) a:03Ladsgroup [14:30:57] could be, I am not arguing with that. But what about the long run [14:31:34] also, I am not following how we make sure the SSTables shipped are correct and will not be causing cassandra crashes [14:31:36] akosiaris: from what I have seen from urandom managing clusters, upgrade are slow and well synchronised, so I'm not afraid [14:32:19] akosiaris: we use SSTableWriter and SSTableLoader, pieces of code shipped with cassandra [14:32:44] joal: oh, I am very much afraid. Despite urandom's valiant efforts he has spent hours trying just to get cassandra working again after a cluster upgrade [14:32:58] also akosiaris, it's the correct moment for us to try that: we have a new cluster not yet used to serve production data [14:33:19] it will not be a new cluster soon though. it will be the production one very very soon [14:33:24] arf, didn't know about that annecdote akosiaris [14:33:39] akosiaris: it is supposed to, at least [14:33:46] not an anecdote, there's a task about it. A quite long one btw [14:33:55] with a lot of pain shared inside [14:34:05] lemme see if I can find it [14:34:08] akosiaris: we need to load 1 year of datam and going with the CQL loader, it takes about 1 week per month to load [14:34:19] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:35:09] yeah, that the CQL loader is not a very performant process I know [14:35:15] so akosiaris, hearing your concerns, I'm in favor of having both bulk and CQL, for different useage [14:35:20] I 'll tell you what it feels like to me [14:35:53] akosiaris: For daily small-enough regular loading, CQL is fine, for big backlogsm bulk should be better [14:35:56] it feels like we have this engine with not enough horsepower and we are starting to poke with a hammer at it trying to inject nitro into it to making it more powerful [14:36:22] (need for speed pun) [14:36:26] akosiaris: While I understand your point, I don't really agree [14:36:49] to me it's more about fasttracking bootstrap [14:37:04] my description is obviously a bad one, but my endpoint is that maybe the engine is not good enough [14:37:57] and we are just trying to work around it's problems [14:38:05] anyway, I am not gonna block or something [14:38:06] <_joe_> akosiaris: I don't think the upgrades are really related to the way you bootstrap new nodes [14:38:06] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2411475 (10elukey) The growth factor didn't play any role into mc1007's hit ratio, double checked with last days of data. [14:38:21] <_joe_> akosiaris: we do some similar tricks with mysql [14:38:30] akosiaris: While this could be discussed (there is an existing task to keep looking at ScyllaDB), using engines with their capabilities is also some good things to do [14:38:50] You'd probably not even doubt using mysqldump for instance, this looks the same to me [14:39:06] actually mysqldump dumps SQL [14:39:19] but I would very highly doubt someone shipping innodb tables [14:39:38] which is what this looks like to me for afar [14:40:13] akosiaris: yes for the annalogy, no from a user-feature perspective [14:40:29] akosiaris: bulk loading is supposed to be something reasonnable to do [14:40:49] "supposed" [14:40:55] that is something I love to hear!!!! [14:41:04] <_joe_> akosiaris: well, we do load the buffer pool :) [14:41:17] as always with new things to test: before having test, it is always supposed in my mind :) [14:41:40] <_joe_> I know it's not the same, but if urandom suggested that, I'd like to hear his opinion [14:42:06] _joe_: works for me :) [14:42:37] _joe_: I have to say that I started investigation in that direction following his advice [14:42:44] anyway, I am glad my concerns have been heard. If urandom suggested it, it is probably gonna be ok. But I 'd love to hear what happens in the long run [14:42:53] <_joe_> yeah that's why I am giving benefit of the doubt :) [14:42:58] <_joe_> akosiaris: me too [14:43:11] <_joe_> I was just stating I'm not against the idea in principle [14:43:26] <_joe_> but of course it raises some concerns [14:43:33] _joe_, akosiaris : It's good you guys make sure concerns are heard :) [14:44:57] <_joe_> joal: I am unsure when he'll be around though, maybe next week [14:45:14] _joe_: not sure either, I think he was at wikimania [14:45:34] <_joe_> joal: yes, and now at the team offsite [14:45:41] <_joe_> (I was at wikimania too) [14:45:43] Ah, didn't know about that one [14:45:48] k _joe_ :) [14:47:53] akosiaris, _joe_ : While waiting for urandom for discussion, do you agree for me to make sure the thing works first (means opening port) [14:47:56] ? [14:48:29] <_joe_> makes sense [14:48:37] Cause for now, it has been tested on the new (nor old) cluster [14:48:45] it has NOT sorry [14:49:22] joal: yeah, but for what is worth, testing generally should happen in labs [14:49:48] akosiaris: definitely [14:50:17] akosiaris: Since we were about performance testing, we used the new cluster, but now for functional testing, can be done in labs [14:50:28] PROBLEM - Host elastic1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:28] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:51:32] ^ elastic1004 is me again. Seems it's not a good day for reboots [14:52:46] !log powercycling elastic1004 (server not coming up during restart - T138811) [14:52:47] T138811: CVE-2016-4997 - https://phabricator.wikimedia.org/T138811 [14:52:47] (03PS2) 10Thcipriani: scap: make deployment aware of canary machines [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) [14:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:40] ACKNOWLEDGEMENT - Host elastic1004 is DOWN: PING CRITICAL - Packet loss = 100% Gehel reboot failed - powercycling [14:55:07] (03PS1) 10Urbanecm: Change project logo for enwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296403 (https://phabricator.wikimedia.org/T138801) [14:56:19] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:57:19] RECOVERY - Host elastic1004 is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [14:58:39] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:58:40] * James_F waves in advance of jouncebot asking. [14:58:45] _joe_: I added https://gerrit.wikimedia.org/r/#/c/294742/1 for puppetswat. I don't know if you've had any time to take a look at it (doubtful with wikimania), but any feedback you have there would be appreciated. [14:58:55] (03PS2) 10Urbanecm: Change project logo for enwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296403 (https://phabricator.wikimedia.org/T138801) [15:00:05] anomie, ostriches, thcipriani, hashar, twentyafterfour, and Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160628T1500). [15:00:05] James_F and jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:01:15] I can SWAT today. [15:01:41] (03PS4) 10Thcipriani: Enable VisualEditor by default for all users of the Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [15:02:14] Woo. [15:02:41] thcipriani: The VE ones should all merge without needing manual rebases, BTW. [15:02:55] akosiaris: thinking again testing, I'd reall like if we could take advantage of the new cluster [15:03:03] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [15:03:42] akosiaris: While the test is first functional (we have no proff it works), the second aspect of it is performance, for which a good bunch as already been done using the new cluster [15:03:48] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:04:44] James_F: hmm...I wonder if I need to remove your -2 to make zuul pay attention [15:04:50] thcipriani: Oh, oops, yeah. [15:04:54] * James_F does it. [15:04:58] joal: well, tests with multiple aspects are not really well designed tests. Especially if the functional part can be very easily split off into a different test [15:05:26] (03CR) 10Jforrester: [C: 031] "Now good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [15:05:33] (03CR) 10Jforrester: [C: 031] "Now good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) (owner: 10Jforrester) [15:05:38] (03CR) 10Jforrester: [C: 031] "Now good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) (owner: 10Jforrester) [15:05:41] (03CR) 10Jforrester: [C: 031] "Now good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) (owner: 10Jforrester) [15:05:50] performance testing is very important obviously, and it should be done at the new cluster, but we are not there yet, are we ? [15:06:12] we don't even know the TCP ports the bulk loading uses yet [15:06:13] akosiaris: if loading works, then performance is observable [15:06:20] (03CR) 10Jforrester: [C: 032] "*prod for Jenkins*" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [15:06:27] (03CR) 10Thcipriani: Enable VisualEditor by default for all users of the Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [15:06:34] well, it doesn't yet, does it ? [15:06:54] James_F: heh, I was just about to prod jenkins :P [15:06:57] akosiaris: indeed it does not [15:06:59] (03Merged) 10jenkins-bot: Enable VisualEditor by default for all users of the Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) (owner: 10Jforrester) [15:07:15] * James_F grins. [15:07:51] akosiaris: possibly with port 7000 opened it might? [15:09:08] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: [[gerrit:292749|Enable VisualEditor by default for all users of the Italian Wikivoyage (T136994)]] (duration: 00m 25s) [15:09:09] T136994: Enable VisualEditor by default for all users of the Italian Wikivoyage - https://phabricator.wikimedia.org/T136994 [15:09:10] 06Operations, 10RESTBase, 06Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken - https://phabricator.wikimedia.org/T138848#2411902 (10Krinkle) [15:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:31] ^ James_F [15:09:43] Yup, LGTM. [15:09:51] joal: could be. but that only reinforces my argument that we don't even know the TCP ports yet. Can we please find them out ? It should not be that much trouble. It's a tcpdump in a working bulk loading instance [15:09:53] (03PS4) 10Thcipriani: Enable VisualEditor by default for all users of the French Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) (owner: 10Jforrester) [15:10:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) (owner: 10Jforrester) [15:10:13] 06Operations, 10RESTBase, 06Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken - https://phabricator.wikimedia.org/T138848#2411922 (10Krinkle) [15:10:33] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2411926 (10akosiaris) >>! In T138609#2411402, @elukey wrote: > Reopening this task after a chat with Alex. > > The Analytics team discovered that we should open por... [15:10:46] (03Merged) 10jenkins-bot: Enable VisualEditor by default for all users of the French Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) (owner: 10Jforrester) [15:10:47] James_F: do you want these to go 1 at a time? Or can I just get them all out at once? [15:10:54] All at once is fine. [15:11:10] (03PS4) 10Thcipriani: Enable VisualEditor by default for all users of the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) (owner: 10Jforrester) [15:11:23] 06Operations, 10RESTBase, 06Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken / RESTBase not discoverable - https://phabricator.wikimedia.org/T138848#2411902 (10Krinkle) [15:11:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) (owner: 10Jforrester) [15:12:06] (03Merged) 10jenkins-bot: Enable VisualEditor by default for all users of the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) (owner: 10Jforrester) [15:12:20] akosiaris: We will look into that - It's a bit more than just a tcp dump, given the settings that needs to work for the streaming to even start, but we'll get you that inforamtion [15:12:24] (03PS4) 10Thcipriani: Enable VisualEditor by default for all users of the German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) (owner: 10Jforrester) [15:12:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) (owner: 10Jforrester) [15:13:10] (03Merged) 10jenkins-bot: Enable VisualEditor by default for all users of the German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) (owner: 10Jforrester) [15:14:02] joal: thank you very much. greatly appreciated! [15:14:48] akosiaris: you've earned yourself the time for discussion with urandom with that :) [15:15:00] :-) [15:16:16] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: Enable VisualEditor by default for all users of the [[gerrit:292750|French (T136993)]], [[gerrit:292751|English (T136992)]], and [[gerrit:292752|German (T136991)]] Wikivoyage (duration: 00m 24s) [15:16:18] T136993: Enable VisualEditor by default for all users of the French Wikivoyage - https://phabricator.wikimedia.org/T136993 [15:16:18] T136992: Enable VisualEditor by default for all users of the English Wikivoyage - https://phabricator.wikimedia.org/T136992 [15:16:18] T136991: Enable VisualEditor by default for all users of the German Wikivoyage - https://phabricator.wikimedia.org/T136991 [15:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:22] ^ James_F check please [15:18:35] thcipriani: Yup, all three are good to go. [15:18:52] James_F: great, thanks for checking! [15:19:21] jan_drewniak: ping for SWAT [15:19:37] o/ [15:19:49] thcipriani: o/ [15:20:11] jan_drewniak: ack, just looking over patch :) [15:21:32] (03PS2) 10Thcipriani: Bumping portals to master Removing survey banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296399 (https://phabricator.wikimedia.org/T136874) (owner: 10Jdrewniak) [15:21:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296399 (https://phabricator.wikimedia.org/T136874) (owner: 10Jdrewniak) [15:22:22] (03Merged) 10jenkins-bot: Bumping portals to master Removing survey banner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296399 (https://phabricator.wikimedia.org/T136874) (owner: 10Jdrewniak) [15:24:51] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:296399|Bumping portals to master (T136874)]] (duration: 00m 24s) [15:24:52] T136874: Wikipedia.org Portal Survey: revisit for more participants - https://phabricator.wikimedia.org/T136874 [15:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:21] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:296399|Bumping portals to master (T136874)]] (duration: 00m 29s) [15:25:22] T136874: Wikipedia.org Portal Survey: revisit for more participants - https://phabricator.wikimedia.org/T136874 [15:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:28] ^ jan_drewniak check please [15:25:38] yuvipanda: o/ where does the PAWS code repo live? i'd like to add it tools.wmflabs.org/hay/directory [15:26:15] niedzielski hey! awesome! it's at github.com/yuvipanda/paws, but needs to move to somewhere else soon [15:27:13] yuvipanda: sweet :) i'll make a pull request maybe this evening. [15:27:26] niedzielski \o/ <3 [15:29:15] <_joe_> yuvipanda: you mean to gerrit? [15:29:28] joe or differential [15:30:48] <_joe_> yeah whichever [15:32:45] yeah, probably a good idea to test this with diffusion / differential [15:33:09] yuvipanda: ostriches or twentyafterfour could probably make that happen :) [15:33:37] thcipriani: sorry for the delay, looks good! [15:33:42] my new labs console tool will be able to as well once it clears security review and gets deployed :) [15:34:18] jan_drewniak: np, thanks for checking! [15:50:45] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2412145 (10elukey) Time to make a summary now that we have a lot of data. The main difference between 1.4.21 and 1.4.25 is the way in which the cache is built a... [15:51:38] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:55:14] (03PS1) 10Mdann52: Change $wgMaxRedirects to 3 on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296406 [15:55:38] (03CR) 10jenkins-bot: [V: 04-1] Change $wgMaxRedirects to 3 on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296406 (owner: 10Mdann52) [15:56:28] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:58:48] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160628T1600). Please do the needful. [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] o/ [16:01:25] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2412208 (10mehtab.ahmed) Please provide us any alternative fonts like Lateef or Tahoma bold, if they doesn't have any issue o... [16:01:43] yo thcipriani, looking at your patch [16:02:00] _joe_: ^ [16:03:49] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:05:50] <_joe_> godog: you're swatting? ack [16:05:59] <_joe_> I didn't realize it's already 6 pm [16:06:22] _joe_: yup, I'd like your opinion on https://gerrit.wikimedia.org/r/#/c/294742/2 tho [16:06:30] (03PS1) 10Ema: diamond TCP collector: publish TFO-related metrics as gauges [puppet] - 10https://gerrit.wikimedia.org/r/296408 [16:06:41] <_joe_> ok looking [16:07:15] (03CR) 10Thcipriani: "Puppet compiler output: https://puppet-compiler.wmflabs.org/3220/" [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) (owner: 10Thcipriani) [16:07:19] thcipriani: will the listed canaries get a deploy twice if one is deploying to the canaries first? I guess even if so that'd be ok [16:08:05] godog: I don't have this code hooked up in scap just yet, we could probably prevent that, but, yeah, really no harm. [16:08:50] this is just the first step to make scap able to find app canaries and api canaries, once that's in the config I can hook it up in code. [16:09:28] I have the rough outline for how I want it to work in: https://phabricator.wikimedia.org/D248 [16:14:12] (03CR) 10Giuseppe Lavagetto: [C: 031] scap: make deployment aware of canary machines [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) (owner: 10Thcipriani) [16:14:25] thcipriani: thanks, yeah LGTM too [16:14:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] scap: make deployment aware of canary machines [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) (owner: 10Thcipriani) [16:15:43] godog: _joe_ awesome thanks! As I mentioned in the patch, I wasn't sure if adding dsh files was the right thing to do going forward, but it seemed right in the short term. [16:15:47] (03CR) 10Filippo Giunchedi: [C: 031] diamond TCP collector: publish TFO-related metrics as gauges [puppet] - 10https://gerrit.wikimedia.org/r/296408 (owner: 10Ema) [16:16:05] thcipriani: yup, that's what we update ATM when messing with mw hosts anyways [16:16:10] <_joe_> thcipriani: exactly [16:16:54] !log starting rolling restart of elasticsearch codfw cluster (T138811) [16:16:56] T138811: CVE-2016-4997 - https://phabricator.wikimedia.org/T138811 [16:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:59] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures [16:28:21] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable updater service - https://phabricator.wikimedia.org/T138627#2412266 (10Smalyshev) Maybe I misunderstand what mask/unmask does, but I though it disables service completely. That's not what I need... [16:31:01] 06Operations, 10Mail: Remove tchen alias - https://phabricator.wikimedia.org/T138860#2412282 (10JGulingan) [16:42:52] 06Operations: reclaim tmh2* as spares or into mw* pool - https://phabricator.wikimedia.org/T115950#2412346 (10RobH) 05Open>03Resolved [16:42:59] 06Operations, 10Mail: Remove tchen alias - https://phabricator.wikimedia.org/T138860#2412347 (10Krinkle) [16:45:50] 06Operations, 10Mail: Remove tchen alias - https://phabricator.wikimedia.org/T138860#2412352 (10Krinkle) [16:45:55] Meh, that doesn't work. [16:49:50] 06Operations, 10ops-codfw, 06DC-Ops: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#2412356 (10RobH) Short answer: Yes, we can resolve it. The humidity levels seem ok now in codfw. I have not resolved it so you can view the long answer below. (Feel free to resolve once you are awar... [16:51:22] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160628T1700). Please do the needful. [17:00:13] no parsoid deploy today. [17:00:27] all's quiet on interactive side [17:12:59] (03CR) 10Ladsgroup: "Swagger check is not enough. It doesn't hit worker nodes. Please correct me if I'm wrong" [puppet] - 10https://gerrit.wikimedia.org/r/296054 (owner: 10Dzahn) [17:15:19] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2412436 (10Dereckson) Lateef is absolutely doable, as already packaged for Debian. For Lateefi, y... [17:18:56] 06Operations, 10ops-eqiad: HP Warning on boot [Firmware Bug]: the BIOS has corrupted hw-PMU resources - https://phabricator.wikimedia.org/T136345#2412458 (10RobH) 05Open>03Resolved I've updated https://wikitech.wikimedia.org/wiki/HP_DL360Gen9 to reflect the required bios setup changes for power mgmt. [17:27:29] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:28:25] robh: https://wikitech.wikimedia.org/w/index.php?title=Platform-specific_documentation%2FHP_DL3N0_Gen8&type=revision&diff=694365&oldid=694360 isn't this page about the Dell PowerEdge RN20s? [17:28:44] SPF|Cloud: yeah i just messed up a bunch of moves [17:28:48] im fixing them now [17:28:59] i had too many tabs open for my copy and pasting =P [17:29:16] * robh is attempting to merge together and migrate a bunch of various documentation pages [17:30:27] though i keep getting errors stating someone is doing stuff [17:32:18] meh, rollbacks and auto reverts for moves dont seem to be very easy.... just moved it back and manually removing the redirection trail i left in my chaos. [17:34:25] ok, thats more sensible now. So yeah my project this AM has been to re-evaluate the HP documentation since its not up to the standards of our Dell documentation. (Makes sense since we've been using Dells for 6 years longer than HP.) [17:34:40] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:35:49] (03PS1) 10Dereckson: Short array syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296419 [17:42:17] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable updater service - https://phabricator.wikimedia.org/T138627#2412490 (10Gehel) Some discussion with Stas to clarify the need: * We do not declare a service in puppet, so puppet will not restart... [17:42:33] (03PS1) 10Giuseppe Lavagetto: First commit of the reorganized codebase [software/service-checker] - 10https://gerrit.wikimedia.org/r/296420 [17:42:51] <_joe_> mobrovac, gwicke ^^ [17:43:24] <_joe_> still misses the tox/jenkis setup though [17:43:47] so should i have just had a phab error? [17:44:06] tried to make task, got 404 [17:44:24] 06Operations: test task - https://phabricator.wikimedia.org/T138866#2412494 (10RobH) [17:44:29] second time works, bleh. [17:46:52] 06Operations: Update & standardize Platform-specific_documentation for HP servers - https://phabricator.wikimedia.org/T138866#2412507 (10RobH) [17:47:31] 06Operations: Update & standardize Platform-specific_documentation for HP servers - https://phabricator.wikimedia.org/T138866#2412494 (10RobH) I've been working on this (and will continue to do so) this AM. [17:49:24] (03PS1) 10Dzahn: DHCP: add MAC for zosma.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/296421 (https://phabricator.wikimedia.org/T138650) [17:50:18] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2412528 (10MoritzMuehlenhoff) @mehtab.ahmed : Can you name me a SVG which uses Sindhi, so that I can generate a test PNG with... [17:51:27] _joe_: *clap clap* [17:51:58] <_joe_> tests pass on both python 2.7 and 3.5 [17:52:20] <_joe_> I would *love* to use asyncio but I guess neon wouldn't be able to use it [17:52:31] nice! [17:52:35] heh [17:52:47] <_joe_> so we're blocked on the neon upgrade :P [17:53:01] (03PS1) 10Dzahn: consistently (no) FQDN in DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/296424 [17:53:12] <_joe_> so, I was wondering if I should use scap3 for this, as debs are an overkill [17:54:07] <_joe_> but I guess that'd mean building wheels and then deploying them [17:54:13] (03CR) 10Dzahn: [C: 032] DHCP: add MAC for zosma.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/296421 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [17:55:22] _joe_: https://wikitech.wikimedia.org/wiki/User:Ladsgroup/Shipping_dependencies_using_wheels [17:56:06] <_joe_> bd808: actually I don't need any dependency (I am strict about not using anything not in jessie at least) [17:56:24] <_joe_> I just need to run setup.py install into a virtualenv [17:56:57] <_joe_> but now it's time for beer :) [17:57:29] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:04:29] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:05:37] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable updater service - https://phabricator.wikimedia.org/T138627#2412544 (10Smalyshev) I'm fine with that. [18:21:09] (03CR) 10Jdlrobson: [C: 04-1] Get descriptions from pageterms for Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296153 (https://phabricator.wikimedia.org/T138705) (owner: 10Aude) [18:23:35] !log zosma - fresh install, sign puppet certs, initial puppet run [18:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:53] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2412573 (10Dzahn) Yes, the problem appears when these rules are adde... [18:48:35] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2412605 (10Paladox) The redirect Bugzilla looks like https://phabric... [18:57:38] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Puppet has 1 failures [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160628T1900). [19:03:45] jouncebot: doing the needful [19:04:38] 07Blocked-on-Operations, 06Operations, 10Increasing-content-coverage, 06Research-and-Data-Backlog: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2412654 (10ori) @ellery pointed out the only piece of scikit-learn that ART is using is the [[ https://git... [19:06:52] I [19:07:11] I am going to push out wmf.7 ahead of wmf.8 to group0 [19:10:17] (03PS1) 1020after4: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296432 [19:18:56] (03CR) 1020after4: [C: 032] all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296432 (owner: 1020after4) [19:19:14] (03CR) 10Aude: Get descriptions from pageterms for Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296153 (https://phabricator.wikimedia.org/T138705) (owner: 10Aude) [19:19:26] (03Abandoned) 10Aude: Get descriptions from pageterms for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296153 (https://phabricator.wikimedia.org/T138705) (owner: 10Aude) [19:19:33] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296432 (owner: 1020after4) [19:21:43] and I was going to enquire about the status :) [19:21:49] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:23:10] hashar: just pushing out wmf.7 then wmf.8 shortly after :) [19:23:14] * hashar looks at bunch of graphs from the saving regression https://phabricator.wikimedia.org/T138550#2410000 [19:23:15] 06Operations, 10ops-codfw, 13Patch-For-Review: mw2098 / mw2123 / mw2134 not coming up after reboot - https://phabricator.wikimedia.org/T138812#2412690 (10Papaul) @MoritzMuehlenhoff there is a task for mw2123 and mw2134 https://phabricator.wikimedia.org/T125088 [19:23:25] yeah that sounds good, would want to a wait a few though [19:23:34] !log Deploying 1.28.0-wmf.7 to all wikis [19:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:58] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.7 [19:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:29] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:25:28] gotta watch out the save timing regression ( graphs links https://phabricator.wikimedia.org/T138550#2410000 ) [19:26:19] there is a 20% surge of stashedit APi requests [19:26:38] PROBLEM - HHVM rendering on mw2071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:38] hashar: should I roll back? [19:26:48] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:26:53] and the check_stash hits / miss counts has sky rocketed again https://graphite.wikimedia.org/render/?width=962&height=492&_salt=1467056269.826&target=MediaWiki.abusefilter.check_stash.*.count&target=secondYAxis(MediaWiki.AbuseFilter.tokenizerCache.hit.count)&from=-2hours [19:27:05] 1 hour view https://graphite.wikimedia.org/render/?width=962&height=492&_salt=1467056269.826&target=MediaWiki.abusefilter.check_stash.*.count&target=secondYAxis(MediaWiki.AbuseFilter.tokenizerCache.hit.count)&from=-1hours [19:27:44] ok rolling back [19:28:00] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:28:30] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:28:39] RECOVERY - HHVM rendering on mw2071 is OK: HTTP OK: HTTP/1.1 200 OK - 65613 bytes in 0.288 second response time [19:29:05] (03PS1) 1020after4: Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296434 [19:29:13] twentyafterfour: I havent looked at the regression timing [19:29:44] hashar: timing? [19:29:59] 75p of edit saving https://phabricator.wikimedia.org/T138550 [19:30:35] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1467142229.576&target=MediaWiki.timing.editResponseTime.p75&from=-2hours [19:31:26] (03CR) 1020after4: [C: 032] Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296434 (owner: 1020after4) [19:31:38] gehel: SMalyshev I am around for WDQS scap3 deployment, FYI [19:31:40] thcipriani: I'm getting ready for wdqs / scap3 ... [19:31:48] :) [19:32:03] thcipriani: thanks, let's start then? [19:32:10] !log Rolling back to wmf.6: T138550 is still a problem [19:32:11] T138550: 1.28.0-wmf.7 save time regression - https://phabricator.wikimedia.org/T138550 [19:32:12] since we're all here... [19:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:28] twentyafterfour: so yeah basically it is not fixed [19:32:44] so, merge https://gerrit.wikimedia.org/r/#/c/295437/, run puppet on tin, puppet on wdqs100?, scap deploy on tin? [19:32:49] (03Merged) 10jenkins-bot: Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296434 (owner: 1020after4) [19:32:55] with p75 of edit saving time going from ~ 600ms to 800ms [19:33:24] gehel: I think so... [19:33:36] * thcipriani double checks [19:33:40] 06Operations, 10Ops-Access-Requests, 05Security: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412705 (10Dzahn) [19:33:46] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: Rolling back to wmf.6: save time regression is still present in wmf.7 [19:33:48] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:33:48] wdq-beta is not going to move to scap3, right? [19:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:23] SMalyshev: hmm, I'm not seeing the scap/scap.cfg in /srv/deployment/wdqs/wdqs on tin is that merged? [19:34:41] thcipriani: merged but probably not checked out [19:34:47] thcipriani: yes it is, just checked [19:34:57] it's in the repo but tin may be not up to date? [19:34:59] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:35:21] SMalyshev: would you update tin to the revision to be deployed (with the scap.cfg) [19:35:46] thcipriani: I'm updating wdq-beeta for starter (just to make sure) and tin just after [19:36:13] then puppet changes can go. I don't *think* there would be any issue in the other order, but this is the way we've done it before and it worked :P [19:36:17] I think it's fine to pull on tin... we can checkout older one if it doesn't work [19:36:26] I'll update tin [19:36:42] ok, should be fine [19:37:16] * twentyafterfour isn't sure what to do about wmf.8 now [19:37:50] I guess I should go ahead with wmf.8 deployment and skip wmf.7 entirely? [19:37:52] 06Operations, 10Ops-Access-Requests, 05Security: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412723 (10Dzahn) We have talked about this at Wikimania. There are some security-tools that have traditionally been running on a personal laptop and we want... [19:37:58] SMalyshev: kk, if you would please run: scap deploy --init inside /srv/deployment/wdqs/wdqs please. This will create the /srv/deployment/wdqs/wdqs/.git/DEPLOY_HEAD file [19:37:59] SMalyshev: thcipriani wdq-beta updated [19:38:46] thcipriani: let gehel finish the beta update test... then (hopefully it's fine) we can proceed [19:38:49] (03PS6) 10Gehel: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [19:39:13] thcipriani: or should it happen before puppet? then I can do it now [19:39:23] SMalyshev: beta looks good [19:39:24] twentyafterfour: na the same behavior will happen for sure [19:39:44] SMalyshev: If you have something more pressing to do, I can probably handle the deploy. I'll ping you if something goes wrong... [19:39:46] ok, run scap deploy --init [19:39:59] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures [19:40:03] gehel: no, I'm fine, I allocated this time for it :) [19:40:05] hashar: so just scrap wmf.8? [19:40:06] SMalyshev: sweet, just saw it run :) [19:40:18] (03PS1) 10Dzahn: admin: add new sectools-roots admin group [puppet] - 10https://gerrit.wikimedia.org/r/296438 (https://phabricator.wikimedia.org/T138873) [19:40:22] twentyafterfour: the issue is in wmf.7 so wmf.8 will have the same deal [19:40:33] thcipriani: where did you see it run? [19:40:38] hashar: I understand that [19:40:40] twentyafterfour: I am more inclined in having the AbuseFilter and AntiSpoof related changes to be reverted entirely [19:41:08] thcipriani, SMalyshev: I'm ready to merge the puppet change [19:41:15] gehel: I saw SMalyshev run scap deploy --init I'm watching using: scap deploy-log -v from /srv/deployment/wdqs/wdqs [19:41:23] hashar: ok I would agree with that too [19:41:38] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:42:46] SMalyshev, thcipriani: I'm waiting for your "go" to merge the puppet change... [19:42:59] twentyafterfour: or we could feature disable the whole editstash thing *evil* [19:43:04] * gehel should probably just go ahead [19:43:07] gehel: puppet changes should be good from my view since .git/DEPLOY_HEAD is in place (scap3 provider looks for that). The puppet provider for scap3 shouldn't run as part of the puppet run, but better to be safe in this instance. [19:43:27] thcipriani: ok, merging [19:43:49] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:44:02] 06Operations, 10ops-codfw, 06DC-Ops: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#2412741 (10Papaul) @RobH I have no access to librenms yet. [19:44:49] (03CR) 10Gehel: [C: 032] Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [19:45:26] twentyafterfour: if we got the revert path, I guess we want to remove AbuseFilter change e91939f - Cache AbuseFilter::checkAllFilters during edit stashing (2 weeks ago) [19:45:35] twentyafterfour: deploy, and see what happens :( [19:45:42] SMalyshev, thcipriani: running puppet on tin [19:45:48] ack [19:46:10] twentyafterfour: not that I know what AbuseFilter is actually doing :( [19:47:04] hashar: I'd rather not revert changes without knowing exactly what the implications are. [19:47:07] OH NO [19:47:10] for god sake [19:47:15] ? [19:47:16] the "fix" in master hasn't been backported [19:47:23] wth [19:47:35] well, that's nice [19:47:35] 06Operations, 10ops-codfw, 06DC-Ops: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1974148 (10MoritzMuehlenhoff) @papaul: mw2123 and mw2134 are now depooled so they can be shut down for troubleshooting at your convenience. [19:47:36] 1.28.0-wmf.7 has the presumably faulty patch [19:47:43] SMalyshev, thcipriani: puppet on tin complete, running puppet on wdqs100[12] [19:47:51] hashar: which one is the fix? I can backport [19:47:52] would want to backport * 6af0857 - Move the filter pre-caching outside of the DB lock (25 hours ago) [19:48:14] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/296255/ :) [19:48:33] lets cherry pick that to wmf/1.28.0-wmf.7 and +2 it right away [19:48:43] then attempt a deploy again and see whether it actually fix it [19:49:11] SMalyshev, thcipriani: puppet done. SMalyshev: you do the deploy? Or shall I? [19:49:18] hashar: already on it [19:49:22] gehel: on tin, blazegraph-service-0.2.2-dist.war is not checked out from git-fat - is it normal? [19:49:31] https://gerrit.wikimedia.org/r/#/c/296440/ [19:49:37] SMalyshev: I don't think so [19:49:39] it's 74 bytes instead of real size. Does scap3 handle this [19:49:54] gehel: just to confirm /srv/deployment/wdqs on the target machines should be owned by deploy-service correct? [19:50:40] thcipriani: yep, owned by deploy-service [19:50:41] SMalyshev: yeah, that 74 bytes is just a git-fat pointer, IIRC. scap3 should handle it like trebuchet was handling it [19:51:00] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:51:04] thcipriani: ok then. Just weird why other gitfat files are loaded and this one is not [19:51:07] thcipriani: so no need for manual fat pull [19:51:25] * gehel does not trust git fat all that much... [19:51:45] ok then, I think we can deploy [19:51:56] gehel: and by we I mean you :) [19:52:03] yeah, I don't know if I trust it too much either. In this instance, scap3 should handle the remote git fat init/pull [19:53:09] thcipriani: I just need to run "scap deploy"? Since the git pull was already done by SMalyshev ? [19:53:51] gehel: yup, run that command inside /srv/deployment/wdqs/wdqs and that should be it. You can watch detailed logs in another window with scap deploy-log -v if you'd like [19:53:57] * gehel is doing his first scap deployment! [19:53:58] (which is what I'm doing :)) [19:54:38] doesn't seem like anything happened in wdq1001 [19:54:51] I still see the old files [19:55:22] 19:54:21 [tin] Finished Deploy: wdqs/wdqs (duration: 00m 14s) [19:55:31] but files on wdq1001 did not change [19:56:00] SMalyshev: mtime seems to have changed. A specific file you are looking at? [19:56:01] hmm, /srv/deployment/wdqs/wdqs should now be a pointer to /srv/deployment/wdqs/wdqs-cache/revs/cdf74585a5303d973e53d0a0d379b75857f5855d [19:56:10] er, a symlink to [19:56:29] oh [19:56:39] ok, that explains it, I was looking at the old dir [19:56:47] wait, that doesn't work [19:56:54] old dir had symlinks [19:57:02] which aren't in the new dir :( [19:57:03] thcipriani: looking at the logs, it seems that both servers have been restarted aat the same time. Is there a way to ensure they are done one after the other? [19:57:18] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:57:20] so now the service is broken :( [19:57:40] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:58:02] SMalyshev, thcipriani: damn, what went wrong? [19:58:03] thcipriani: I didn't know it creates a new dir... is there any way to keep symlinks or reinstate them? [19:58:16] hmm, why wouldn't the symlinks be deployed correctly? Which symlinks are you looking at? [19:58:20] gehel: it created a new dir, but this link: lrwxrwxrwx 1 deploy-service deploy-service 26 May 12 21:49 wikidata.jnl -> /var/lib/wdqs/wikidata.jnl [19:58:24] is now gone [19:58:36] we need to put it back and restart the service [19:58:59] and figure out how to make it not be killed by the deploy.... [19:59:40] gehel: puppet might restore it if it's run again... [19:59:55] SMalyshev: running puppet... [20:00:18] gehel: run it just on wdq1001 just in case [20:00:20] SMalyshev: we can probably store that symlink in git, but it's ugly [20:00:48] gehel: store like in repo? no, that is wrong [20:00:53] can either add that symlink to git, or add a command that runs before service restart to create it [20:01:20] thcipriani: that's duplicating puppet, which I don't like too much but I guess we can do it [20:01:34] thcipriani: also, what happens with old deploys? are they cleaned up eventually? [20:01:40] SMalyshev: puppet done, restarted blazegraph on wdqs1001 [20:01:52] SMalyshev: yes, they are cleaned every 5 deploys [20:01:56] gehel: the link is not back ;( [20:02:00] working on making that configurable. [20:02:30] ok, let's recreate that symlink manually and cleanup once service is restored [20:02:32] gehel: oh, now I have only the link in the dir... wtf??? [20:02:48] the rest of the files are gone [20:03:11] ohh... that's because puppet thinks it should be directory and not link :( [20:03:21] gehel: we need to redeploy and fix puppet [20:03:32] damn, just saw the logs. That would have been a good case of testing with --noop [20:03:48] gehel: great idea :) [20:04:09] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:04:22] adding this checks.yaml file in /srv/deployment/wdqs/wdqs/scap/checks.yaml should create the symlink before the restart https://gist.github.com/thcipriani/a72d6be7583adef127e3367f207b1ba0 [20:04:29] on tin [20:05:59] PROBLEM - Blazegraph Port on wdqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused [20:06:19] PROBLEM - Blazegraph process on wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [20:06:48] twentyafterfour: deploy again ? [20:06:53] so I'd say pause puppet, add checks.yaml, redeploy, fix puppet, and then re-enable puppet. [20:07:55] thcipriani: can I have more than one command there? [20:08:06] hashar: yep, just running sync-dir [20:08:25] !log disabling puppet on wdqs100[12] to cleanup after failed scap3 deplyoment [20:08:26] SMalyshev: you should create a new command block for more than 1 command per stage [20:08:30] I had to deal with rebasing some patches and submodule bumps that weren't committed on tin [20:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:42] thcipriani: command block? [20:08:59] SMalyshev: yeah, like the 'create_symlink' block [20:09:29] !log deploying https://gerrit.wikimedia.org/r/#/c/296440/ to hopefully unblock wmf.7 deployments. refs T138550, T136973 [20:09:31] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [20:09:31] T138550: 1.28.0-wmf.7 save time regression - https://phabricator.wikimedia.org/T138550 [20:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:37] SMalyshev: so like: https://gist.github.com/thcipriani/a72d6be7583adef127e3367f207b1ba0 [20:09:52] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploying https://gerrit.wikimedia.org/r/#/c/296440/ refs T138550, T136973 (duration: 02m 06s) [20:09:54] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [20:09:54] T138550: 1.28.0-wmf.7 save time regression - https://phabricator.wikimedia.org/T138550 [20:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:35] (03PS1) 1020after4: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296444 [20:10:49] RECOVERY - Blazegraph process on wdqs1001 is OK: PROCS OK: 1 process with UID = 998 (blazegraph), regex args ^java .* blazegraph-service-.*-dist.war [20:11:01] (03CR) 1020after4: [C: 032] "Yet again rolling forward to wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296444 (owner: 1020after4) [20:11:25] thcipriani: SMalyshev: ok, service restored. Time to breath a deep breath and cleanup [20:11:26] (03PS17) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [20:11:37] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296444 (owner: 1020after4) [20:11:53] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.7 [20:12:00] (03PS1) 10Smalyshev: Fix wdqs deployment - now wdqs is a symlink [puppet] - 10https://gerrit.wikimedia.org/r/296446 [20:12:15] gehel, thcipriani: https://gerrit.wikimedia.org/r/#/c/296445/ [20:12:48] RECOVERY - Blazegraph Port on wdqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 [20:12:52] (03CR) 10jenkins-bot: [V: 04-1] Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [20:13:14] gehel, thcipriani: and this one for puppet [20:13:25] https://gerrit.wikimedia.org/r/#/c/296446/ [20:13:35] https://graphite.wikimedia.org/render/?width=962&height=492&_salt=1467056269.826&target=MediaWiki.abusefilter.check_stash.*.count&target=secondYAxis(MediaWiki.AbuseFilter.tokenizerCache.hit.count)&from=-20min doesn't look good [20:14:29] twentyafterfour: yeah [20:14:52] gotta paste a one hour graph showing both spikes [20:14:53] https://graphite.wikimedia.org/render/?width=962&height=492&target=MediaWiki.abusefilter.check_stash.*.count&target=secondYAxis(MediaWiki.AbuseFilter.tokenizerCache.hit.count)&from=-1hours [20:14:56] and I guess [20:15:07] revert the changes [20:15:08] gehel: the same has to be done on wdq2 [20:15:24] SMalyshev: +1'd [20:15:25] SMalyshev: on the puppet one, you still keep the file resource to be able to have other resources depending on it [20:15:27] ? [20:15:34] gehel: yes [20:15:42] SMalyshev: ok, looks good [20:15:50] gehel: since in labs setup it works differently [20:15:57] SMalyshev: I put wdqs2 on maintenance to give us time to think a bit... [20:16:28] (03CR) 10Gehel: [C: 031] Fix wdqs deployment - now wdqs is a symlink [puppet] - 10https://gerrit.wikimedia.org/r/296446 (owner: 10Smalyshev) [20:18:11] SMalyshev, thcipriani: so, to clean all that, we should be able to just merge those 2 patches and do a scap deploy again, right? [20:18:31] gehel: hopefully, yes :) [20:18:50] thcipriani: I wonder if it's possible to deploy on just one host? [20:18:57] gehel: that looks correct to me. One thing to note is we could also setup a canary deploy so it would only deploy to one machine, you can manually check, and then roll forward [20:19:06] thcipriani: we should be able to test that on just wdqs1002 by editing the scap config directly on tin, right? [20:19:21] * gehel likes canaries! [20:19:36] yeah, you can just comment out wdqs1001 in the scap/wdqs file directly on tin and that would work [20:19:41] twentyafterfour: at least there are less API REquests to stashedit :) [20:19:46] so the cache works [20:19:49] gehel: ok, let's do that then! [20:19:55] (03PS1) 1020after4: Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296450 (https://phabricator.wikimedia.org/T138550) [20:20:08] (03CR) 1020after4: [C: 032] Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296450 (https://phabricator.wikimedia.org/T138550) (owner: 1020after4) [20:20:10] which would indicate something else screw up [20:20:49] (03Merged) 10jenkins-bot: Revert "all wikis to 1.28.0-wmf.7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296450 (https://phabricator.wikimedia.org/T138550) (owner: 1020after4) [20:21:21] you can also make wdqs2001 a permanent canary by adding the lines: server_groups: canary, default; canary_dsh_targets: wdqs-canary to scap.cfg then creating a scap/wdqs-canary file with just wdqs2001 in it. [20:21:40] then it would do a full deploy to wdqs2001, wait for user input, then do wdqs1001 [20:22:40] gehel: to be on the safe-side, when you run scap deploy, run: scap deploy -f to force a deploy, since the folder for the rev is already there, scap might not be happy with just scap deploy [20:23:18] thcipriani, SMalyshev: ok, doing deplyoment just on 1002 [20:23:38] PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 396 bytes in 0.016 second response time [20:23:52] boh. [20:23:58] PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 396 bytes in 0.018 second response time [20:24:11] gehel: looks like the links are back [20:24:15] thcipriani: not working... [20:24:28] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: once again rolling back to wmf.6 refs T136973 T138550 [20:24:29] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [20:24:29] T138550: 1.28.0-wmf.7 save time regression - https://phabricator.wikimedia.org/T138550 [20:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:51] gehel: 1002 looks fine to me [20:25:08] SMalyshev: icinga disagrees, but I'm not sure why [20:25:23] gehel: did you set maint mode? [20:25:48] yes, might be the issue, but wondering why it alerts just during deployment [20:26:50] gehel: dunno, seems to be fine to me... [20:27:09] SMalyshev: ok, let's remove maintenance [20:27:45] twentyafterfour: so either revert en masse all suspicious patches, or try to narrow it down by progressively backport to the currently running version until the issue is triggered [20:27:56] twentyafterfour: it might well be something else entirely :(( [20:28:09] RECOVERY - WDQS HTTP on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9487 bytes in 0.003 second response time [20:28:38] RECOVERY - WDQS SPARQL on wdqs1002 is OK: HTTP OK: HTTP/1.1 200 OK - 9487 bytes in 0.009 second response time [20:30:25] SMalyshev, thcipriani: ok, looks good to me [20:30:29] (03CR) 10Legoktm: "It is possible that people shorten links to private wikis since *.wikimedia.org will be whitelisted, however the extension will not be ena" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [20:30:33] gehel: we still need to merge https://gerrit.wikimedia.org/r/#/c/296446/ and see that puppet is fine with new setup [20:31:04] (03CR) 10Gehel: [C: 032] Fix wdqs deployment - now wdqs is a symlink [puppet] - 10https://gerrit.wikimedia.org/r/296446 (owner: 10Smalyshev) [20:32:21] gehel: I think that file needs a bit of a rewrite now that we don't actually use git deploy but scap... but I think it'll do for now as is [20:32:29] I'll clean it up later [20:32:54] SMalyshev: yep, let's go back to a stable situation first... [20:33:50] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:34:19] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:34:47] SMalyshev: only change with --noop: /Stage[main]/Wdqs/File[/srv/deployment/wdqs/wdqs/updater-logs.xml]/ensure: current_value absent, should be present (noop) [20:35:11] gehel: hmm... let me check [20:36:03] SMalyshev: log4j config for updater. That seems to be important... [20:36:09] gehel: yeah that file is created by puppet [20:36:43] I'm kind of confused about how many things are moving from puppet to scap here... [20:37:24] SMalyshev: seems to me that we should have a better separation of configuration. That log4j config should probably not be in the package_dir [20:40:47] gehel: yeah maybe... right now it's not too critical - I think it does fine with defualt config. But I'll make a patch to fix it [20:41:14] gehel: I think we can move it to a better place like /etc/wdqs or something [20:42:01] gehel: all the rest seems to be fine with puppet? [20:42:19] SMalyshev: yep, that was the only issue... [20:42:35] SMalyshev: ok, running puppet for real [20:42:49] gehel: ok then, I think we can re-enable puppet and hope for the best :) [20:43:47] SMalyshev, thcipriani: I'll still do a scap deploy to ensure wdqs1001 is in the same state as wdqs1002... [20:44:04] sure [20:44:08] ack [20:44:24] thcipriani: I wonder if it's possible to arrange it so that both hosts are not deployed at the same time, but one after another? [20:44:38] that will ensure minimal disruption [20:44:40] SMalyshev: definitely [20:44:47] thcipriani: so how you do that? [20:44:54] SMalyshev: cannaries! [20:45:05] ah, that's the canary setup? [20:45:10] * SMalyshev goes to read the docs [20:45:26] yup, so if you add the line: server_groups: canary, default to scap.cfg [20:45:44] * gehel just knows it is possible, does not actually know how to configure canaries... [20:45:45] along with the line: canary_dsh_targets: wdqs-canary [20:46:08] then add a file in scap/wdqs-canary with the canary host as the contents [20:46:10] that's it [20:46:32] thcipriani: so wdqs and wdqs-canary should be disjoint, right? [20:46:52] I don't understand the question? [20:47:06] SMalyshev: yes [20:47:13] thcipriani: if I have two hosts, then one should be in wdqs-canary and the other in wdqs? [20:47:24] SMalyshev: yeah, that'll work [20:47:30] ok then [20:48:12] scap will deploy to a host in the first group that contains that host, so it's not critical that the two files are unique [20:48:14] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/296457/ [20:48:29] SMalyshev, thcipriani: actually, if I remember correctly, those 2 lists don't have to be be disjoined. We could actually have the canary also defined in the main list. COrrect? [20:48:45] gehel: yup [20:48:51] ah, good [20:49:38] ok, scap also run on wdqs1001. Things seems to be mostly clean. [20:50:07] SMalyshev: gehel thanks for bearing with me, sorry this was a mess :( [20:50:10] Incident report will wait tomorrow to be written. And I need to pack a few bags ... [20:50:38] thcipriani: that happens. And not easy to test this one before production. [20:51:04] thcipriani: is it possible to deploy our wdq-beta server also with scap3? What would be needed for that? [20:51:10] SMalyshev: should have clarified that scap3 makes new-dirs/symlinks when we talked about moving WDQS to being with. It's definitely helpful for quick rollbacks in cases where deploy checks fail, but it also requires some planning :( [20:52:09] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:52:10] gehel: where does that server live? [20:52:56] thcipriani: wdq-beta.eqiad.wmflabs (not sure what you mean by "where") [20:53:33] thcipriani: it's on labs, yes. Not sure if scap3 works on labs (I tried to set up git deploy and it proved harder than its worth) [20:53:53] I meant to ask if it's in a labs project with a "tin"-type machine setup [20:53:57] but if we could test full cycle deploy that would be excellent [20:54:20] thcipriani: we don't have deploy host set up, but we could if needed [20:55:35] thcipriani, gehel: https://gerrit.wikimedia.org/r/#/c/296459/ [20:55:37] the puppet roles for deploy hosts are still a bit gnarly, but it would be good to fix those up. There's also a long-term task to remove the apache dependency from scap which would make that easier. [20:55:40] SMalyshev, thcipriani: I need to leave you. I'll be around a bit more, so ping me if needed. [20:55:51] gehel: ack, thanks again. [20:56:02] gehel: thanks for staying late! [20:56:52] np, have fun! [20:58:21] hashar: one more patch to try [20:58:49] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:03:24] (03PS1) 10Odder: Update logo settings for Adyghe Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296460 [21:09:26] twentyafterfour: still around ? AaronSchulz has another candidate https://gerrit.wikimedia.org/r/296457 [21:10:07] twentyafterfour: the check_stash bump is expected, all related to enwiki having a ton of edits [21:11:12] AaronSchulz: poked Mukunda about it. It is a bit too late for me to continue :( [21:13:38] hashar: I'm still around [21:13:58] we can try aaron patch again? https://gerrit.wikimedia.org/r/296457 [21:14:05] or just hold everything and wait for tomorrow ? [21:14:20] I'd try it, it's only 2pm here :) [21:14:21] your call really, cause I will probaqbly not see the rest of this deploy window [21:14:56] at least Aaron is around :)) [21:15:24] I'll try it of course [21:15:32] hashar: thanks for your help [21:16:31] AaronSchulz: another idea I had in mind is: stashedit api calls having a high miss in parser cache and thus being slower. With stashedit being counted in the EditTiming metric. [21:16:41] AaronSchulz: cherry picking that patch over to wmf.7 and deploying. [21:16:48] AaronSchulz: but that is a total non sense theory for sure (I dont even know what stashedit is meant for) [21:21:29] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:24:51] !log deploying wmf.7 yet again, once CI finishes testing https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973 [21:24:53] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [21:24:53] T138550: 1.28.0-wmf.7 save time regression - https://phabricator.wikimedia.org/T138550 [21:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:29] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:27:40] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:28:00] twentyafterfour: CI done :) [21:28:18] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:28:54] twentyafterfour: ah no that is on master [21:29:06] forget me [21:29:11] I am confused / too late [21:31:45] hashar: it's all good, syncing the change [21:31:47] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.7/extensions/AbuseFilter/: deploy https://gerrit.wikimedia.org/r/#/c/296464/ refs T138550 T136973 (duration: 00m 36s) [21:31:49] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [21:31:49] T138550: 1.28.0-wmf.7 save time regression - https://phabricator.wikimedia.org/T138550 [21:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:10] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:32:36] (03PS1) 1020after4: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296467 [21:32:52] (03CR) 1020after4: [C: 032] "one more time," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296467 (owner: 1020after4) [21:33:28] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296467 (owner: 1020after4) [21:33:57] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.7 [21:34:10] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:34:44] ok it's deployed. watching graps now [21:35:34] AaronSchulz: deployed [21:35:49] check_stats counter raises as expected per aaron comment https://graphite.wikimedia.org/render/?width=962&height=492&_salt=1467056269.826&target=MediaWiki.abusefilter.check_stash.*.count&target=secondYAxis(MediaWiki.AbuseFilter.tokenizerCache.hit.count)&from=-2hours [21:37:25] editResponseTime looks ok, I think: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1467142229.576&target=MediaWiki.timing.editResponseTime.p75&from=-4hours [21:37:26] and the rate of stashedit POST goes down magically :) [21:37:30] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: puppet fail [21:38:33] sometime I wish we could namespace our statsd metric per wmf version in use [21:38:39] for ease of comparison [21:40:04] AaronSchulz: the MediaWiki.abusefilter.check_stash is some kind of cache isn't it ? With the the "hit" one supposed to raise as the cache is filling up? [21:41:01] I won't really fill up, since it's different for each edit [21:42:36] the 120 minutes one https://graphite.wikimedia.org/render/?width=800&height=600&target=MediaWiki.timing.editResponseTime.p75&from=-120minutes [21:43:24] maybe there is some slight change but it is hard to know really [21:43:49] and over 20 days that looks fine [21:44:20] hence "Avoid using computed variables to determine stash keys" https://gerrit.wikimedia.org/r/296464 would have done the magic? [21:45:07] (03Abandoned) 10MarcoAurelio: Enable Translate extension on AffCom (chapcomwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio) [21:45:22] hit rate is still low, though that's not urgent, just something to look into [21:47:30] AaronSchulz: we also have a bunch of notice for filters not existing though they should :( [21:47:57] Undefined index: XXX in /srv/mediawiki/php-1.28.0-wmf.7/extensions/AbuseFilter/AbuseFilter.class.php on line 754 [21:48:10] Notice: Undefined index: XXXX in /srv/mediawiki/php-1.28.0-wmf.7/extensions/AbuseFilter/AbuseFilter.class.php on line 826 [21:48:51] https://phabricator.wikimedia.org/T138528 and https://phabricator.wikimedia.org/T138529 [21:49:06] which might be related to your recent patch. Not sure how much a concern it is [21:51:45] where? not seeing them at https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors [21:52:31] I got them from fluorine ssh -C fluorine.eqiad.wmnet tail -F -n0 /a/mw-log/hhvm.log [21:52:43] they are notices, so I guess they are not showing on the mediawiki-errors filte [21:52:50] s/filte/dashboard/ [21:54:07] so eventually we have various code refering to self::$filters[$filter] [21:54:27] where "$filter" looks like an int but is not an existing index in self::$filters [21:55:12] ah, https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm [21:56:55] hmm [21:57:06] editResponseTime is down to 400ms :) [21:59:14] !log deploying ores beec291 [21:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:03:41] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:05:58] twentyafterfour: AaronSchulz: looks all good to me! [22:06:18] * AaronSchulz looks at the warnings [22:06:36] AaronSchulz: thanks a ton for the timing regression fix up !!! [22:06:57] * AaronSchulz really dislikes mutating static variables [22:08:23] I am off myself [22:08:44] I will discover the fate of 1.28.0-wmf.8 tomorrow [22:17:23] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Traffic, 07HTTPS: https://ores.wikimedia.org redirects me to HTTP when I don't include a trailing slash - https://phabricator.wikimedia.org/T138682#2413170 (10Ladsgroup) ``` amir@amir-Lenovo-ideapad-300-15IBR:~/workspace/numerical methods$ curl "ht... [22:21:01] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:23:12] (03PS1) 10EBernhardson: Configuration changes for es 2.x on logstash1001-1006 [puppet] - 10https://gerrit.wikimedia.org/r/296475 [22:25:35] so I'm cherry-picking https://gerrit.wikimedia.org/r/#/c/296476/ to wmf.8 [22:27:42] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:34:26] 06Operations, 06Community-Liaisons, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-xx 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2413207 (10RobH) 05Open>03declined window and work cancelled. [22:39:57] (03PS1) 10EBernhardson: Import kibana .deb to apt repository [puppet] - 10https://gerrit.wikimedia.org/r/296477 (https://phabricator.wikimedia.org/T129138) [22:40:42] (03PS7) 10EBernhardson: Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) [22:52:38] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Traffic, 07HTTPS: https://ores.wikimedia.org redirects me to HTTP when I don't include a trailing slash - https://phabricator.wikimedia.org/T138682#2413248 (10Ladsgroup) 05Open>03Resolved [22:57:29] (03CR) 10EBernhardson: "files have moved to modules/aptrepo/files/" [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [22:58:10] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:00:05] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160628T2300). [23:00:05] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:21] (03PS1) 10Catrope: Add $wmgEchoTransition setting for Echo transition flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296479 [23:00:23] (03PS1) 10Catrope: Enable Echo transition flags in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296480 [23:00:47] looks like it's only me? i can deploy [23:01:07] !log Updating MW version on wikitech-static to 1.27 (LTS) - https://lists.wikimedia.org/pipermail/mediawiki-announce/2016-June/000191.html [23:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:47] (03PS2) 10Catrope: Enable Echo transition flags in production for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296480 [23:04:19] !log ebernhardson@tin Synchronized php-1.28.0-wmf.7/extensions/EventBus/EventBus.php: SWAT: EventBus: Match the expected format of response log key (duration: 00m 31s) [23:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:13] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:05:15] ffs SMW [23:05:27] * ebernhardson feels like that's not the first time ;) [23:05:33] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:06:54] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures [23:07:49] (03PS1) 1020after4: group0 to 1.28.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296482 [23:08:26] swat completed [23:09:50] (03PS8) 10EBernhardson: Update kibana module for kibana 4 [puppet] - 10https://gerrit.wikimedia.org/r/296279 (https://phabricator.wikimedia.org/T129138) [23:10:10] !log syncing new branch 1.28.0-wmf.8 refs T137492 [23:10:11] T137492: MW-1.28.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T137492 [23:10:12] !log wikitech-static working now, poke me on IRC or file a #wikitech.wikimedia.org ticket if you find any issues [23:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:55] !log twentyafterfour@tin Started scap: sync new branch, testwiki to php-1.28.0-wmf.8 refs T137492 [23:10:56] T137492: MW-1.28.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T137492 [23:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:44] RECOVERY - salt-minion processes on labstore2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:34:42] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:36:22] PROBLEM - salt-minion processes on labstore2004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:51:42] RECOVERY - salt-minion processes on labstore2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:57:14] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2413337 (10mehtab.ahmed) @Dereckson kindly check this if I have done the work up to the expectation. https://github.com/solan... [23:58:32] PROBLEM - salt-minion processes on labstore2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion