[00:00:38] i don't actually see jenkins-bot voting though [00:00:45] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [00:00:47] re: contint1001 [00:01:41] (03PS1) 10Chad: Stop forcing php5 in `mwscript` [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) [00:04:19] the init script also looks ugly because it needs to support alot of os. like the mac. [00:04:45] (03PS3) 10Aaron Schulz: Capture messages on 'autoloader' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356841 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [00:04:47] (03CR) 10Aaron Schulz: [C: 032] Capture messages on 'autoloader' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356841 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [00:06:09] (03Merged) 10jenkins-bot: Capture messages on 'autoloader' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356841 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [00:06:26] (03CR) 10jenkins-bot: Capture messages on 'autoloader' debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356841 (https://phabricator.wikimedia.org/T166759) (owner: 10Ori.livneh) [00:09:44] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: Capture messages on 'autoloader' debug log channel (duration: 00m 44s) [00:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:05] (03CR) 10Dzahn: [C: 032] replace references to RT tickets with Phab ticket numbers [puppet] - 10https://gerrit.wikimedia.org/r/358500 (https://phabricator.wikimedia.org/T165733) (owner: 10Dzahn) [00:13:23] jenkins now back, ack [00:15:56] 10Operations, 10Patch-For-Review, 10Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3346713 (10Dzahn) I'm unsure whether this is resolved now and that's simply as good as it gets.. or if we want to remove the remaining references or manually "import" those tickets. [00:16:48] 10Operations, 10Patch-For-Review, 10Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3346715 (10Dzahn) p:05Normal>03Low [00:29:25] (03PS2) 10Dzahn: Limit Electron memory usage to 2G [puppet] - 10https://gerrit.wikimedia.org/r/358888 (https://phabricator.wikimedia.org/T167834) (owner: 10GWicke) [00:31:13] (03CR) 10Dzahn: [C: 032] "merging. to prevent T167819 from happening again" [puppet] - 10https://gerrit.wikimedia.org/r/358888 (https://phabricator.wikimedia.org/T167834) (owner: 10GWicke) [00:33:06] (03CR) 10Dzahn: "a refresh of the pdfrender service has been triggered by puppet run on scb1001" [puppet] - 10https://gerrit.wikimedia.org/r/358888 (https://phabricator.wikimedia.org/T167834) (owner: 10GWicke) [00:33:39] (03PS4) 10Dzahn: keyholder: add stretch support, fix key validity check [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) [00:34:07] (03CR) 10Chad: [C: 032] ExtensionDistributor: Add REL1_29, drop REL1_23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [00:35:08] (03Merged) 10jenkins-bot: ExtensionDistributor: Add REL1_29, drop REL1_23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [00:35:43] (03CR) 10jenkins-bot: ExtensionDistributor: Add REL1_29, drop REL1_23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [00:39:23] (03CR) 10Dzahn: [C: 032] keyholder: add stretch support, fix key validity check [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [00:39:33] !log demon@tin Synchronized wmf-config/CommonSettings.php: extdist update (duration: 00m 44s) [00:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:47] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time [00:44:57] !log naos: disarm keyholder and armed it again to proof i didn't break anything on jessie by fixing keyholder on stretch with gerrit:358884 [00:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:32] (03CR) 10Dzahn: "17:48 < mutante> !log naos: disarm keyholder and armed it again to proof i didn't break anything on jessie by fixing keyholder on stretch " [puppet] - 10https://gerrit.wikimedia.org/r/358884 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [00:47:24] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3346776 (10Benoit_Rochon) Thank you Reddy for your commitment. Everyboby here a... [00:52:32] PROBLEM - pdfrender on scb2004 is CRITICAL: connect to address 10.192.16.36 and port 5252: Connection refused [00:52:32] PROBLEM - pdfrender on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 5252: Connection refused [00:58:47] PROBLEM - pdfrender on scb1004 is CRITICAL: connect to address 10.64.48.29 and port 5252: Connection refused [01:00:26] !log netmon1002 - locally "git clone /var/lib/rancid/GIT/core" into /var/lib/rancid (i rsynced that but it's a bare repository without a work tree. work tree is /var/lib/rancid/core (after this) (T159756) [01:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:36] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [01:01:59] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3346794 (10RobH) yep, walked away and now its working and installer is running on dumpsdata1002 [01:20:43] !log netmon1002 - copied missing router.db, routers.all/.down/.up over from netmon1001 to /var/lib/rancid/core. routers.db is an untracked file, the others are in .gitignore. this is all like on netmon1001 as well. adding routers.db to .gitignore file on both, like the other router* files already were (T159756) [01:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:52] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [01:22:32] 10Operations, 10netops: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3346480 (10BBlack) What are the real pros and cons on this? We could even go in the other direction and have a unique ASN per region/continent. How does the impact future anycasting? Note https://tools.ietf.org/ht... [01:48:11] !log netmon1002 - chown rancid:rancid /var/lib/rancid ; touch /var/lib/rancid/.gitconfig, let rancid write to config, then git config --global user.email and user.name as the rancid user | fix permissions on .git/objects files, let rancid user own them all | re-commit .gitingore change | SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/lib/rancid/bin/rancid-run as user "rancid" runs clean, [01:48:17] finally and ends with "All routers succesfully completd" (T159756) [01:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:19] T159756: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756 [02:02:55] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3339555 (10Johan) (As this has been marked with user-notice.) How tentative is that date? Is it worth spreading the word mentioning at specific date yet? [02:06:37] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3346838 (10RobH) a:05RobH>03ArielGlenn [02:07:11] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10RobH) ready for service implementation, assigned to ariel for followup. (this task can either be used to track its implementation, or can be resolved.) [02:32:16] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 07m 58s) [02:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:46] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3346884 (10Dzahn) < mutante> !log netmon1002 - chown rancid:rancid /var/lib/rancid ; touch /var/lib/rancid/.gitconfig, let rancid write to config, then git config --global user.email and user.name as th... [03:07:32] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.5) (duration: 14m 52s) [03:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 14 03:14:28 UTC 2017 (duration 6m 56s) [03:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:26] (03PS7) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [04:27:55] (03PS8) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [04:40:56] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:41:29] (03CR) 10Chad: bd808's dotfiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [04:42:46] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:43:26] RainbowSprinkles: would that alias be "push origin HEAD:refs/for/production" ? [04:44:41] Yep [04:47:22] Oh, and *.pyc for gitignore. Does *anyone* ever want to commit those? [04:47:25] :p [04:55:05] (03PS9) 10BryanDavis: bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 [04:56:48] (03CR) 10BryanDavis: bd808's dotfiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [05:00:27] bd808: We should build dotfiles as a service ;-) [05:06:12] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3341896 (10Marostegui) Once the database and tables are there, let us (DBAs) know... [05:56:36] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=296.30 Read Requests/Sec=2603.20 Write Requests/Sec=7.60 KBytes Read/Sec=24381.20 KBytes_Written/Sec=3539.60 [06:06:36] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.60 Read Requests/Sec=0.40 Write Requests/Sec=1.00 KBytes Read/Sec=9.60 KBytes_Written/Sec=10.80 [06:32:17] !log installing remaining libtasn security updates [06:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:10] !log restart pdfrender on scb1004 (xpra race condition) [07:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:46] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.078 second response time [07:04:41] !log restart pdfrender on scb200[2,4] (xpra race condition) [07:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:06] RECOVERY - pdfrender on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [07:06:06] RECOVERY - pdfrender on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time [07:08:37] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3347026 (10Marostegui) p:05Triage>03Normal [07:09:58] 10Operations, 10Electron-PDFs, 10Services, 10Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3347028 (10elukey) Restarted `pdfrender` on scb1004, scb200[2,4] today, xpra race condition traces in `/srv/log/pdfrender` [07:10:44] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3347029 (10Marostegui) p:05Triage>03Normal [07:11:13] (03PS3) 10Ema: varnish mobile redirects: allow for dashes in first label [puppet] - 10https://gerrit.wikimedia.org/r/358028 (https://phabricator.wikimedia.org/T167492) (owner: 10BBlack) [07:11:38] (03CR) 10Ema: [V: 032 C: 032] varnish mobile redirects: allow for dashes in first label [puppet] - 10https://gerrit.wikimedia.org/r/358028 (https://phabricator.wikimedia.org/T167492) (owner: 10BBlack) [07:14:02] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3347033 (10Marostegui) @MarcoAurelio just ping me when you want to do this. We did (T167597) Monday morning (European morning) and we had no issues, so maybe around... [07:19:12] 10Operations, 10Electron-PDFs, 10Services, 10Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3347040 (10MoritzMuehlenhoff) p:05Normal>03High I think the only reliable way to fix this would be to add a systemd unit to xp... [07:27:37] (03PS5) 10Ema: VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 [07:32:09] !log installing zziplib security updates on jessie [07:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:02] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Mobile, 10Patch-For-Review: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org - https://phabricator.wikimedia.org/T167492#3347046 (10ema) 05Open>03Resolved a:03ema As @Legoktm mentio... [07:59:20] !log Drop table updates on s3 - T139342 [07:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:30] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [08:06:23] (03PS1) 10Filippo Giunchedi: graphite: introduce graphite::whisper_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358903 (https://phabricator.wikimedia.org/T1075) [08:07:25] (03CR) 10jerkins-bot: [V: 04-1] graphite: introduce graphite::whisper_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358903 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [08:09:33] (03PS2) 10Filippo Giunchedi: graphite: introduce graphite::whisper_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358903 (https://phabricator.wikimedia.org/T1075) [08:10:38] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3347081 (10jcrespo) [08:11:06] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10jcrespo) Taking db1096 for T167567 [08:11:41] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3347085 (10jcrespo) [08:17:38] (03PS25) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [08:18:44] (03CR) 10jerkins-bot: [V: 04-1] maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (owner: 10Gehel) [08:19:38] (03PS26) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [08:23:38] (03CR) 10Hashar: [C: 031] bd808's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [08:23:47] (03PS1) 10Gehel: maps - add dummy redis password [labs/private] - 10https://gerrit.wikimedia.org/r/358906 [08:24:10] (03CR) 10Gehel: [V: 032 C: 032] maps - add dummy redis password [labs/private] - 10https://gerrit.wikimedia.org/r/358906 (owner: 10Gehel) [08:25:16] (03PS1) 10Jcrespo: Add temporary parsercache machines to both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/358907 (https://phabricator.wikimedia.org/T167567) [08:25:26] (03CR) 10Filippo Giunchedi: [C: 031] "PCC https://puppet-compiler.wmflabs.org/6766/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/358903 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [08:25:30] (03CR) 10Filippo Giunchedi: [C: 032] graphite: introduce graphite::whisper_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358903 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [08:25:34] (03PS3) 10Filippo Giunchedi: graphite: introduce graphite::whisper_cleanup [puppet] - 10https://gerrit.wikimedia.org/r/358903 (https://phabricator.wikimedia.org/T1075) [08:27:04] (03PS2) 10Jcrespo: Add temporary parsercache machines to both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/358907 (https://phabricator.wikimedia.org/T167567) [08:29:42] (03CR) 10Gehel: "Actually, we are already on 5.3.3 on the logstash production cluster (and on deployment-prep). So "really soon (tm)" is actually "already " [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [08:31:44] (03CR) 10Marostegui: [C: 031] "Looks good: https://puppet-compiler.wmflabs.org/6770/" [puppet] - 10https://gerrit.wikimedia.org/r/358907 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [08:33:33] (03PS3) 10Jcrespo: Add temporary parsercache machines to both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/358907 (https://phabricator.wikimedia.org/T167567) [08:34:21] (03PS1) 10Filippo Giunchedi: role: cleanup zuul data in graphite [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) [08:35:39] (03PS6) 10Ema: VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 [08:35:46] (03CR) 10Ema: [V: 032 C: 032] VCL: use resp.reason for synthetic responses generation [puppet] - 10https://gerrit.wikimedia.org/r/358057 (owner: 10Ema) [08:37:06] (03PS4) 10Jcrespo: Add temporary parsercache machines to both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/358907 (https://phabricator.wikimedia.org/T167567) [08:39:37] (03CR) 10Jcrespo: [C: 032] Add temporary parsercache machines to both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/358907 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [08:41:55] (03CR) 10Hashar: [C: 04-1] "\O/ Let me find out the retention time per metric namespace :] And I guess I will just amend the patch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [08:45:07] PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [08:45:30] jynus ^ [08:46:08] (03CR) 10Filippo Giunchedi: role: cleanup zuul data in graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [08:46:12] yes, yes [08:46:18] it is WIP [08:46:34] heartbeat cannot run without mysql running or something [08:46:52] yes, just giving you the heads up for that failure :) [08:46:53] thanks [08:56:29] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 2026443 [08:57:36] (03CR) 10Filippo Giunchedi: "Hashar, also I'm assuming we could do the same for nodepool?" [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [09:01:56] (03PS27) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [09:12:58] (03PS1) 10Jcrespo: mariadb-prometheus: Add temporary hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/358913 (https://phabricator.wikimedia.org/T167567) [09:13:21] (03PS2) 10Jcrespo: mariadb-prometheus: Add temporary hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/358913 (https://phabricator.wikimedia.org/T167567) [09:15:20] 10Operations, 10RESTBase-API, 10Traffic, 10Patch-For-Review: [feature request] Redirect root API path to docs page - https://phabricator.wikimedia.org/T125226#3347209 (10ema) p:05Triage>03Normal [09:17:01] (03CR) 10Marostegui: [C: 031] mariadb-prometheus: Add temporary hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/358913 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [09:17:13] (03PS2) 10Ema: Redirect /api/rest_v1 to RESTBase docs page [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) (owner: 10Ppchelko) [09:18:17] !log delete files older than 365d from 'servers' graphite hierarchy [09:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:52] 10Operations, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Switch CI tests back to HHVM 3.18 - https://phabricator.wikimedia.org/T167493#3347232 (10MoritzMuehlenhoff) Can be closed, right? [09:21:56] (03PS28) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [09:22:31] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3347233 (10MoritzMuehlenhoff) The migration to HHVM 3.18 is almost complete; the two remaining systems (terbium/wasat) will be updated tomorrow. [09:23:26] (03PS3) 10Ema: Redirect /api/rest_v1 to RESTBase docs page [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) (owner: 10Ppchelko) [09:24:52] 10Operations, 10MediaWiki-Vagrant, 10HHVM: hhvm-gdb on mediawiki vagrant gives ImportError: No module named unwinder - https://phabricator.wikimedia.org/T165101#3347281 (10MoritzMuehlenhoff) [09:25:49] 10Operations, 10MediaWiki-Vagrant, 10HHVM: hhvm-gdb on mediawiki vagrant gives ImportError: No module named unwinder - https://phabricator.wikimedia.org/T165101#3256886 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff gdb in jessie-wikimedia has been upgraded to 7.11.1 to support debugging... [09:26:36] 10Operations, 10HHVM: HHVM segfault with mediawiki/core / Scribunto - https://phabricator.wikimedia.org/T166550#3347301 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff That was caused by a bug in the memory allocation code in HHVM, which is fixed in HHVM 3.18.2+wmf5, closing. [09:27:53] 10Operations, 10Datasets-General-or-Unknown, 10Wikidata, 10HHVM, and 2 others: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3347305 (10MoritzMuehlenhoff) We're now using 3.18 in production (except snapshot* and video scalers hosts, which are still on trusty) [09:28:47] (03CR) 10Jcrespo: [C: 032] mariadb-prometheus: Add temporary hosts to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/358913 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [09:30:09] (03PS29) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [09:31:39] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 2023555 [09:46:54] !log Rename table titlekey before dropping it on enwiki - db1089 - T164949 [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:05] T164949: Drop titlekey table from all wmf databases - https://phabricator.wikimedia.org/T164949 [09:50:29] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 213015.01 seconds [09:51:07] (03PS1) 10Jcrespo: mariadb: Enable file-per-table option on parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/358918 (https://phabricator.wikimedia.org/T167567) [09:51:59] did icinga fail again or just an expiration? [09:52:42] Expiration I think [09:53:15] do I renew it, how much time? [09:53:23] I will take care of it [09:53:25] no worries [09:53:30] ok, thanks [10:07:03] (03CR) 10Marostegui: [C: 031] mariadb: Enable file-per-table option on parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/358918 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [10:07:44] Did you update something last night? [10:07:50] https://en.wikisource.org/w/index.php?title=Page:The_Innocents_Abroad_(1869).pdf/683&action=edit&redlink=1 is VERY slow [10:07:56] (03CR) 10Jcrespo: [C: 032] mariadb: Enable file-per-table option on parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/358918 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [10:08:54] ShakespeareFan00: It loaded quite fast on my side [10:09:00] Odd [10:09:04] Cache delay? [10:09:23] Because it's VERY slow for me [10:10:00] can you ping en.wikisource.org ? [10:10:10] and give us the ip you get? [10:10:49] 91.198.174.192 [10:11:31] that is text-lb.esams.wikimedia.org, same dc we are using [10:11:51] ShakespeareFan00: I am reviewing what was deployed last night fyi [10:12:21] jynus: It may be related to an image server not en.wikisource.org as such [10:12:42] Or some interaction with what does the thumbnailing for djvu/pDF [10:12:52] yeah, I only wanted to know which datacenter you were using [10:12:59] Naturally I've disabled any ad-blocker [10:13:00] I had a quick look too, pinging en.wikisource.org, I'm getting 91.198.174.192 but after one or two successful pings, get "Request timeout for icmp_seq" errors. [10:13:23] page loaded instantly for me though. [10:13:46] It's not a big issue for me , after all it takes time to proofread pages, but it would be nice to know why it's slow [10:14:16] the ping was giving a time arouns 32ms [10:14:33] what if you reload, does it happen again? [10:14:36] ShakespeareFan00: same for me [10:14:42] but the page loads straightaway [10:14:50] jynus: Still get slow loading times on a reload [10:15:20] It james up on "tools-static.wmflabs.org" in the status bar [10:15:27] *jams [10:16:16] well, then highly likely it could be some extra javascript added on your user page or done by the admins at the wiki [10:17:19] That's plausible [10:17:25] I am not being too successful scanning yesterday's train changes and looking for something that might be related [10:19:55] Disabled user javascript and it's STILL slow [10:20:54] Can you enable the developer tools on the browser and check the network to see what is exactly slow maybe? [10:21:45] It seems to be slow when it's requesting a scanned page for the first time.. [10:21:55] Once cached it load almost instantly [10:21:56] tools-static.wmflabs.org is not included [10:22:04] on the default page rendering [10:22:13] so it is clearly some user additions [10:22:30] or wiki aditions [10:22:43] Right [10:22:56] So the question is how to fix it? [10:23:24] that would be offtopic here, wikimedia-tech would be the way to get help [10:23:39] #wikimedia-tech [10:23:39] ShakespeareFan00: This is what I was checking btw: https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.5 [10:24:08] they most certainly will be able to help you there :-) [10:25:45] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3347402 (10hashar) [10:25:49] 10Operations, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Switch CI tests back to HHVM 3.18 - https://phabricator.wikimedia.org/T167493#3347400 (10hashar) 05Open>03Resolved Indeed nothing reported. Thank you very much ! [10:28:08] ShakespeareFan00: this is what you should get: https://phab.wmfusercontent.org/file/data/c3du6xlxekvo72ysccoj/PHID-FILE-kc6lphxqdtntjp2gxd3r/web_profiling.png [10:28:23] with no external references which may be slowing down your experience [10:32:54] hashar: around? [10:35:19] jynus: yes [10:35:39] and working? [10:37:17] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3347409 (10Samwalton9) Not directly, but I might have someone who does have direct contact. Could you give m... [10:38:16] jynus: what do you mind. ? Whether I am fully functionning? [10:38:32] available, I mean [10:38:39] yes yes !!! [10:38:44] sorry you got me confused :) [10:38:47] he he [10:38:51] sorry [10:39:17] Did some checking in logs - mw1297.eqiad.wmnet is where it thinks the scanned images are from [10:39:25] I can't ping that server though [10:39:30] so I wante to tell you that I am going to take over deploymenyt for a few pages [10:39:36] *hours [10:39:43] I will add it to deployment [10:39:52] and do it now so I do not interrupt other works [10:40:02] deployments later [10:40:15] ShakespeareFan00: You cannot ping that server, it doesn't have a public iface [10:40:23] Fair enough [10:40:30] hashar: hope you are ok with that [10:40:31] but that's where the lag seems to be occuring [10:40:35] let me see if there is something wrong with it [10:41:05] It seems to be intermittent.. [10:41:18] As once a scan image is cached, it's loads very quickly [10:41:23] jynus: most probably [10:41:50] it is going to be a bit risky, but not doing it will be worse [10:41:51] jynus: note that gehel has scheduled "switch Cirrus search traffic to codfw" at 13:00 UTC / 15:00 CEST [10:41:58] mmm [10:42:05] I hadn't seen that [10:42:06] but I guess we can push it to a bit later [10:42:15] https://wikitech.wikimedia.org/wiki/Deployments#Wednesday.2C.C2.A0June.C2.A014 :D [10:42:20] no [10:42:22] I can wait [10:42:29] oh [10:42:35] it is 15 CEST [10:42:38] I can do it before [10:42:41] 2 hours + from now [10:42:47] GO GO GO !!! [10:42:51] this should take 20 minutes [10:43:01] and either it works or breaks completelly [10:43:03] marostegui: I don't know if that server (or related) IS the cause of the slow performance, but thought I better mention that's where the lag seemed to be [10:43:20] so either 20 minutes or 2 days ? -:} [10:43:36] ShakespeareFan00: Sure. I am checking en.wikisource.org events and everything seems to be normal error-wise [10:43:50] actually yes, the other route will take 3x30 days [10:44:01] I see some references to deprecated ResourceLoader stuff this end though [10:45:24] jynus: so in short yes sounds good [10:45:43] hashar: sorry to bother you, but I prefer to warn someone and bother people than the oppsite [10:46:04] I will take care of revert and monitoring, that shouldn't be an issue [10:46:18] marostegui: It does seem to be the image that causes the lag [10:46:33] yeah it is always nice to announce changes ahead of time. And I am quite happy to be the listener / support you [10:46:40] Does it take a while for the backend to render up a thumbnail from PDF is there's not one existing? [10:46:48] *if [10:47:32] I've had this slow performance on a "new" thumbnails from djvu/pdf before.. [10:49:21] But it shouldn't take over 1 min to render a scan should it? [10:49:42] hashar: added it here https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T1100 [10:50:22] marostegui: Thanks [10:53:23] ShakespeareFan00: I cannot really find anything wrong there :( [10:53:34] Then I am puzzlled [10:53:47] Because I should not be waiting over 1 min for an image to load [10:54:03] When I've switched out anythingt that could be delaying [10:54:56] Could it be your connection? I mean, 3 of us didn't have issues loading that page, could it be your end? Don't know [10:55:11] Let me ask someone else to see if they have issues too [10:56:36] I don't see how it could be the connection this end when the main site loads instantly [10:56:44] It's only the images that lag [10:56:49] "intermittently" [10:57:01] I have asked two more people one in Italy and another one in England, and loaded fine too :( [10:57:14] Then I remain puzzled [10:57:20] Me too [10:57:28] There shouldn't be annything locally causing an over 1 min lage [10:57:29] Let me try with a proxy [10:58:42] The only possible thing I can think of is that being in the UK, WMF sites are being "filtered" for adult content by an ISP [10:58:48] But I have no control over that [10:59:06] jynus: nice :-} [10:59:08] ShakespeareFan00: One of the ones I asked is based in England (no idea about his ISP) [10:59:21] ShakespeareFan00: Trying now with a proxy over Norway and Germany. Let me try a UK one [10:59:23] Thanks [10:59:55] It only seems to occur on first access.. [11:00:04] jynus: Dear anthropoid, the time has come. Please deploy One time deployment of patch to try to fix parsercaches issues (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T1100). [11:00:05] jynus: A patch you scheduled for One time deployment of patch to try to fix parsercaches issues is about to be deployed. Please be available during the process. [11:00:15] Once an image is in the cache (either locally or at WMF) it loads instantly [11:01:51] lol [11:02:03] I can see the issue ShakespeareFan00 is referring to. The PDF thumbnailer seems extremely slow [11:02:23] tto: I figured it may be something related to that [11:03:28] It's not a particularly hi-res PDF... maybe it's just the overall file size (61 MB) that is making it choke [11:03:59] For wikisource documents 61MB is not that large [11:04:04] So ... [11:04:31] To me it keeps working fine (just tried an UK proxy) [11:04:33] And bear in mind the scans were being asked for page by page [11:04:37] (03PS1) 10Jcrespo: parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) [11:05:02] marostegui: Bear in mind as I said , once an image is thumbnailed/generated it loads very quickly [11:05:03] marostegui, try https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/The_Innocents_Abroad_%281869%29.pdf/page704-788px-The_Innocents_Abroad_%281869%29.pdf.jpg - but replace 788 by a new number to force a new image to be generated [11:05:34] ProofreadPage doesn't use normal thumbnail sizes, so the images are not pre-generated and must be created from scratch the first time a book is proofread [11:06:12] ShakespeareFan00, I'd suggest filing a Phabricator task [11:06:27] tto: yep, that is slow [11:08:18] ShakespeareFan00: yes, please fill a phabricator task with the details so we can track it and act on it [11:11:21] I'll certainly consider doing that if it persists.. [11:11:30] Do you mind if I quote you? [11:11:46] You can quote me :) [11:12:30] tto:? [11:12:44] Fine [11:14:47] Okay, Ill give it few hours to clear if it does , but will now be considering a phab ticket , Thanks :) [11:14:51] * ShakespeareFan00 out [11:14:56] ShakespeareFan00: Thanks [11:16:48] 10Operations, 10Wikimedia-IRC-RC-Server, 10User-notice: Reboot irc.wikimedia.org for kernel upgrades - https://phabricator.wikimedia.org/T167643#3347462 (10akosiaris) It's dependent on a kernel upgrade due to be released on the 19th. That has already been rescheduled once. I wish I could provide a degree of... [11:16:56] (03CR) 10Alexandros Kosiaris: [C: 032] irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [11:17:03] (03PS5) 10Alexandros Kosiaris: irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [11:17:06] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] irc echo: Convert package from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/358626 (owner: 10Paladox) [11:17:48] thanks :) [11:19:03] (03CR) 10Marostegui: [C: 031] parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [11:19:48] !log Deploy alter table s4 - labsdb1011 - T166206 [11:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:58] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [11:26:59] (03PS2) 10Jcrespo: parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) [11:27:03] jynus, hashar: I was on lunch break, but yes, there is no emergency for my deployment [11:28:00] don't worry, we are going to do it now, and either finish or revert immediately [11:28:30] (03CR) 10Marostegui: [C: 031] parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [11:28:53] so I am going to deploy that [11:29:10] if it fails, I will rebase to HEAD~1 and redeploy immediately [11:29:15] not waiting for gerrit [11:29:22] sounds good [11:29:27] I do not think it would be an immediate problem [11:29:36] I think IF there is a problem [11:29:49] it would me something more like slowdown over some period [11:30:04] sounds good! [11:30:14] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [11:30:20] memcache is up, and there are some failsafes [11:30:29] write load may be an issue [11:30:43] but that is why I chose a monster 500GB server as temporary [11:30:54] And SSDs! [11:31:10] the load, however is 1000 reads/s [11:31:15] which should be doable [11:31:42] the problem would be either the parser workflow [11:31:50] or the es servers [11:31:59] (03PS1) 10KartikMistry: Fix ContentTranslationTargetNamespace value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358929 (https://phabricator.wikimedia.org/T167865) [11:32:12] as you see, it is a highly calculated risk [11:32:24] and done in a way that is very easy to revert [11:32:53] maybe I am worrying too much and all keys are actually logically deleted [11:33:01] and ignored [11:33:11] in that case, we will not have any issue [11:33:22] (03CR) 10Jcrespo: [C: 032] parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [11:34:16] !log about to deploy performance-impacting change on the parsercache persistent storage T167567 [11:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:24] T167567: Migrate parsercache hosts to file per table - https://phabricator.wikimedia.org/T167567 [11:35:55] I will keep an eye on the several pc hosts and mediawiki errors [11:36:09] I am keeping an eye on the new hosts [11:36:10] (03Merged) 10jenkins-bot: parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [11:36:16] someone should keep an eye on performance of app servers [11:37:02] I can keep an eye on some grafana dashboards [11:37:09] I have performance metrics open already [11:37:14] and maybe the ^ [11:37:25] yeah, that is certainly going to get a bit worse [11:37:28] for a while [11:38:02] actually [11:38:25] can we run the script [11:38:35] the one for codfw failover now? [11:39:06] no, that wouldn't work, because it would only be useulf if it regenerated the caches [11:39:16] better doing it like this, ok deploying now [11:40:09] (03PS2) 10KartikMistry: Remove ContentTranslationTargetNamespace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358929 (https://phabricator.wikimedia.org/T167865) [11:40:13] go [11:41:09] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Switchover pc1004 to db1096 (duration: 00m 54s) [11:41:16] I will wait for codfw [11:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:21] not reason if it doesn't work [11:41:23] 10Operations, 10netops: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3347521 (10akosiaris) [11:41:26] *no [11:42:09] spike of connections, but mostly ok, I think [11:42:15] yes, so far so good [11:42:24] stable at 180-200 [11:42:26] (03CR) 10Santhosh: [C: 031] Remove ContentTranslationTargetNamespace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358929 (https://phabricator.wikimedia.org/T167865) (owner: 10KartikMistry) [11:42:29] the problem is performance now [11:43:01] volans: performance metrics okish? [11:43:23] it's a 1m metric... so early to say [11:43:26] so far so good [11:43:27] I do not see 5XX [11:43:40] which is expected, but would be a reason to revert immediately [11:44:14] so far it is looking good [11:44:32] writes are very slow [11:44:47] like 10 per second or so [11:44:51] that can be good or bad [11:45:15] (03CR) 10jenkins-bot: parsercache: Switchover pc1004 and pc2004 to db1096 and db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358926 (https://phabricator.wikimedia.org/T167567) (owner: 10Jcrespo) [11:45:41] if someone sees a reason to revert, shout [11:45:48] I will keep looking [11:46:12] nothing weird so far [11:46:26] that is weird by itself :-/ [11:46:56] well, it is a massive server so… [11:47:08] the problem is not so much the server [11:47:17] but the large invalidation we are forcing [11:47:41] the server is a tank, but if we are forcing app servers to work more, that is a problem [11:47:49] * volans wonders if we're using those caches at all :D [11:48:04] I will check the ES servers [11:48:36] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3347537 (10Mvolz) >>! In T165105#3347409, @Samwalton9 wrote: > Not directly, but I might have someone who do... [11:48:47] jynus: i was checking those and so far so good [11:49:12] slightly higher load than usual, but almost impossible to appreciate [11:49:25] I can see some increased metrics in HHVM, but noone seems problematic to me [11:49:28] https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&from=now-1h&to=now [11:49:31] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3347538 (10Samwalton9) Yes - thanks! Do you have the IP address, so I can pass it along? [11:49:42] yes, that is expected [11:50:02] but hopefully it shoudl have a downwards trend [11:50:35] yeah [11:51:03] pc1 has actually lower load [11:51:12] because less data, presumably [11:51:19] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=parsercache&var-shard=pc1&var-role=All [11:51:49] the problem is we could not discard a time bomb- request getting delayed [11:52:10] it could also be the less deletes being sent [11:52:16] *fewer [11:52:24] 10Operations, 10Citoid, 10VisualEditor, 10Services (blocked), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3347539 (10Mvolz) >>! In T165105#3347538, @Samwalton9 wrote: > Yes - thanks! Do you have the IP address, so... [11:52:40] https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&orgId=1&from=now-3h&to=now-5m this is expected I would say [11:53:03] yes [11:53:08] why is not going to zero? [11:53:20] becaue we have only emptied 33% of the keys [11:53:32] less tham that, actually [11:53:35] like 30% [11:53:36] pc1 shard should be zero no? [11:53:44] in usage [11:53:51] no, it was replicating from the other shard [11:54:01] plius that includes memcache, probably [11:54:10] which we haven't touched [11:54:19] pc hosts are only the persistence layer of the cache [11:54:24] the disk cache [11:54:51] so you can see that the risk was not that large [11:55:09] wait, I'm referring to the link you posted jynus, mysql-aggregated, I'm assuming the temp DBs are not there [11:55:18] they are [11:55:42] then I would have expected an increase in writes, not a lowering of them [11:55:50] no more deletes! [11:56:00] because everthing is purged [11:56:06] technically, 1/2 of the writes [11:56:37] rows are not inserted but no longer deleted [11:56:45] becaues no more old keys [11:56:46] I'm not sure I follow, before they were updated [11:56:56] and old keys not deleted, that was the bug AFAIK :) [11:57:05] anomie added the functionality to purge old keys on insert of the new ones [11:57:14] I tried to explain you that before [11:57:15] and it wad deployed? when? [11:57:24] on sunday? [11:58:39] cannot be, it was merged on gerrit at 6AM this morning: https://gerrit.wikimedia.org/r/#/c/358239/ [11:58:40] also, less rows, means no purging [11:58:42] so I doubt is in prod [11:58:49] or if it is nobody told us :D [11:58:59] it doesn't matter [11:59:08] this will make that unneeded [11:59:08] actually, is +2, not submitted [11:59:14] I know [11:59:22] but doesn't explain the graphs to me :D [11:59:28] no deletes [11:59:31] no purges [11:59:44] purges happen at the same time than inserts [11:59:45] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-spa-ita: New upstream release [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/358570 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [11:59:47] *same rate [11:59:55] no rows >30 days [11:59:56] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/358517 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [11:59:58] right now [12:00:33] also less reads because there are less cached rows [12:01:03] rows read are engine statistics, not query stats [12:01:19] if a query reads 0 rows, 0 are added instead of the usual 1 [12:01:26] same for writes [12:02:30] I will now deploy codfw and switch repliucation direction [12:02:38] ok [12:03:03] the app servers load is a bit high [12:03:56] can we handle it? [12:04:12] 10Operations, 10hardware-requests: codfw: (1) labtest puppetmaster - https://phabricator.wikimedia.org/T164515#3347568 (10faidon) [12:04:25] 10Operations, 10Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3347571 (10faidon) [12:04:27] 10Operations, 10Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#3347572 (10faidon) [12:05:00] i think so [12:05:14] let's wait 5 minutes [12:05:17] before going for codfw [12:06:53] 10Operations, 10Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3347577 (10faidon) [12:09:42] a few random hits doesn't make me feel uncached requests are much more slower [12:09:44] 10Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3347590 (10faidon) [12:10:54] but app load is still high [12:10:59] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 2079203 [12:11:05] (03CR) 10Ladsgroup: "I spent around several hours to find a solution and I couldn't find one, this is the best thing I've got https://serverfault.com/questions" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [12:13:15] !log upload apertium-fra-cat_1.2.0~r78602-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [12:13:22] !log upload apertium-spa-ita_0.2.0~r78826-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [12:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:26] kart_: ^ [12:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:52] so in reality we only have an extra 10% miss [12:14:59] 10Operations: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306#3347594 (10MoritzMuehlenhoff) p:05Triage>03Low a:05MoritzMuehlenhoff>03None As long as we build our own HHVM packages, we can live without that, lowering priority. That was a problem in the package fr... [12:15:19] because we have to wait for memcache to expire the strings [12:15:24] jynus yeah, I don't think we are under troubles [12:15:47] we are making the apps work a bit more [12:16:16] the problem is that the misses here are very expensive, depending on the page [12:18:18] should we wait for codfw then? [12:18:28] I do not see why [12:18:32] there is nothing going on on codfw [12:18:54] no writes should be done there, plust there is no replication either [12:19:49] Just to make sure everything is fine and then we can go and deploy codfw and keep working on it [12:20:01] disk space is growing very slowly: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=db2072&var-network=bond0&from=1497431977411&to=1497442777411 [12:20:14] (click on /srv) [12:20:46] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: Refactor maps puppet code to the role / profile paradigm - https://phabricator.wikimedia.org/T167871#3347632 (10Gehel) [12:20:56] not bad yeah [12:21:05] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: Refactor maps puppet code to the role / profile paradigm - https://phabricator.wikimedia.org/T167871#3347632 (10Gehel) p:05Triage>03High [12:21:54] we have now around 3% of the previous keys [12:22:00] I can add more manually [12:22:07] (03PS30) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 (https://phabricator.wikimedia.org/T167871) [12:22:17] so we do not have to wait a full month [12:22:50] I guess you want to copy only keys generated after wmf.4 was deployed right? [12:23:05] yes [12:23:16] I can copy based on timestamp [12:23:36] yeah exptime > deploy date + 30d [12:24:20] !log jynus@tin Synchronized wmf-config/db-codfw.php: Switchover pc2004 to db2072 (duration: 00m 43s) [12:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:40] but anyway will be a long process given that we can use only new keys [12:25:09] unless we have a way to know which keys where unchanged with the MW change and copy oldest one too that don't change the keyname [12:25:24] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint, 10Patch-For-Review: Refactor maps puppet code to the role / profile paradigm - https://phabricator.wikimedia.org/T167871#3347677 (10Gehel) During this refactoring, we want to reuse the standard `profile::redis::master`. This does not allow anonymou... [12:26:14] (03PS1) 10Muehlenhoff: Reimage mw2118 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/358943 [12:28:32] I will talk to the deployer to see if we can make an heuristic [12:29:27] (03CR) 10Nemo bis: "Diego, thanks again for your patch! Do you think you can amend the patch as suggested above (code comments and so on)?" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr) [12:31:22] jynus, marostegui: I need to get grab something for lunch now [12:31:36] volans: enjoy! thanks for your help [12:31:46] feel free to ping me if needed ;) [12:31:55] akosiaris: thanks! [12:49:46] there are some long open connections on parsercaches, I think they render pages with the connection open [12:50:34] which is bad, but gives you an idea of how much times it takes to render- 0-20 seconds [12:51:49] jouncebot: next [12:51:49] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T1300) [12:52:07] hashar: there is only one commit for eu swat, but it looks a bit scary :) https://gerrit.wikimedia.org/r/#/c/358625/ [12:52:26] "Test swapping traffic over to recreate the circumstances that have previously caused the error." [12:53:36] (03PS1) 10Gehel: maps - add dummy redis password for tilerator / tileratorui [labs/private] - 10https://gerrit.wikimedia.org/r/358950 (https://phabricator.wikimedia.org/T167871) [12:56:28] (03PS2) 10Alexandros Kosiaris: apertium: Update package list [puppet] - 10https://gerrit.wikimedia.org/r/358575 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [12:56:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apertium: Update package list [puppet] - 10https://gerrit.wikimedia.org/r/358575 (https://phabricator.wikimedia.org/T167247) (owner: 10KartikMistry) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T1300). Please do the needful. [13:00:04] gehel: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:35] thanks jouncebot! I'm getting ready to push it and waiting for dcausse to be available (any time now) [13:01:14] zeljkof: oops. I added my patch in wrong SWAT window. [13:01:23] zeljkof: can I added to list quickly? [13:01:33] kart_: sure [13:01:36] o/ [13:01:59] zeljkof: you are doing this swat? [13:02:14] gehel: I'm around... [13:02:23] just wanted to ask you if you are pushing your change [13:02:43] gehel: zeljkof: jynus had a database maintenance [13:03:02] hashar: eu swat postponed? [13:03:05] not sure whether he has completed the maintenance/patch deployment he was doing [13:03:08] zeljkof: I have not pushed it yet, but I can do it myself if you have other things to do. [13:03:23] zeljkof: done. [13:03:29] jynus: are you done on the DB stuff? Let me know when I can start ... [13:03:34] gehel: please do :) but first check with jynus [13:03:35] zeljkof: it is simple config patch. [13:03:37] hashar: the patch was deployed already - we were monitoring the thingy [13:03:52] hashar: let's wait for jynus to confirm it is fine to proceed with swat [13:03:56] ok [13:03:59] lets hold a bit :} [13:04:05] fine by me! [13:04:06] I think it is fine, but I would like him to confirm [13:04:10] hashar: I've patch for SWAT :) [13:04:20] hashar: are you doing the swat? [13:04:31] zeljkof: you can do it :} [13:04:39] hashar: sure, just checking :) [13:04:41] I have just stepped in to warn about the database stuff [13:04:47] thanks [13:04:57] I was not aware of that [13:05:08] kartik patch can be deployed ( https://gerrit.wikimedia.org/r/#/c/358929/ ) [13:05:09] kart_: do you want to push your commit yourself? or should I do it? [13:05:19] zeljkof: do that :) [13:05:21] I dont see how it will overlap with the database thing [13:05:58] hashar: ok, deploying kart_'s change then [13:06:02] (03CR) 10Hashar: [C: 031] Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [13:07:25] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358929 (https://phabricator.wikimedia.org/T167865) (owner: 10KartikMistry) [13:09:09] (03Merged) 10jenkins-bot: Remove ContentTranslationTargetNamespace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358929 (https://phabricator.wikimedia.org/T167865) (owner: 10KartikMistry) [13:09:19] (03CR) 10jenkins-bot: Remove ContentTranslationTargetNamespace config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358929 (https://phabricator.wikimedia.org/T167865) (owner: 10KartikMistry) [13:09:25] hashar: uh oh, fatalmonitor says: [13:09:27] Every 2.0s: tail -n 1000 /srv/mw-log/hhvm.log | [13:09:34] tail: cannot open ‘/srv/mw-log/hhvm.log’ for reading: No such file or directory [13:09:48] (at mwlog1001) [13:11:19] kart_: can you test 358929 at mwdebug1002? or should I do a full deployment? [13:11:20] ;( [13:11:42] last entry is Jun 12 14:58 hhvm.log-20170613 [13:12:05] (03PS1) 10KartikMistry: Remove unneeded ContentTranslationTargetNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358958 [13:12:06] zeljkof: testing. [13:12:17] zeljkof: I've another follow-up patch too. [13:12:35] kart_: wait, did not push to mwdebug yet, in a sec [13:12:38] zeljkof: https://gerrit.wikimedia.org/r/358958 - it removed variable from CommonSettings.php [13:12:56] zeljkof: if you can merge and push it too, that'll be good. [13:13:00] (03CR) 1020after4: [C: 031] "bump?" [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [13:13:18] kart_: that's why I ask, is there anything to test at mwdebug? or should I do a full deploy? [13:13:57] 10Operations, 10Release-Engineering-Team, 10HHVM: mwlog1001.eqiad.wmnet ‘/srv/mw-log/hhvm.log’ for reading: No such file or directory - https://phabricator.wikimedia.org/T167878#3347823 (10hashar) [13:14:01] zeljkof: yes. need to test. give me a minute. [13:14:05] filled the task for lack of hhvm.log file [13:14:56] zeljkof: let me know once you push to mwdebug [13:15:17] kart_: 352567 should be at mwdebug1002, but there were some errors... pasting [13:15:59] hashar: moar errors :( https://phabricator.wikimedia.org/P5577 [13:16:26] scap pull at mwdebug1002 said cannot delete non-empty directory: php-1.30.0-wmf.3(...) [13:16:34] for several directories [13:16:45] zeljkof: doesn't seems to work. so, I guess due to these errors? [13:16:51] 10Operations, 10Release-Engineering-Team, 10HHVM: mwlog1001.eqiad.wmnet ‘/srv/mw-log/hhvm.log’ for reading: No such file or directory - https://phabricator.wikimedia.org/T167878#3347841 (10hashar) Most probably 1fd05f19dce1591f7c63c9cfd579379c7573caaa in puppet by @elukey ``` commit 1fd05f19dce1591f7c63c9cf... [13:17:08] kart_: could be [13:17:27] elukey: looks like the rsyslog patch for hhvm on Monday causes the logs to no more be relayed to mwlog1001.eqiad.wmnet :( https://phabricator.wikimedia.org/T167878 [13:18:26] hashar: he's out today btw, back next week I think [13:18:31] zeljkof: Okay. So, https://gerrit.wikimedia.org/r/#/c/358958/ also need to go along with it. It is followup patch. But since we are in different errors, lets wait. [13:18:54] (03CR) 10Hashar: hhvm: force rsyslog config to create log files with www-data perms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358318 (https://phabricator.wikimedia.org/T146464) (owner: 10Elukey) [13:20:24] but yeah I agree it seems indeed the stop statement [13:21:13] kart_: please add it to the deployments calendar [13:21:49] hashar: I am not sure should I continue, with fatalmonitor problemsand scap pull problem https://phabricator.wikimedia.org/P5577 [13:22:15] zeljkof: Okay. Adding. [13:22:24] I'll look into fixing the hhvm.log issue [13:23:14] zeljkof: my patch can wait until next swat, so no pressure from me! Let's get back in a clean state first... [13:24:16] (03Abandoned) 1020after4: Add 90.231.10.86 to phabbanlist, this crawler is causing outages. [puppet] - 10https://gerrit.wikimedia.org/r/349342 (owner: 1020after4) [13:24:35] (03PS1) 10Hashar: hhvm: do not stop rsyslog rules processing too early [puppet] - 10https://gerrit.wikimedia.org/r/358959 (https://phabricator.wikimedia.org/T146464) [13:24:40] (03CR) 1020after4: [C: 031] "BUMP:" [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) (owner: 1020after4) [13:24:43] godog: found it [13:24:49] (03PS5) 1020after4: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) [13:24:50] godog: https://gerrit.wikimedia.org/r/358959 if you dont mind [13:25:07] hashar: yeah I think that's it too, I'll merge it [13:25:19] godog: the rule that got introduced has a "stop" in it, that prevent a later rule that forward to logstash from running. I should have caught that on review :(((((((((((((((((((((( [13:26:05] (03CR) 10Filippo Giunchedi: [C: 032] hhvm: do not stop rsyslog rules processing too early [puppet] - 10https://gerrit.wikimedia.org/r/358959 (https://phabricator.wikimedia.org/T146464) (owner: 10Hashar) [13:26:19] zeljkof: so, should we revert the patch? [13:26:50] kart_: not sure, waiting for hashar [13:26:58] OK! [13:27:21] hashar: heh, it happens :) monitoring for missing files or logs is doable but tricky I guess to maintain [13:27:37] logs to hhvm.log should be back shortly, on the next puppet run [13:28:25] I can force a run though if you are waiting on that too [13:28:36] na it is ok :) [13:28:58] hashar: did you see this? https://phabricator.wikimedia.org/P5577 [13:29:05] godog: puppet runs once per hour is it it? [13:29:08] more problems... [13:29:14] zeljkof: known scap issue [13:29:18] hashar: no half an hour [13:29:33] date [13:29:35] zeljkof: https://phabricator.wikimedia.org/T162207 [13:29:35] err [13:29:38] zeljkof: just ignore it. [13:30:17] hashar: ok, thanks continuing with swat then [13:30:49] kart_: can you test again at mwdebug1002? [13:31:01] zeljkof: yes. doing it. [13:31:15] (03PS1) 10Cmjohnson: Adding mac addresses for wtp1025-48 T165520 [puppet] - 10https://gerrit.wikimedia.org/r/358960 [13:33:27] zeljkof: looks good. [13:33:35] zeljkof: please also merge 2nd patch. [13:33:46] ok, deploying the first one and merging the second one [13:34:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358958 (owner: 10KartikMistry) [13:35:01] !log zfilipin@tin scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [13:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:48] kart_: deployment failed :( [13:35:52] :/ [13:36:10] (03CR) 10Cmjohnson: [C: 032] Adding mac addresses for wtp1025-48 T165520 [puppet] - 10https://gerrit.wikimedia.org/r/358960 (owner: 10Cmjohnson) [13:36:15] (03Merged) 10jenkins-bot: Remove unneeded ContentTranslationTargetNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358958 (owner: 10KartikMistry) [13:36:25] (03CR) 10jenkins-bot: Remove unneeded ContentTranslationTargetNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358958 (owner: 10KartikMistry) [13:36:33] 500 Undefined variable: wmgContentTranslationTargetNamespace in /srv/mediawiki/wmf-config/CommonSettings.php on line 3030 [13:36:38] zeljkof: kart_ ^^ [13:36:51] hashar: fixed in followup patch already. [13:37:03] hashar, kart_: just looking at it [13:37:05] godog: we have logs again \O/ :} [13:37:49] hashar: so, what to do, deploy the fix (https://gerrit.wikimedia.org/r/#/c/358958) before https://gerrit.wikimedia.org/r/#/c/358929/ [13:38:01] zeljkof: yes. [13:38:13] 10Operations, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwlog1001.eqiad.wmnet ‘/srv/mw-log/hhvm.log’ for reading: No such file or directory - https://phabricator.wikimedia.org/T167878#3347934 (10hashar) 05Open>03Resolved a:03hashar Fixed, the logs are back on mwlog1001 :} Than... [13:38:31] (03PS1) 10Alexandros Kosiaris: Revert "Switch einsteinium and tegmen roles" [puppet] - 10https://gerrit.wikimedia.org/r/358962 (https://phabricator.wikimedia.org/T164206) [13:38:41] kart_: can you test it at mwdebug1002? or should I do a full deploy? [13:39:01] and hurry hup [13:39:15] because the whole cluster is spamming logs / erroring at lightning speed! [13:39:45] zeljkof: looks good. go ahead. [13:39:49] though that is only on the canaries for now :-} [13:40:19] hashar: that's why we've canaries :) [13:41:34] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:358958|Remove unneeded ContentTranslationTargetNamespace (T167865)]] (duration: 00m 44s) [13:41:36] it is gone [13:41:39] (the spam) [13:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:43] T167865: [1.30.0-wmf.5] testwiki: Console error on Translate page: "[CX] Invalid publishing namespace configuration. Namespace does not exist: Main" - https://phabricator.wikimedia.org/T167865 [13:42:10] hashar, kart_ : deployed 358958, deploying 358929 [13:42:34] zeljkof: thanks. [13:42:35] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:358929|Remove ContentTranslationTargetNamespace config (T167865)]] (duration: 00m 43s) [13:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:44] gehel: I have deployed kart_'s pathes, there are 15 more minutes in the swat, do you want to deploy your change, or move it to the next swat? [13:43:53] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3347975 (10akosiaris) Up to now we have had not been able to reproduce this in any way, only notice it when it happens. We have no clear repro... [13:44:38] zeljkof: next swat does not seem to be overloaded, let's move it there... [13:45:04] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Web-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3347979 (10Anomie) [13:45:19] zeljkof: please let me know when swat is done. I 'd like to deploy https://gerrit.wikimedia.org/r/#/c/358962/ but I don't want to cause trouble to swat [13:45:37] (03PS1) 10Ema: VCL: add Retry-After header to 429 responses [puppet] - 10https://gerrit.wikimedia.org/r/358965 (https://phabricator.wikimedia.org/T163233) [13:45:39] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1036" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358966 [13:45:49] (03CR) 10Marostegui: [C: 04-2] "Wait for the ALTERs to finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358966 (owner: 10Marostegui) [13:50:03] PROBLEM - Disk space on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [13:50:26] akosiaris: sorry, just saw this [13:50:31] !log eu swat finished [13:50:37] akosiaris: done with swat [13:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:43] PROBLEM - configured eth on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [13:50:55] zeljkof: ok thanks! [13:51:03] PROBLEM - dhclient process on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [13:51:03] PROBLEM - Check systemd state on neon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:51:08] ACKNOWLEDGEMENT - Check systemd state on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:08] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:08] ACKNOWLEDGEMENT - DPKG on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:08] ACKNOWLEDGEMENT - Disk space on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:08] ACKNOWLEDGEMENT - Disk space on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:08] ACKNOWLEDGEMENT - configured eth on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:09] ACKNOWLEDGEMENT - dhclient process on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:09] ACKNOWLEDGEMENT - puppet last run on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:10] ACKNOWLEDGEMENT - salt-minion processes on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris still in setup [13:51:10] ACKNOWLEDGEMENT - Check systemd state on neon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still in setup [13:52:06] (03PS2) 10Alexandros Kosiaris: Revert "Switch einsteinium and tegmen roles" [puppet] - 10https://gerrit.wikimedia.org/r/358962 (https://phabricator.wikimedia.org/T164206) [13:53:03] PROBLEM - configured eth on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [13:53:21] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Switch einsteinium and tegmen roles" [puppet] - 10https://gerrit.wikimedia.org/r/358962 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [13:53:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Switch einsteinium and tegmen roles" [puppet] - 10https://gerrit.wikimedia.org/r/358962 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [13:53:24] PROBLEM - dhclient process on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [13:53:24] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [13:53:33] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [13:53:53] PROBLEM - salt-minion processes on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [13:54:13] PROBLEM - Check systemd state on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [13:54:33] PROBLEM - Check the NTP synchronisation status of timesyncd on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [13:54:43] PROBLEM - DPKG on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [13:55:35] (03PS1) 10Alexandros Kosiaris: Revert "Switchover icinga.wikimedia.org to tegmen" [dns] - 10https://gerrit.wikimedia.org/r/358967 (https://phabricator.wikimedia.org/T164206) [13:57:53] is the kubernetes thingy related to icinga or has nothing to do? [13:59:18] nothing to do with it [13:59:43] I 've just reused neon as an element name for the kubernetes staging master [14:03:56] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:03:56] PROBLEM - configured eth on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:03:57] PROBLEM - salt-minion processes on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:04:06] PROBLEM - dhclient process on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:04:06] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [14:04:06] PROBLEM - configured eth on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:04:07] PROBLEM - Disk space on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:04:16] PROBLEM - DPKG on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:04:17] PROBLEM - Check systemd state on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:04:17] PROBLEM - puppet last run on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:04:26] PROBLEM - dhclient process on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:04:26] PROBLEM - Check systemd state on neon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:36] PROBLEM - Disk space on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:04:36] PROBLEM - Check systemd state on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:04:36] PROBLEM - salt-minion processes on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:05:32] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Switchover icinga.wikimedia.org to tegmen" [dns] - 10https://gerrit.wikimedia.org/r/358967 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [14:06:36] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-icinga],Exec[acme-setup-acme-tendril] [14:06:47] Found another thumbnailer issue - https://en.wikisource.org/w/index.php?title=Page:Library_Construction,_Architecture,_Fittings,_and_Furniture.djvu/5&action=edit&redlink=1 [14:06:55] For me shows a blank scan page... [14:07:17] Clicking image shows it clearly has text and the text layer is retrived correctly [14:07:31] Noted on phabricator [14:07:36] PROBLEM - DPKG on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:08:26] PROBLEM - tcpircbot_service_running on einsteinium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [14:09:43] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3348115 (10akosiaris) DNS switched over, hosts exchanged roles, failover is done. einsteinium is once again the icinga host. Now let's see if... [14:10:26] RECOVERY - tcpircbot_service_running on einsteinium is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [14:12:36] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:13:33] (03PS2) 10Krinkle: Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 (owner: 10Tim Starling) [14:13:39] (03CR) 10Krinkle: [C: 031] Remove EtcdConfig from beta cluster for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351767 (owner: 10Tim Starling) [14:13:45] akosiaris: seems that icinga-wm was not "killed" on tegmen [14:13:54] we have two [14:14:23] yeah, but solving a different issue right now [14:14:35] still related to that [14:14:43] if I can help let me know ;) [14:17:18] (03PS1) 10Alexandros Kosiaris: Update mentions of irclib => irc in tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/358968 [14:17:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update mentions of irclib => irc in tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/358968 (owner: 10Alexandros Kosiaris) [14:18:56] PROBLEM - Check the NTP synchronisation status of timesyncd on kubestagetcd1001 is CRITICAL: Return code of 255 is out of bounds [14:20:31] (03PS1) 10Faidon Liambotis: Kill ori's version of ack [puppet] - 10https://gerrit.wikimedia.org/r/358969 [14:20:39] bd808: ^ [14:20:55] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Kill ori's version of ack [puppet] - 10https://gerrit.wikimedia.org/r/358969 (owner: 10Faidon Liambotis) [14:21:36] PROBLEM - Check the NTP synchronisation status of timesyncd on kubestagetcd1002 is CRITICAL: Return code of 255 is out of bounds [14:29:44] (03CR) 10Faidon Liambotis: [C: 04-1] "See inline: basically ack we already ship fleet-wide, and the Vim plugins we're not, but can easily by adding vim-scripts to standard_pack" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [14:31:52] cmjohnson1: what rack is analytics1069 in? [14:32:01] d8 [14:32:05] it's in racktables [14:32:20] thanks! didn't check this morning [14:33:51] (03PS1) 10Alexandros Kosiaris: ircecho: add support for the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/358975 [14:41:00] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3348220 (10alanajjar) @MarcoAurelio please see the comment of the user in Meta! [14:44:41] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3348249 (10Cmjohnson) [14:46:18] 10Operations, 10Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#3348255 (10RobH) 05Open>03Resolved we've purchased this system, and setup is via T167160 [14:46:39] 10Operations, 10Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3348261 (10RobH) 05Open>03Resolved We've purchased this system, setup via T167160. [14:47:06] 10Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3348266 (10RobH) 05Open>03Resolved This system has been purchased and is being setup via T167820. [14:47:59] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: add support for the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/358975 (owner: 10Alexandros Kosiaris) [14:48:03] (03PS2) 10Alexandros Kosiaris: ircecho: add support for the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/358975 [14:48:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ircecho: add support for the ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/358975 (owner: 10Alexandros Kosiaris) [14:57:56] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [14:58:44] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3348350 (10Nuria) 05Open>03Resolved [15:00:28] !log restart varnish-backend on cp2014 [15:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:53] > Warning: LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.4/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329 [15:01:56] PROBLEM - Host kubestagetcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:57] seems to be trending in logstash [15:02:09] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=now-1h&to=now [15:02:13] 503 is high, 500 is back to normal [15:03:32] 503s in upload are due to cp2014 misbehaving, they should go down soon [15:05:00] 10Operations, 10Beta-Cluster-Infrastructure, 10DBA, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Better mysql command prompt info for Beta - https://phabricator.wikimedia.org/T157714#3348386 (10hashar) 05Open>03Resolved Solved via https://gerrit.wikimedia.org/r/#/c/336964/ / 35098e1e363e... [15:05:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3348392 (10Cmjohnson) [15:05:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3287641 (10Cmjohnson) racked in A6/B7/D1 [15:05:40] (03CR) 10EBernhardson: "in this case thogh antoine is testing some potential future logstash stuff, by logging to relforge and visualizing with a kibana instance " [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [15:06:07] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#963609 (10herron) It looks like this task has been stalled for some time. Today DMARC aggregate reports are being sent to an active dmarcian account which provid... [15:07:05] 10Operations, 10ops-eqiad: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3348413 (10Cmjohnson) [15:07:36] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 0 [15:08:26] RECOVERY - Host kubestagetcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [15:08:28] 10Operations, 10LDAP-Access-Requests, 10Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3348420 (10hashar) The summary is roughly: Only service groups still have `sillyshell` as a login. OpenStackManager is no more adding it and de... [15:08:39] !log installing systemd bugfix updates from jessie point update [15:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:21] there was a spike of errors [15:09:34] Oh, I arrive late [15:11:36] RECOVERY - dhclient process on kubestagetcd1001 is OK: PROCS OK: 0 processes with command name dhclient [15:11:36] RECOVERY - Disk space on kubestagetcd1001 is OK: DISK OK [15:11:56] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:11:56] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:12:07] RECOVERY - salt-minion processes on kubestagetcd1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:12:16] RECOVERY - configured eth on kubestagetcd1001 is OK: OK - interfaces up [15:12:16] RECOVERY - DPKG on kubestagetcd1001 is OK: All packages OK [15:14:07] !log restart varnish-backend on cp2017 [15:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:06] RECOVERY - dhclient process on kubestagetcd1002 is OK: PROCS OK: 0 processes with command name dhclient [15:15:16] RECOVERY - Disk space on kubestagetcd1002 is OK: DISK OK [15:15:26] RECOVERY - puppet last run on kubestagetcd1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:15:36] RECOVERY - salt-minion processes on kubestagetcd1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:15:38] RECOVERY - DPKG on kubestagetcd1002 is OK: All packages OK [15:15:46] RECOVERY - configured eth on kubestagetcd1002 is OK: OK - interfaces up [15:17:27] marostegui: Hello [15:17:52] Look what i found -" 21:59 gwicke: restarted pdfrender on scb1003; was spinning on CPU & using 15G of memory (!) [15:17:54] 21:58 gwicke: restarted pdfrender on scb1002 and scb1004; was spinning on CPU" [15:18:00] in the log [15:18:10] So something broke [15:18:46] Interesting... [15:18:48] gwicke ^ [15:18:56] RECOVERY - Check the NTP synchronisation status of timesyncd on kubestagetcd1001 is OK: OK: synced at Wed 2017-06-14 15:18:51 UTC. [15:19:06] Also this morning : - "" 07:04 elukey: restart pdfrender on scb200[2,4] (xpra race condition) [15:19:07] 07:03 elukey: restart pdfrender on scb1004 (xpra race condition)" [15:19:53] marostegui: I was also seing a Djvu that was returning blank scans in Proofread page, but was returning the right images on the Image tag [15:20:08] Noted in a phabricator ticket but may be related [15:20:24] ShakespeareFan00: You've got the ticket number handy? [15:21:08] https://phabricator.wikimedia.org/T167887 [15:21:15] Thank you! [15:21:22] which proved to be a user side error [15:21:36] RECOVERY - Check the NTP synchronisation status of timesyncd on kubestagetcd1002 is OK: OK: synced at Wed 2017-06-14 15:21:28 UTC. [15:22:04] Ah - so it was already identified [15:22:45] Yeah - but I will be filing a related bug/enhcancement request [15:22:54] https://phabricator.wikimedia.org/T167877 is the ticket for the PDF handlign issue [15:23:24] Thanks [15:25:36] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 0 [15:27:11] hmm did something happen on the apt repo's? [15:27:12] marostegui: https://phabricator.wikimedia.org/T167896 is a folloeup to the 1px width issue... [15:27:14] i get this error [15:27:15] W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/jessie-wikimedia/main/source/Sources Hash Sum mismatch [15:27:15] W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/jessie-wikimedia/main/binary-amd64/Packages Hash Sum mismatch [15:27:36] E: Some index files failed to download. They have been ignored, or old ones used instead. [15:27:55] ShakespeareFan00: Ah good, thank you [15:27:55] that is usually transient. Can you try again ? :) [15:28:02] apt-get update [15:28:09] yep, i did [15:28:31] happened again after doing apt-get update. [15:28:42] On other systems I've used there's a warning given about "silly" or clearly non-sensical vallues [15:28:59] I was suprised mediawiki didn't have simmilar safegaurds/user nags.. [15:29:59] paladox: have you tried and apt-get clean too? [15:30:09] yep [15:31:05] the problem started at 1:15pm (bst) [15:31:16] RECOVERY - Check systemd state on kubestagetcd1002 is OK: OK - running: The system is fully operational [15:31:23] i didnt notice until now as i didnt look in the channel where it reports the errors for me until now. [15:31:27] seems that the one of the file in /var/lib/apt/lists/ might be corrupted [15:31:44] hmm [15:31:48] it doesn't happen to me on other labs instances for example [15:31:50] it downloaded stuff like this [15:31:50] -rw-r--r-- 1 root root 184000 Jun 6 14:33 mirrors.wikimedia.org_debian_dists_jessie-backports_non-free_i18n_Translation-en [15:33:23] running sudo rm -rf /var/lib/apt/lists/* fixed it :) [15:33:42] you could just delete the broken ones [15:33:52] ;) [15:33:59] Oh, i just did /var/lib/apt/lists/* :) [15:36:44] (03PS1) 10Chad: group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358980 [15:37:00] (03CR) 10Chad: [C: 04-2] "For later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358980 (owner: 10Chad) [15:39:29] (03PS1) 10Alexandros Kosiaris: site.pp: Fix typo for kubernetes stating etcd host [puppet] - 10https://gerrit.wikimedia.org/r/358981 [15:40:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] site.pp: Fix typo for kubernetes stating etcd host [puppet] - 10https://gerrit.wikimedia.org/r/358981 (owner: 10Alexandros Kosiaris) [15:41:07] akosiaris: I see what you did there (tm) [15:41:36] PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [15:42:36] RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [15:43:05] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review: Eliminate SPOFs in the existing eqiad kubernetes infrastructure - https://phabricator.wikimedia.org/T162040#3348579 (10akosiaris) [15:43:33] (03CR) 10Ema: [C: 031] hieradata: a/a for swift [puppet] - 10https://gerrit.wikimedia.org/r/358620 (owner: 10Filippo Giunchedi) [15:43:50] (03CR) 10Ema: [C: 031] hieradata: move swift codfw to passive [puppet] - 10https://gerrit.wikimedia.org/r/358621 (owner: 10Filippo Giunchedi) [15:44:04] (03CR) 10Ema: [C: 031] hieradata: point esams to swift eqiad [puppet] - 10https://gerrit.wikimedia.org/r/358622 (owner: 10Filippo Giunchedi) [15:44:21] (03PS2) 10Filippo Giunchedi: hieradata: a/a for swift [puppet] - 10https://gerrit.wikimedia.org/r/358620 [15:44:54] !log point varnish upload back to swift eqiad [15:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:19] (03PS1) 10Alexandros Kosiaris: Add kubernetes staging etcd records [dns] - 10https://gerrit.wikimedia.org/r/358982 (https://phabricator.wikimedia.org/T162045) [15:46:11] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: a/a for swift [puppet] - 10https://gerrit.wikimedia.org/r/358620 (owner: 10Filippo Giunchedi) [15:46:20] (03CR) 10BryanDavis: ">" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [15:46:31] akosiaris: merging your change too [15:48:24] (03PS2) 10Filippo Giunchedi: hieradata: move swift codfw to passive [puppet] - 10https://gerrit.wikimedia.org/r/358621 [15:49:33] 10Operations, 10netops: Merge AS14907 with AS43281 - https://phabricator.wikimedia.org/T167840#3348613 (10faidon) >>! In T167840#3346810, @BBlack wrote: > What are the real pros and cons on this? We could even go in the other direction and have a unique ASN per region/continent. How does the impact future an... [15:50:04] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: move swift codfw to passive [puppet] - 10https://gerrit.wikimedia.org/r/358621 (owner: 10Filippo Giunchedi) [15:51:08] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubernetes staging etcd records [dns] - 10https://gerrit.wikimedia.org/r/358982 (https://phabricator.wikimedia.org/T162045) (owner: 10Alexandros Kosiaris) [15:51:43] godog: thanks [15:53:08] np! [15:55:15] !log mobrovac@tin Started deploy [restbase/deploy@4c1cdd0]: (no justification provided) [15:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:23] bd808: ack-grep: /usr/bin/ack [15:56:31] bd808: I think there is no mangled name anymore [15:56:50] oh. that's nice :) [15:56:56] well let me verify that on trusty [15:57:06] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[etcd] [15:57:15] yeah, not the case in trusty :( [15:57:30] poor old trusty [15:57:36] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [15:57:47] mobrovac: ^ [15:57:50] so trusty ships ack-grep containing /usr/bin/ack-grep, jessie ships ack-grep containing both /usr/bin/ack and ack-grep, and stretch ships with "ack" and a transitional ack-grep package [15:57:53] related to the deploy? [15:57:54] fyi :) [15:57:59] akosiaris: known, yup [15:58:02] ok [15:58:18] and stretch doesn't have an /usr/bin/ack-grep :) [15:58:30] so you need to place the alias under an if, sorry :) [15:59:28] bd808: hey and sorry for all the trouble with what should probably have been a trivial patchset :( [15:59:37] I probably looked at it more closely than I should had :) [16:00:05] !log mobrovac@tin Finished deploy [restbase/deploy@4c1cdd0]: (no justification provided) (duration: 04m 51s) [16:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:01] (03PS2) 10Filippo Giunchedi: hieradata: point esams to swift eqiad [puppet] - 10https://gerrit.wikimedia.org/r/358622 [16:02:06] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:03:13] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: point esams to swift eqiad [puppet] - 10https://gerrit.wikimedia.org/r/358622 (owner: 10Filippo Giunchedi) [16:03:39] !log point varnish upload in esams back to eqiad [16:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:03] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3348651 (10Papaul) Dear Mr Papaul Tshibamba, Hewlett Packard Enterprise Reference Number: 5320469389 STATUS: Customer Self Repair Part has been shipped Part/s shipped: 765867-001 Part description: SPS-DRV... [16:05:12] Anyone got a sec for a little trivial trebuchet cleanup? https://gerrit.wikimedia.org/r/#/c/356496/ is just removing some old crap we don't care about. [16:06:56] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],File[/etc/bacula/ssl] [16:07:57] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3348668 (10Aklapper) @alanajjar: Which one and where exactly? Clear links welcome for statements, to avoid misunderstandings. :) [16:08:01] (03CR) 10Alexandros Kosiaris: [C: 032] Drop test/testrepo from trebuchet-deployed repos [puppet] - 10https://gerrit.wikimedia.org/r/356496 (owner: 10Chad) [16:08:06] (03PS2) 10Alexandros Kosiaris: Drop test/testrepo from trebuchet-deployed repos [puppet] - 10https://gerrit.wikimedia.org/r/356496 (owner: 10Chad) [16:08:08] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Drop test/testrepo from trebuchet-deployed repos [puppet] - 10https://gerrit.wikimedia.org/r/356496 (owner: 10Chad) [16:08:10] (03PS1) 10Alexandros Kosiaris: Require installation of bacula-fd before cert exposure [puppet] - 10https://gerrit.wikimedia.org/r/358987 [16:08:31] RainbowSprinkles: merged [16:08:44] (03PS2) 10Alexandros Kosiaris: Require installation of bacula-fd before cert exposure [puppet] - 10https://gerrit.wikimedia.org/r/358987 [16:08:55] I 'll cleanup tin/naos [16:09:11] All of /srv/test/ can go [16:09:19] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Require installation of bacula-fd before cert exposure [puppet] - 10https://gerrit.wikimedia.org/r/358987 (owner: 10Alexandros Kosiaris) [16:09:19] Er, /srv/deployment/test/ [16:10:29] thcipriani: Do you have any idea what /srv/deployment/STALE/ is about? [16:11:06] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3348680 (10ema) [16:12:27] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3348682 (10Nuria) [16:12:37] RainbowSprinkles: yes, /me digs for phab ticket [16:12:42] RainbowSprinkles: done [16:12:53] Thanks! [16:12:55] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:13:12] $trebuchet_repos-- [16:13:13] RainbowSprinkles: https://phabricator.wikimedia.org/T129290#3275279 [16:13:15] \o/ [16:13:15] ACKNOWLEDGEMENT - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused Marko Obrovac figuring out the new cassandra driver bug - The acknowledgement expires at: 2017-06-16 16:12:36. [16:13:24] nice :) [16:13:55] Ah, stuff planning to drop [16:13:58] Makes sense [16:13:58] yup [16:16:35] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3348702 (10Paladox) [16:16:42] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051483 (10Paladox) [16:17:25] PROBLEM - Check systemd state on kubestagetcd1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:17:25] PROBLEM - puppet last run on kubestagetcd1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair] [16:18:25] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 6 [16:18:34] (03PS1) 10Alexandros Kosiaris: base::expose_puppet_certs: Specify ordering for keypair [puppet] - 10https://gerrit.wikimedia.org/r/358989 [16:22:21] (03CR) 10Alexandros Kosiaris: [C: 032] base::expose_puppet_certs: Specify ordering for keypair [puppet] - 10https://gerrit.wikimedia.org/r/358989 (owner: 10Alexandros Kosiaris) [16:24:35] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.079 second response time [16:25:51] (03PS1) 10Ottomata: Setup analytics1069 as a Hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/358990 [16:26:05] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:14] (03PS1) 10Alexandros Kosiaris: base::expose_puppet_certs: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/358991 [16:26:15] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] base::expose_puppet_certs: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/358991 (owner: 10Alexandros Kosiaris) [16:26:45] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:26:55] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:05] (03PS2) 10Ottomata: Setup analytics1069 as a Hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/358990 [16:27:15] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:25] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:15] PROBLEM - puppet last run on kubetcd2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:35] PROBLEM - puppet last run on etcd1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:55] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:55] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:55] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:15] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:18] (03CR) 10Ottomata: [C: 032] Setup analytics1069 as a Hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/358990 (owner: 10Ottomata) [16:29:22] (03PS3) 10Ottomata: Setup analytics1069 as a Hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/358990 [16:29:24] (03CR) 10Ottomata: [V: 032 C: 032] Setup analytics1069 as a Hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/358990 (owner: 10Ottomata) [16:32:23] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T167667#3348738 (10Marostegui) Oh nice - you got it in the end. Thanks for handling it! [16:32:25] RECOVERY - Check systemd state on kubestagetcd1002 is OK: OK - running: The system is fully operational [16:32:26] RECOVERY - puppet last run on kubestagetcd1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:37:59] (03PS1) 10Alexandros Kosiaris: k8s staging cluster: interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/358992 [16:39:16] (03PS2) 10Alexandros Kosiaris: k8s staging cluster: interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/358992 [16:39:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s staging cluster: interface::add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/358992 (owner: 10Alexandros Kosiaris) [16:44:13] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10Dzahn) Check this for ALL of the other language prefixes too. We once had them all, was a long discussion with the owner of .wiki years ago. Then WMF decided to not use... [16:45:04] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3348768 (10Dzahn) de.wiki, fr.wiki, it.wiki, etc etc... [16:45:09] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[python3-numpy],Package[python-numpy] [16:45:29] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:45:53] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3348771 (10Dzahn) T145907 and T88873 [16:45:59] RECOVERY - Check systemd state on kubestagetcd1001 is OK: OK - running: The system is fully operational [16:47:19] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [16:48:44] (03PS1) 10Ottomata: Update python3-numpy backports version to install on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/358995 [16:49:08] (03PS2) 10Ottomata: Update python3-numpy backports version to install on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/358995 [16:49:11] (03CR) 10Ottomata: [V: 032 C: 032] Update python3-numpy backports version to install on Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/358995 (owner: 10Ottomata) [16:49:19] RECOVERY - Check systemd state on neon is OK: OK - running: The system is fully operational [16:49:46] (03CR) 10Dzahn: "adding more reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/355869 (https://phabricator.wikimedia.org/T164810) (owner: 10Dzahn) [16:51:09] RECOVERY - puppet last run on analytics1069 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:52:29] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:52:34] (03PS1) 10Alexandros Kosiaris: Specify correct certificate path in hiera for k8s [puppet] - 10https://gerrit.wikimedia.org/r/358996 [16:52:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify correct certificate path in hiera for k8s [puppet] - 10https://gerrit.wikimedia.org/r/358996 (owner: 10Alexandros Kosiaris) [16:53:14] (03PS6) 10Dzahn: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) (owner: 1020after4) [16:53:19] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.083 second response time [16:54:00] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:54:25] (03CR) 10Dzahn: [C: 032] Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) (owner: 1020after4) [16:54:29] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:54:44] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3348775 (10akosiaris) [16:55:09] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:55:09] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:55:29] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:55:29] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:55:49] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:55:59] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:56:07] (03PS1) 10Alexandros Kosiaris: Kubernetes staging key path typo fix [puppet] - 10https://gerrit.wikimedia.org/r/358997 [16:56:22] (03CR) 10Dzahn: [C: 032] "well.. more than one "simple" config change has broken things in the past. but i'll assume this has been tested." [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) (owner: 1020after4) [16:56:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Kubernetes staging key path typo fix [puppet] - 10https://gerrit.wikimedia.org/r/358997 (owner: 10Alexandros Kosiaris) [16:56:29] RECOVERY - puppet last run on kubetcd2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:57:09] RECOVERY - puppet last run on etcd1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:57:29] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:57:59] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:59:13] (03PS7) 10Dzahn: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) (owner: 1020after4) [17:04:03] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3348807 (10Nuria) [17:06:07] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1798302 (10Nuria) >as well as the pageview API, which is currently low on backend capacity. Correction: pageview API has been rebuild since last comment and it can handle a LOT... [17:07:13] (03CR) 10Dzahn: [C: 031] "i think it's ok. what do others think?" [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [17:07:40] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10Nuria) The fact that no requests have been throttled of late in PageviewAPI (see 429 graph below) kind of tells me that PageviewAPI has received too f... [17:10:31] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3348822 (10Dzahn) @akosiaris should we get back to it and unstall? [17:14:29] (03PS9) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [17:15:54] (03CR) 10Daniel Kinzler: [C: 04-1] Make /entity/ redirect internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [17:25:51] (03PS10) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [17:27:21] (03PS2) 10Krinkle: [WIP] mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) [17:28:07] (03PS3) 10Krinkle: [WIP] mediawiki: Fix error page template issues [puppet] - 10https://gerrit.wikimedia.org/r/358430 (https://phabricator.wikimedia.org/T113114) [17:29:27] (03PS11) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [17:47:55] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3348984 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None [17:47:59] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3348985 (10alanajjar) @Aklapper of course I talked about the comment by Idh0854 on Steward requests/Username changes. [[ https://meta.wikimedia.org/w/index.php?titl... [17:49:18] !log installing mongodb update from jessie point release on tungsten [17:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:36] (03CR) 10Ladsgroup: Make /entity/ redirect internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [17:55:35] !log restarting hhvm on mw1261-mw1265 to pick up libxslt update [17:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T1800). [18:00:04] gehel and aharoni: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:10] I can SWAT! [18:00:29] Niharika: ! [18:00:42] And it's a pretty interesting patch for you today. [18:00:53] Indirectly related to Compact Language Links. [18:01:06] (03CR) 10Niharika29: [C: 032] Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [18:01:15] :D [18:01:43] Niharika: In case you haven't heard it, Compact Language Links now also pull user's languages from the Babel box. [18:02:06] ebernhardson: gehel: Anyone around to test? [18:02:22] aharoni: I have been keeping an eye on the tickets. It's so exciting. [18:02:24] :) [18:02:26] But the Babel extension is not configured in all the wikis yet, so my patch today is supposed to fix it. [18:02:51] Niharika: yup [18:02:55] Ah. [18:03:19] After this, all that's left is making it non-beta on German and English Wikipedias. (Of course, "all that's left" is a major understatement when we're talking about German and English Wikipedias.) [18:03:27] (03CR) 10Daniel Kinzler: [C: 04-1] Make /entity/ redirect internal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [18:03:33] dcausse: are you around? [18:03:52] 10Operations, 10hardware-requests: codfw: (1) labtest puppetmaster - https://phabricator.wikimedia.org/T164515#3349031 (10RobH) 05Open>03Resolved This was purchased and is being setup on T167157 [18:04:39] (03PS3) 10Niharika29: Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [18:05:00] (03CR) 10Niharika29: [V: 032 C: 032] Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [18:06:04] !log installing unzip security updates [18:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:38] (03Merged) 10jenkins-bot: Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [18:06:48] (03CR) 10jenkins-bot: Test elastic2020 does not fall out of cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358625 (owner: 10EBernhardson) [18:08:17] Niharika: I'm here, sorry for the delay [18:09:23] gehel: Your changes should be on mwdebug1002 now. [18:09:31] Niharika: ok, testing... [18:10:42] Niharika: it looks like it is working fine [18:11:56] 10Operations, 10ops-codfw: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3349060 (10Papaul) [18:12:17] Niharika: can you push to the rest of the cluster? [18:12:41] gehel: On it. [18:12:47] thanks! [18:13:08] (03PS5) 10Daniel Kinzler: Make /entity/ redirect internal [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [18:13:57] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/358625/ Test elastic2020 does not fall out of cluster (duration: 00m 44s) [18:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:06] (03CR) 10Daniel Kinzler: "PS5 is my attempt at solving this with a redirect=force flag, see I13373c8859be." [puppet] - 10https://gerrit.wikimedia.org/r/357985 (https://phabricator.wikimedia.org/T119536) (owner: 10Ladsgroup) [18:15:24] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3349071 (10RobH) [18:15:32] Niharika: I see traffic flowing to elasticsearch codfw, no obvious error, this looks good [18:15:36] 10Operations, 10hardware-requests: eqiad: (2) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#3349089 (10RobH) 05stalled>03Resolved a:03RobH This was ordered and setup task is T167905. [18:16:55] (03PS3) 10Niharika29: Sort wmgBabelMainCategory alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358006 (owner: 10Amire80) [18:17:04] !log niharika29@tin Synchronized tests/cirrusTest.php: https://gerrit.wikimedia.org/r/#/c/358625/ Test elastic2020 does not fall out of cluster (duration: 00m 43s) [18:17:12] (03CR) 10Niharika29: [C: 032] Sort wmgBabelMainCategory alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358006 (owner: 10Amire80) [18:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:43] gehel: All done. [18:18:02] Niharika: thanks for the help! And sorry again to be late, some miscommunication... [18:18:11] No worries. [18:18:11] io load on 2020 is up, but not crashing yet :) [18:18:38] ebernhardson: I'm preparing the patch to come back to eqiad. I'll let you watch over it... [18:18:46] gehel: should just be a revert of existing patch [18:18:53] yep [18:20:18] (03PS1) 10Gehel: Revert "Test elastic2020 does not fall out of cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359007 (https://phabricator.wikimedia.org/T149006) [18:21:06] 10Operations, 10ops-eqiad: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3349106 (10Cmjohnson) [18:21:58] 10Operations, 10ops-eqiad: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10Cmjohnson) Racked A6 and A7 B7 and B8 C3 and C4 D3, D4 and D6 [18:22:02] (03CR) 10jenkins-bot: Sort wmgBabelMainCategory alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358006 (owner: 10Amire80) [18:22:23] (03CR) 10Florianschmidtwelzow: [C: 031] Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) (owner: 10Reedy) [18:22:31] (03PS2) 10Florianschmidtwelzow: Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) (owner: 10Reedy) [18:22:34] Niharika: just checking—I'm patiently waiting, right? [18:23:16] aharoni: Indeed. Your patch https://gerrit.wikimedia.org/r/#/c/358006/ is on mwdebug1002. [18:23:22] Anything to test? [18:23:48] Niharika: it's trivial, just sorting lines. The next one is more interesting. [18:24:06] Ideally, both should be deployed together. [18:24:57] !log reimporting data from pc1004 to db1096 [18:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:36] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Sort wmgBabelMainCategory alphabetically https://gerrit.wikimedia.org/r/#/c/358006/ (duration: 00m 44s) [18:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:54] (03PS3) 10Niharika29: Add wmgBabelMainCategory for many languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358007 (owner: 10Amire80) [18:28:56] (03CR) 10Niharika29: [C: 032] Add wmgBabelMainCategory for many languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358007 (owner: 10Amire80) [18:29:04] (03CR) 10jenkins-bot: Add wmgBabelMainCategory for many languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358007 (owner: 10Amire80) [18:29:11] aharoni: The second one is up there too. Anything to test for this? [18:29:29] 10Operations, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10GWicke) [18:29:30] yes, let me see... [18:30:31] Niharika: tested, works! [18:30:37] this is even easier than I thought it would be :) [18:30:42] aharoni: Great! [18:30:43] :) [18:31:37] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Web-Backlog, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3349135 (10Jdlrobson) [18:32:03] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:32:10] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3349142 (10GWicke) [18:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:14] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3349137 (10GWicke) 05Open>03Resolved a:03GWicke @bblack and myself looked into this yesterday after the deployment of the more aggressive global limits, and found that leg... [18:33:35] Niharika: is it fully deployed now? [18:34:38] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/358007/ Add wmgBabelMainCategory for many languages (duration: 00m 43s) [18:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:55] aharoni: It is! [18:35:02] Niharika: thank you! [18:36:12] Niharika: if you're curious, you can try running `mw.config.get( 'wgULSBabelLanguages' )` in the JS console on your favorite wiki. (Of course, you need to have {{#babel|WHATEVER}} on your user page first.) [18:37:12] Caveat: A small number of wikis still won't have it, because they don't have a category for Babel. [18:37:37] They need to set it up... if they didn't have it till now, maybe now they'll finally set it up?.. [18:39:34] aharoni: This is awesome. I can see it being super useful for compact language links. [18:39:47] Did we ever come up with an acronym for that? :P [18:39:58] Niharika: the task has four thumbs-up and one heart in the badges section ;) [18:40:08] Niharika: we say "CLL" sometimes :) [18:40:15] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3349165 (10Ottomata) 05stalled>03Resolved [18:40:28] :) I'm glad so many people are using it now! [18:40:50] Making it non-beta on the English Wikipedia will be the true revolution. [18:41:25] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Web-Backlog, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3349181 (10Krinkle) [18:41:33] aharoni: Is it enabled for anons? [18:41:49] Niharika: yes, on all Wikipedias except German and English. [18:41:52] 10Operations, 10ArchCom-RfC, 10Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3349184 (10Krinkle) [18:41:55] German and English—some time soon. [18:42:16] 10Operations, 10Performance-Team, 10Traffic: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#3349189 (10Krinkle) [18:42:23] Nice! [18:43:44] (SWAT's all done) [18:45:06] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:45:36] 10Operations, 10Performance-Team, 10Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3349198 (10Krinkle) a:03Peter [18:48:11] 10Operations, 10Performance-Team: HTTP responses from app servers sometimes stall for >1s - https://phabricator.wikimedia.org/T164248#3349217 (10Krinkle) a:03Krinkle [18:49:53] !log otto@tin Started deploy [eventlogging/analytics@1ce446d]: (no justification provided) [18:49:58] !log otto@tin Finished deploy [eventlogging/analytics@1ce446d]: (no justification provided) (duration: 00m 04s) [18:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:59] (03PS1) 10Chad: Jenkins slave: Ensure group exists before trying to make the user [puppet] - 10https://gerrit.wikimedia.org/r/359012 [18:53:46] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2211188 [18:58:26] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [18:59:52] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3349292 (10Dzahn) 05stalled>03Open [19:00:05] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T1900). [19:02:06] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:02:26] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15560 bytes in 0.083 second response time [19:04:14] (03CR) 10Chad: [C: 032] group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358980 (owner: 10Chad) [19:05:32] (03Merged) 10jenkins-bot: group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358980 (owner: 10Chad) [19:05:47] (03CR) 10jenkins-bot: group1 to wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358980 (owner: 10Chad) [19:08:26] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.5 [19:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:39] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3028990 (10GWicke) 3.18 might resolve {T97192} as well. This needs to be verified. [19:08:58] 10Operations, 10Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3349354 (10Krinkle) @MoritzMuehlenhoff asked about osmium earlier this week. Osmium is set-up as a MediaWiki app server, but not part of the production wiki cluster in any way. It... [19:16:10] (03Draft1) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 [19:16:12] (03PS2) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [19:16:53] (03CR) 10Paladox: "i've only tested the script. I doint run this puppet class so i can't test the class. But the script works :)" [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [19:17:05] !log restart varnish backend on cp1074 [19:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:36] (03CR) 10jerkins-bot: [V: 04-1] Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [19:18:28] (03PS3) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [19:19:57] (03PS3) 10Hashar: systemd: add spec [puppet] - 10https://gerrit.wikimedia.org/r/339176 [19:20:15] (03CR) 10Thcipriani: [C: 031] Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [19:20:23] (03CR) 10Hashar: [C: 031] "Rebased. I had to add ::logrotate to the .fixture.yml list since rsyslog now depends on it." [puppet] - 10https://gerrit.wikimedia.org/r/339176 (owner: 10Hashar) [19:21:22] (03PS7) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [19:22:05] (03CR) 10Hashar: [C: 031] "Rebased. specs pass." [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [19:23:26] !log demon@tin Synchronized php: symlink bump (duration: 00m 43s) [19:23:30] (03PS4) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [19:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:46] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [19:27:14] (03PS5) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [19:27:16] (03PS4) 10Hashar: interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420 [19:27:18] (03PS7) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [19:27:37] 10Operations, 10Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) a:05Krinkle>03None [19:28:30] (03CR) 10Hashar: [C: 031] "Rebased. Spec still pass. This patch is still applied on the beta cluster puppet master to unbreak puppet there." [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [19:29:19] gehel: doh sorry... I completely forgot about the codfw patch [19:30:27] dcausse: no problem, I was almost on time :) [19:30:32] (03PS2) 10Hashar: apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 [19:31:49] (03PS3) 10Hashar: apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 [19:32:35] jouncebot: next [19:32:35] In 0 hour(s) and 27 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T2000) [19:33:07] (03PS1) 10Chad: Remove Dashiki from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359022 [19:33:19] Nothing for ORES today [19:33:49] (03PS1) 10Chad: Remove Linter from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359023 [19:34:06] (03CR) 10Chad: [C: 032] Remove Dashiki from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359022 (owner: 10Chad) [19:34:12] (03CR) 10Chad: [C: 032] Remove Linter from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359023 (owner: 10Chad) [19:34:16] RainbowSprinkles: Are you mostly done with the train? [19:34:18] Oh, haha [19:34:29] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3349480 (10Gilles) [19:34:31] 10Operations, 10MW-1.30-release-notes, 10Performance-Team, 10Thumbor, and 2 others: Limit maximum x-content-dimension size to avoid hitting nginx limits - https://phabricator.wikimedia.org/T167034#3349478 (10Gilles) 05Open>03declined No header, no problem! :) [19:34:36] Slightly relatedly to those... https://gerrit.wikimedia.org/r/#/c/358639/ [19:35:12] (03PS1) 10Chad: Remove LoginNotify from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359024 [19:35:19] (03Merged) 10jenkins-bot: Remove Dashiki from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359022 (owner: 10Chad) [19:35:27] (03Merged) 10jenkins-bot: Remove Linter from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359023 (owner: 10Chad) [19:35:31] 10Operations, 10Performance-Team, 10Thumbor: Write graceful rolling restart script for Thumbor - https://phabricator.wikimedia.org/T162875#3349490 (10Gilles) a:03Gilles [19:35:33] (03CR) 10Chad: [C: 032] Remove LoginNotify from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359024 (owner: 10Chad) [19:35:58] (03CR) 10jenkins-bot: Remove Dashiki from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359022 (owner: 10Chad) [19:36:02] (03PS2) 10Hashar: contint: PHP packages cleanup [puppet] - 10https://gerrit.wikimedia.org/r/346165 [19:36:04] (03CR) 10Chad: [C: 032] Remove duplicate config from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358639 (owner: 10Reedy) [19:36:31] (03Merged) 10jenkins-bot: Remove LoginNotify from extension-list-labs, redundant to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359024 (owner: 10Chad) [19:37:04] (03Merged) 10jenkins-bot: Remove duplicate config from CommonSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358639 (owner: 10Reedy) [19:37:17] (03CR) 10Hashar: "Rebased on top of the addition of php5-gmp / php7.0-gmp." [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [19:37:41] (03Draft1) 10Paladox: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 [19:37:44] !log demon@tin Synchronized wmf-config/extension-list-labs: No-op (duration: 00m 44s) [19:37:44] (03PS2) 10Paladox: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 [19:37:46] Ta [19:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:00] (03CR) 10Chad: [V: 032 C: 032] Configuring git-fat to work with Archiva [software/gerrit] - 10https://gerrit.wikimedia.org/r/356482 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [19:38:35] (03CR) 10Chad: [V: 032 C: 032] Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [19:38:45] (03PS4) 10Chad: Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) [19:38:48] (03CR) 10Chad: [V: 032 C: 032] Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [19:39:13] (03CR) 10Hashar: "I reopened this patch since apparently that would have prevented the outage of PdfHandler described in T164145." [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [19:42:22] !log running mwscript initSiteStats.php srnwiki --update [19:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:56] (03CR) 10Hashar: build: allow usage of a different puppet version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338633 (owner: 10Hashar) [19:43:03] (03PS4) 10Hashar: build: allow usage of a different puppet version [puppet] - 10https://gerrit.wikimedia.org/r/338633 [19:44:50] (03CR) 10Gergő Tisza: "I don't see how it would have. The outage was cause by firejail outputting logging stuff like "Reading profile /etc/firejail/mediawiki-con" [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [19:47:27] !log demon@tin Synchronized wmf-config/CommonSettings-labs.php: no-op (duration: 00m 44s) [19:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:48] (03CR) 10Hashar: "One can check T146914 for all the details. In short that causes a lot of unnecessary salt calls each time puppet run and might well trigge" [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) (owner: 10Hashar) [19:47:55] (03PS3) 10Hashar: salt: fix grain-ensure comparison [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) [19:47:57] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3349561 (10Cmjohnson) [19:48:54] 10Operations, 10Analytics, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3349563 (10Nuria) >which matches metrics end points explicitly limited at 100/s per client IP. mmm... looking at pageview API dashboard i can see some of lawful traffic (spike... [19:49:57] (03CR) 10Paladox: [C: 04-1] "Doesn't work." [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [19:57:14] (03CR) 10Hashar: [C: 04-1] "Sorry I need to rebase this one carefully." [puppet] - 10https://gerrit.wikimedia.org/r/342635 (https://phabricator.wikimedia.org/T134381) (owner: 10Hashar) [19:59:24] (03PS3) 10Paladox: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 [19:59:41] (03CR) 10Paladox: "Looks like it fixes it now :)." [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T2000). [20:00:12] no parsoid deploy today [20:00:31] (03PS4) 10Reedy: Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) [20:00:44] (03CR) 10Reedy: [C: 032] Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [20:02:33] (03Merged) 10jenkins-bot: Promote CollaborationKit to the big leagues; deploy on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343697 (https://phabricator.wikimedia.org/T138326) (owner: 10Reedy) [20:03:49] Still nothing for ORES today [20:03:57] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Add CollaborationKit to testwiki (duration: 00m 44s) [20:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:14] !log reedy@tin Synchronized wmf-config/CommonSettings.php: CollaborationKit loader code (duration: 00m 43s) [20:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:41] (03PS1) 10Chad: Install jenkins on releases.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/359029 [20:06:08] !log reedy@tin Synchronized wmf-config/CommonSettings-labs.php: noop (duration: 00m 43s) [20:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:45] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349597 (10GWicke) [20:18:01] 10Operations: Impending load test - https://phabricator.wikimedia.org/T167920#3349637 (10Haiku-narrative) [20:20:58] 10Operations, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349656 (10Reedy) [20:22:29] 10Operations, 10hardware-requests: codfw: (2) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3349661 (10chasemp) [20:30:11] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349706 (10chasemp) [20:32:03] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349637 (10Reedy) Can you advise what API queries you're actually making? And any suggestion of magnitude? [20:34:36] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [20:38:01] (03PS1) 10Reedy: [WIP] Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) [20:40:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [20:41:06] (03PS2) 10Reedy: [WIP] Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) [20:44:48] (03PS1) 10Reedy: Remove $stdlogo comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 [20:48:15] (03PS1) 10Andrew Bogott: labspuppetbackend: add api methods to query by role [puppet] - 10https://gerrit.wikimedia.org/r/359041 (https://phabricator.wikimedia.org/T151522) [20:50:19] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349772 (10Haiku-narrative) This will be the general format of the vast majority of calls: https://en.wikipedia.org/w/api.php?action=query&format=json&redirects=&continue=&prop=extracts%7... [20:52:39] RoanKattouw: isReservedDataAttribute is removing the class, it should be called before adding classes to the other attribs [20:52:55] I can write a patch if you are not working on one already [20:52:57] I just discovered that too [20:53:01] It's filtering backwards [20:53:09] Filtering *in* reserved attrs instead of filtering them *out* [20:53:15] Please do [20:53:40] OldChangesList adds classes after filtering, but the filtering is backwards there too [20:54:10] Maybe we need Sanitizer::isSafeDataAttribute() that returns !isReservedDataAttribute [20:54:27] > $attrs = [ 'class' => ['foo', 'bar', 'baz'], 'data-ooui' => 'foo', 'data-revid' => 123 ]; [20:54:27] > var_dump(wfArrayFilterByKey($attrs, [ Sanitizer::class, 'isReservedDataAttribute' ] ) ); [20:54:27] array(1) { [20:54:27] 'data-ooui' => [20:54:27] string(3) "foo" [20:54:27] } [20:57:12] (03CR) 10Krinkle: "Does the feature still exist? If so, might make sense to remove that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 (owner: 10Reedy) [20:57:47] (03CR) 10Krinkle: "Can we have a 2x logo from the start?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [20:58:40] (03PS1) 10Jdlrobson: Remove dead config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) [20:58:44] (03PS3) 10Reedy: [WIP] Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) [20:59:11] (03CR) 10Reedy: "We could, but how do we make them? Just request a 270px image from our thumbnailer and use that?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [20:59:40] (03CR) 10Krinkle: "Yeah. That's how you got the PNG, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [21:00:50] (03PS1) 10Jdlrobson: Remove Cards from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) [21:02:07] (03CR) 10Chad: [C: 032] Remove Cards from the cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [21:02:19] (03CR) 10Reedy: "Indeed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [21:02:33] (03CR) 10Chad: "Nvm, tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359045 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [21:03:09] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351286 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [21:03:18] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Create pp_stage0.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351286 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [21:03:38] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [21:03:47] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [21:04:58] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/359047/ [21:05:52] (03PS4) 10Reedy: [WIP] Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) [21:06:12] (03CR) 10Chad: "I think it's absolutely absurd that we have to have these giant arrays all saying the same thing. $stdLogo was so much simpler." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 (owner: 10Reedy) [21:12:19] (03PS5) 10Reedy: Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) [21:12:24] (03CR) 10Reedy: [C: 032] Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [21:14:03] (03Merged) 10jenkins-bot: Add atjwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359035 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [21:15:12] !log reedy@terbium Started scap: (no justification provided) [21:15:13] !log reedy@terbium scap aborted: (no justification provided) (duration: 00m 01s) [21:15:19] stupid thing [21:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:31] [aaa36b49d1b152f4c3d078a6] [no req] RuntimeException from line 3158 of /srv/mediawiki/php-1.30.0-wmf.4/includes/libs/rdbms/database/Database.php: Could not open "/srv/mediawiki/php-1.30.0-wmf.4/extensions/AccountAudit/accountaudit.sql". [21:16:32] Brilliant [21:17:58] (03CR) 10Krinkle: "Once we have it for all wikis, it should be fairly simple to drop the project-specific ones and just default to /dbname.png." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 (owner: 10Reedy) [21:18:29] (03PS6) 10Bearloga: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) [21:19:08] (03PS2) 10Jdlrobson: Remove dead config variable MinervaPrintStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359043 (https://phabricator.wikimedia.org/T166408) [21:19:47] (03CR) 10Krinkle: [C: 031] db-readonly: Change the read only message for something generic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) (owner: 10Jcrespo) [21:20:19] 10Operations, 10Maps (Maps-data): Epic: backup vector tiles - https://phabricator.wikimedia.org/T159770#3349893 (10debt) Moving off the sprint board - the Discovery team won't be able to do this work at this time. [21:20:43] (03CR) 10Krinkle: [C: 031] db-readonly: Change the read only message for something generic (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356584 (https://phabricator.wikimedia.org/T166345) (owner: 10Jcrespo) [21:22:04] 10Operations, 10Discovery, 10Maps, 10Interactive-Sprint: import_waterlines is broken - https://phabricator.wikimedia.org/T159771#3349922 (10debt) p:05Triage>03Normal We should probably take a look at this again to see if it needs to be done once T153282 is done. Leaving in backlog for now. [21:22:37] !log reedy@tin Synchronized php-1.30.0-wmf.4/extensions/WikimediaMaintenance/addWiki.php: Remove accountaudit (duration: 00m 44s) [21:22:39] (03PS4) 10Paladox: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 [21:22:46] PROBLEM - puppetmaster https on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 8140: Connection refused [21:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:39] (03CR) 10Paladox: "See https://github.com/jaraco/irc/blob/6dc684d47519c41716036f3674ce29de838244d6/irc/client.py#L818" [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [21:25:43] !log reedy@tin Synchronized dblists/: add atjwiki T167714 (duration: 00m 42s) [21:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:54] T167714: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714 [21:26:18] !log reedy@tin rebuilt wikiversions.php and synchronized wikiversions files: Add atjwiki T167714 [21:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:34] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: atjwiki T167714 (duration: 00m 43s) [21:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:47] (03PS2) 10Andrew Bogott: labspuppetbackend: add api methods to query by role [puppet] - 10https://gerrit.wikimedia.org/r/359041 (https://phabricator.wikimedia.org/T151522) [21:29:00] !log reedy@tin Synchronized static/images/project-logos/: atjwiki T167714 (duration: 00m 43s) [21:29:06] 10Operations, 10Discovery, 10Maps, 10Traffic: Make maps active / active - https://phabricator.wikimedia.org/T162362#3349967 (10debt) Moving off the sprint board - the Discovery team won't be able to do this work at this time. [21:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:42] tgr: Sorry, got distracted with other things, looking at your patch now [21:29:55] !log reedy@tin Synchronized langlist: Add atj T167714 (duration: 00m 43s) [21:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:31] 10Operations, 10Maps (Maps-data): Epic: backup vector tiles - https://phabricator.wikimedia.org/T159770#3349976 (10debt) 05Open>03declined And, looks like we can decline this one - based on this [[ https://phabricator.wikimedia.org/T159770#3078098 | comment ]]. [21:30:44] RoanKattouw: I forgot to update the tests apparently [21:30:58] Oh I see, you renamed the function and updated its use everywhere [21:31:07] I'm in the IRC RfC discussion now, can fix up in 30 mins if that's OK [21:31:14] For a moment I doubted whether it was used wrong in every single place, but then I remembered you only introduced it last week anyway [21:31:18] I'll fix now [21:33:38] (03PS1) 10Reedy: Update interwiki.php, at atkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359054 (https://phabricator.wikimedia.org/T167714) [21:33:47] (03PS2) 10Reedy: Update interwiki.php, at atkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359054 (https://phabricator.wikimedia.org/T167714) [21:33:49] (03CR) 10Reedy: [C: 032] Update interwiki.php, at atkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359054 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [21:34:50] 10Operations, 10ops-codfw: Rack/setup codfw spare systems - https://phabricator.wikimedia.org/T167705#3350026 (10RobH) [21:35:01] (03Merged) 10jenkins-bot: Update interwiki.php, at atkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359054 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [21:35:10] I guess https://github.com/wikimedia/parsoid/blob/master/tools/fetch-sitematrix.js needs to be run via node or something? :/ [21:35:22] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3350029 (10faidon) What kind of User-Agent will you be using? Please have a look at the [[ https://www.mediawiki.org/wiki/API:Etiquette | API Etiquette ]] and especially the User-Agent sec... [21:36:38] !log reedy@tin Synchronized wmf-config/interwiki.php: Update interwiki map for atjwiki T167714 (duration: 00m 44s) [21:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:49] T167714: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714 [21:45:46] RECOVERY - puppetmaster https on labtestcontrol2001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.184 second response time [21:53:12] (03PS7) 10Bearloga: Add Shiny Server module and Discovery Dashboards role/profile [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) [21:54:14] (03PS3) 10Andrew Bogott: labspuppetbackend: add api methods to query by role [puppet] - 10https://gerrit.wikimedia.org/r/359041 (https://phabricator.wikimedia.org/T151522) [21:54:16] (03PS1) 10Andrew Bogott: labspuppetbackend: Add an alternative read-only port for queries [puppet] - 10https://gerrit.wikimedia.org/r/359057 [21:55:20] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: Add an alternative read-only port for queries [puppet] - 10https://gerrit.wikimedia.org/r/359057 (owner: 10Andrew Bogott) [22:01:06] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:19:00] tgr: Are you fixing the broken unit tests for https://gerrit.wikimedia.org/r/#/c/359047 ? We should probably get some sort of fix out for the SWAT at 4pm Pacific [22:19:29] I'll have to rewrite the whole patch, I messed up [22:19:44] I had the right idea with https://phabricator.wikimedia.org/T167922#3349773 [22:20:24] isReservedDataAttribute was there to make sure hooks can only set attributes which cannot be imitated in wikitext [22:20:37] Given that the data attributes aren't being added correctly anyway, should I just revert the main change in wmf5? [22:21:10] I won't revert it in master because you can probably fix it well before the wmf6 cut [22:21:33] sure, let's do that [22:22:44] OK, I'll work on deploying a revert [22:23:23] In the SWAT that is [22:25:07] (03PS1) 10Andrew Bogott: labspuppetbackend: Set port 8100 to read-only and 8101 read/write [puppet] - 10https://gerrit.wikimedia.org/r/359064 [22:27:14] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: Set port 8100 to read-only and 8101 read/write [puppet] - 10https://gerrit.wikimedia.org/r/359064 (owner: 10Andrew Bogott) [22:28:13] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3341896 (10Reedy) A quick look suggests you could do with translating the namespa... [22:29:52] (03PS1) 10Reedy: Add atjwiki meta namespace talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359066 (https://phabricator.wikimedia.org/T167714) [22:30:06] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:30:43] (03CR) 10Reedy: [C: 032] Add atjwiki meta namespace talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359066 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [22:31:57] (03Merged) 10jenkins-bot: Add atjwiki meta namespace talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359066 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [22:33:15] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: meta namespace talk for atjwiki (duration: 00m 44s) [22:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:36] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350154 (10Dzahn) [22:38:35] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350157 (10Reedy) The wiki is now for all intents and purposes, created. There's... [22:41:40] (03PS5) 10Paladox: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 [22:43:57] !log reedy@tin Synchronized php-1.30.0-wmf.5/extensions/WikimediaMaintenance/addWiki.php: Remove accountaudit (duration: 00m 44s) [22:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:16] (03PS1) 10Catrope: Enable $wgStructuredChangeEnableExperimentalViews in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359071 (https://phabricator.wikimedia.org/T164130) [22:45:35] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350162 (10Benoit_Rochon) Thank you Reedy, Dzahn and Marostegui. It's funny be... [22:45:53] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Impending load test - https://phabricator.wikimedia.org/T167920#3349637 (10GWicke) You could consider using https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title instead, which is a fully cached version of the page summary informa... [22:46:54] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350167 (10Reedy) >>! In T167714#3350162, @Benoit_Rochon wrote: > About f/u on Wi... [22:47:34] (03PS2) 10Andrew Bogott: labs dnsrecursor: add atjwiki [puppet] - 10https://gerrit.wikimedia.org/r/358412 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [22:49:08] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350174 (10Dzahn) I created the Wikidata item for it. Feel free to add more state... [22:51:20] (03CR) 10Andrew Bogott: [C: 032] labs dnsrecursor: add atjwiki [puppet] - 10https://gerrit.wikimedia.org/r/358412 (https://phabricator.wikimedia.org/T167714) (owner: 10Reedy) [22:57:18] (03PS1) 10Dzahn: rename mwreleases1001 to releases1001 [dns] - 10https://gerrit.wikimedia.org/r/359073 [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170614T2300). [23:00:04] ebernhardson and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:27] I'll SWAT [23:03:02] (03CR) 10Catrope: [C: 032] Enable $wgStructuredChangeEnableExperimentalViews in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359071 (https://phabricator.wikimedia.org/T164130) (owner: 10Catrope) [23:05:05] (03Merged) 10jenkins-bot: Enable $wgStructuredChangeEnableExperimentalViews in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359071 (https://phabricator.wikimedia.org/T164130) (owner: 10Catrope) [23:06:31] ebernhardson: You here for your SWAT change? [23:06:37] (switch Cirrus search traffic back to eqiad) [23:06:51] (03PS3) 10Jforrester: Cleanup ORES config: Drop wgOresExtensionStatus (default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354732 [23:06:53] (03PS1) 10Jforrester: Cleanup ORES config: Alphasort wmgUseORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359075 [23:07:41] (03PS4) 10Catrope: Cleanup ORES config: Drop wgOresExtensionStatus (default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354732 (owner: 10Jforrester) [23:07:46] (03CR) 10Catrope: [C: 032] Cleanup ORES config: Drop wgOresExtensionStatus (default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354732 (owner: 10Jforrester) [23:07:52] (03CR) 10Mobrovac: [C: 031] "It makes sense to me to have this in Varnish since most of the logic related to RB is here, Apache only has the URL rewrite part." [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) (owner: 10Ppchelko) [23:07:54] (03PS2) 10Catrope: Cleanup ORES config: Alphasort wmgUseORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359075 (owner: 10Jforrester) [23:07:56] (03PS6) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [23:08:02] (03PS7) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [23:08:04] (03CR) 10Catrope: [C: 032] Cleanup ORES config: Alphasort wmgUseORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359075 (owner: 10Jforrester) [23:09:26] (03Merged) 10jenkins-bot: Cleanup ORES config: Drop wgOresExtensionStatus (default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354732 (owner: 10Jforrester) [23:10:11] (03Merged) 10jenkins-bot: Cleanup ORES config: Alphasort wmgUseORES [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359075 (owner: 10Jforrester) [23:13:54] (03PS2) 10Dzahn: rename mwreleases1001 to releases1001 [dns] - 10https://gerrit.wikimedia.org/r/359073 (https://phabricator.wikimedia.org/T164030) [23:14:00] (03PS3) 10Dzahn: rename mwreleases1001 to releases1001 [dns] - 10https://gerrit.wikimedia.org/r/359073 (https://phabricator.wikimedia.org/T164030) [23:15:43] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3350238 (10Dzahn) We talked about this on IRC and came to the conclusion that we want to use this new VM for not just mediawiki releases bu... [23:17:22] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350239 (10Benoit_Rochon) Thank you Dzahn. I'm trying to link atj home page to... [23:20:13] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350242 (10Reedy) >>! In T167714#3350239, @Benoit_Rochon wrote: > Thank you Dzahn... [23:22:38] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350259 (10Dzahn) >>! In T167714#3350239, @Benoit_Rochon wrote: > I'm trying to l... [23:23:37] !log catrope@tin Synchronized wmf-config/: ORES config cleanups (duration: 00m 46s) [23:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:32] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350261 (10Dzahn) You could upload the logo image to commons. And then link it to... [23:25:24] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3350263 (10Dzahn) Sorry if i added confusion but my creation of the wikidata item... [23:25:54] RoanKattouw: doh i've been distracted. here now [23:26:05] RoanKattouw: i can deploy it if you're done [23:26:10] I'm still going so no worires [23:26:23] (03PS2) 10Catrope: Revert "Test elastic2020 does not fall out of cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359007 (https://phabricator.wikimedia.org/T149006) (owner: 10Gehel) [23:26:37] (03CR) 10Catrope: [C: 032] Revert "Test elastic2020 does not fall out of cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359007 (https://phabricator.wikimedia.org/T149006) (owner: 10Gehel) [23:27:35] (03Merged) 10jenkins-bot: Revert "Test elastic2020 does not fall out of cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359007 (https://phabricator.wikimedia.org/T149006) (owner: 10Gehel) [23:29:03] ebernhardson: On mwdebug1002 now; is this testable there? [23:29:08] If not that's OK [23:29:12] RoanKattouw: not really, it should be fine though [23:30:19] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Send search traffic back to eqiad T149006 (duration: 00m 44s) [23:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:30] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [23:30:42] 23:30:15 Synchronized wmf-config/InitialiseSettings.php: Send search traffic back to eqiad T149006 (duration: 00m 44s) [23:30:42] * RoanKattouw waits for logmsgbot ... [23:31:11] can see traffic starting to switch already [23:33:11] (03PS1) 10Dzahn: rename mwreleases1001 to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/359077 (https://phabricator.wikimedia.org/T164030) [23:33:21] !log catrope@tin Synchronized php-1.30.0-wmf.5/includes/: Unbreak watchlist highlighting T167922 (duration: 01m 30s) [23:33:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3350284 (10EBernhardson) Tested a switchover and about 5 hours of traffic, elastic2020 seemed happy enough and acted like the rest of the serve... [23:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:31] T167922: [Regression wmf.5] On enhanced watchlists, boldness styling shows regardless of diffs being seen, and a gadget is broken - https://phabricator.wikimedia.org/T167922 [23:40:39] (03CR) 10Mobrovac: [C: 04-1] (in progress) Update recommendation-api module and role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/358026 (https://phabricator.wikimedia.org/T167113) (owner: 10Nschaaf) [23:41:15] !log mwreleases1001 - scheduled downtime, shutdown, kill VM, re-install as releases1001 (T164030) [23:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:25] T164030: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030 [23:46:00] (03CR) 10Dzahn: [C: 032] rename mwreleases1001 to releases1001 [dns] - 10https://gerrit.wikimedia.org/r/359073 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [23:47:48] (03CR) 10Dzahn: [C: 032] rename mwreleases1001 to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/359077 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [23:49:21] !log ganeti: removed instance mwreleases1001, created new instance releases1001 with same parameters (2 VCPUS,4G memory, 1 x 128G disk) (T164030) [23:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:32] T164030: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030 [23:51:35] (03PS6) 10Dzahn: Ircecho: Fix bot not to use carriage returns [puppet] - 10https://gerrit.wikimedia.org/r/359025 (owner: 10Paladox) [23:55:39] !log mwreleases: revoke puppet cert, delete salt key, remove from icinga. releases1001 still syncing disks for a while (50m), being created... T164030 [23:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:49] T164030: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030 [23:58:23] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3350322 (10Dzahn)