[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T0000). Please do the needful. [00:00:04] Smalyshev: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:05:39] SMalyshev, I can do it [00:06:05] yt? [00:08:32] ebernhardson, can you be SMalyshev's stand-in for this deployment? [00:11:46] can we also SWAT https://gerrit.wikimedia.org/r/#/c/335561/ ? [00:11:56] I must have misplaced it somewhere in that huge table [00:12:40] makes sense. can you put it in the right place while I'm fixing this? [00:12:50] s/fixing/deploying/ [00:13:38] (03CR) 10MaxSem: [C: 032] Do not throw away $wgRateLimitsExcludedIPs defaults when there is a wiki-specific setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335561 (https://phabricator.wikimedia.org/T87841) (owner: 10Gergő Tisza) [00:16:49] (03Merged) 10jenkins-bot: Do not throw away $wgRateLimitsExcludedIPs defaults when there is a wiki-specific setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335561 (https://phabricator.wikimedia.org/T87841) (owner: 10Gergő Tisza) [00:17:37] (03CR) 10jenkins-bot: Do not throw away $wgRateLimitsExcludedIPs defaults when there is a wiki-specific setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335561 (https://phabricator.wikimedia.org/T87841) (owner: 10Gergő Tisza) [00:20:59] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/335561/ (duration: 01m 00s) [00:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:08] tgr, ^ [00:21:31] SMalyshev or ebernhardson - ping me if you still want that deployed [00:23:06] thanks MaxSem! [00:23:07] (03PS6) 10Dzahn: aptrepo: add cron to rsync APT data automatically [puppet] - 10https://gerrit.wikimedia.org/r/334241 (https://phabricator.wikimedia.org/T84380) [00:23:56] MaxSem: ping [00:24:03] sorry, had to go for a while [00:24:08] pong [00:24:10] ready? [00:24:15] yes [00:24:26] (03PS2) 10MaxSem: Add most frequently used aliases for filetype: search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335265 (https://phabricator.wikimedia.org/T156413) (owner: 10Smalyshev) [00:24:36] (03CR) 10MaxSem: [C: 032] Add most frequently used aliases for filetype: search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335265 (https://phabricator.wikimedia.org/T156413) (owner: 10Smalyshev) [00:26:12] (03Merged) 10jenkins-bot: Add most frequently used aliases for filetype: search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335265 (https://phabricator.wikimedia.org/T156413) (owner: 10Smalyshev) [00:27:15] SMalyshev, pulled on mwdebug1002, please test [00:28:02] checking [00:28:07] (03CR) 10jenkins-bot: Add most frequently used aliases for filetype: search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335265 (https://phabricator.wikimedia.org/T156413) (owner: 10Smalyshev) [00:28:55] MaxSem: seems to be working fine [00:29:35] 06Operations, 06Labs, 07Tracking: Cleanup tools nfs share on labstore1004/5 - https://phabricator.wikimedia.org/T156982#2991854 (10madhuvishy) [00:30:15] !log maxsem@tin Synchronized wmf-config/CirrusSearch-common.php: https://gerrit.wikimedia.org/r/#/c/335265/2 (duration: 00m 40s) [00:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:51] SMalyshev, please test ^ [00:31:59] MaxSem: all is fine, thanks! [00:32:10] weee [00:33:02] 06Operations, 06Labs, 07Tracking: Cleanup tools nfs share on labstore1004/5 - https://phabricator.wikimedia.org/T156982#2991870 (10madhuvishy) Finding all log/err/out files >1G by doing `find /srv/tools/shared/tools/ -type f -size +1G > gt1g` `cat gt1g | grep -e \.log -e \.error -e \.err -e debug\.txt -e \.o... [00:33:28] (03PS7) 10Dzahn: aptrepo: add cron to rsync APT data automatically [puppet] - 10https://gerrit.wikimedia.org/r/334241 (https://phabricator.wikimedia.org/T84380) [00:34:56] (03CR) 10Dzahn: [C: 032] aptrepo: add cron to rsync APT data automatically [puppet] - 10https://gerrit.wikimedia.org/r/334241 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [00:45:26] (03CR) 10Tim Landscheidt: [C: 04-1] "webservicemonitor uses "/usr/bin/sudo -i -u $tool /usr/bin/webservice restart" to start a webservice that is down. I think this change wo" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/335569 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [00:49:25] (03PS1) 10Dzahn: installserver: fix reverse IPv6 lookup for install1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/335577 [00:51:07] (03CR) 10Dzahn: [C: 032] "3.5.0.0.3.5.1.0.0.8.0.0.8.0.2.0.2.0.0.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer install2002.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/335577 (owner: 10Dzahn) [00:51:17] (03PS2) 10Dzahn: installserver: fix reverse IPv6 lookup for install1002/2002 [dns] - 10https://gerrit.wikimedia.org/r/335577 [00:52:06] so, I'm still working on the conclusions and actionables, but here's a start: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170201-Phabricator [00:53:07] but I've been working for ~7 hours without a break so I'm gonna step away for a minute... I'll finish up the incident report tonight though. [00:53:18] greg-g: jynus ^ [00:57:13] (03PS1) 10Dereckson: Add images.metmuseum.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335578 (https://phabricator.wikimedia.org/T156855) [00:58:09] MaxSem: can you deploy that too if you're still swatting? ^ [00:58:28] yeah [00:59:08] thanks, adding to the table [00:59:20] (03CR) 10MaxSem: [C: 032] Add images.metmuseum.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335578 (https://phabricator.wikimedia.org/T156855) (owner: 10Dereckson) [01:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T0100). [01:00:19] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2991922 (10mmodell) incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170201-Phabricator [01:01:01] (03Merged) 10jenkins-bot: Add images.metmuseum.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335578 (https://phabricator.wikimedia.org/T156855) (owner: 10Dereckson) [01:01:09] (03CR) 10jenkins-bot: Add images.metmuseum.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335578 (https://phabricator.wikimedia.org/T156855) (owner: 10Dereckson) [01:01:46] Dereckson, pulled on mwdebug1002 [01:02:44] Testing if mwdebug1002 is okay to give me the form [01:02:51] ah here we are [01:03:38] MaxSem: works [01:04:13] 06Operations, 10ops-ulsfo: cp4008 and cp4012 running on single PSU - https://phabricator.wikimedia.org/T151275#2991932 (10RobH) [01:04:32] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/335578/ (duration: 00m 42s) [01:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:42] Dereckson, ^ [01:06:01] Thanks. [01:11:11] (03CR) 10BryanDavis: "> webservicemonitor uses "/usr/bin/sudo -i -u $tool /usr/bin/webservice" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/335569 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [01:11:25] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5311/" [puppet] - 10https://gerrit.wikimedia.org/r/334467 (owner: 10Dzahn) [01:26:29] (03PS3) 10Dzahn: aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 [01:27:31] (03CR) 10jerkins-bot: [V: 04-1] aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 (owner: 10Dzahn) [01:27:46] (03PS4) 10Dzahn: aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 [01:29:16] (03PS5) 10Dzahn: aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 [01:30:25] (03CR) 10Dzahn: [C: 032] aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 (owner: 10Dzahn) [01:33:16] 06Operations, 13Patch-For-Review: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380#2991984 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/335577/ https://gerrit.wikimedia.org/r/#/c/334467/ [01:33:21] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2991985 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/335577/ https://gerrit.wikimedia.org/r/#/c/334467/ [01:39:11] (03CR) 1020after4: [C: 031] Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [01:42:57] (03PS1) 10Dzahn: installserver: add firewall hole for rsync also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/335585 (https://phabricator.wikimedia.org/T84380) [01:44:20] (03CR) 10Dzahn: [C: 032] installserver: add firewall hole for rsync also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/335585 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [01:45:04] 06Operations, 10Ops-Access-Requests: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2991994 (10MusikAnimal) [01:54:32] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2992028 (10Dzahn) [01:54:47] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2991994 (10bd808) +1 from me. Getting @MusikAnimal access to both Hadoop and the DB replicas will help with many projects in #c... [01:55:07] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2992033 (10MusikAnimal) [01:56:34] duuuuh... now this took way too long to figure out [01:57:02] i was wondering "why the heck does rsyncing not work even though i clearly open the ferm hole" [01:57:22] answer is because carbon has multiple IPv6 addresses [01:57:42] the "unmapped" one it had first.. and then the "mapped" one we added later [01:57:53] and when sending data it comes from the unmapped one.. man... [02:01:51] lovely [02:02:09] !log carbon - remove unmapped IPv6 address making ferm rules fail, use only the _mapped_ IP (ip addr del 2620:0:861:1:7a2b:cbff:fe09:ea0/64 dev eth0) (T84380 T132757) [02:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:18] T84380: Setup install server in codfw - tftp done, but not apt and other install services - https://phabricator.wikimedia.org/T84380 [02:02:19] T132757: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757 [02:02:50] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-3/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR [02:05:10] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1813.79926 Seconds [02:06:10] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 32.816381 Seconds [02:36:00] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.9) (duration: 14m 39s) [02:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:46] 06Operations, 10MediaWiki-ResourceLoader, 06Performance-Team, 10Traffic, 13Patch-For-Review: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657#2992075 (10Krinkle) @BBlack I'd like to apply a similar patch to prod puppet. First... [02:40:06] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2992076 (10Milimetric) +1 @MusikAnimal should have this access. But I think you need a +1 from Danny too. cc @DannyH [02:50:47] (03PS1) 10Andrew Bogott: Horizon: Only display puppet roles that have filtertags in the puppet comments. [puppet] - 10https://gerrit.wikimedia.org/r/335593 (https://phabricator.wikimedia.org/T149589) [02:52:05] 06Operations, 07Puppet, 10Horizon, 06Labs, 13Patch-For-Review: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2992084 (10Andrew) Updates: * The cache is working properly * Rendering in version Mitaka is much faster, so this will be less of an issue as soon as we're abl... [03:03:28] (03PS1) 10Dzahn: aptrepo: disable autoconfigured EUI64 addresses [puppet] - 10https://gerrit.wikimedia.org/r/335594 (https://phabricator.wikimedia.org/T84380) [03:04:51] (03PS2) 10Dzahn: aptrepo: disable autoconfigured EUI64 addresses [puppet] - 10https://gerrit.wikimedia.org/r/335594 (https://phabricator.wikimedia.org/T84380) [03:05:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [03:08:22] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 13m 49s) [03:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:06] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Feb 2 03:14:06 UTC 2017 (duration 5m 44s) [03:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:50] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/1: down - Core: cr2-eqiad:xe-3/2/3 (Zayo, OGYX/120003//ZYO, 36ms) {#11519} [10Gbps wave]BR [03:17:52] 06Operations, 07Puppet, 10Horizon, 06Labs, 13Patch-For-Review: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2992120 (10scfc) I'm not particular fond of that idea (but can offer no alternative) because then "click here to test a role" becomes a) "submit a patch to add fi... [03:18:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [03:31:19] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2992146 (10bd808) Actually he needs @Kaldari to sign off as his manager. [03:36:10] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:49:50] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:05:10] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:18:50] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:37:10] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [04:38:10] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3218297 keys, up 93 days 20 hours - replication_delay is 0 [04:44:43] Hey, cloning from https://phabricator.wikimedia.org/diffusion/1913/research-ores-editquality.git r updating existent clone (git pull) end up with 500 errors. I wonder how that can gets fixed [04:44:43] [04:45:05] error: RPC failed; result=22, HTTP code = 504 [04:45:05] fatal: The remote end hung up unexpectedly [04:46:12] google helped me: http://stackoverflow.com/questions/6842687/the-remote-end-hung-up-unexpectedly-while-git-cloning [04:46:15] nvm [05:06:51] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:21:10] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:10] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:10] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:10] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:10] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:10] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:10] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:11] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:22:00] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:22:00] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:22:00] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:22:01] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:22:01] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:22:01] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:22:01] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:22:01] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:35:50] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:11:50] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:28:11] PROBLEM - carbon-cache@e service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@e is failed [06:28:11] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:00] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:32:10] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89941.16 seconds [06:36:20] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:39:50] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:21] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2992309 (10kaldari) +1 from me. [06:52:01] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:54:00] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:55:10] RECOVERY - carbon-cache@e service on graphite1003 is OK: OK - carbon-cache@e is active [06:55:10] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [07:00:43] (03PS3) 10Yuvipanda: docker: Gently wade into new coding guidelines [puppet] - 10https://gerrit.wikimedia.org/r/335278 [07:00:45] (03PS3) 10Yuvipanda: docker: Allow using docker.io's repo directly for debs [puppet] - 10https://gerrit.wikimedia.org/r/335299 [07:06:20] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:09:30] (03PS4) 10Yuvipanda: docker: Gently wade into new coding guidelines [puppet] - 10https://gerrit.wikimedia.org/r/335278 [07:09:32] (03PS4) 10Yuvipanda: docker: Allow using docker.io's repo directly for debs [puppet] - 10https://gerrit.wikimedia.org/r/335299 [07:15:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2371 [07:20:10] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 225052 Threads: 1 Questions: 1857167 Slow queries: 953 Opens: 2681 Flush tables: 1 Open tables: 451 Queries per second avg: 8.252 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:30:51] (03PS2) 10Muehlenhoff: Remove otto from piwik-roots [puppet] - 10https://gerrit.wikimedia.org/r/335010 (https://phabricator.wikimedia.org/T142836) [07:31:25] (03CR) 10Giuseppe Lavagetto: [C: 031] docker: Gently wade into new coding guidelines [puppet] - 10https://gerrit.wikimedia.org/r/335278 (owner: 10Yuvipanda) [07:31:38] (03CR) 10Yuvipanda: [C: 032] docker: Gently wade into new coding guidelines [puppet] - 10https://gerrit.wikimedia.org/r/335278 (owner: 10Yuvipanda) [07:33:23] (03PS3) 10Muehlenhoff: Remove otto from piwik-roots [puppet] - 10https://gerrit.wikimedia.org/r/335010 (https://phabricator.wikimedia.org/T142836) [07:34:20] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:35:38] (03CR) 10Muehlenhoff: [C: 032] Remove otto from piwik-roots [puppet] - 10https://gerrit.wikimedia.org/r/335010 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [07:36:46] (03PS2) 10Muehlenhoff: Remove otto from aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/335012 (https://phabricator.wikimedia.org/T142836) [07:36:56] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#2992331 (10Marostegui) In addition to what @jcrespo mentioned, in general, we are not completely happy if we decommission servers before running pt-table-checksum (T154485) wh... [07:38:17] (03CR) 10Muehlenhoff: [C: 032] Remove otto from aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/335012 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [07:41:54] !log Restart MySQL on db2012 to tune some innodb_ft flags - T156905 [07:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:58] T156905: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905 [07:43:50] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:48:10] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89925.42 seconds [07:55:14] PROBLEM - MariaDB disk space on dbstore1001 is CRITICAL: DISK CRITICAL - free space: /srv 387056 MB (5% inode=99%) [07:55:39] I will take care of that [07:57:20] thanks marostegui ! [07:59:07] (03PS2) 10Giuseppe Lavagetto: Initial commit of etcd-mirror [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/335460 [08:01:30] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:08:37] !log legoktm@tin Synchronized php-1.29.0-wmf.10/includes/api/ApiQueryAllUsers.php: Make last remaining user_groups queries honor $wgDisableUserGroupExpiry https://gerrit.wikimedia.org/r/#/c/335587/ (T156995) (duration: 00m 58s) [08:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:42] T156995: Request for Special:ActiveUsers with some parameters gives an error in non-Wikispedia wikis - https://phabricator.wikimedia.org/T156995 [08:08:50] (03PS18) 10Filippo Giunchedi: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [08:10:15] !log legoktm@tin Synchronized php-1.29.0-wmf.10/includes/specials/pagers/ActiveUsersPager.php: Make last remaining user_groups queries honor $wgDisableUserGroupExpiry https://gerrit.wikimedia.org/r/#/c/335587/ (T156995) (duration: 00m 51s) [08:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:50] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:12:23] (03CR) 10Filippo Giunchedi: [C: 032] Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [08:17:10] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:17:31] that's me ^ [08:17:32] silencing [08:19:30] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:20:00] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:20:00] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:20:00] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:20:20] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:20:20] racing puppet heh [08:21:00] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus/jmx_exporter] [08:22:20] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[prometheus/jmx_exporter] [08:23:29] !log filippo@tin Started deploy [prometheus/jmx_exporter@23a8f0b]: (no justification provided) [08:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:41] !log filippo@tin Finished deploy [prometheus/jmx_exporter@23a8f0b]: (no justification provided) (duration: 00m 12s) [08:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:49] no justification! [08:24:10] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:24:33] !log Remove /srv/tmp/db1067.tar.gz from dbstore1001 to gain disk space [08:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:58] !log Transfer /srv/tmp/db1063.tar.gz from dbstore1001 to db1095:/srv/tmp to gain disk space [08:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:50] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:26:22] !log filippo@tin Started deploy [prometheus/jmx_exporter@23a8f0b]: (no justification provided) [08:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:29] !log filippo@tin Finished deploy [prometheus/jmx_exporter@23a8f0b]: (no justification provided) (duration: 00m 07s) [08:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:30] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:30:18] !log installing ntfs-3g security update on labnodepool (other servers had it deinstalled) [08:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:51] (03PS2) 10Muehlenhoff: Remove ntfs-3g on precise/trusty [puppet] - 10https://gerrit.wikimedia.org/r/335444 [08:35:11] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:35:20] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:35:30] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:37:50] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:39:00] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:39:51] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:41:47] (03PS5) 10Yuvipanda: [WIP] tools: Use docker engine profile [puppet] - 10https://gerrit.wikimedia.org/r/335299 [08:42:01] !log filippo@tin Started deploy [prometheus/jmx_exporter@23a8f0b]: jmx_exporter deploy [08:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:05] !log filippo@tin Finished deploy [prometheus/jmx_exporter@23a8f0b]: jmx_exporter deploy (duration: 00m 04s) [08:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:18] (03CR) 10Muehlenhoff: [C: 032] Remove ntfs-3g on precise/trusty [puppet] - 10https://gerrit.wikimedia.org/r/335444 (owner: 10Muehlenhoff) [08:43:43] (03PS1) 10Filippo Giunchedi: scap: use cassandra dsh group [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/335619 [08:44:19] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] scap: use cassandra dsh group [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/335619 (owner: 10Filippo Giunchedi) [08:44:41] (03PS6) 10Yuvipanda: [WIP] tools: Use docker engine profile [puppet] - 10https://gerrit.wikimedia.org/r/335299 [08:54:31] (03CR) 10Filippo Giunchedi: "> > IMO on balance these two things don't offset introducing (and" [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [08:57:02] !log deploying schema change to page_assessments_projects on enwiki T156305 [08:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:07] T156305: Update page_assessments_projects schema for subprojects in production - https://phabricator.wikimedia.org/T156305 [09:01:07] (03CR) 10Ema: "> If we can selectively call the inspect for errors only, I am ok" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [09:02:16] (03PS3) 10Ema: Use caller function module name as default log prefix [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 [09:02:58] (03PS19) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) [09:13:15] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2990060 (10fgiunchedi) thanks @Krinkle ! I have some questions m... [09:15:01] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2992491 (10MoritzMuehlenhoff) jessie-wikimedia/backports has the same priority as main, if it gets added there, it also upgrades all labs instances... [09:17:12] !log Remove dbstore1001:/srv/tmp/db1063.tar.gz after it has been transferred to db1095:/srv/tmp/db1063.tar.gz to get more disk space [09:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:57] !log uograding remaining canary servers to new HHVM packages [09:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:28] !log deploying schema change to page_assessments_projects on enwikivoyage T156305 [09:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:34] T156305: Update page_assessments_projects schema for subprojects in production - https://phabricator.wikimedia.org/T156305 [09:22:00] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [09:22:37] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2992505 (10fgiunchedi) FWIW we're using the same method via `jessie-wikimedia/experimental` on cache boxes for e.g. nginx or the kernel, this is the... [09:27:13] 06Operations, 06Operations-Software-Development: Puppet compiler: abort on git rebase conflict - https://phabricator.wikimedia.org/T157001#2992511 (10Volans) [09:31:29] 06Operations, 06Operations-Software-Development: Puppet compiler: re-add the concurrency option NUM_THREADS - https://phabricator.wikimedia.org/T157002#2992528 (10Volans) [09:36:22] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2992548 (10hashar) ARGHGHGHG. At the risk of derailing completely: Debian.org has 500 by default and backports at 100 | codename/component | Pri... [09:38:53] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2992552 (10MoritzMuehlenhoff) > I am not sure though why `jessie-wikimedia` has `backports` and `main` components since they end up being at the sam... [09:54:49] !log rolling restart of logstash cluster to pick up openjdk/NSS security updates [09:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:33] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#2992609 (10Marostegui) Hi, Some tests I have been doing: - Executed the same query on db1048 (slave that crashed - OOM). Query took 2 minutes (as it did yester... [10:19:58] (03PS3) 10Elukey: Replace mc2001 with mc2019 in Mediawiki Redis/Memcached shards [puppet] - 10https://gerrit.wikimedia.org/r/335449 (https://phabricator.wikimedia.org/T155755) [10:23:14] !log rolling restart of nginx tls terminators running on mw* application servers in eqiad to pick up openssl 1.1 update [10:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:50] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [10:34:50] RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [10:34:58] !log restarted hadoop-mapreduce-historyserver on analytics1001 [10:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:03] !log Swap mc2001 with mc2019 (Redis codfw replicas) - T155755 [10:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:07] T155755: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755 [10:53:13] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2992752 (10fgiunchedi) For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know if mediawiki is already able to do that... [10:53:29] (03PS1) 10Filippo Giunchedi: Allow mwlog[12]001 on datasets/dumps/eventlog/logstash [puppet] - 10https://gerrit.wikimedia.org/r/335623 (https://phabricator.wikimedia.org/T123728) [10:53:31] (03PS1) 10Filippo Giunchedi: scap: move udp2log from fluorine to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335624 (https://phabricator.wikimedia.org/T123728) [10:53:33] (03PS1) 10Filippo Giunchedi: udp2log: mirror traffic to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) [10:53:53] (03CR) 10Elukey: [C: 032] Replace mc2001 with mc2019 in Mediawiki Redis/Memcached shards [puppet] - 10https://gerrit.wikimedia.org/r/335449 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [10:55:11] some alarms might trigger (Redis replication and ip sec) [10:55:59] Creating service restbase/cassandra [10:55:59] WARNING:conftool:Setting default_values to the default value {'pooled': 'no', 'weight': 0} [10:56:02] Creating service aqs/cassandra [10:56:05] WARNING:conftool:Setting default_values to the default value {'pooled': 'no', 'weight': 0} [10:56:08] Creating service maps/cassandra [10:56:08] godog: --^ [10:56:11] WARNING:conftool:Setting default_values to the default value {'pooled': 'no', 'weight': 0} [10:56:14] this looks weird [10:56:25] elukey: just now? [10:56:48] yeah, I merged my code review [10:56:55] but it is totally unrelated [10:57:25] elukey: odd, yeah I got the same messages earlier today when merging mine which was expected, anyways let me know when it finishes [10:57:43] godog: finished, just wanted to check with you that everything was ok [10:58:59] https://config-master.wikimedia.org looks good [10:59:00] elukey: yeah it looks like conftool-merge tries to create the services at each invocation [10:59:17] not sure why though [11:03:12] also it outputs weight: 0 but conftool-data has weight: 10 as default value [11:04:24] mmm [11:04:25] _joe_: ^ have you seen this before? namely conftool-merge trying to create services at each run [11:04:44] <_joe_> wat? [11:05:12] I'm assuming it is related to https://gerrit.wikimedia.org/r/#/c/332535 and its conftool-data changes heh [11:05:15] <_joe_> godog: did anyone change conftool-data? [11:05:35] <_joe_> uff [11:05:38] <_joe_> let me see [11:06:03] I'm ok reverting the conftool-data part of that change if that makes things easier [11:06:18] <_joe_> wait [11:06:45] <_joe_> so I do see a problem there [11:07:53] <_joe_> I think I know what's going on [11:08:00] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [11:08:10] <_joe_> and I think I also said so in a CR [11:08:47] <_joe_> default_values are needed as they're used to fill the "node" objects with default values. [11:08:56] <_joe_> eric answeered "done" [11:09:01] PROBLEM - IPsec on mc2001 is CRITICAL: Strongswan CRITICAL - ok: 1 not-conn: mc1001_v4 [11:09:17] ah, I see the typo now, defaul vs default [11:09:23] <_joe_> but it doesn't seem like they've been added [11:09:38] fixing that now [11:10:20] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [11:10:21] <_joe_> "defaul_values": {"pooled": "no", "weight": 10}} [11:10:24] <_joe_> oook [11:10:32] <_joe_> that doesn't look right :P [11:10:33] checking mc2001 [11:10:38] (03CR) 10Juniorsys: [C: 031] Fix typo in realm.pp [puppet] - 10https://gerrit.wikimedia.org/r/335548 (owner: 10Tim Landscheidt) [11:10:49] but mc1001 -> mc1019 works well :) [11:10:53] ah ok now I got it [11:11:08] mc2001 is now not able to connect via IPsec to mc1001 [11:11:24] (03PS1) 10Filippo Giunchedi: conftool-data: fix typo defaul_values vs default_values [puppet] - 10https://gerrit.wikimedia.org/r/335626 [11:12:00] RECOVERY - IPsec on mc2001 is OK: Strongswan OK - 1 ESP OK [11:12:13] ran puppet --^ [11:12:14] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] conftool-data: fix typo defaul_values vs default_values [puppet] - 10https://gerrit.wikimedia.org/r/335626 (owner: 10Filippo Giunchedi) [11:18:06] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2992765 (10elukey) Warning: I am swapping all the mc2* codfw hos... [11:23:14] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#2992780 (10Joe) Correct me if I'm wrong, but I think the Main pa... [11:24:39] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2992781 (10hashar) I managed to reproduce locally with HHVM 3.12.11+dfsg-1+wmf1 from apt.wikimedia.org and its l... [11:28:23] (03PS1) 10Elukey: Replace mc200[23] Redis/Memcached codfw shards with mc202[01] [puppet] - 10https://gerrit.wikimedia.org/r/335627 (https://phabricator.wikimedia.org/T155755) [11:29:48] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942#2992787 (10Addshore) [11:29:50] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2802094 (10Addshore) 05Open>03Resolved [11:32:38] !log joal@tin Started deploy [analytics/refinery@bc4b4ed]: (no justification provided) [11:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:08] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 983 MB (2% inode=82%) [11:35:41] !log joal@tin Finished deploy [analytics/refinery@bc4b4ed]: (no justification provided) (duration: 03m 03s) [11:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:46] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5315/" [puppet] - 10https://gerrit.wikimedia.org/r/335627 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [11:39:47] (03PS3) 10Giuseppe Lavagetto: Initial commit of etcd-mirror [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/335460 [11:40:52] !log Swap mc2002 with mc2020, mc2003 with mc2021 (Redis codfw replicas) - T155755 [11:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:56] T155755: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755 [11:45:57] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [11:53:07] RECOVERY - Disk space on stat1002 is OK: DISK OK [11:53:43] this is me --^ [11:58:14] (03PS2) 10Nschaaf: (in progress) Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) [11:58:30] !log joal@tin Started deploy [analytics/refinery@bc4b4ed]: (no justification provided) [11:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:45] !log joal@tin Finished deploy [analytics/refinery@bc4b4ed]: (no justification provided) (duration: 01m 14s) [11:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:08] (03CR) 10Nschaaf: [C: 04-1] "I did not notice the refinery-drop-hourly-partitions script before. I've updated the commit to utilize that instead. I still need to test " [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [12:07:10] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:27] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2992912 (10hashar) Bash limits the size of core files to 0 which means no core get generated. Gotta set it to `u... [12:27:07] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2992915 (10Milimetric) Sorry, should've looked that up [12:30:11] (03PS4) 10Gehel: elasticsearch - reimage elasticsearch relforge servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) [12:32:12] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elasticsearch relforge servers to jessie [puppet] - 10https://gerrit.wikimedia.org/r/323156 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [12:36:09] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:43:07] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2992971 (10hashar) Installed on mwdebug1001 under /home/hashar/mediawiki-core ``` git clone --depth 1 --single-... [12:51:29] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2992981 (10hashar) On deployment-tin additionally did: git clone --depth 1 --single-branch https://gerrit.wikim... [12:52:59] 06Operations, 10ops-eqiad: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2992982 (10fgiunchedi) [12:53:01] !log starting reimage to jessie of elasticsearch relforge - T151326 [12:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:06] T151326: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326 [12:54:59] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:55:37] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2992999 (10hashar) Deployment-tin does not segfault. I am thus upgrading HHVM: Unpacking hhvm-dbg (3.12.11+dfsg... [12:55:51] (03PS15) 10Elukey: Add JVM Heap usage alarms for basic Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/330154 (https://phabricator.wikimedia.org/T88640) [13:10:09] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5316/" [puppet] - 10https://gerrit.wikimedia.org/r/330154 (https://phabricator.wikimedia.org/T88640) (owner: 10Elukey) [13:13:57] (03PS1) 10Volans: Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 [13:17:43] (03CR) 10Volans: "I've found them doing a full puppet compiler:" [labs/private] - 10https://gerrit.wikimedia.org/r/335643 (owner: 10Volans) [13:17:52] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2993037 (10hashar) Fast repro is: hhvm -v Eval.Jit=false tests/phpunit/phpunit.php tests/phpunit/includes/i... [13:19:48] (03CR) 10Volans: "Last changeset was just a rebase due to conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [13:22:59] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:29:30] graphite.w.o is acting weird for me [13:30:29] very slow and when it loads the visualization is messed up [13:31:30] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2990073 (10MoritzMuehlenhoff) I think I've identified the problem, a new package is building on copper ATM. [13:32:13] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:30] yeah.. [13:32:52] at this point I am not sure if https://gerrit.wikimedia.org/r/330154 caused it [13:33:03] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.009 second response time [13:34:20] when it breaks I can see a "OperationalError: database is locked" on the UI [13:35:32] elukey: not sure but it may be a clue that https://graphite-labs.wikimedia.org/ is ok even if a little slow [13:35:42] not sure how tied the two are but I think it's the same role etc [13:38:03] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [13:39:03] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.009 second response time [13:39:29] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2993094 (10hashar) [13:46:06] (03PS1) 10Elukey: Revert "Add JVM Heap usage alarms for basic Hadoop daemons" [puppet] - 10https://gerrit.wikimedia.org/r/335648 [13:46:51] I can't find anything super obvious from the host metrics but the timeline correlate too much with my change + puppet runs, reverting [13:47:33] (03CR) 10Elukey: [C: 032] Revert "Add JVM Heap usage alarms for basic Hadoop daemons" [puppet] - 10https://gerrit.wikimedia.org/r/335648 (owner: 10Elukey) [13:50:07] 06Operations, 10ops-esams, 10hardware-requests: reclaim hooft to spares - https://phabricator.wikimedia.org/T131560#2170895 (10faidon) Let's do the opposite, see T156506. [13:51:23] ok also running puppet on all the Hadoop nodes to speed up [13:52:47] Grafana is also reporting errors.. [13:55:42] elukey: I suspect it is https://phabricator.wikimedia.org/T157022 [13:56:45] godog: yeah it seems freezing once in a while [13:56:50] 06Operations, 10netops: asw-d-codfw public1-vlan addition review (blocks gerrit2001) - https://phabricator.wikimedia.org/T156957#2993128 (10faidon) Yes, this sounds fine. If you don't know this already, a tip to make sure that the config is applying correctly is to run `show interfaces ge-5/0/10 | display inhe... [13:57:03] but not sure how/if my changes triggered more load [13:57:50] could be but unlikely, depending on the query [13:57:58] I assume I'll be okay to do swat in 3 mins even with these possible graphite issues? [13:58:12] addshore: I think so yeah [13:58:19] ack, thanks! [13:58:42] (03CR) 10Addshore: Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) (owner: 10Addshore) [13:58:50] (03PS4) 10Addshore: Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) [13:58:59] (03PS2) 10Addshore: Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) (owner: 10Urbanecm) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T1400). [14:00:04] addshore and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:08] Urbanecm: here? :) (I'll mine first) [14:00:13] (03CR) 10Addshore: [C: 032] Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) (owner: 10Addshore) [14:00:26] addshore: you are in charge of eu swat today? [14:00:39] zeljkof: will do (as one of the two changes is mine) :) [14:00:44] (03CR) 10Filippo Giunchedi: [C: 031] Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 (owner: 10Volans) [14:00:49] addshore: great :) [14:00:57] You can have a day off! ;) [14:01:40] (03Merged) 10jenkins-bot: Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) (owner: 10Addshore) [14:01:49] (03CR) 10jenkins-bot: Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) (owner: 10Addshore) [14:01:59] (03CR) 10Addshore: [C: 032] Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) (owner: 10Urbanecm) [14:02:33] ah swat [14:02:48] godog: https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=graphite1001&from=now-7d&to=now looks nice :) [14:02:49] Urbanecm: Urbanecm_ here? [14:02:50] \^^/ [14:02:57] (03CR) 10Faidon Liambotis: [C: 04-2] "No, that's wrong:" [puppet] - 10https://gerrit.wikimedia.org/r/335594 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [14:03:01] (03CR) 10Addshore: Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) (owner: 10Urbanecm) [14:03:28] addshore: yes, today. [14:03:36] godog: indeed something is broken [14:03:45] Urbanecm_: brilliant! [14:03:50] (I'm at library and they block IRC port...) [14:03:52] (03CR) 10Addshore: [C: 032] Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) (owner: 10Urbanecm) [14:04:00] 06Operations, 10ops-eqiad: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2993154 (10fgiunchedi) model `INTEL SSDSC2BB600G4` [14:04:28] ok I ran puppet in all hadoop nodes, my patch is fully reverted [14:04:30] Urbanecm_: that is a terrible library! [14:04:32] elukey: heh for some definition of 'nice' heh [14:04:45] hashar: for SWATting yes :) [14:04:55] (but they allow SSH btw) [14:05:37] hashar: I got a bunch of "cannot delete non-empty directory: php-1.29.0-wmf.3" on mwdebug1002 when doing scap pull, im guessing i can ignore that? [14:05:48] doh [14:05:55] addshore: permission issue I guess. Worth reporting to a task [14:05:59] but yeah probably harmless [14:06:07] Okay will do after! [14:06:12] (03Merged) 10jenkins-bot: Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) (owner: 10Urbanecm) [14:06:17] 14:05:24 Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 93% OVER_THRESHOLD (Avg. Error rate: Before: 0.01, After: 1.00, Threshold: 0.06) [14:06:18] hmmm [14:06:27] eeek [14:07:11] 06Operations, 10ops-eqiad: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2992982 (10elukey) From Grafana: {F5453869} {F5453880} [14:07:17] 1.00 means 100% errror rate? [14:07:33] jynus: it write 93 %. But almost 100% [14:07:35] (03CR) 10jenkins-bot: Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335481 (https://phabricator.wikimedia.org/T156737) (owner: 10Urbanecm) [14:07:40] no idea [14:07:44] mw1278 doesnt look bad on logstash [14:08:16] and tested on mwdebug1002 and everything look clear [14:08:29] maybe it had a single error? :D [14:08:32] addshore: so test it at mw1278 if you can :D [14:08:43] 2017-02-02T14:05:11 enwiki Model contains an error for 465982874: CommentDeleted: Comment deleted (datasource.revision.comment) [14:09:00] so potentially it had zero errors before [14:09:08] hashar: how is eswikinews related to enwiki? [14:09:16] and one error happened exactly during the 20 seconds log check [14:09:30] I guess I'll resync? [14:09:33] yeah [14:10:00] I'm kind of suprised the failure of the scap didn't log actually [14:10:24] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:324489|Enable ElectronPdfService extension on dewiki]] T150942 (duration: 00m 41s) [14:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:28] T150942: Deploy ElectronPdfService Extension to dewiki - https://phabricator.wikimedia.org/T150942 [14:10:56] Okay, looks clear! :) [14:11:30] addshore: thanks :) [14:11:41] Urbanecm_: that was my change, your is coming up now! [14:12:09] addshore: I'm confused by the merged :D [14:12:11] Urbanecm_: on mwdebug1002 ! [14:12:49] addshore: Can't test import as I'm not a sysop. [14:13:01] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:13:04] ack, the site looks up :) syncing now [14:13:19] ok [14:14:16] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:335481|Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews]] T156737 (duration: 00m 40s) [14:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:20] T156737: Add Wikinews languages (en, pt, ca, fr, de, it) as import sources on eswikinews - https://phabricator.wikimedia.org/T156737 [14:14:26] Urbanecm_: all done! and looks good to me! [14:14:29] Thanks [14:14:41] !log EU SWAT finished [14:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:46] addshore: sorry for bothering but I didn't add one patch to the calendar :D. It's 334997. Can we reopen SWAT and do it? Or should I use Monday SWAT? [14:16:00] It's for T156621 [14:16:01] T156621: Create Wikiprojekti namespace on Finnish Wikipedia - https://phabricator.wikimedia.org/T156621 [14:16:58] Urbanecm_: add it to the calendar! I guess we can squeeze it in! [14:17:03] Ok [14:17:07] (03PS2) 10Addshore: Create Wikiprojekti namespace on fiwiki and enable VE in it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334997 (https://phabricator.wikimedia.org/T156621) (owner: 10Urbanecm) [14:18:11] addshore: added [14:18:17] (03CR) 10Addshore: [C: 032] Create Wikiprojekti namespace on fiwiki and enable VE in it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334997 (https://phabricator.wikimedia.org/T156621) (owner: 10Urbanecm) [14:18:45] (03CR) 10Hashar: [C: 031] "When a snapshot is being generated, an instance is booted but Nodepool does not consider it to be part of the pool. So it reports it as an" [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [14:19:03] 06Operations, 10scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#2993201 (10Addshore) [14:19:46] (03Merged) 10jenkins-bot: Create Wikiprojekti namespace on fiwiki and enable VE in it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334997 (https://phabricator.wikimedia.org/T156621) (owner: 10Urbanecm) [14:19:54] (03CR) 10jenkins-bot: Create Wikiprojekti namespace on fiwiki and enable VE in it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334997 (https://phabricator.wikimedia.org/T156621) (owner: 10Urbanecm) [14:20:31] (03CR) 10Andrew Bogott: [C: 031] "Is there anything distinctive about the snapshot so we can detect it as such? (e.g. the lack of an IP?)" [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [14:22:28] Urbanecm_: checked and syncing [14:22:43] thx [14:23:03] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:334997|Create Wikiprojekti namespace on Finnish Wikipedia]] T156621 (duration: 00m 41s) [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:08] T156621: Create Wikiprojekti namespace on Finnish Wikipedia - https://phabricator.wikimedia.org/T156621 [14:23:11] !log EU SWAT really finished [14:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:10] urandom: addshore: congrats :] [14:28:42] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 10scap, 15User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#2993250 (10hashar) [14:33:08] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#2801021 (10hashar) mwscript is hardcoded to use php5. We no more add Zend PHP extensions (such as php-memcached) on the app servers. When we... [14:33:30] (03CR) 10Rush: "wondering the same thing as andrew, I did catch an alien tuesday during initial testing and it looked like:" [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [14:33:35] (03PS6) 10Rush: wip nodepool: track and alert on age of instance states [puppet] - 10https://gerrit.wikimedia.org/r/335373 [14:36:16] 06Operations, 07Puppet, 10Horizon, 06Labs, 13Patch-For-Review: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2993261 (10Andrew) @scfc part of why I'm conflicted about this issue is that wmf Ops keep asserting that no one would ever use this gui to discover roles that the... [14:37:56] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#2993282 (10hashar) [14:41:09] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:45:01] (03PS1) 10Elukey: Replace Redis/Memcached shards mc200[4567] with mc202[2345] [puppet] - 10https://gerrit.wikimedia.org/r/335655 (https://phabricator.wikimedia.org/T155755) [14:51:47] !log manually fail sdc on graphite1001 - T157022 [14:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:51] T157022: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022 [14:53:13] (03CR) 10Hashar: [C: 031] "The alien ci-jessie-wikimedia-498353 was an instance that nova somehow could not delete but it is gone now. The long story is at T156636 " [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [14:53:19] PROBLEM - MD RAID on graphite1001 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 2, Spare: 0 [14:53:21] ACKNOWLEDGEMENT - MD RAID on graphite1001 is CRITICAL: CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T157034 [14:53:25] 06Operations, 10ops-eqiad: Degraded RAID on graphite1001 - https://phabricator.wikimedia.org/T157034#2993319 (10ops-monitoring-bot) [14:54:25] 06Operations, 10ops-eqiad: Degraded RAID on graphite1001 - https://phabricator.wikimedia.org/T157034#2993325 (10fgiunchedi) [14:54:27] 06Operations, 10ops-eqiad: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2993328 (10fgiunchedi) [14:56:53] 06Operations, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#2993373 (10MoritzMuehlenhoff) [14:57:59] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=451.70 Read Requests/Sec=448.30 Write Requests/Sec=437.40 KBytes Read/Sec=32384.00 KBytes_Written/Sec=3198.40 [15:01:39] (03CR) 10Elukey: [C: 032] Replace Redis/Memcached shards mc200[4567] with mc202[2345] [puppet] - 10https://gerrit.wikimedia.org/r/335655 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [15:01:58] !log Replace Redis/Memcached shards mc200[4567] with mc202[2345] [15:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:59] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=16.10 Read Requests/Sec=2.20 Write Requests/Sec=122.50 KBytes Read/Sec=31.60 KBytes_Written/Sec=803.20 [15:11:06] !log rebooting mc203[56] (not taking any traffic) to test if they come up cleanly [15:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:43] !log upgrading firejail on eqiad imagescalers [15:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:40] (03PS1) 10Andrew Bogott: bootstrap_vz and vmbuilder: Do apt-get update before first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/335661 [15:35:02] !log upgrading firejail on wtp1001 [15:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:51] !log upgrading firejail on remaining app servers [15:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:53] (03CR) 10Zhuyifei1999: "> interactive restarts" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/335569 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [16:00:30] !log rebooting mc203[01234] (not serving any traffic) to see if they come up cleanly [16:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:48] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm] [16:03:38] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:11] (03CR) 10Hashar: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) (owner: 10Zhuyifei1999) [16:10:48] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:13:08] !log rebooting mc202[6789] (not serving any traffic) to see if they come up cleanly [16:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:38] PROBLEM - Host mc2026 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:48] PROBLEM - Host mc2027 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:48] PROBLEM - Host mc2028 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:58] ahhh sorry didn't silence icinga [16:14:58] PROBLEM - Host mc2029 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:10] np, thank you for taking care of that [16:15:18] RECOVERY - Host mc2028 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [16:15:22] the reboots and checks, I mean [16:15:28] RECOVERY - Host mc2026 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:15:28] RECOVERY - Host mc2027 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [16:15:38] RECOVERY - Host mc2029 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [16:15:48] :) [16:16:15] if you do not create alerts from time to time [16:16:30] it means you are not working on interesting things [16:16:42] haha [16:16:52] :D [16:19:21] (03CR) 10Hashar: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/334978 (https://phabricator.wikimedia.org/T156605) (owner: 10Zhuyifei1999) [16:24:39] !log reboot mc2019->mc2025 to see if they come up cleanly (currently codfw replicas of eqiad redis shards) [16:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:08] (03PS1) 10Papaul: DNS/Decom: Remove mgmt dns productions DNS for db2015,db2025-db2027 Bug:T156342,T149102 [dns] - 10https://gerrit.wikimedia.org/r/335667 [16:26:28] (03CR) 10Dzahn: "@Faidon 1) the issue is on carbon ( inet6 2620:0:861:1:7a2b:cbff:fe09:ea0)" [puppet] - 10https://gerrit.wikimedia.org/r/335594 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [16:26:35] these ones should have probably been done before replacing the old codfw shards, but it is also a good preliminary test [16:27:21] I am starting with mc2019, silenced also mc1001 for 10 minutes so no icinga alers for redis replication should fire (or IPSec failures) [16:27:48] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2993752 (10Papaul) [16:27:48] PROBLEM - Host mc2019 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:38] RECOVERY - Host mc2019 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [16:28:48] icinga didn't like the downtime -.- [16:29:39] 06Operations, 10ops-codfw, 10hardware-requests: decomission db2015 - https://phabricator.wikimedia.org/T149102#2742190 (10Papaul) a:05Papaul>03RobH port information: ge-6/0/18 [16:30:14] 06Operations, 06Operations-Software-Development: Puppet compiler: sync newest facts only - https://phabricator.wikimedia.org/T157052#2993763 (10Volans) [16:30:55] (03PS1) 10Volans: Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) [16:31:01] will wait a bit and then proceed with the rest [16:31:38] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:32:40] (03PS2) 10Volans: Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) [16:33:23] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2993785 (10bd808) >>! In T123728#2992752, @fgiunchedi wrote: > For redundancy purposes it would be nice if mediawiki could send udp2log traffic to udp2log receivers in both datacenters. I don't know... [16:33:47] (03PS3) 10Volans: Puppet compiler: sync newest facts only [puppet] - 10https://gerrit.wikimedia.org/r/335670 (https://phabricator.wikimedia.org/T157052) [16:34:16] (03PS3) 10Reedy: Escape period in wiki.phtml rewrites [puppet] - 10https://gerrit.wikimedia.org/r/331944 [16:36:58] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2993800 (10Papaul) a:05Papaul>03RobH decommission complete port information db2025 ge-5/0/9 db2026 ge-5/0/10 db2027 ge-5/0/15 [16:39:28] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 649 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3244043 keys, up 94 days 8 hours - replication_delay is 649 [16:39:28] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 650 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3244060 keys, up 94 days 8 hours - replication_delay is 650 [16:40:43] hey guys, jeff hobson has wrapped up with the foundation. it's not an emergency, but what's the appropriate way to ensure +2 rights and server access are pulled? [16:40:45] this is not me --^ [16:40:52] I am working on mc hosts [16:42:39] dr0ptp4kt: https://office.wikimedia.org/wiki/VerboseOffboard [16:44:33] bd808: thx. the first two sections are done or in flight. for the ops part is there a particular sort of phab ticket i should file or something like that? i don't mean to be lazy, except for the purpose of doing it the right way. [16:45:21] Ops-Access-Requests IIRC? [16:45:32] Or, it's probably as good as any.. [16:45:32] I'd say to hit up whoever is on clinic duty, but the slot in the topic seems to be blank [16:46:06] heh [16:48:29] (03PS2) 10Andrew Bogott: bootstrap_vz and vmbuilder: Do apt-get update before first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/335661 [16:48:31] (03PS1) 10Andrew Bogott: vmbuilder and boostrap_vz: Include many more packages in base image [puppet] - 10https://gerrit.wikimedia.org/r/335672 [16:51:28] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3230611 keys, up 94 days 8 hours - replication_delay is 0 [16:52:28] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3230502 keys, up 94 days 8 hours - replication_delay is 0 [16:56:28] !log reedy@tin Synchronized php-1.29.0-wmf.10/extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php: Fix inclusion path (duration: 00m 41s) [16:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:48] hasharAway: any issue with jenkins? it took 22m to run the jobs for a puppet CR [17:00:05] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T1700). [17:00:05] reedy and thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:26] * thcipriani present [17:00:53] Wonder who's doing it. AFAIK joe is gone for the weekend [17:05:08] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [17:08:07] !log restbase deploying 634faea2 [17:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:59] 06Operations, 10Ops-Access-Requests, 10Analytics, 10Analytics-Cluster: Requesting access to analytics-privatedata-users for musikanimal - https://phabricator.wikimedia.org/T156986#2991994 (10Nuria) Approved on my end. [17:10:08] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [17:10:08] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [17:10:08] PROBLEM - check_puppetrun on frqueue1002 is CRITICAL: CRITICAL: puppet fail [17:11:53] mobrovac: ok, updated graph: https://grafana-admin.wikimedia.org/dashboard/db/eventstreams?from=1486018402114&to=1486052894764 [17:11:58] has count as separate graph [17:12:08] so ya, those both seem correct [17:12:17] 1 request (ever) in this time (no traffic yet :) ) [17:12:26] why is the rate for v2_stream... so crazy? [17:12:48] no idea man [17:13:06] haha [17:13:07] ok [17:13:16] but yeah, you can drop your manual metrics, you do the exact same thing the template [17:13:24] sorry, can't investigate now [17:13:26] on other stuff [17:13:33] ok, so the response metric will be sent to statsd before the request finishes? [17:13:56] the count should [17:14:02] ok cool [17:14:07] yeah then totally redundant, will remove it [17:15:08] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [17:15:08] PROBLEM - check_puppetrun on frqueue1002 is CRITICAL: CRITICAL: puppet fail [17:15:20] thx Reedy [17:15:46] (03CR) 10Jcrespo: [C: 032] logstash_checker: provide an absolute threshold [puppet] - 10https://gerrit.wikimedia.org/r/335558 (owner: 10Thcipriani) [17:15:59] (03PS2) 10Jcrespo: logstash_checker: provide an absolute threshold [puppet] - 10https://gerrit.wikimedia.org/r/335558 (owner: 10Thcipriani) [17:20:08] RECOVERY - check_puppetrun on frqueue1002 is OK: OK: Puppet is currently enabled, last run 220 seconds ago with 0 failures [17:20:08] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 182 seconds ago with 0 failures [17:21:44] (03PS1) 10Elukey: Replace codfw Memcached/Redis mc1008->mc1011 with mc2026->mc2029 [puppet] - 10https://gerrit.wikimedia.org/r/335676 (https://phabricator.wikimedia.org/T155755) [17:22:00] 06Operations, 10Ops-Access-Requests: Offboarding for Jeff Hobson - https://phabricator.wikimedia.org/T157058#2993963 (10dr0ptp4kt) [17:22:37] thcipriani, about to merge [17:22:43] in production, I mean [17:22:50] jynus: ok [17:23:43] which machines do I run puppet now? [17:24:10] logstash1001? [17:24:29] jynus: ah only needed on deploy hosts [17:24:33] tin and mira [17:24:36] ah, ok [17:24:38] doing [17:25:42] okie doke. I'll test after puppet run. [17:26:10] I did tin [17:26:16] I suppose mira can wait? [17:26:26] 30 minutes? [17:26:33] yeah, mainly needed for the active deployment machine [17:26:36] it looks ok there [17:26:38] only place it runs regularly [17:26:40] * thcipriani checks [17:26:41] on tin, I mean [17:27:05] jynus: works as expected! [17:27:07] thank you! [17:29:45] (03PS2) 10Elukey: Replace codfw Memcached/Redis mc1008->mc1011 with mc2026->mc2029 [puppet] - 10https://gerrit.wikimedia.org/r/335676 (https://phabricator.wikimedia.org/T155755) [17:30:14] (03CR) 10Jcrespo: [C: 031] "Looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/331944 (owner: 10Reedy) [17:30:52] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07HHVM: New HHVM 3.12.11 segfault at end of MediaWiki PHPUnit tests - https://phabricator.wikimedia.org/T156923#2994029 (10MoritzMuehlenhoff) I backed out the bzip2-segfault-sweep.patch introduced in 3.12.11+dfsg-1 and that... [17:32:53] !log restbase deploy end of 634faea2 [17:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:46] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#2994039 (10Paladox) [17:33:48] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:16] twentyafterfour ^^ [17:38:52] (03PS3) 10Elukey: Replace codfw Memcached/Redis mc2008->mc2011 with mc2026->mc2029 [puppet] - 10https://gerrit.wikimedia.org/r/335676 (https://phabricator.wikimedia.org/T155755) [17:43:55] !log upgrade & restart of db2052 T111654 [17:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:00] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [17:50:11] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2994110 (10RobH) [17:50:14] 06Operations, 10netops: asw-d-codfw public1-vlan addition review (blocks gerrit2001) - https://phabricator.wikimedia.org/T156957#2994108 (10RobH) 05Open>03Resolved Thanks! Merged change. [17:51:39] 06Operations, 10Ops-Access-Requests: Offboarding for Jeff Hobson - https://phabricator.wikimedia.org/T157058#2994119 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff @dr0ptp4kt This is already taken care of. I've removed his cluster shell access and LDAP permissions yesterday. (The offboardi... [17:52:07] thanks much moritzm [17:56:56] !log upgrade & restart of db2059 T111654 [17:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:00] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [17:58:39] !log bsitzmann@tin Started deploy [mobileapps/deploy@09101f7]: (no justification provided) [17:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T1800). [18:00:36] !log bsitzmann@tin Finished deploy [mobileapps/deploy@09101f7]: (no justification provided) (duration: 01m 56s) [18:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:21] !log ^ reverted previous deploy due to incorrect links in the news endpoint [18:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:41] 06Operations, 10ops-eqiad: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#2994155 (10Gehel) [18:01:45] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2994154 (10Gehel) [18:02:11] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2814411 (10Gehel) reimaging of relforge1001 should wait until disk issues are solved (T156663). [18:02:48] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:03:33] !log bsitzmann@tin Started deploy [mobileapps/deploy@09101f7]: try previous deploy again (at least on canary) [18:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:25] !log bsitzmann@tin Finished deploy [mobileapps/deploy@09101f7]: try previous deploy again (at least on canary) (duration: 00m 51s) [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:37] 06Operations, 10ops-eqiad: Degraded RAID on relforge1001 - https://phabricator.wikimedia.org/T156663#2994160 (10Gehel) disk info: ``` gehel@relforge1001:~$ sudo smartctl -i /dev/sdc smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-100-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Frank... [18:05:45] (03PS6) 10ArielGlenn: dumps: Modernize design of the index page [puppet] - 10https://gerrit.wikimedia.org/r/334856 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [18:07:03] (03CR) 10ArielGlenn: [C: 032] dumps: Modernize design of the index page [puppet] - 10https://gerrit.wikimedia.org/r/334856 (https://phabricator.wikimedia.org/T155697) (owner: 10Ladsgroup) [18:10:39] apergos: \o/ [18:10:54] Thanks. I make the patch for other pages right now [18:10:59] ok! [18:11:08] I'll look at them tomorrow [18:13:25] thanks [18:16:45] (03CR) 10Rush: "I can select for and exclude based on 'ciimage' if we are willing to say that's the canonical scheme." [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [18:17:04] godog: about? [18:17:14] have a q about the mwlogs dataset patchset [18:17:26] apergos: sure, for a little while still [18:17:38] is there an intent to rsync from/to dataset1001? [18:18:06] go [18:18:09] urgh [18:18:11] godog: [18:18:12] apergos: yeah, mwlog is going to replace fluorine, so same set of rsync in/out [18:18:16] ah [18:18:35] ok, you need the same cron job on whichever one of those is primary then [18:18:47] not that this needs to be part of the patchset, just to bear in mind [18:20:28] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:14] if the mwloh hosts will have ipv6 then you should also add them to the dumps::rsync_clients_ipv6 stanza in role/common/dumps/server.yaml, godog [18:21:27] *mwlog [18:21:59] apergos: ack, I don't know offhand is it fluorine -> dataset or the other way around? who pulls data from where? [18:22:10] or push, whichever [18:22:12] the cron is on fluorine to push [18:25:41] apergos: ok thanks! yeah I'll make sure the cronjobs are moved too [18:25:51] yep [18:26:01] just don't push from both hosts :-P [18:26:25] haha not even if the data is the same?! [18:27:11] I'll keep it in mind tho [18:28:15] apergos: can you comment on the ipv6 bits on the patchset too so I won't forget? [18:28:20] sure sure [18:28:23] apergos: if you are around. Where progress.html is used? [18:28:27] https://dumps.wikimedia.org/progress.html returns 404 [18:28:36] Amir1 wait 2 mins please [18:28:45] sure, no rush [18:30:34] (03CR) 10ArielGlenn: "Two things: there's a cron job on fluorine that shoves data over, you'll need to make sure whichever host is the primary, picks up that jo" [puppet] - 10https://gerrit.wikimedia.org/r/335623 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [18:30:43] godog: done [18:30:49] apergos: \o/ thansk [18:38:44] Amir1: you can do report.html if you want to get the per-directory index html pages... the progress.html file is left around only as a sample. [18:39:27] apergos: I see [18:39:28] oka [18:39:30] *okay [18:43:14] (03PS1) 10Ladsgroup: dumps: More UI cleanup [puppet] - 10https://gerrit.wikimedia.org/r/335684 (https://phabricator.wikimedia.org/T155697) [18:44:14] apergos: ^ For when you have some time, I'll tackle progress report in another patch, since I want to test it throughfully and it's not easy [18:44:54] sure [18:45:02] thanks for doing the work! [18:48:28] thanks for reviewing [18:48:30] o/ [18:49:28] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:57:55] (03PS1) 10Jdlrobson: RelatedArticles enabled on French Wikipedia (mobile only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335685 (https://phabricator.wikimedia.org/T156362) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T1900). Please do the needful. [19:00:27] No patches, all done [19:00:31] Yay, successful swat [19:01:55] (03PS1) 10Jdlrobson: Related pages is shown to 90% of mobile users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) [19:04:23] (03PS1) 10Jdlrobson: Limit page images on beta cluster to images in the lead section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335687 (https://phabricator.wikimedia.org/T152115) [19:07:18] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89995.75 seconds [19:08:37] (03PS1) 10Jdlrobson: Update apple touch icon for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335689 (https://phabricator.wikimedia.org/T152538) [19:10:18] bd808: Is this the right place to talk about 2FA on labs? [19:10:51] jdlrobson: #wikimedia-labs is probably better [19:13:07] !log Gracefully restarting Jenkins [19:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:59] hmm [19:19:11] Jenkins restart in less than a minute [19:24:30] !log reset wikitech 2FA for jdlrobson [19:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:35] (03PS1) 10Ottomata: Fix typo on zookeeper package conditional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/335690 [19:28:52] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo on zookeeper package conditional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/335690 (owner: 10Ottomata) [19:29:21] !log reset wikimedia 2FA for jdlrobson [19:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:26] (03PS1) 10Ottomata: Update cdh module with zookeeper package conditional typo fix [puppet] - 10https://gerrit.wikimedia.org/r/335691 [19:35:31] (03CR) 10Ottomata: [V: 032 C: 032] Update cdh module with zookeeper package conditional typo fix [puppet] - 10https://gerrit.wikimedia.org/r/335691 (owner: 10Ottomata) [19:36:35] ostriches: well, in case you want to do something in the swat time, there's a whole lot of users in aawiki with the 'inactive' flag which comes from nowhere... maybe you could do some sql magic and remove that flag for them so I can sleep in peace at night ;) [19:37:41] aawiki keeps you awake at night? [19:37:47] You should see a doctor about that ;-) [19:37:52] see T150538 [19:37:52] T150538: Cleanup 'inactive' usergroup on aawiki - https://phabricator.wikimedia.org/T150538 [19:38:06] it was a joke, I usually sleep well :) [19:40:04] I don't see the problem, really. [19:40:08] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [19:40:08] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 17206 MB (48% inode=91%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 148036 MB (10% inode=99%) [19:40:10] It's an unused user group [19:40:15] On a closed wiki [19:40:32] yep, it's just a little thing to do when you're bored and dunno what to do :) [19:40:42] not urgent, etc [19:40:44] ah chasemp! https://gerrit.wikimedia.org/r/#/c/316577/ [19:40:54] 06Operations, 10Traffic, 07HTTPS, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#2994540 (10Bawolff) [19:40:57] @chasemp: This updated also modules/cdh and modules/mariadb which does not seem to have been intended [19:42:16] (03CR) 10Ottomata: "Whoa, this commit also included a submodule bump up past a few versions. It looks like it was accidentally reverted to an older version i" [puppet] - 10https://gerrit.wikimedia.org/r/335691 (owner: 10Ottomata) [19:43:16] ottomata: danggg [19:43:26] you are right that is entirely my bad [19:44:26] i think it hasn't bitten us because we hadn't restarted nodemanagers in a while, but elukey just did...so we will watch out! [19:44:52] ottomata: I'm a meeting, totally good with a revert if you are goign that way or failing forward whatever is best for you guys [19:45:08] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 17206 MB (48% inode=91%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 90678 MB (6% inode=99%) [19:45:09] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [19:50:08] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 17207 MB (48% inode=91%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 54828 MB (3% inode=99%) [19:50:18] ACKNOWLEDGEMENT - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 17207 MB (48% inode=91%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 54828 MB (3% inode=99%): Jeff_Green known. fixing momentarily. [19:56:17] Hey, is our mail queue backed up? [19:57:59] volans: ping [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170202T2000). Please do the needful. [20:02:22] volans: will need to get back later, gotta fix something [20:04:00] (03PS1) 1020after4: all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335692 [20:04:02] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335692 (owner: 1020after4) [20:04:35] !log deploying MediaWiki 1.29.0-wmf.10 to all wikis [20:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:35] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335692 (owner: 1020after4) [20:06:51] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335692 (owner: 1020after4) [20:09:11] hmmm ... Notice: Undefined index: HASH in /srv/mediawiki/php-1.29.0-wmf.10/includes/cache/MessageCache.php on line 594 [20:09:21] Krinkle: sure, I'm cooking dinner ;) [20:10:48] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.10 [20:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:55] twentyafterfour: There's at least one bug filed about that [20:12:59] twentyafterfour: Yep, filed, Aaron and I will look at it now. I'd recommend staying on wmf.9 for now. [20:13:06] Just found out a few minutes ago [20:13:12] https://phabricator.wikimedia.org/T156996 [20:13:26] I don't know what it affects yet, but it's message cache. Better safe than sorry to avoid a mess that's hard to recover from [20:13:39] ok rolling back [20:14:21] !log rolling back to wmf.9 due to T156996 [20:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:26] T156996: Multiple PHP warnings in master [FormattedRCFeed] - https://phabricator.wikimedia.org/T156996 [20:15:07] (03PS1) 1020after4: group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335694 [20:15:09] (03CR) 1020after4: [C: 032] group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335694 (owner: 1020after4) [20:16:44] (03Merged) 10jenkins-bot: group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335694 (owner: 1020after4) [20:16:52] (03CR) 10jenkins-bot: group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335694 (owner: 1020after4) [20:17:19] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 wikis to 1.29.0-wmf.9 [20:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:37] Syntax Error: Top-level pages object is wrong type (null) [20:18:13] Model contains an error for 409001483: TextDeleted: Text deleted (datasource.revision.parent.text) [20:19:41] twentyafterfour: Something to do with pdfs. [20:19:56] ahh, more of that eh [20:19:59] Either something's busted or someone's been uploading some weird PDFs we can't handle right [20:20:29] All of those "Syntax Error" thing we've been getting are related to that [20:21:29] uhm, "model contains error" is ores [20:21:48] Oh, dunno about that one :) [20:21:48] /srv/mediawiki/php-1.29.0-wmf.10/extensions/ORES/includes/Cache.php:130 [20:21:51] But the first one is [20:21:54] PDFs [20:21:58] I didn't mean to paste the first one ;) [20:22:51] o/ [20:23:01] I hope my betalabs woes aren't causing other people trouble. [20:23:10] We have a repo that seems to be too big for scap/diffusion [20:23:31] no, but there are ores warnings in production [20:24:04] Like, something new? [20:24:32] We have a few warnings that are very difficult to deal with on our backlog. [20:24:53] see above [20:25:07] Model contains an error for 409001483: TextDeleted: Text deleted (datasource.revision.parent.text) [20:25:18] that's in wmf.10 [20:25:21] which was rolled back [20:25:45] halfak: is that new or dupe? if it's new I'll create a task with a stack trace [20:26:04] twentyafterfour, that's normal and expected. [20:26:06] (03PS7) 10Yuvipanda: tools: Use docker engine profile in tools builder [puppet] - 10https://gerrit.wikimedia.org/r/335299 [20:26:12] We can probably make that quieter. [20:28:43] (03PS8) 10Yuvipanda: tools: Use docker engine profile in tools builder [puppet] - 10https://gerrit.wikimedia.org/r/335299 [20:29:00] halfak: yeah if it's expected it shouldn't be logging an error [20:29:10] that's logspam of the worst kind [20:29:25] log at info level? [20:29:55] Hmm... yeah... I guess we could do that at the info level. Let me try to see if I can set up a task that targets that kind of warning. [20:35:53] !log carbon - disabling puppet (to stop it from re-adding second IPv6 address causing issues with ferm rules) [20:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:03] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2994611 (10yuvipanda) I just manually imported docker-engine version tools is using into aptly, and it all seems to be ok. [20:41:50] (03PS9) 10Yuvipanda: tools: Use docker engine profile in tools builder [puppet] - 10https://gerrit.wikimedia.org/r/335299 [20:46:38] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:49:56] Hi! I'm Ranko Nikolić, administrator and bureaucrat on Serbian Wikipedia. Two months ago, I turned on 2FA. Few days ago I lost access to my accounts on Wikipedia, because my authentication device is broken. Does someone can help me? [20:51:36] Ranko: try this first: " [20:51:36] Go to Special:OATH or preferences on the project you enabled 2FA on. If you are no longer in groups that are permitted to enroll, you can still disable via Special:OATH. [20:52:01] https://meta.wikimedia.org/wiki/Help:Two-factor_authentication#Disabling_two-factor_authentication [20:52:51] next thing would be "do you have those one-time scratch tokens" [20:56:50] I can not log in, so you do not have access to Special:OATH or preferences. I don't have them and I don't remember that these tokens offered when you enable this option. [20:58:47] Ranko: ok, hold on, let me find out what else can be done [20:59:25] (03PS2) 10Andrew Bogott: vmbuilder and boostrap_vz: Include many more packages in base image [puppet] - 10https://gerrit.wikimedia.org/r/335672 [20:59:27] (03PS3) 10Andrew Bogott: bootstrap_vz and vmbuilder: Do apt-get update before first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/335661 [20:59:29] (03PS1) 10Andrew Bogott: vmbuilder: Hack around a bug with sudoers [puppet] - 10https://gerrit.wikimedia.org/r/335695 [21:01:04] (03CR) 10Andrew Bogott: "Tim, since I see that you've been fighting with this bug as well, here is my stupid, temporary fix FYI." [puppet] - 10https://gerrit.wikimedia.org/r/335695 (owner: 10Andrew Bogott) [21:02:20] (03CR) 10Andrew Bogott: [C: 032] vmbuilder: Hack around a bug with sudoers [puppet] - 10https://gerrit.wikimedia.org/r/335695 (owner: 10Andrew Bogott) [21:04:32] (03PS3) 10Andrew Bogott: New labs instances: Fail more obviously if we have DNS issues [puppet] - 10https://gerrit.wikimedia.org/r/334365 [21:07:38] (03CR) 10Andrew Bogott: [C: 032] New labs instances: Fail more obviously if we have DNS issues [puppet] - 10https://gerrit.wikimedia.org/r/334365 (owner: 10Andrew Bogott) [21:10:42] (03PS3) 10Andrew Bogott: vmbuilder and boostrap_vz: Include many more packages in base image [puppet] - 10https://gerrit.wikimedia.org/r/335672 [21:12:41] (03CR) 10Andrew Bogott: [C: 032] vmbuilder and boostrap_vz: Include many more packages in base image [puppet] - 10https://gerrit.wikimedia.org/r/335672 (owner: 10Andrew Bogott) [21:12:57] (03CR) 10Andrew Bogott: [C: 032] bootstrap_vz and vmbuilder: Do apt-get update before first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/335661 (owner: 10Andrew Bogott) [21:13:09] (03PS4) 10Andrew Bogott: bootstrap_vz and vmbuilder: Do apt-get update before first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/335661 [21:14:38] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [21:16:24] mutante: we can strip 2FA, policy and process is on wikitech [21:16:57] technically, it's easy [21:19:26] p858snake: thanks, but that isn't just for Wikitech login itself? [21:19:42] this is about a wikipedia [21:20:14] https://wikitech.wikimedia.org/wiki/Password_reset/Confirming_identities [21:20:30] ah, right, i just mentioned the "commited identity" thing [21:26:25] (03PS1) 10RobH: fixing my dns typo [dns] - 10https://gerrit.wikimedia.org/r/335700 [21:27:00] (03CR) 10RobH: [C: 032] fixing my dns typo [dns] - 10https://gerrit.wikimedia.org/r/335700 (owner: 10RobH) [21:46:52] (03PS1) 10Aaron Schulz: Enable wgEnableWANCacheReaper in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335704 [21:48:31] (03Draft1) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [21:48:34] (03PS2) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [21:48:43] (03CR) 10Hashar: "recheck" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293150 (owner: 10Yuvipanda) [21:49:42] (03CR) 10jerkins-bot: [V: 04-1] Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 (owner: 10Paladox) [21:50:27] !log reimaging relforge1002.eqiad.wmnet [21:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:31] (03PS3) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [21:52:29] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2814411 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['relforge1002.eqiad.wmnet'] ``... [21:53:03] (03PS4) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [21:54:05] Krinkle: what's the status of T156996? [21:54:06] T156996: Multiple PHP warnings in master [FormattedRCFeed] - https://phabricator.wikimedia.org/T156996 [21:54:16] twentyafterfour: fixed in master [21:54:22] (03CR) 10jerkins-bot: [V: 04-1] Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 (owner: 10Paladox) [21:54:47] thanks, cherry-picking [21:54:53] twentyafterfour: but that particular one isn't even in prod [21:55:00] won't cherry-pick [21:55:05] ? [21:55:07] oh [21:55:18] "\nWarning: __construct() expects exactly 1 parameter, 0 given in /srv/mediawiki/tags/2017-02-02_07:24:01/includes/rcfeed/FormattedRCFeed.php on line 37 [21:55:18] " [21:55:22] is only on m aster as of 1 day ago [21:55:27] and now fixed [21:55:32] the isuse is the HASH notice [21:55:35] was got its own ticket [21:55:38] (03PS5) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [21:55:40] oh [21:55:44] * twentyafterfour missed that [21:56:22] twentyafterfour: https://phabricator.wikimedia.org/T157033 [21:58:09] (03CR) 10Hashar: "recheck" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293150 (owner: 10Yuvipanda) [21:58:12] (03PS1) 10Jforrester: admin: Re-apply jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/335705 [21:58:45] (03CR) 10jerkins-bot: [V: 04-1] admin: Re-apply jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/335705 (owner: 10Jforrester) [22:03:25] (03PS6) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [22:04:35] (03PS2) 10Jforrester: admin: Re-apply jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/335705 [22:07:28] (03PS7) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [22:08:11] twentyafterfour, Krinkle : https://gerrit.wikimedia.org/r/#/c/335707/1 [22:10:14] AaronSchulz: awesome, thanks [22:12:07] cherry picked https://gerrit.wikimedia.org/r/#/c/335707/1 [22:12:26] (03PS8) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [22:12:43] but ... needs code review prior to deploying that with the train [22:13:59] Anyone know if varnish or hhvm imposes a limit on GET url length? Trying to track down where a limit is imposed… seems not to be MW itself [22:15:07] (03PS3) 10GuerellaNuke23: admin: Re-apply jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/335705 (owner: 10Jforrester) [22:18:30] 06Operations, 10ops-eqiad: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#2994871 (10fgiunchedi) Note that the same behaviour is now showing up on `sdb` too. I've asked @robh to bump quantity to order in T157065, assuming worst case all SSDs will eventually show the same beha... [22:18:58] coreyfloyd: doesn't the HTTP spec impose a limit? [22:19:05] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2994879 (10RobH) [22:19:23] twentyafterfour: i thought so but looked it up and it said no [22:20:09] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:25:21] (03CR) 10RobH: [C: 032] admin: Re-apply jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/335705 (owner: 10Jforrester) [22:26:19] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2994888 (10RobH) [22:32:28] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2994909 (10RobH) a:05RobH>03demon Assigning this task to Chad. Once he is aware that this system is all theirs, he can resolve. [22:32:35] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2994911 (10RobH) [22:34:38] (03PS4) 10RobH: admin: Re-apply jforrester ssh key [puppet] - 10https://gerrit.wikimedia.org/r/335705 (owner: 10Jforrester) [22:35:26] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:36:16] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:42:48] (03Draft1) 10Paladox: Up max_execution to 20 from 10 in phabricator/php.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/335714 [22:42:50] (03PS2) 10Paladox: Up max_execution to 20 from 10 in phabricator/php.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) [22:43:00] (03PS3) 10Paladox: Up max_execution to 20 from 10 in phabricator/php.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) [22:43:05] twentyafterfour mutante ^^ :) [22:43:40] (03PS9) 10Paladox: Allow us to change elasticsearch configs in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/335703 [22:45:56] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#2994983 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['relforge1002.eqiad.wmnet'] ``` and were **ALL** successful. [22:46:14] (03CR) 10Bmansurov: [C: 031] Update apple touch icon for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335689 (https://phabricator.wikimedia.org/T152538) (owner: 10Jdlrobson) [22:48:17] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [22:49:34] (03Draft1) 10Paladox: Up post_max_size to 50M in phabricator's php.ini file [puppet] - 10https://gerrit.wikimedia.org/r/335717 [22:49:38] (03PS2) 10Paladox: Up post_max_size to 50M in phabricator's php.ini file [puppet] - 10https://gerrit.wikimedia.org/r/335717 [22:50:35] twentyafterfour mutante ^^ :) [22:51:21] paladox: we don't allow large file uploads no need to change the post size [22:51:46] (03CR) 10Bmansurov: [C: 031] Related pages is shown to 90% of mobile users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335686 (https://phabricator.wikimedia.org/T154681) (owner: 10Jdlrobson) [22:51:48] Oh, we will need to customise it for labs :). [22:52:32] (03CR) 10Paladox: [C: 04-1] "We won't need this higher for prod. will have to update this letter to support customising it." [puppet] - 10https://gerrit.wikimedia.org/r/335717 (owner: 10Paladox) [22:52:46] twentyafterfour can we do https://gerrit.wikimedia.org/r/335714 ? [22:52:53] should we up it furthur to 30? [22:53:02] no [22:53:47] twentyafterfour no for 30 or no for https://gerrit.wikimedia.org/r/335714 ? [23:03:26] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:05:16] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:07:04] (03CR) 10Paladox: [C: 031] "It passes i think https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/5322/console" [puppet] - 10https://gerrit.wikimedia.org/r/335703 (owner: 10Paladox) [23:08:02] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2995039 (10Dzahn) @demon So if we'd just put the role gerrit::server on this one as well, let's figure out which things need to be stopped or skipped when not on the "active" serve... [23:09:08] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5322/" [puppet] - 10https://gerrit.wikimedia.org/r/335703 (owner: 10Paladox) [23:11:39] !log Gerrit: we'll be flushing session caches momentarily, sorry for the inconvenience [23:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:24] a user banned for spam/vandalizing on phab has been interfering with gerrit patches [23:12:28] https://gerrit.wikimedia.org/r/#/c/332384/ [23:12:50] OH-: yes, on it [23:12:59] thanks :) [23:19:23] MaxSem: thanks for the review on https://gerrit.wikimedia.org/r/#/c/335707/ [23:19:32] :) [23:19:45] jouncebot: now [23:19:45] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [23:19:48] jouncebot: next [23:19:48] In 0 hour(s) and 40 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170203T0000) [23:20:02] guess swat can wait [23:20:03] * twentyafterfour is going to push the train forward now [23:21:08] doit [23:22:15] (03PS1) 1020after4: all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335729 [23:22:17] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335729 (owner: 1020after4) [23:23:38] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335729 (owner: 1020after4) [23:23:50] (03CR) 10jenkins-bot: all wikis to 1.29.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335729 (owner: 1020after4) [23:25:29] (03CR) 1020after4: "20 seems a bit long to me. 10 is already a long time for a request to run." [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [23:25:50] (03CR) 10Paladox: "What about 15?" [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [23:26:18] (03PS4) 10Paladox: Up max_execution to 15 from 10 in phabricator/php.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) [23:26:21] (03CR) 1020after4: "15 seems more reasonable, as far as I can remember it used to be 5 seconds. :-/" [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [23:26:45] (03CR) 10Paladox: "> 15 seems more reasonable, as far as I can remember it used to be 5" [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [23:27:03] C'est pas mieux [23:27:12] w/w [23:33:59] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.10/includes/cache/MessageCache.php: deploy I5b84b1ae4a9c7b710ee452c61d7d9d6076ec9e6a refs T156996 (duration: 00m 45s) [23:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:02] T156996: Multiple PHP warnings in master [FormattedRCFeed] - https://phabricator.wikimedia.org/T156996 [23:38:29] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.10 [23:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:45] (03PS1) 10Dzahn: lower TTL of apt.wikimedia.org to 300 [dns] - 10https://gerrit.wikimedia.org/r/335730 [23:39:23] (03CR) 10Dzahn: [C: 032] lower TTL of apt.wikimedia.org to 300 [dns] - 10https://gerrit.wikimedia.org/r/335730 (owner: 10Dzahn) [23:40:33] (03Abandoned) 10Dzahn: aptrepo: disable autoconfigured EUI64 addresses [puppet] - 10https://gerrit.wikimedia.org/r/335594 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [23:42:22] twentyafterfour could you +1 or -1 https://gerrit.wikimedia.org/r/#/c/335703/ please? [23:43:13] (03CR) 10Dzahn: "A query that takes over 10 seconds is probably too long and i would be surprised if it happens to be between 10 and 15 seconds and not mor" [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [23:43:57] (03CR) 10Paladox: "@Dzahn sometimes the query can be run in the 10 secs but sometimes it goes out side the 10secs range and then it would fails." [puppet] - 10https://gerrit.wikimedia.org/r/335714 (https://phabricator.wikimedia.org/T125357) (owner: 10Paladox) [23:44:36] (03CR) 10Dzahn: "@Juniosys Pyoungmeister is not active anymore, fyi. makes more sense to add the current DBA people. let me do that." [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:46:54] (03CR) 10Dzahn: "i think 10M is plenty for file attachments on tickets. do we really have cases where larger files are needed? Also why does it say "not " [puppet] - 10https://gerrit.wikimedia.org/r/335717 (owner: 10Paladox) [23:47:35] (03CR) 10Paladox: [C: 04-1] "Is this used by http (cloning through http)?" [puppet] - 10https://gerrit.wikimedia.org/r/335717 (owner: 10Paladox) [23:48:00] Notice: Undefined index: footer-site-heading-html in /srv/mediawiki/php-1.29.0-wmf.10/extensions/MobileFrontend/includes/skins/MinervaTemplate.php on line 98 [23:48:07] Notice: Undefined index: mobile-license in /srv/mediawiki/php-1.29.0-wmf.10/extensions/MobileFrontend/includes/skins/MinervaTemplate.php on line 99 [23:49:49] (03PS6) 10Dzahn: ores/otrs/package_builder: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334300 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:50:49] twentyafterfour https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/blob/master/includes/skins/MinervaTemplate.php#L98 [23:51:02] (03PS1) 10Kaldari: Setting wgPageAssessmentsSubprojects to true on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335732 [23:51:07] !log 1.29.0-wmf.10 appears to be stable. Train deployment complete. [23:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:26] (03CR) 1020after4: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/335703 (owner: 10Paladox) [23:53:39] twentyafterfour ^^ thanks [23:54:38] (03CR) 10Dzahn: [C: 032] "no diff for the OTRS part http://puppet-compiler.wmflabs.org/5323/" [puppet] - 10https://gerrit.wikimedia.org/r/334300 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys)