[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170127T0000). Please do the needful. [00:00:30] no. [00:00:42] No patches, good. No swat today [00:00:54] ostriches: train done as well? [00:00:59] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2974746 (10RobH) @dzahn seems to have pointed out the relevant T84160. So these should be fully decommissioned including wipe (since they may not have been properly wiped when shipped from Tampa). [00:01:11] mobrovac: train rolled back [00:01:17] oh ok [00:01:19] thnx [00:01:27] https://phabricator.wikimedia.org/T156364 [00:02:40] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2974748 (10GWicke) [00:03:09] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [00:04:38] ostriches: there is a throttle rule to fix for https://phabricator.wikimedia.org/T156278 [00:05:06] ostriches: better to do that Friday or now? [00:05:13] (it's for Saturday) [00:05:19] Ask twentyafterfour, he's the one who's been doing deploys today [00:05:23] I'm dipping out a few mins early today [00:05:25] (03CR) 10Mobrovac: [C: 031] "Yup, good to go." [puppet] - 10https://gerrit.wikimedia.org/r/334452 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [00:05:26] * ostriches waves [00:05:52] Dereckson: we can deploy that if you'd like [00:06:16] ok [00:06:16] (03PS2) 10Volans: Testreduce: allow to decide the state of the services [puppet] - 10https://gerrit.wikimedia.org/r/334452 (https://phabricator.wikimedia.org/T156177) [00:07:26] (03CR) 10Volans: [C: 032] Testreduce: allow to decide the state of the services [puppet] - 10https://gerrit.wikimedia.org/r/334452 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [00:07:29] Dereckson: which patch? [00:07:48] https://gerrit.wikimedia.org/r/#/c/334156/ [00:07:50] ? [00:09:52] An IP fix for this one, I'm checking something and I'll prepare it in a few minutes [00:10:04] ok [00:12:14] !log re-enabled puppet (with a temporary fix to keep parsoid-vd and parsoid-vd-client stopped) on ruthenium T156177 [00:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:18] T156177: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177 [00:13:09] PROBLEM - Druid broker on druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server broker [00:13:19] PROBLEM - Check systemd state on druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:14:40] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2974787 (10Paladox) @Marostegui and @jcrespo and @demon and @dzahn i found a new lib we can use it's https://mariadb.com/kb/en/ma... [00:15:16] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2974788 (10Paladox) Oh, wait, mysql connector supports this too. [00:16:47] (03CR) 10Krinkle: "Looks like some exim4 config might still reference this. https://github.com/search?utf8=%E2%9C%93&q=org%3Awikimedia+wiki-mail&type=Code&re" [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [00:17:56] ugh, wmf.8 dateformatter has problems almost as bad as wmf.9 [00:18:04] Notice: Undefined property: DateFormatter::$keys in /srv/mediawiki/php-1.29.0-wmf.8/includes/parser/DateFormatter.php on line 212 [00:18:11] Notice: Undefined property: DateFormatter::$targets in /srv/mediawiki/php-1.29.0-wmf.8/includes/parser/DateFormatter.php on line 229 [00:18:52] (03CR) 10Dzahn: [C: 031] Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [00:19:09] RECOVERY - Druid broker on druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server broker [00:19:19] RECOVERY - Check systemd state on druid1001 is OK: OK - running: The system is fully operational [00:19:58] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#2974794 (10Volans) @ssastry @mobrovac Puppet re-enabled with a temporary patch to allow this. Let us know once the issue is fixed to revert the patch and make Puppe... [00:26:04] 06Operations: re-create install1001 as physical server ? - https://phabricator.wikimedia.org/T156440#2974839 (10Dzahn) [00:26:24] (03PS1) 10Dereckson: Fix throttle rule for Her Girl Friday + Lenny Unconference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334456 (https://phabricator.wikimedia.org/T156278) [00:26:40] 06Operations: re-create install1001 as physical server ? - https://phabricator.wikimedia.org/T156440#2974819 (10Dzahn) p:05Triage>03Normal [00:26:58] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334456 (https://phabricator.wikimedia.org/T156278) (owner: 10Dereckson) [00:27:14] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2974849 (10Paladox) Nope doesn't work on mysql's version of the connector. Works on mariadb's one though. [00:27:18] 06Operations: re-create install1001 as physical server ? - https://phabricator.wikimedia.org/T156440#2974819 (10Dzahn) [00:27:20] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2974850 (10Dzahn) [00:28:34] (03Merged) 10jenkins-bot: Fix throttle rule for Her Girl Friday + Lenny Unconference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334456 (https://phabricator.wikimedia.org/T156278) (owner: 10Dereckson) [00:28:43] (03CR) 10jenkins-bot: Fix throttle rule for Her Girl Friday + Lenny Unconference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334456 (https://phabricator.wikimedia.org/T156278) (owner: 10Dereckson) [00:29:24] Live on mwdebug1002.eqiad.wmnet [00:30:34] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2974875 (10Papaul) @RobH can those servers be unracked and added to the decommission servers list? [00:30:44] (03PS6) 10Madhuvishy: [WIP] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [00:31:37] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2974878 (10RobH) Since they aren't in use by the DBA team, yep! * unplug the network cable for production network * wipe disks * unrack for decom [00:31:55] (03PS7) 10Madhuvishy: toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 [00:32:14] !log dereckson@tin Synchronized wmf-config/throttle.php: Fix throttle rule for Her Girl Friday + Lenny Unconference (T156278) (duration: 00m 53s) [00:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:19] T156278: Her Girl Friday + Lenny Unconference / Editathon in NYC, 2017-01-28 - throttle rules - https://phabricator.wikimedia.org/T156278 [00:32:26] (03CR) 10Madhuvishy: [V: 032 C: 032] toolschecker: Split each check into a separate uwsgi application [puppet] - 10https://gerrit.wikimedia.org/r/334433 (owner: 10Madhuvishy) [00:32:45] (03PS1) 10Krinkle: multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 [00:32:47] (03PS1) 10Krinkle: (no-op) Move comment about flow.dblist in settings to the dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334460 [00:33:45] Added to deployments table too. [00:35:56] (03PS2) 10Krinkle: multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 [00:35:58] (03PS2) 10Krinkle: (no-op) Move comment about flow.dblist in settings to the dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334460 [00:47:49] (03PS1) 10Madhuvishy: toolschecker: Fix service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/334461 [00:48:31] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2975008 (10RobH) [00:48:34] 06Operations, 10Traffic: convert stream.wikimedia.org from GS to LE certificate - https://phabricator.wikimedia.org/T155524#2975007 (10RobH) 05Open>03declined [00:48:35] (03PS1) 10Krinkle: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 [00:48:47] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10RobH) [00:48:57] !log carbon - moved the 1.5TB /srv/"mirrors.off", which used to be mirrors but is now on sodium, into / to that /srv/ can be synced without this [00:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:01] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10RobH) This likely shouldn't close yet, and we should add in mx/mail systems. [00:49:27] (03CR) 10jerkins-bot: [V: 04-1] Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [00:50:56] 06Operations: re-create install1001 as physical server ? - https://phabricator.wikimedia.org/T156440#2975015 (10Dzahn) ...or we could leave it a VM but add a second virtual disk (easy?) to just give it enough space and leave the rest as is. that would also work. [00:53:07] (03PS1) 10Krinkle: Remove unused top6-wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334463 [00:54:14] (03PS2) 10Krinkle: Remove unused top6-wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334463 [00:54:33] (03PS2) 10Dzahn: aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 [00:54:39] (03PS2) 10Krinkle: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 [00:54:52] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 (owner: 10Dzahn) [00:55:12] (03PS3) 10Dzahn: aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 [00:57:12] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 (owner: 10Dzahn) [00:57:44] (03CR) 10Krinkle: "How is s4 related? Does this feature conflict with some undocumented db indexes on the s4 server?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309087 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson) [01:01:10] (03PS1) 10Dzahn: aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) [01:01:39] (03PS2) 10Madhuvishy: toolschecker: Fix service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/334461 [01:01:41] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#2975028 (10mobrovac) Thnx @Volans for taking care of this and keeping tabs on it :) @ssastry, please let us know once you rebuild the VD repos and redeploy them on... [01:03:05] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:03:23] (03PS4) 10Dzahn: aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 [01:04:38] (03PS2) 10Dzahn: aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) [01:06:08] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 (owner: 10Dzahn) [01:06:12] (03CR) 1020after4: [C: 031] multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 (owner: 10Krinkle) [01:07:44] (03CR) 10Jdlrobson: "We're likely to be using this again soon as it seems most of our rollouts will be excluding the top6 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334463 (owner: 10Krinkle) [01:10:05] (03PS1) 10Dzahn: aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 [01:10:58] (03CR) 10jerkins-bot: [V: 04-1] aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 (owner: 10Dzahn) [01:11:15] (03PS5) 10Dzahn: aptrepo: rsync cron (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/334241 [01:14:11] (03CR) 10Madhuvishy: [C: 032] toolschecker: Fix service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/334461 (owner: 10Madhuvishy) [01:14:33] (03PS2) 10Dzahn: aptrepo/rsync: flip the "if"-logic around instead of a negation [puppet] - 10https://gerrit.wikimedia.org/r/334467 [01:15:41] (03CR) 10Krinkle: "I'd rather recommend to convert it to regular dblist that documents where it is enabled instead of the current double negative approach wh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334463 (owner: 10Krinkle) [01:20:19] !log deploying hotfix for phabricator refs T154479 [01:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:23] T154479: Adjust custom panel code to upstream changes (ProfilePanels → ProfileMenuItems) - https://phabricator.wikimedia.org/T154479 [01:30:32] (03PS1) 10Madhuvishy: toolchecker: Fix directory dependencies [puppet] - 10https://gerrit.wikimedia.org/r/334473 [01:31:53] (03CR) 10Madhuvishy: [C: 032] toolchecker: Fix directory dependencies [puppet] - 10https://gerrit.wikimedia.org/r/334473 (owner: 10Madhuvishy) [01:53:22] (03PS1) 10Madhuvishy: toolschecker: Fix upstart conf files path [puppet] - 10https://gerrit.wikimedia.org/r/334482 [01:54:39] (03CR) 10Madhuvishy: [C: 032] toolschecker: Fix upstart conf files path [puppet] - 10https://gerrit.wikimedia.org/r/334482 (owner: 10Madhuvishy) [02:02:32] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 335 bytes in 0.007 second response time [02:04:40] (03PS1) 10Madhuvishy: toolchecker: Update path for k8s etcd check [puppet] - 10https://gerrit.wikimedia.org/r/334486 [02:05:04] ^ fixing [02:06:40] (03CR) 10Madhuvishy: [C: 032] toolchecker: Update path for k8s etcd check [puppet] - 10https://gerrit.wikimedia.org/r/334486 (owner: 10Madhuvishy) [02:08:32] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.697 second response time [02:08:53] yeah I guessed that and didn't even ping you ;) [02:09:37] my downtime expired the exact moment i was fixing it :) [02:13:54] of course! :) [02:21:04] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.8) (duration: 07m 57s) [02:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.835 second response time [02:26:11] (03PS1) 10Papaul: DHCP: Add DHCP entries for mc2019-mc2036 Bug:T155755 [puppet] - 10https://gerrit.wikimedia.org/r/334492 [02:28:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.004 second response time [02:29:22] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:32:27] (03CR) 10Volans: "@godog: I'm not convinced this is the right solution, if the RAM usage is close to the limit the carbon-cache will start failing but will " [puppet] - 10https://gerrit.wikimedia.org/r/334364 (https://phabricator.wikimedia.org/T155876) (owner: 10Filippo Giunchedi) [02:35:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.443 second response time [02:36:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.855 second response time [02:53:06] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.9) (duration: 14m 14s) [02:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:18] (03CR) 10Volans: [C: 04-1] "If I understand it correctly this will just log a more meaningful message, but is not fixing the issue that once the connection with etcd " (032 comments) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [02:57:22] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:58:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 27 02:58:52 UTC 2017 (duration 5m 46s) [02:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:12] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1820.028858 Seconds [03:06:12] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 42.799822 Seconds [03:24:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.723 second response time [03:26:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.121 second response time [03:39:15] (03PS1) 10Madhuvishy: toolchecker: Add script to manage toolchecker* services [puppet] - 10https://gerrit.wikimedia.org/r/334495 [03:41:42] (03CR) 10Madhuvishy: [C: 032] toolchecker: Add script to manage toolchecker* services [puppet] - 10https://gerrit.wikimedia.org/r/334495 (owner: 10Madhuvishy) [04:18:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.544 second response time [04:19:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.959 second response time [04:20:22] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:20:42] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:48:42] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:49:02] PROBLEM - puppet last run on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:49:22] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [04:49:52] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [04:56:39] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2975266 (10Dzahn) @hashar which CI systems had SSL certs again please [05:35:02] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:41:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.473 second response time [05:42:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.274 second response time [05:45:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.847 second response time [05:47:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.210 second response time [06:03:02] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:24:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.371 second response time [06:25:08] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2975351 (10Gilles) [06:25:20] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10Gilles) [06:25:35] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10Gilles) [06:26:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.484 second response time [06:33:52] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:35:02] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:38:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.566 second response time [06:39:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.785 second response time [06:44:02] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [07:02:52] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:04:12] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:08:54] 06Operations, 10ops-codfw: Codfw: Missing mgmt dns for db2025-db2027 - https://phabricator.wikimedia.org/T156342#2975372 (10Marostegui) >>! In T156342#2974878, @RobH wrote: > Since they aren't in use by the DBA team, yep! > > * check and if needed remove production dns entries > * note the switch ports on thi... [07:16:45] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975376 (10Marostegui) Wow, so awesome to wake up and see so much progress has been done on this ti... [07:23:55] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2975381 (10Marostegui) >>! In T145885#2974787, @Paladox wrote: > @Marostegui and @jcrespo and @demon and @dzahn i found a new lib... [07:32:12] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:10:58] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2975422 (10Paladox) >>! In T145885#2975381, @Marostegui wrote: >>>! In T145885#2974787, @Paladox wrote: >> @Marostegui and @jcres... [08:13:13] !log uploaded openssl 1.1.0d packages for jessie-wikimedia to carbon [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975428 (10Paladox) >>! In T156373#2975376, @Marostegui wrote: > Wow, so awesome to wake up and see... [08:22:04] (03PS1) 10Elukey: Add a uptime threshold check to check_leaked_hhvm_threads [puppet] - 10https://gerrit.wikimedia.org/r/334505 [08:23:12] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975448 (10Marostegui) MariaDB reopened the bug to fix it in 10.0 \o/ https://jira.mariadb.org/bro... [08:24:58] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975451 (10Paladox) @Marostegui I doint think they know about the bug @Jcrespo filled. [08:28:23] (03PS1) 10Muehlenhoff: Update to 1.0.2k [debs/openssl] - 10https://gerrit.wikimedia.org/r/334506 [08:29:50] (03CR) 10Giuseppe Lavagetto: [C: 031] "Not pretty but it works in removing the alarms we determined to be false positives. And the general alert is still valuable." [puppet] - 10https://gerrit.wikimedia.org/r/334505 (owner: 10Elukey) [08:30:05] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975471 (10Marostegui) >>! In T156373#2975451, @Paladox wrote: > @Marostegui I doint think they kno... [08:31:38] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975472 (10Marostegui) Looks like they fixed it and will be shipped with the 10.0.30: ``` Resoluti... [08:35:19] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975487 (10Paladox) @Marostegui this is the fix for 10.0 https://github.com/MariaDB/server/commit/7... [08:36:00] (03PS2) 10Elukey: Add a uptime threshold check to check_leaked_hhvm_threads [puppet] - 10https://gerrit.wikimedia.org/r/334505 [08:41:36] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975516 (10Marostegui) Yes, let's wait for Jaime to build the package with that to see how it works... [08:47:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.722 second response time [08:48:06] (03PS3) 10Elukey: Add a uptime threshold check to check_leaked_hhvm_threads [puppet] - 10https://gerrit.wikimedia.org/r/334505 [08:48:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.266 second response time [08:50:18] (03CR) 10Elukey: [C: 032] Add a uptime threshold check to check_leaked_hhvm_threads [puppet] - 10https://gerrit.wikimedia.org/r/334505 (owner: 10Elukey) [08:51:51] the patch *should* be an attempt to fix the false positives of "Check HHVM threads for leakage" for the videoscalers [08:52:03] not really pretty I know [08:57:07] (03CR) 10Filippo Giunchedi: "> @godog: I'm not convinced this is the right solution, if the RAM" [puppet] - 10https://gerrit.wikimedia.org/r/334364 (https://phabricator.wikimedia.org/T155876) (owner: 10Filippo Giunchedi) [08:57:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.037 second response time [08:59:08] godog: as you want, I don't have a strong opinion on that, but would be nice to know something is failing consistently and that might hide that [08:59:46] (03PS14) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [09:00:01] (03PS3) 10Juniorsys: Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) [09:00:13] (03PS3) 10Juniorsys: deployment: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334278 (https://phabricator.wikimedia.org/T93645) [09:00:19] (03PS3) 10Juniorsys: dnsrecursor: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334279 (https://phabricator.wikimedia.org/T93645) [09:00:29] (03PS3) 10Juniorsys: elasticsearch: Lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334281 (https://phabricator.wikimedia.org/T93645) [09:00:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.611 second response time [09:00:41] (03PS3) 10Juniorsys: etcd: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) [09:00:51] (03PS3) 10Juniorsys: eventlogging/eventstreams: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) [09:00:58] (03PS3) 10Juniorsys: extdist: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334284 [09:01:05] (03PS3) 10Juniorsys: jupterhub/keyholder: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334287 (https://phabricator.wikimedia.org/T93645) [09:01:08] (03PS3) 10Juniorsys: k8s: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334289 (https://phabricator.wikimedia.org/T93645) [09:01:16] (03PS3) 10Juniorsys: labs modules linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) [09:01:21] (03PS3) 10Juniorsys: ldap: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334291 (https://phabricator.wikimedia.org/T93645) [09:01:26] (03PS3) 10Juniorsys: librenms/locales/logstash/lshell linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334293 (https://phabricator.wikimedia.org/T93645) [09:01:32] (03PS3) 10Juniorsys: lvm/lvs: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334294 (https://phabricator.wikimedia.org/T93645) [09:01:38] (03PS3) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334295 (https://phabricator.wikimedia.org/T93645) [09:01:43] (03PS3) 10Juniorsys: mysql: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334298 (https://phabricator.wikimedia.org/T93645) [09:01:48] (03PS3) 10Juniorsys: Linting changes (multiple) [puppet] - 10https://gerrit.wikimedia.org/r/334299 (https://phabricator.wikimedia.org/T93645) [09:01:57] (03PS3) 10Juniorsys: ores/otrs/package_builder: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334300 (https://phabricator.wikimedia.org/T93645) [09:02:11] (03PS3) 10Juniorsys: openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) [09:02:17] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2975551 (10Paladox) Ok :) [09:02:18] (03PS3) 10Juniorsys: profile linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334303 (https://phabricator.wikimedia.org/T93645) [09:02:19] volans: ok thanks! I'll think a bit more about it [09:02:26] (03PS3) 10Juniorsys: prometheus: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334306 (https://phabricator.wikimedia.org/T93645) [09:02:34] :) [09:02:34] (03PS3) 10Juniorsys: puppet/puppet_compiler: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334307 (https://phabricator.wikimedia.org/T93645) [09:02:44] (03PS3) 10Juniorsys: planet/pmacct/programdashboard/pybal lint changes [puppet] - 10https://gerrit.wikimedia.org/r/334308 (https://phabricator.wikimedia.org/T93645) [09:02:52] (03PS3) 10Juniorsys: quarry: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) [09:03:04] (03PS3) 10Juniorsys: role: Linting changes (backup,bastionhost+others) [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) [09:03:10] (03PS3) 10Juniorsys: redis: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334311 (https://phabricator.wikimedia.org/T93645) [09:03:17] (03PS3) 10Juniorsys: Linting fixes (multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334317 (https://phabricator.wikimedia.org/T93645) [09:03:26] (03PS3) 10Juniorsys: graphoid/gridengine/grub/haproxy/hhvm lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334319 (https://phabricator.wikimedia.org/T93645) [09:03:34] (03PS3) 10Juniorsys: ifttt/imagemagick/initramfs/interface lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334320 (https://phabricator.wikimedia.org/T93645) [09:03:42] (03PS2) 10Juniorsys: Puppet style: Use one line per include/require [puppet] - 10https://gerrit.wikimedia.org/r/334322 [09:08:22] (03PS1) 10Urbanecm: Create namespace alias وگ for NS_PROJECT in fawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334510 (https://phabricator.wikimedia.org/T156451) [09:17:07] (03PS1) 10Urbanecm: Remove flaggedrevs-protect-review page protection from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334511 (https://phabricator.wikimedia.org/T156448) [09:21:27] (03CR) 10Alexandros Kosiaris: [C: 032] Update to 1.6.0-2 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/321684 (owner: 10Alexandros Kosiaris) [09:22:30] (03CR) 10Muehlenhoff: [C: 032] Update to 1.0.2k [debs/openssl] - 10https://gerrit.wikimedia.org/r/334506 (owner: 10Muehlenhoff) [09:23:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.699 second response time [09:24:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.655 second response time [09:26:24] (03PS1) 10Muehlenhoff: Drop patch zero-pad-dhe.patch, merged in 1.0.2k [debs/openssl] - 10https://gerrit.wikimedia.org/r/334512 [09:37:06] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2975634 (10faidon) asw-c2-eqiad was replaced yesterday (Jan 26 17:50 UTC) with one of our spares. Total downtime was approximately 30 minutes mostly due to the recabling effort but... [09:40:25] (03PS2) 10Ema: varnish: stop ensuring libvmod-header is absent [puppet] - 10https://gerrit.wikimedia.org/r/333044 [09:40:30] (03CR) 10Ema: [V: 032 C: 032] varnish: stop ensuring libvmod-header is absent [puppet] - 10https://gerrit.wikimedia.org/r/333044 (owner: 10Ema) [09:40:40] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2975637 (10hashar) @Dzahn I should have written down somewhere following our conversation from last week or so. For the CI we have the following domains all serving HTTP being force redirecte... [09:43:06] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2975639 (10hashar) [09:43:40] (03CR) 10Ema: [C: 032] Pass config file name as a CLI argument [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334163 (owner: 10Ema) [09:43:53] (03CR) 10Ema: [V: 032 C: 032] Pass config file name as a CLI argument [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334163 (owner: 10Ema) [09:44:14] 06Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2975640 (10Legoktm) >>! In T149408#2928202, @Joe wrote: > Slides for the starting the discussion available here https://docs.googl... [09:44:17] (03CR) 10Muehlenhoff: [C: 032] Drop patch zero-pad-dhe.patch, merged in 1.0.2k [debs/openssl] - 10https://gerrit.wikimedia.org/r/334512 (owner: 10Muehlenhoff) [10:02:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.826 second response time [10:03:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.575 second response time [10:16:33] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2975775 (10akosiaris) 05Open>03Resolved And etherpad is now upgraded to 1.6.0-2 running on nodejs 6.9.1~dfsg-1+wmf1. Resolving the task once more [10:16:56] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2975777 (10akosiaris) [10:21:50] !log added addshore to labs-tools-wikibugs2 gerrit group [10:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:53] 06Operations, 10Graphite, 10Monitoring: graphite1003 short of available RAM - https://phabricator.wikimedia.org/T155872#2975820 (10fgiunchedi) I've tracked this down to expensive queries on graphite1003 making carbon-cache explode in memory. Namely cassandra-related 99percentile `SSTablesPerReadHistogram` fo... [10:25:22] 06Operations, 10Monitoring: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767#1757637 (10fgiunchedi) See also {T155872} for a case where heavy queries were not impacting uwsgi but carbon-cache instead using a lot of memory. [10:31:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.604 second response time [10:33:02] 06Operations, 10Monitoring: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767#2975863 (10fgiunchedi) Updated [[ https://wikitech.wikimedia.org/wiki/Graphite#Troubleshooting | wikitech Graphite troubleshooting ]] on how to identify such queries. [10:33:24] (03PS3) 10Faidon Liambotis: Setup & configure certspotter [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807) [10:33:43] 06Operations, 10Monitoring: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767#2975866 (10fgiunchedi) [10:33:46] 06Operations, 10Graphite, 10Monitoring: graphite1003 short of available RAM - https://phabricator.wikimedia.org/T155872#2975869 (10fgiunchedi) [10:33:48] 06Operations, 10Graphite, 10Monitoring, 13Patch-For-Review: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2975870 (10fgiunchedi) [10:34:01] (03PS4) 10Faidon Liambotis: Setup & configure certspotter [puppet] - 10https://gerrit.wikimedia.org/r/333231 (https://phabricator.wikimedia.org/T155807) [10:34:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.455 second response time [10:35:29] !log manually running certspotter -all_time as my user on einstenium (will take a few days to complete) [10:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] 06Operations, 10Graphite, 10Monitoring, 13Patch-For-Review: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957597 (10fgiunchedi) See also {T116767} to track heavy graphite queries, closing as its duplicate. [10:37:23] 06Operations, 10Graphite, 10Monitoring: graphite1003 short of available RAM - https://phabricator.wikimedia.org/T155872#2975880 (10fgiunchedi) Merging as T116767 duplicate, we can followup there as heavy queries were the root cause anyways. [10:44:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.266 second response time [10:45:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.442 second response time [10:54:07] !log uploaded openssl 1.0.2k for jessie-wikimedia to carbon [10:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:24] (03CR) 10Dereckson: "This patch is not a no op by the way, it also remove special wikis in addition to s4/commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309087 (https://phabricator.wikimedia.org/T143345) (owner: 10Jdlrobson) [11:11:11] !log initial installation of openssl bugfix/security updates [11:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:03] (03PS1) 10Urbanecm: Enable SandboxLink on tgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334551 (https://phabricator.wikimedia.org/T156473) [11:32:06] 06Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2976055 (10Joe) >>! In T149408#2975640, @Legoktm wrote: >>>! In T149408#2928202, @Joe wrote: >> Slides for the starting the discus... [11:35:16] 06Operations: re-create install1001 as physical server ? - https://phabricator.wikimedia.org/T156440#2976057 (10faidon) Why not go the other way and make both VMs? There's little point in having those be physical servers I think, they're really tiny machines. Also, yes, give it another disk or just extend its /... [11:45:33] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.33 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/334251 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles) [11:49:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.153 second response time [11:50:11] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334552 [11:50:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.586 second response time [11:52:09] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334552 (owner: 10Marostegui) [11:52:32] PROBLEM - zotero on sca1003 is CRITICAL: HTTP CRITICAL - No data received from host [11:53:22] RECOVERY - zotero on sca1003 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.006 second response time [11:53:25] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334552 (owner: 10Marostegui) [11:53:34] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334552 (owner: 10Marostegui) [11:55:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions for s2 in eqiad (duration: 00m 59s) [11:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:01] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add rack positions for s2 in codfw (duration: 00m 47s) [11:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.976 second response time [12:01:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.014 second response time [12:08:48] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#2976095 (10fgiunchedi) [12:09:16] (03PS2) 10Ema: etcd.py: log a warning on empty responses from etcd [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) [12:09:52] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2957555 (10fgiunchedi) >>! In T155875#2975634, @faidon wrote: > During the whole 30 minute window there was also an increased response time from the MediaWiki API, that cascaded in... [12:11:23] (03CR) 10Ema: "Thanks for the review!" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [12:13:14] !log upgrading openjdk-7 packages (security updates) on wdqs cluster [12:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:15] (03CR) 10Ema: etcd.py: log a warning on empty responses from etcd (032 comments) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [12:16:18] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#2976119 (10fgiunchedi) Logstash during that time period (` January 26th 2017, 17:56:23.220 to January 26th 2017, 18:25:00.000`): https://logstash.wikimedia.org/got... [12:16:22] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag not a slave [12:20:18] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334554 [12:23:31] (03PS4) 10Gehel: elasticsearch: Lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334281 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [12:24:17] !log upgrading mediawiki canaries to new openssl 1.1 package [12:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:40] (03CR) 10Gehel: [C: 032] elasticsearch: Lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334281 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [12:26:06] (03CR) 10Gehel: "This is a noop. Thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/334281 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [12:26:18] (03PS14) 10Elukey: Add JVM Heap usage alarms for basic Hadoop daemons [puppet] - 10https://gerrit.wikimedia.org/r/330154 (https://phabricator.wikimedia.org/T88640) [12:31:25] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2976141 (10elukey) @Cmjohnson would you have time next week to apply the thermal paste to a couple of analytics hosts to see if they impro... [12:41:44] 06Operations, 10ops-codfw, 10DBA: Change rack for servers in s1 - https://phabricator.wikimedia.org/T156478#2976188 (10Marostegui) [12:42:17] 06Operations, 10ops-codfw, 10DBA: Change rack for servers in s1 - https://phabricator.wikimedia.org/T156478#2976205 (10Marostegui) [12:46:12] (03PS2) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334554 [12:57:30] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334554 (owner: 10Marostegui) [12:57:35] (03PS1) 10Muehlenhoff: Update to 4.4.45 [debs/linux44] - 10https://gerrit.wikimedia.org/r/334558 [12:58:56] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334554 (owner: 10Marostegui) [12:59:04] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334554 (owner: 10Marostegui) [13:00:12] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add rack positions for s3 in codfw (duration: 00m 40s) [13:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions for s3 in eqiad (duration: 00m 40s) [13:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:52] PROBLEM - Disk space on analytics1032 is CRITICAL: DISK CRITICAL - free space: /boot 7 MB (2% inode=99%) [13:03:25] whaaaat [13:03:31] checking [13:04:08] ahh yes new kernels [13:04:13] and tiny boot [13:06:49] elukey: I've pruned a few, should recover soon [13:07:52] RECOVERY - Disk space on analytics1032 is OK: DISK OK [13:08:38] moritzm: ah yes I was about to remove them [13:08:40] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334561 [13:09:52] elukey: hadoop is now fully upgraded [13:10:45] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334561 (owner: 10Marostegui) [13:11:08] !log starting db1048 until db1043-bin.001457:753455353, expect it to stop soon [13:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:32] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [13:11:36] s/starting/starting database replication/ [13:11:45] checking the history server.. [13:12:03] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334561 (owner: 10Marostegui) [13:12:13] 06Operations: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011#2976237 (10MoritzMuehlenhoff) These are fully rolled out: gnupg gnupg2 ruby2.1 wget [13:12:24] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334561 (owner: 10Marostegui) [13:12:32] RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [13:12:52] (03PS1) 10Ema: Use caller function module name as default log prefix [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 [13:13:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions for s4 in eqiad (duration: 00m 43s) [13:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:02] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add rack positions for s4 in codfw (duration: 00m 41s) [13:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 2.040 second response time [13:17:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.601 second response time [13:19:20] (03CR) 10Alexandros Kosiaris: [C: 032] geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:19:32] (03PS15) 10Alexandros Kosiaris: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:19:36] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2976245 (10jcrespo) On the DBA side of things, I am ok with that proposal, asking for @demon opinion, or anyone at gerrit app lay... [13:19:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:20:05] (03PS4) 10Alexandros Kosiaris: k8s: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334289 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:20:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334289 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [13:25:13] (03CR) 10Alexandros Kosiaris: [C: 032] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) (owner: 10BryanDavis) [13:25:21] (03PS12) 10Alexandros Kosiaris: Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) (owner: 10BryanDavis) [13:25:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Provision MediaWiki-Vagrant on Jessie hosts [puppet] - 10https://gerrit.wikimedia.org/r/245920 (https://phabricator.wikimedia.org/T154340) (owner: 10BryanDavis) [13:26:29] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334568 [13:28:13] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334568 (owner: 10Marostegui) [13:29:30] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334568 (owner: 10Marostegui) [13:29:32] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:39] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334568 (owner: 10Marostegui) [13:30:37] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add rack positions for s5 in codfw (duration: 00m 40s) [13:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions for s5 in eqiad (duration: 00m 40s) [13:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:57] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334606 [13:41:47] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334606 (owner: 10Marostegui) [13:43:03] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334606 (owner: 10Marostegui) [13:43:20] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334606 (owner: 10Marostegui) [13:43:22] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:02] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add rack positions for s6 in codfw (duration: 00m 40s) [13:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions for s6 in eqiad (duration: 00m 40s) [13:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.968 second response time [13:48:32] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.504 second response time [13:51:50] (03PS1) 10Jcrespo: phabricator: Increase phabricator dbs buffer pool [puppet] - 10https://gerrit.wikimedia.org/r/334629 [13:52:47] (03CR) 10Marostegui: [C: 031] phabricator: Increase phabricator dbs buffer pool [puppet] - 10https://gerrit.wikimedia.org/r/334629 (owner: 10Jcrespo) [13:52:59] (03CR) 10Jcrespo: "What do you thing of this? Replication is going SOOOOO SLOW: https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad%20prometheus%2F" [puppet] - 10https://gerrit.wikimedia.org/r/334629 (owner: 10Jcrespo) [13:58:17] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.45 [debs/linux44] - 10https://gerrit.wikimedia.org/r/334558 (owner: 10Muehlenhoff) [13:58:32] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:58:38] (03PS2) 10Jcrespo: phabricator: Increase phabricator dbs buffer pool [puppet] - 10https://gerrit.wikimedia.org/r/334629 [14:00:35] (03CR) 10Marostegui: [C: 031] "It has picked up a nice pace now - a lot better than 3 minutes ago." [puppet] - 10https://gerrit.wikimedia.org/r/334629 (owner: 10Jcrespo) [14:01:57] (03CR) 10Jcrespo: [C: 032] phabricator: Increase phabricator dbs buffer pool [puppet] - 10https://gerrit.wikimedia.org/r/334629 (owner: 10Jcrespo) [14:02:55] (03PS1) 10Faidon Liambotis: labs: remove keystone alerts [puppet] - 10https://gerrit.wikimedia.org/r/334643 [14:03:55] (03CR) 10Faidon Liambotis: [C: 032] labs: remove keystone alerts [puppet] - 10https://gerrit.wikimedia.org/r/334643 (owner: 10Faidon Liambotis) [14:05:12] (03PS2) 10Faidon Liambotis: labs: remove keystone alerts [puppet] - 10https://gerrit.wikimedia.org/r/334643 [14:05:20] (03CR) 10Faidon Liambotis: [V: 032 C: 032] labs: remove keystone alerts [puppet] - 10https://gerrit.wikimedia.org/r/334643 (owner: 10Faidon Liambotis) [14:06:32] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [14:06:52] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [14:07:57] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Add rack positions for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334644 [14:07:59] ^that was me with a fast reboot [14:10:29] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Add rack positions for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334644 (owner: 10Marostegui) [14:11:22] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:11:50] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334644 (owner: 10Marostegui) [14:11:59] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Add rack positions for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334644 (owner: 10Marostegui) [14:12:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add rack positions for s7 in eqiad (duration: 00m 40s) [14:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:43] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add rack positions for s7 in codfw (duration: 00m 40s) [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:15] (03PS2) 10Dzahn: DHCP: Add DHCP entries for mc2019-mc2036 Bug:T155755 [puppet] - 10https://gerrit.wikimedia.org/r/334492 (owner: 10Papaul) [14:18:26] (03CR) 10Dzahn: [C: 032] DHCP: Add DHCP entries for mc2019-mc2036 Bug:T155755 [puppet] - 10https://gerrit.wikimedia.org/r/334492 (owner: 10Papaul) [14:21:02] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:00] (03PS1) 10Muehlenhoff: Add more email addresses and account expiration dates [puppet] - 10https://gerrit.wikimedia.org/r/334648 [14:29:39] (03PS2) 10Muehlenhoff: Add more email addresses and account expiration dates [puppet] - 10https://gerrit.wikimedia.org/r/334648 [14:30:05] 06Operations: re-create install1001 as physical server ? - https://phabricator.wikimedia.org/T156440#2976415 (10Dzahn) >>! In T156440#2976057, @faidon wrote: > Why not go the other way and make both VMs? Alright. will do that. > All of carbon's /srv is 31G, which is... peanuts :) Heh, yes, but only since: "... [14:32:12] (03CR) 10Muehlenhoff: [C: 032] Add more email addresses and account expiration dates [puppet] - 10https://gerrit.wikimedia.org/r/334648 (owner: 10Muehlenhoff) [14:34:03] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2976419 (10Dzahn) @hashar thank you for this very detailed reply. Since everything is already behind varnish i think it will not be relevant in the context of this ticket then because of "2)"... [14:34:48] 06Operations: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440#2976441 (10Dzahn) [14:34:53] 06Operations: re-create install2001 as a VM - https://phabricator.wikimedia.org/T156440#2974819 (10Dzahn) a:03Dzahn [14:38:16] 06Operations, 10Traffic: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2976449 (10hashar) I don't think we ever used self-signed certs for CI. Internal communications I can remember of are: * Varnish -> Apache on contint1001 (plain HTTP) * Apache on contint1001 t... [14:41:37] (03PS3) 10Ema: Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) [14:41:49] (03CR) 10Ema: [V: 032 C: 032] Text VCL: consolidate mobile hostname rewrite regex [puppet] - 10https://gerrit.wikimedia.org/r/333158 (https://phabricator.wikimedia.org/T155504) (owner: 10Ema) [14:43:14] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2976466 (10demon) >>! In T145885#2976245, @jcrespo wrote: > On the DBA side of things, I am ok with that proposal, asking for @de... [14:49:00] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:51:10] (03PS1) 10Muehlenhoff: Add more researchers (email and expiries) [puppet] - 10https://gerrit.wikimedia.org/r/334657 [14:52:57] (03PS2) 10Muehlenhoff: Add more researchers (email and expiries) [puppet] - 10https://gerrit.wikimedia.org/r/334657 [14:55:37] (03PS1) 10Andrew Bogott: Revert "labs: remove keystone alerts" [puppet] - 10https://gerrit.wikimedia.org/r/334658 [15:04:54] (03CR) 10Filippo Giunchedi: Enable Prometheus JMX exporter on Cassandra nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:06:26] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add more researchers (email and expiries) [puppet] - 10https://gerrit.wikimedia.org/r/334657 (owner: 10Muehlenhoff) [15:09:20] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2976550 (10Paladox) >>! In T145885#2976466, @demon wrote: >>>! In T145885#2976245, @jcrespo wrote: >> On the DBA side of things,... [15:11:26] (03CR) 10Andrew Bogott: [C: 04-1] "I can find one labs instance that is running 3.4.3-1~ubuntu12.04.1: limn1.analytics.eqiad.wmflabs. That instance is slated for deletion b" [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [15:14:26] (03CR) 10Filippo Giunchedi: "LGTM in principle, some naming bikeshedding!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334344 (owner: 10Elukey) [15:19:27] someone can update https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps/estimates#March_2014_estimate <- this? [15:19:47] I'd need to have a fresh estimate [15:20:34] 06Operations, 10ops-codfw, 10DBA: Change rack for servers in s1 - https://phabricator.wikimedia.org/T156478#2976587 (10Papaul) @Marostegui yes it is doable. [15:22:59] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Fine by me. -2 until limn1.analytics.eqiad.wmflabs is deleted" [puppet] - 10https://gerrit.wikimedia.org/r/334155 (owner: 10Alexandros Kosiaris) [15:28:49] (03PS1) 10Filippo Giunchedi: prometheus: add aggregation rules for apache and hhvm [puppet] - 10https://gerrit.wikimedia.org/r/334662 [15:28:55] (03CR) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:29:15] (03PS13) 10Eevans: Enable Prometheus JMX exporter on Cassandra nodes [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) [15:30:26] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/332535 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [15:32:16] 06Operations, 10ops-codfw, 10DBA: Change rack for servers in s1 - https://phabricator.wikimedia.org/T156478#2976660 (10Marostegui) Thanks, let's do it next week! [15:32:33] (03PS1) 10Alexandros Kosiaris: ores::redis: Enable diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/334663 [15:36:01] 06Operations: Upgrade nginx on notebook* servers - https://phabricator.wikimedia.org/T156495#2976670 (10MoritzMuehlenhoff) [15:36:07] 06Operations: Upgrade nginx on notebook* servers - https://phabricator.wikimedia.org/T156495#2976682 (10MoritzMuehlenhoff) p:05Triage>03Normal [15:37:15] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#2976687 (10BBlack) [15:40:10] (03PS1) 10Muehlenhoff: More email addresses [puppet] - 10https://gerrit.wikimedia.org/r/334665 [15:45:09] (03PS1) 10Muehlenhoff: Record CVE ID fixed in earlier 4.4.x kernel [debs/linux44] - 10https://gerrit.wikimedia.org/r/334666 [15:45:11] !log cache_text: ban req.url == "/apple-app-site-association" && obj.status == 404 (T155504) [15:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:16] T155504: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504 [15:48:23] (03CR) 10Muehlenhoff: [V: 032 C: 032] More email addresses [puppet] - 10https://gerrit.wikimedia.org/r/334665 (owner: 10Muehlenhoff) [15:48:28] !log Stop mysql and shutdown db1072 for maintenance - T156226 [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:33] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [15:49:54] (03PS5) 10Rush: Tools: Disable automatic backups of aptly repositories [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) (owner: 10Tim Landscheidt) [15:54:20] (03PS1) 10Elukey: Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) [15:55:40] (03PS1) 10Muehlenhoff: Two special cases; account expiry dates for research fellows [puppet] - 10https://gerrit.wikimedia.org/r/334668 [15:56:00] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [15:56:11] (03CR) 10jerkins-bot: [V: 04-1] Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:56:16] (03PS1) 10Cmjohnson: Updating IP for db1072 to match rack change to B2. T156226 [dns] - 10https://gerrit.wikimedia.org/r/334669 [15:56:45] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db1072 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334670 (https://phabricator.wikimedia.org/T156226) [15:56:56] (03CR) 10Cmjohnson: [C: 032] Updating IP for db1072 to match rack change to B2. T156226 [dns] - 10https://gerrit.wikimedia.org/r/334669 (owner: 10Cmjohnson) [15:57:43] (03CR) 10Muehlenhoff: [C: 032] Two special cases; account expiry dates for research fellows [puppet] - 10https://gerrit.wikimedia.org/r/334668 (owner: 10Muehlenhoff) [15:58:09] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links, 13Patch-For-Review: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2976752 (10ema) @Fjalapeno @JMinor @JoeWalsh the issue shoul... [15:58:37] ^ andrewbogott is that you? I'm about to head into a meeting fy [15:58:41] fyi even [15:58:41] (03PS2) 10Elukey: Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) [15:59:26] chasemp: it's me [15:59:43] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db1072 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334670 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [16:00:44] or, wait maybe it isn't... [16:00:45] * andrewbogott looks [16:01:13] !log submitted wmf-mariadb10_10.0.29-2 for T156373 fix [16:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:18] T156373: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373 [16:03:00] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:03:00] (03PS1) 10DCausse: [WIP] Configure A/B test for CrossProject search results sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) [16:03:31] (03CR) 10DCausse: [C: 04-1] "not ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [16:05:30] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2976762 (10Paladox) +1 :) [16:06:12] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1072 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334670 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [16:07:25] !log marostegui@tin Synchronized wmf-config/db-codfw.php: db1072 change IP - T156226 (duration: 00m 40s) [16:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:29] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [16:08:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: db1072 change IP - T156226 (duration: 00m 40s) [16:08:16] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1072 IP and rack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334670 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [16:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:13] (03PS1) 10Muehlenhoff: Two more researchers [puppet] - 10https://gerrit.wikimedia.org/r/334674 [16:11:29] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2976777 (10Paladox) yep the mariadb plugin fixes it. sudo mysql -p Enter password: Welcome to the MariaDB monitor. Commands en... [16:11:53] 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2976778 (10Paladox) that just proofs that the db is not actually utf8mb4. So the fix in jdbc works :) [16:15:55] (03CR) 10Muehlenhoff: [C: 032] Two more researchers [puppet] - 10https://gerrit.wikimedia.org/r/334674 (owner: 10Muehlenhoff) [16:16:01] (03PS3) 10Elukey: Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) [16:17:31] 06Operations, 10DBA: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#2976786 (10Marostegui) db1072 has been moved out (T156226) to another rack. So right now (well, once db1072 is recloned) we will have 1 of the API servers out from D1. I still think we should move db1073 out (right now we have... [16:19:53] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2976789 (10faidon) Ping! Early February is now a week away. [16:21:00] 06Operations, 10DBA, 13Patch-For-Review: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226#2976790 (10Marostegui) db1072 has been moved to B2 DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated Pending: reimage and reclone Thanks @Cmjohnson [16:28:14] (03PS1) 10Muehlenhoff: Two more email addresses for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/334676 [16:31:02] (03CR) 10EBernhardson: [WIP] Configure A/B test for CrossProject search results sidebar (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [16:32:07] (03CR) 10Muehlenhoff: [C: 032] Two more email addresses for ISI Foundation researchers [puppet] - 10https://gerrit.wikimedia.org/r/334676 (owner: 10Muehlenhoff) [16:33:15] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2976836 (10Anomie) > The simplest solution would be to just move all thumb accesses to thumb.php (or an api module) I note thumb.php could probably us... [16:37:34] (03CR) 10DCausse: [C: 04-1] [WIP] Configure A/B test for CrossProject search results sidebar (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334673 (https://phabricator.wikimedia.org/T149806) (owner: 10DCausse) [16:41:18] 06Operations, 06Commons, 10Traffic, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2976859 (10zhuyifei1999) 05Open>03Resolved Closing as resolved as it cannot be reproduced anymore. If the bug appears again feel free to reopen. [16:46:44] (03PS1) 10Thcipriani: Scap: Bump version to 3.5.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/334677 (https://phabricator.wikimedia.org/T127762) [16:56:43] 06Operations, 10netops: pfws not on librenms - https://phabricator.wikimedia.org/T156381#2976889 (10faidon) 05Open>03Resolved a:03faidon I reenabled (actually removed and readded) pfw-codfw yesterday and I haven't seen any ill effects in ~24 hours, so resolving this. By the way, note that pfws are easily... [16:57:52] 06Operations, 10DBA: Switchover s1 master db1057 -> db1052 - https://phabricator.wikimedia.org/T156008#2976902 (10faidon) [17:14:19] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:22] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2976990 (10jcrespo) MariaDB's regresion test does work on the new package: https://github.com/Maria... [17:19:16] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2976994 (10jcrespo) [17:21:41] (03PS1) 10Andrew Bogott: Keystone: Turn on caching of tokens and catalog [puppet] - 10https://gerrit.wikimedia.org/r/334680 (https://phabricator.wikimedia.org/T156337) [17:22:05] (03PS2) 10Andrew Bogott: Keystone: Turn on caching of tokens and catalog [puppet] - 10https://gerrit.wikimedia.org/r/334680 (https://phabricator.wikimedia.org/T156337) [17:23:11] (03CR) 10Andrew Bogott: [C: 032] Keystone: Turn on caching of tokens and catalog [puppet] - 10https://gerrit.wikimedia.org/r/334680 (https://phabricator.wikimedia.org/T156337) (owner: 10Andrew Bogott) [17:23:54] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977001 (10Paladox) :) [17:24:04] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2977003 (10madhuvishy) [17:27:24] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2977010 (10madhuvishy) [17:27:28] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: Migrate misc to secondary labstore HA cluster - https://phabricator.wikimedia.org/T154336#2977008 (10madhuvishy) 05Open>03Resolved Closing this now. Noting that https://wikitech.wikimedia.org/wiki/Incident_documentation/20170118-Labs happened during... [17:28:42] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977015 (10mmodell) Awesome, this is really impressive work everyone. Thanks for helping @paladox! [17:29:20] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977017 (10Paladox) Your welcome :) [17:30:28] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977032 (10mmodell) And of course mad props and thanks to @jcrespo, @Marostegui, and @epriestley [17:40:37] (03PS1) 10Aklapper: Block IPs for recent attempts to upload offtopic files to Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/334683 [17:42:19] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:43:16] (03CR) 10Aklapper: "IP range might be a bit massive but see https://phabricator.wikimedia.org/people/logs/query/advanced/ . Plus not sure if that's the exact " [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [17:45:39] 06Operations: Replace bast3001 - https://phabricator.wikimedia.org/T156506#2977122 (10faidon) [17:54:05] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2977163 (10yuvipanda) On further thought, I think I just want to use the aptly that we've setup for tools already. 1. We already use this for other package... [17:55:47] !log OS installation on mc2019-mc2036 [17:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:32] (03PS1) 10Halfak: Adds aspell-ro to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/334690 [18:20:49] PROBLEM - puppet last run on serpens is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:51] (03PS2) 10Halfak: ores:Adds aspell-ro to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/334690 [18:34:47] (03PS3) 10AndyRussG: CentralNotice config: make mediawiki its own CN project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334025 (https://phabricator.wikimedia.org/T155997) [18:37:38] (03PS1) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 [18:38:03] (03PS2) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [18:38:49] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:19] ^ I reached out to serpens to do a manual run and it seemed ok I see 'serpens puppet-agent[16212]: Could not retrieve catalog; skipping run' in the log [18:40:30] (03CR) 10jerkins-bot: [V: 04-1] nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [18:40:34] which suggests master side issues indeed that appear to be transient [18:43:24] (03PS3) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [18:44:22] (03CR) 10jerkins-bot: [V: 04-1] nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [18:48:21] chasemp: might be an instance of T153246 ? [18:48:22] T153246: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246 [18:48:34] and with that I'm logging off for the weekend [18:48:52] godog: good thought [18:48:56] have a good weekend [18:48:59] you too! [18:49:24] anyone deploying right now? If not I'm going to deploy wmf.9 real quick [18:50:27] going once... [18:50:31] shouldn't be :) [18:50:57] (03PS1) 1020after4: group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334693 [18:50:59] (03CR) 1020after4: [C: 032] group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334693 (owner: 1020after4) [18:52:06] (03Merged) 10jenkins-bot: group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334693 (owner: 1020after4) [18:52:15] (03CR) 10jenkins-bot: group2 wikis to 1.29.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334693 (owner: 1020after4) [18:52:18] !log Rolling forward with group2 to 1.29.0-wmf.9 refs T156364 T154683 [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:23] T154683: MW-1.29.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T154683 [18:52:24] T156364: Warning: Empty regular expression in /srv/mediawiki/php-1.29.0-wmf.9/includes/parser/DateFormatter.php on line 200 - https://phabricator.wikimedia.org/T156364 [18:52:47] (03PS4) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [18:52:49] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 wikis to 1.29.0-wmf.9 [18:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:58] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2758050 (10Legoktm) @joe could you also upload the slides from https://docs.google.com/presenta... [19:15:21] mutante: do you think this is worth a blog post? https://admin.phacility.com/phame/post/view/7/autocomplete_now_with_emoji/ :P [19:16:25] greg-g: :) hehe, it's nice, but i think it should be a nice post on wikitech-l rather than blog [19:16:40] or maybe i'm misjudging where our phab users are reading [19:17:16] yea,i still think wikitech-l or some other list [19:18:31] ditto [19:35:20] (03PS3) 10Dzahn: ssl: delete stream.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/334207 (https://phabricator.wikimedia.org/T134361) [19:35:29] (03PS4) 10Dzahn: ssl: delete stream.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/334207 (https://phabricator.wikimedia.org/T134361) [19:38:45] (03CR) 10Rush: [C: 031] nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [19:39:18] (03CR) 10Dzahn: [C: 032] "this is behind misc-web and the cert has expired" [puppet] - 10https://gerrit.wikimedia.org/r/334207 (https://phabricator.wikimedia.org/T134361) (owner: 10Dzahn) [19:45:40] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977533 (10Marostegui) >>! In T156373#2976990, @jcrespo wrote: > MariaDB's regresion test does work... [19:47:46] anomie: looks like you were right about the cached objects for T156364, at least the error rate is dropping off now [19:47:46] T156364: Warning: Empty regular expression in /srv/mediawiki/php-1.29.0-wmf.9/includes/parser/DateFormatter.php on line 200 - https://phabricator.wikimedia.org/T156364 [19:49:32] marostegui: re: db1019/db1042 decom - it seems like db1042 decided to die/shutdown itself on the day after i removed it from puppet :p [19:51:37] !log db1042 - i came to shut it down .. and noticed it had died (or somebody did it) about 3 hours ago .. there it goes (T149793) [19:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:41] T149793: Decommission db1042 - https://phabricator.wikimedia.org/T149793 [19:52:40] !log db1019 - shutdown -h now (T146265) [19:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:44] T146265: db1019: Decommission - https://phabricator.wikimedia.org/T146265 [19:52:59] PROBLEM - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused [20:05:46] got that ^^^ [20:05:52] cool [20:07:25] !log updated watchmouse checks for s5 (de wiki) because Main_Page was deleted, used the localized page instead [20:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:30] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2017-01-30 20:06:52. [20:14:26] !log db1029, db1042, analytics1015, analytics1026 - puppet node deactivate, remove from icinga, finish decom (T147313, T149793, T146265) [20:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:32] T149793: Decommission db1042 - https://phabricator.wikimedia.org/T149793 [20:14:32] T147313: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313 [20:14:33] T146265: db1019: Decommission - https://phabricator.wikimedia.org/T146265 [20:14:54] 1019, not 1029, that typo was only in log [20:23:01] good, you do not want to destroy wiki's notification and boad systems :-) [20:23:29] I think would had survived, in fact [20:24:01] !log restart and upgrade mariadb on db1048 [20:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:20] proxy will complain momentarily [20:24:39] 1003 and 8, I think [20:25:05] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2879938 (10scfc) Is there really much experience with `aptly`? :-) I stumble along with it quite a bit, and – if possible – I would much rather switch the... [20:26:42] jynus: yes, for sure :o also it's fixed in SAL in wiki now [20:26:54] mutante, I was joking [20:26:57] no problem [20:27:29] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [20:27:32] :) [20:27:49] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [20:35:29] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [20:35:49] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [20:41:01] (03CR) 10Volans: [C: 031] "LGTM, do we have a testing pybal environment somewhere?" (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334369 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [20:42:14] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977727 (10jcrespo) The production test seemed ok- 10.0.29-2 not crashing on db1048 (not sure if it... [20:55:06] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links, 13Patch-For-Review: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2977751 (10Fjalapeno) @ema thanks for the update! We will ge... [20:55:07] (03CR) 10Volans: [C: 04-1] "inspect.stack() has performance impacts, see inline." (032 comments) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [20:56:33] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links, 13Patch-For-Review: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2977752 (10JoeWalsh) @ema thanks! it's working now [20:57:25] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Links, 13Patch-For-Review: Fix universal link support in iOS when the OS requests the site association file from m.wikipedia.org - https://phabricator.wikimedia.org/T155504#2977755 (10Fjalapeno) 05Open>03Resolved @Ema - just chec... [20:58:27] (03PS3) 10Dzahn: remove db1019, db1042, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) [21:03:29] (03CR) 10Dzahn: [C: 032] "these have been powered down and removed from icinga now" [dns] - 10https://gerrit.wikimedia.org/r/334014 (https://phabricator.wikimedia.org/T149793) (owner: 10Dzahn) [21:05:25] (03PS1) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:06:26] (03CR) 10jerkins-bot: [V: 04-1] Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (owner: 10Andrew Bogott) [21:06:32] 06Operations, 10Continuous-Integration-Infrastructure: On Trusty and Jessie PHP yields: PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/20-xhprof.ini on line 2 - https://phabricator.wikimedia.org/T135338#2977791 (10hashar) Duplicate bug T156524 show how confusing that depreca... [21:08:56] (03PS2) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:09:51] (03CR) 10jerkins-bot: [V: 04-1] Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (owner: 10Andrew Bogott) [21:12:39] (03PS3) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:13:26] (03CR) 10jerkins-bot: [V: 04-1] Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (owner: 10Andrew Bogott) [21:16:25] (03PS4) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:17:14] (03CR) 10jerkins-bot: [V: 04-1] Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (owner: 10Andrew Bogott) [21:22:49] PROBLEM - puppet last run on ms-be1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:24:44] was there any change in SSL certs recently? my maven install can't talk to archiva.wikimedia.org anymore [21:25:56] (03PS5) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:26:21] 06Operations: fix log reading permissions for dc-ops admin group - https://phabricator.wikimedia.org/T156529#2977826 (10Dzahn) [21:27:12] !log mobrovac@tin Starting deploy [trending-edits/deploy@e0e32bb]: Restart the service to assess the load of replaying the last 6h T156411 [21:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:17] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [21:27:20] 06Operations: fix log reading permissions for dc-ops admin group - https://phabricator.wikimedia.org/T156529#2977840 (10Dzahn) [21:28:16] !log mobrovac@tin Finished deploy [trending-edits/deploy@e0e32bb]: Restart the service to assess the load of replaying the last 6h T156411 (duration: 01m 03s) [21:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:57] (03PS6) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:34:32] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2977856 (10Marostegui) >>! In T156373#2977727, @jcrespo wrote: > The production test seemed ok- 10.... [21:34:48] !log mobrovac@tin Starting deploy [trending-edits/deploy@0e79bec]: Bump max_age to 12h T156411 [21:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:52] T156411: Compute the trending articles over a period of 24h rather than 1h - https://phabricator.wikimedia.org/T156411 [21:36:18] (03PS7) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:36:46] !log mobrovac@tin Finished deploy [trending-edits/deploy@0e79bec]: Bump max_age to 12h T156411 (duration: 01m 58s) [21:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:07] (03CR) 10jerkins-bot: [V: 04-1] Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (owner: 10Andrew Bogott) [21:38:04] (03PS8) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 [21:39:12] (03PS1) 10Dzahn: admin: fix log file perms for dc-ops on jessie [puppet] - 10https://gerrit.wikimedia.org/r/334719 (https://phabricator.wikimedia.org/T156529) [21:40:28] 06Operations, 10Mobile-Content-Service, 10Reading-Web-Trending-Service, 13Patch-For-Review, and 3 others: New Service Request for Trending Edits Service - https://phabricator.wikimedia.org/T150043#2977871 (10mobrovac) 05Open>03Resolved The service is fully in production, time to resolve this ticket. [21:41:01] !log restored watchmouse checks for s5 (de wiki), Main_Page redirect was restored [21:41:03] (03PS9) 10Andrew Bogott: Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (https://phabricator.wikimedia.org/T156337) [21:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:28] (03CR) 10Andrew Bogott: [C: 032] Keystone: use uwsgi::app instead of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/334714 (https://phabricator.wikimedia.org/T156337) (owner: 10Andrew Bogott) [21:48:55] (03PS2) 10Dzahn: admin: fix log file perms for dc-ops on jessie [puppet] - 10https://gerrit.wikimedia.org/r/334719 (https://phabricator.wikimedia.org/T156529) [21:49:37] (03CR) 10Dzahn: "403 Forbidden" [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [21:49:53] (03PS1) 10EBernhardson: Setup sister search prefix display types [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334721 (https://phabricator.wikimedia.org/T149806) [21:52:49] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:53:59] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:19] (03CR) 10Dzahn: "for the range i'd say let's use CIDR, (e.g. '197.216.8.0/24'), the range thing can be buggy/tricky afair (https://stackoverflow.com/quest" [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [22:04:35] (03CR) 10Dzahn: "i can't see those logs so don't know if a /24 is too large" [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [22:09:59] (03CR) 10Dzahn: "ah, looks like simply "require not 197.216.8" would do it too (https://httpd.apache.org/docs/2.4/mod/mod_authz_core.html#require) but CIDR" [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [22:12:22] (03CR) 10Alexandros Kosiaris: [C: 032] ores:Adds aspell-ro to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/334690 (owner: 10Halfak) [22:12:28] (03PS3) 10Alexandros Kosiaris: ores:Adds aspell-ro to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/334690 (owner: 10Halfak) [22:12:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores:Adds aspell-ro to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/334690 (owner: 10Halfak) [22:13:05] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2972713 (10hashar) Well done!!! 🥂 🍾 == [22:14:04] (03PS3) 10Dzahn: aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) [22:18:13] (03PS4) 10Dzahn: aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) [22:19:11] (03CR) 10jerkins-bot: [V: 04-1] aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:19:51] (03PS5) 10Dzahn: aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) [22:23:44] (03CR) 10Dzahn: [C: 032] aptrepo: add second rsync module for entire /srv/ [puppet] - 10https://gerrit.wikimedia.org/r/334465 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:29:03] (03CR) 10Volans: "In much better state now, but needs rebase, there are some conflicts hence puppet compiler is failing." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [22:31:22] !log carbon: rsync entire /srv/ to install2001 (this is APT data but also misc things like junos, megacli, firmware, ipmi [22:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:08] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:45:08] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:45:27] !log install1001 - adding a second virtual hard disk, 80G [22:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:51] (03PS12) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) [23:07:14] (03CR) 10Mobrovac: RESTBase-Cassandra: Add the topk reporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328660 (https://phabricator.wikimedia.org/T147366) (owner: 10Mobrovac) [23:07:47] volans: rebased and removed quotes on ensure ^ [23:08:51] (03PS3) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) [23:09:43] (03PS9) 10Paladox: Gerrit: Set useUnicode=true, also change connectionCollation to utf8mb4_unicode_ci [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) [23:11:08] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:13:34] Hey... is there anyway in Mediawiki to get the equivalent mobile URL for a project or page, from within a desktop request? [23:14:08] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:14:15] AndyRussG: add the "m." in front of it? [23:29:11] mutante: yes, I mean the "right" way? [23:29:29] A DRY way? [23:29:49] that doesn't repeat logic in condig files? [23:30:00] s/condig/config/ [23:30:51] (03PS1) 10Tjones: Deploy TextCat Improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) [23:38:45] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team, 07Upstream: During Phabricator upgrade on 2017-01-26, all m3 replica dbs crashed at the same time - https://phabricator.wikimedia.org/T156373#2978846 (10Paladox) :) [23:43:08] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:55:59] ACKNOWLEDGEMENT - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Volans Andrew will fix it later - T156337 https://phabricator.wikimedia.org/T156337