[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150827T0000). Please do the needful. [00:00:14] (03CR) 10Ori.livneh: [C: 031] Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [00:02:15] 6operations, 10Wikimedia-Mailing-lists: Mailman error on wikimedia-de-by moderator interface - https://phabricator.wikimedia.org/T110427#1577889 (10JohnLewis) 5Open>3Resolved [00:07:16] (03Abandoned) 10Alex Monk: Enable banners on beta labs where Pagebanner is installed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234184 (owner: 10Jdlrobson) [00:07:27] Thanks Krenair :) [00:07:43] yw [00:08:39] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1577914 (10JohnLewis) 3NEW a:3Dzahn [00:08:50] mutante: ^ [00:10:33] JohnFLewis: https://phabricator.wikimedia.org/T110439 [00:10:36] bot fail too [00:10:52] close one, you pick :) [00:10:57] ha :P [00:10:58] will do [00:11:09] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1577940 (10JohnLewis) [00:11:23] because wikidata; merge into the lower ID :) [00:14:56] JohnFLewis: :) except merge isn't merge [00:15:14] true :) [00:16:07] need definition of "large" [00:16:39] and something that counts them for lists but doesnt take 24 hours to finish :p [00:16:43] for all lists [00:19:27] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1577971 (10Dzahn) even after removing all this stuff, an rsync (second run) still took 60m, exactly an hour [00:20:08] mutante: large: greater than 1. we demand all emails be moderated immediately or have good regex rules! [00:20:20] !log taking phabricator offline for scheduled upgrade [00:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:40] but seriously; I'd say something statistical depending on how many emails are in moderation [00:21:58] nnoooo [00:22:11] phabricator upgrade time [00:22:23] (i know it's scheduled :) [00:22:41] !log aaron@tin Synchronized php-1.26wmf20/extensions/CentralAuth: 47e181adb2898977b146de7398eaa35aebb870e3 (duration: 01m 13s) [00:22:43] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1577975 (10Dzahn) "how long does rsync take" -> an hour or a little bit more, depending on how long we wait between rsync runs. actually tested this though, lowered time a bit by deleting a bun... [00:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:51] hits save as quick as possible [00:23:26] it won't be down long hopefully this one goes quickly [00:24:21] !log aaron@tin Synchronized php-1.26wmf19/extensions/CentralAuth: 47e181adb2898977b146de7398eaa35aebb870e3 (duration: 01m 13s) [00:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:46] (03PS2) 10Ori.livneh: base: Don't install command-not-found-data either [puppet] - 10https://gerrit.wikimedia.org/r/232867 (owner: 10Tim Landscheidt) [00:28:45] waits for "is phab down" questions on other channels [00:28:59] :) we might want a different error message for the scheduled downtimes [00:28:59] (03PS5) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [00:29:20] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [00:29:40] ori: I guess I broke my near 4 month no-deploy streak [00:30:06] 3 months and 3 weeks or so [00:30:53] and phabricator is back [00:31:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [00:31:18] !log finished phabricator upgrade, everything appears to be working [00:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:15] mutante: yeah I've been meaning to look into switching the message, but to do that I need to reconfigure apache's docroot during the upgrade (instead of just stopping and then starting apache) [00:34:12] twentyafterfour: oh, yea, i understand, or a redirect each time. well, this upgrade was really fast indeed. thank you! [00:34:46] yeah it's just far worse if I run the paranoid database backup before applying the schema migrations [00:34:57] 6operations, 10Wikimedia-Mailing-lists: announce scheduled downtime - https://phabricator.wikimedia.org/T110133#1577987 (10Dzahn) also tell ops list [00:35:15] but if the schema changes break things, or if code needs to roll back, then without a dump I'm in a bad situation [00:35:25] still the schema changes are usually sorta minor [00:37:20] yep, changing the message or pointing misc-web somewhere else seems to much extra work [00:37:53] (03PS6) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [00:41:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:41:35] 6operations, 10Wikimedia-Mailing-lists: rsync exim spool directory - https://phabricator.wikimedia.org/T110440#1577996 (10Dzahn) 3NEW a:3Dzahn [00:43:28] 6operations, 10Wikimedia-Mailing-lists: test sending individual mails from fermium during migration - https://phabricator.wikimedia.org/T110441#1578005 (10Dzahn) 3NEW a:3Dzahn [00:43:41] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1578013 (10Dzahn) "also tell ops" -> added on T110133 "stop exim, copy spool directories" -> T110440, T110136 "exim commands cheat sheet" -> T110441 [00:45:20] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1578016 (10Dzahn) [00:46:05] 6operations, 6Phabricator: apache on iridium segfaults (so far this has triggered two phabricator outages in 6 hours) - https://phabricator.wikimedia.org/T109941#1578029 (10mmodell) So far so good :) [00:46:34] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1578032 (10Dzahn) see updates from T108071#1562876 and following [00:47:04] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1578034 (10Dzahn) done. we have a copy of archives and configs over an fermium again [00:48:06] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1578045 (10Dzahn) [00:48:08] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1578043 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:48:35] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1450894 (10Dzahn) [00:48:36] 6operations, 10Wikimedia-Mailing-lists: write migration plan for mailman - https://phabricator.wikimedia.org/T109467#1578046 (10Dzahn) 5Open>3Resolved resolving, all action items are done or have a follow-up ticket linked [00:54:51] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1578058 (10Dzahn) sizes after considerable cleanup (qfiles/bad, shunt, Gigabytes deleted): 17:54 91G archives 17:54 4.4G data 1... [00:55:49] (03PS5) 10Dzahn: mailman: also import held messages and qfiles [puppet] - 10https://gerrit.wikimedia.org/r/234138 (https://phabricator.wikimedia.org/T110131) [01:00:57] (03PS7) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 [01:01:05] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [01:12:22] (03PS1) 10Dzahn: mailman: fix path error in sync script [puppet] - 10https://gerrit.wikimedia.org/r/234199 [01:12:29] (03CR) 10jenkins-bot: [V: 04-1] mailman: fix path error in sync script [puppet] - 10https://gerrit.wikimedia.org/r/234199 (owner: 10Dzahn) [01:12:33] (03PS2) 10Dzahn: mailman: fix path error in sync script [puppet] - 10https://gerrit.wikimedia.org/r/234199 [01:12:57] (03CR) 10Dzahn: [C: 032] mailman: fix path error in sync script [puppet] - 10https://gerrit.wikimedia.org/r/234199 (owner: 10Dzahn) [01:15:07] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:16:08] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:17:07] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 504 bytes in 0.001 second response time [01:18:07] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10847 bytes in 0.097 second response time [01:20:47] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1578138 (10Dzahn) [01:20:49] 6operations, 10Wikimedia-Mailing-lists: rsync all configs and archives one more time - https://phabricator.wikimedia.org/T110129#1578136 (10Dzahn) 5Resolved>3Open we will have to do it yet another time, because: - we want to run a full import to see how long it takes - we don't have enough diskspace to ke... [01:24:47] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1578155 (10mmodell) So a canary is just a rolling deploy with a specified starting host (canary) followed by a modal "continue y/n" prompt... [01:40:22] (03PS1) 10Yurik: Tilerator config - add redis connection [puppet] - 10https://gerrit.wikimedia.org/r/234203 [01:47:05] (03PS1) 10Yurik: Relax REFERER restrictions for Maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/234205 [01:49:59] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2020_v6 [01:51:59] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:55:37] 6operations, 6Discovery, 10Maps: Determine limited maps deployment options - https://phabricator.wikimedia.org/T109159#1578189 (10Yurik) Update: per today's meeting between @bblack, @akosiaris, @maxsem, and myself, the current strategy is to allow low-traffic "trial mode" operations, with plenty of warnings... [01:55:53] (03PS2) 10Yurik: Relax REFERER restrictions for Maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/234205 (https://phabricator.wikimedia.org/T109159) [02:05:58] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:38:36] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 10m 44s) [02:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:05] 6operations, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1578453 (10BBlack) 3NEW [02:44:27] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:46:20] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10848 bytes in 0.141 second response time [02:49:22] 6operations, 10Traffic: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578472 (10BBlack) 3NEW [02:50:41] 6operations, 10Traffic: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578479 (10BBlack) [02:50:49] 6operations, 10Traffic: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578472 (10BBlack) [02:52:10] 6operations, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1578486 (10BBlack) 3NEW [02:53:31] 6operations, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1578493 (10BBlack) 3NEW [02:54:05] 6operations, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1578500 (10BBlack) 3NEW [02:54:21] 6operations, 10Traffic: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1578507 (10BBlack) 3NEW [03:00:49] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1578525 (10BBlack) 5Open>3Resolved This is reasonably-well resolved at this point, although we still have intermittent v6 dropouts that rarely make... [03:01:32] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1578528 (10BBlack) [03:02:01] 6operations, 10Analytics-Cluster, 10Traffic, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1115779 (10BBlack) [03:03:05] 6operations, 10ContentTranslation-Deployments, 10Parsoid, 10Traffic, 10VisualEditor: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578531 (10Krenair) wmf-config/CommonSettings-labs.php: $wgContentTranslationParsoid['url'] = 'http://parsoid-lb.eqiad.wikimedia... [03:05:14] (03PS1) 10Mattflaschen: Set $wgFlowMigrateReferenceWiki false on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) [03:05:29] 6operations, 5Interdatacenter-IPsec: IPsec: add firewall rules - https://phabricator.wikimedia.org/T85823#1578537 (10BBlack) 5Open>3declined At this point, I think even if we evaded the iptables perf issues, I'm not comfortable enough with the risk balance to try to enforce ipsec-only communication with ou... [03:05:47] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. [03:06:11] twentyafterfour, jynus ^? [03:06:17] aand working now [03:09:09] bd808, is there a phabricator project for iegreview? [03:10:13] Krenair: yes, https://phabricator.wikimedia.org/tag/wikimedia-ieg-grant-review/ [03:10:33] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578472 (10Krenair) [03:15:15] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578558 (10ssastry) We also use this for visual diff testing (but, we could potentially use RESTBase for it by f... [03:18:22] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1578559 (10Bugreporter) [03:18:38] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1575485 (10Bugreporter) Just resolve these three bugs. [03:29:11] 6operations, 10Traffic: Expand misc cluster into cache PoPs - https://phabricator.wikimedia.org/T101339#1578574 (10BBlack) [03:38:22] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578580 (10BBlack) Yeah all of that will have to be addressed too. I had assumed cxserver (et al) had their own... [03:40:20] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578581 (10BBlack) I just looked, cxserver does have it's own hostnames (mapped to the parsoid IP) at `cxserver.... [03:45:38] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [03:50:08] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [03:50:18] PROBLEM - puppet last run on mw2196 is CRITICAL puppet fail [03:51:37] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 51.89 ms [04:02:07] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1578625 (10Dzahn) [04:04:42] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1578626 (10Dzahn) a full import with `./import_all_lists.sh` took: real 115m26.589s [04:05:04] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578627 (10BBlack) So, I took a 1 hour log of all traffic on the 2x varnish frontends for parsoidcache with any... [04:06:15] bblack, are all of the https redirects for normal wiki sites done at the varnish level these days? [04:06:52] I guess it depends on your definition of "normal wiki sites", but I think the answer is yes [04:07:31] the ultra-verbose answer is in the top description on https://phabricator.wikimedia.org/T104681 [04:07:33] Normal wiki sites in production that we run that aren't in frack and aren't wikitech. [04:07:54] if they're one of the main projects like wikipedia.org, wikiversity.org, etc. Yes [04:07:59] I ask because I noticed apache still has a bunch of rewrite rules for it. [04:08:20] yeah it does. we'll eventually clean that up, but I want to get the varnish/nginx stuff to a better place first. [04:08:34] so that there's less chance a regex screwup in one place drops all the redirects, which already happened once :/ [04:09:37] "a better place" meaning - not so many crazy little one-off exceptions for broken things making it complicated. nginx just universally redirects all HTTP requests without really looking at them. [04:09:57] okay [04:10:08] thanks bblack [04:13:48] 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1578635 (10greg) [04:14:29] I see what you did there g-g :p [04:14:34] :) :) [04:14:47] I was actually hoping the link would be easier [04:15:18] I couldn't figure out the native link to a file markup, if it exists [04:17:58] RECOVERY - puppet last run on mw2196 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [04:19:51] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1578638 (10Dzahn) `./bin/list_lists` says 556 matching mailing lists found on fermium. But the same command says 558 (!) on sodium. Figure out the diff! [04:20:07] PROBLEM - puppet last run on mw2114 is CRITICAL Puppet has 1 failures [04:24:26] 6operations: grafana.wikimedi.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1578639 (10greg) 3NEW [04:26:16] 6operations: grafana.wikimedia.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1578646 (10Krenair) [04:42:37] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1578657 (10Dzahn) it probably has to do with the way these have been disabled in the past. i see: `wlm-cn.disabled.rt8567` and `licom-l.disabled.rt7307`... [04:42:38] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:47:38] RECOVERY - puppet last run on mw2114 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:49:40] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1578658 (10Dzahn) ``` Importing fix_url... Running fix_url.fix_url()... Loading list foundation-l (locked) Unknown list: foundation-l Traceback (most recen... [04:55:53] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1578661 (10Dzahn) Lists to check for import issues (because there are still files left that should have been deleted if it was succesful). advocacy_adviso... [05:16:07] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [05:41:39] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:00:48] <_joe_> !log powercycling mw2140, not responding to ping, blank console [06:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:02:28] RECOVERY - Host mw2140 is UPING OK - Packet loss = 0%, RTA = 52.32 ms [06:14:37] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [06:17:30] <_joe_> \o/ [06:17:44] <_joe_> ^^ someone removed the readonly remote on tin [06:28:12] Readonly remote? [06:28:32] We should only be using https on tin...which is not read only :p [06:30:38] PROBLEM - puppet last run on eventlog2001 is CRITICAL puppet fail [06:30:39] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:31:46] <_joe_> ostriches: there is a git remote in mediawiki-staging called "readonly" that I created for monitoring purposes [06:31:47] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on db1018 is CRITICAL Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:32:05] <_joe_> it allows us to see if any change is waiting to be brought to production on tin [06:32:13] _joe_: It's probably the same remote since we don't use ssh anymore :p [06:32:27] PROBLEM - puppet last run on mw1215 is CRITICAL Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 2 failures [06:32:29] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:32:32] <_joe_> you have "origin" and "gerrit" [06:32:49] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:32:49] <_joe_> I want to have one remote that only the monitoring will touch directly [06:33:02] <_joe_> I mean the local remote, not the remote url [06:33:06] <_joe_> that can be identical [06:33:32] Why does it need a remote that's not origin? [06:35:26] <_joe_> separation of duties, I basically didn't want to mess with your work :) [06:35:55] yesterdat, or* and more people were doing black magic on tin, _joe_ [06:36:17] _joe_: ...it wouldn't? The objects are the same regardless of which remote you fetch from [06:36:33] (assuming the remotes are pointing at the same repo, which they are) [06:37:03] <_joe_> jynus: that alarm was up since 1 month :P [06:37:09] oh [06:37:13] Also, those remote urls are /technically/ wrong, even if gerrit redirects them properly. [06:37:20] <_joe_> ahah ok [06:38:33] <_joe_> ostriches: suppose you the releaser do git fetch while the monitoring script is running, that would result in a race condition and fail [06:38:50] <_joe_> since the script runs every minute... [06:39:21] <_joe_> there is a non-zero probability of race conditions [06:39:31] !log tin: dropped useless "gerrit" remote from /srv/mediawiki-staging (uses ssh, lol), pointed {origin,readonly} at the actual repo instead of a redirect. [06:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:40:18] <_joe_> ostriches: thanks :) [06:43:37] _joe_: I've never heard of racing with fetching, but I suppose it's possible. [06:43:52] (it's more a checkout or something else affecting the index that I'd see racing, but ymmv) [06:55:48] RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:55:58] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:13] <_joe_> brb [06:57:18] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:19] RECOVERY - puppet last run on db1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:19] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] RECOVERY - puppet last run on eventlog2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:03] (03PS1) 10Muehlenhoff: Move the logstash ingestion rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/234217 [07:25:08] PROBLEM - puppet last run on ms-be3001 is CRITICAL Puppet has 1 failures [07:36:10] 6operations, 10hardware-requests, 7Database, 5Patch-For-Review: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1578897 (10jcrespo) Arrangement: ``` ES1 === es1012 [A2] es1018 [D1] es1019 [D3] ES2 === es1011 [A2] MASTER es1013 [B1] es1015 [C2] ES3 === es1014 [B1] MASTER es101... [07:43:00] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1578920 (10Joe) @mmodell "canaries" usually serve a small amount of the service traffic, say 10%, so they're more than one server in gener... [07:51:08] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:17] good morning [07:59:19] (03PS1) 10Muehlenhoff: Exempt mediawiki/http from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/234221 [08:25:04] (03CR) 10Alexandros Kosiaris: [C: 032] Tilerator config - add redis connection [puppet] - 10https://gerrit.wikimedia.org/r/234203 (owner: 10Yurik) [08:25:37] (03PS1) 10KartikMistry: Enable 'newarticle' campaign in itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234223 (https://phabricator.wikimedia.org/T109709) [08:46:54] 6operations: mw2187 - read-only filesystem - https://phabricator.wikimedia.org/T109717#1579097 (10akosiaris) We 've seen this error before with scap (my memory says many months ago), albeit with a different host. It was a false alarm back then as well. I don't remember us having debugged this more thoroughly cau... [08:48:32] (03PS1) 10Jcrespo: Reorganization of new External Storage nodes [puppet] - 10https://gerrit.wikimedia.org/r/234225 (https://phabricator.wikimedia.org/T105843) [08:49:27] 6operations, 10Deployment-Systems, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1579108 (10akosiaris) @ArielGlenn, what's the status? [08:53:38] (03CR) 10Alexandros Kosiaris: [C: 031] Move the logstash ingestion rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/234217 (owner: 10Muehlenhoff) [08:54:21] (03CR) 10Alexandros Kosiaris: "I think moritz did it in https://gerrit.wikimedia.org/r/234217 ?" [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [08:56:08] (03CR) 10Jcrespo: [C: 032] Reorganization of new External Storage nodes [puppet] - 10https://gerrit.wikimedia.org/r/234225 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [08:58:15] (03CR) 10Muehlenhoff: "Indeed. I made poor design decisions in Ic1b73d4 and I92fc0ca, but let's clean this up properly before it gets enabled in production. I'll" [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [08:59:25] (03PS2) 10Muehlenhoff: Move the logstash ingestion rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/234217 [08:59:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move the logstash ingestion rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/234217 (owner: 10Muehlenhoff) [09:02:17] (03PS2) 10Muehlenhoff: Logstash: make sure all input defines deal with ferm [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [09:09:49] 6operations, 6Services, 3Discovery-Maps-Sprint: Tilerator git deploy has 4/5 issue too - https://phabricator.wikimedia.org/T110434#1579165 (10akosiaris) 5Open>3Resolved a:3akosiaris It was the same issue as with kartotherian, namely tin was in the minions list. I 've removed it manually and it should b... [09:14:11] (03PS1) 10Yuvipanda: aptly: Fix typo in role [puppet] - 10https://gerrit.wikimedia.org/r/234231 [09:14:47] (03PS2) 10Yuvipanda: aptly: Fix typo in role [puppet] - 10https://gerrit.wikimedia.org/r/234231 [09:14:55] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Fix typo in role [puppet] - 10https://gerrit.wikimedia.org/r/234231 (owner: 10Yuvipanda) [09:19:06] (03PS1) 10Yuvipanda: aptly: s/url/url/ [puppet] - 10https://gerrit.wikimedia.org/r/234232 [09:19:20] (03PS2) 10Yuvipanda: aptly: s/url/url/ [puppet] - 10https://gerrit.wikimedia.org/r/234232 [09:19:27] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: s/url/url/ [puppet] - 10https://gerrit.wikimedia.org/r/234232 (owner: 10Yuvipanda) [09:20:36] <_joe_> YuviPanda: aha the "puppet walk of shame" when you find all your typos :) [09:20:49] _joe_: yes yes :) [09:21:29] <_joe_> s/url/url/ sounds strange [09:21:40] <_joe_> you're substituting a term with itself [09:21:48] <_joe_> maybe that's cool in perl6, dunno [09:21:51] hahahahaha [09:21:54] typo in commit message... [09:21:58] ./me facepalms [09:22:05] this is clearly too early for me to be writing anything critical [09:22:18] <_joe_> ./me: command-not-found [09:22:33] <_joe_> we could have a more informative message, but we love messing with customers [09:22:38] <_joe_> :D [09:22:43] <_joe_> sorry, couldn't resist [09:25:53] wheee, it's mostly working! [09:26:12] (03CR) 10Alexandros Kosiaris: "yeah, I think it's cool" [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [09:27:31] !log installing and configuring servers es1012-es1019 [09:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:14] (03CR) 10Giuseppe Lavagetto: "Can I ask why you didn't use apache::mpm directly for this?" [puppet] - 10https://gerrit.wikimedia.org/r/233906 (owner: 10Rush) [09:36:55] sorry about the alterts, sometimes icinga wins me over on the install race (I cannot disable a check that doesn't exist!) [09:48:59] (03PS1) 10Hashar: gdash: fix parser cache metrics [puppet] - 10https://gerrit.wikimedia.org/r/234234 [09:50:10] (03PS2) 10Hashar: gdash: fix parser cache metrics [puppet] - 10https://gerrit.wikimedia.org/r/234234 [09:53:00] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1579261 (10Seb35) Both tasks have patches written by @Bawolff and tested by me, waiting for reviewers and +2. @BBlack: the g... [10:08:34] 6operations, 7Graphite, 7Monitoring: evaluate tessera dashboards - https://phabricator.wikimedia.org/T104366#1579280 (10hashar) //I have mostly played with Grafana// Grafana: * same interface as Kibana (logstash) so I feel at home * the Graphite queries builder with auto completion is very useful * variable... [10:13:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK Less than 1.00% above the threshold [1000000.0] [10:15:46] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [10:18:20] 6operations, 7Database: defragment db1015 - https://phabricator.wikimedia.org/T110504#1579303 (10jcrespo) 3NEW a:3jcrespo [10:23:35] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1579312 (10akosiaris) Noting that we 've been contacted by the hosting company. [10:29:30] 6operations, 7Monitoring: grafana.wikimedia.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1579323 (10akosiaris) p:5Triage>3Normal [10:32:17] 6operations, 7Monitoring: grafana.wikimedia.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1579327 (10akosiaris) I am assuming this has been going for a long time. Questions: * What kind of privacy issues does it create. * In case AWS goes down, how much functionality do we l... [10:34:29] 6operations, 7Monitoring: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#1579329 (10akosiaris) p:5Triage>3High Raised to high as this might save us from outages [10:34:56] 6operations, 10ops-eqiad: es1005 and es1006 have degraded RAIDs (failed disks each) - https://phabricator.wikimedia.org/T110008#1579333 (10akosiaris) p:5Triage>3Normal [10:37:00] 6operations, 5Patch-For-Review, 7Pybal: Configure pybal ulimits higher - https://phabricator.wikimedia.org/T110091#1579348 (10akosiaris) Can I assume this is resolved now ? [10:38:07] 6operations, 10Datasets-Archiving: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1579349 (10akosiaris) p:5Triage>3Low [10:38:39] 6operations, 7Monitoring: grafana.wikimedia.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1579353 (10yuvipanda) >>! In T110484#1579327, @akosiaris wrote: > I am assuming this has been going for a long time. Questions: > > * What kind of privacy issues does it create. Expose... [10:38:42] 6operations, 10Security-Reviews: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1579354 (10akosiaris) p:5Triage>3Normal [10:42:46] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:12] (03PS1) 10Hashar: contint: drop /data/project/debianrepo [puppet] - 10https://gerrit.wikimedia.org/r/234239 [10:45:26] (03PS1) 10Filippo Giunchedi: swift: lower conntrack TIME_WAIT timeout [puppet] - 10https://gerrit.wikimedia.org/r/234240 [10:45:27] 6operations, 6Discovery, 10Maps, 10Traffic: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1579377 (10akosiaris) p:5Triage>3Normal [10:46:12] (03PS1) 10Yuvipanda: aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 [10:46:15] godog: ^ is the design [10:46:20] well [10:46:25] the implementation of design. [10:46:58] (03CR) 10jenkins-bot: [V: 04-1] aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 (owner: 10Yuvipanda) [10:47:37] (03PS2) 10Yuvipanda: aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 [10:48:19] (03CR) 10jenkins-bot: [V: 04-1] aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 (owner: 10Yuvipanda) [10:48:57] (03PS3) 10Yuvipanda: aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 [10:48:59] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/234239 (owner: 10Hashar) [10:49:55] (03PS4) 10Yuvipanda: aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 [10:50:03] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Auto setup published repos by default [puppet] - 10https://gerrit.wikimedia.org/r/234241 (owner: 10Yuvipanda) [10:51:54] (03PS1) 10Yuvipanda: aptly: Switch class to define [puppet] - 10https://gerrit.wikimedia.org/r/234242 [10:52:00] (03CR) 10jenkins-bot: [V: 04-1] aptly: Switch class to define [puppet] - 10https://gerrit.wikimedia.org/r/234242 (owner: 10Yuvipanda) [10:52:05] (03PS2) 10Yuvipanda: aptly: Switch class to define [puppet] - 10https://gerrit.wikimedia.org/r/234242 [10:53:51] 6operations, 6Discovery, 10Maps, 10Traffic: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1579396 (10akosiaris) a:3BBlack I agree with the statements above. A couple of notes. * A botnet asking all high zoom level tiles would indeed cause cache eviction... [10:54:46] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Switch class to define [puppet] - 10https://gerrit.wikimedia.org/r/234242 (owner: 10Yuvipanda) [10:58:03] (03PS1) 10Jcrespo: Pool es1011, depool es1008 as storage nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234244 (https://phabricator.wikimedia.org/T105843) [10:58:31] (03PS2) 10Alexandros Kosiaris: maps: Grant redis stop/start/enable/disable sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/234034 (https://phabricator.wikimedia.org/T106637) [11:01:29] (03PS2) 10Jcrespo: Pool es1011, depool es1008 as storage nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234244 (https://phabricator.wikimedia.org/T105843) [11:02:10] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Grant redis stop/start/enable/disable sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/234034 (https://phabricator.wikimedia.org/T106637) (owner: 10Alexandros Kosiaris) [11:02:52] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1579427 (10akosiaris) [11:02:59] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1473834 (10akosiaris) Redis done as well [11:03:03] (03PS1) 10Yuvipanda: aptly: Specify arch explicitly for publishing [puppet] - 10https://gerrit.wikimedia.org/r/234246 [11:03:09] (03CR) 10jenkins-bot: [V: 04-1] aptly: Specify arch explicitly for publishing [puppet] - 10https://gerrit.wikimedia.org/r/234246 (owner: 10Yuvipanda) [11:03:14] (03PS2) 10Yuvipanda: aptly: Specify arch explicitly for publishing [puppet] - 10https://gerrit.wikimedia.org/r/234246 [11:05:50] (03CR) 10Yuvipanda: "Why a database? Why can't we just store that info in memory? If we lose that it's no big deal, no?" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [11:07:34] (03CR) 10Yuvipanda: "We should also probably just write to services.log, and maybe provide symlinks? Also let's not call it BigBrother - IMO that was a histori" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [11:07:47] (03CR) 10Jcrespo: [C: 032] Pool es1011, depool es1008 as storage nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234244 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [11:08:41] (03CR) 10Yuvipanda: [C: 04-1] "And I definitely don't think this should parse .bigbrotherrc files and use them! it should read from service.manifest only. When I convert" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [11:09:31] (03CR) 10Yuvipanda: "However, if your intention is to get rid of the current perl script and then 'fix' this later, that's cool too. I just hope that the origi" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [11:10:00] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool es1011 for the first time, depool es1008 (duration: 00m 12s) [11:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:15] (03CR) 10Yuvipanda: [C: 032] aptly: Specify arch explicitly for publishing [puppet] - 10https://gerrit.wikimedia.org/r/234246 (owner: 10Yuvipanda) [11:12:11] PROBLEM - HHVM rendering on mw1143 is CRITICAL - Socket timeout after 10 seconds [11:13:41] PROBLEM - Apache HTTP on mw1143 is CRITICAL - Socket timeout after 10 seconds [11:13:55] (03PS1) 10Yuvipanda: aptly: Fix typo in distribution name [puppet] - 10https://gerrit.wikimedia.org/r/234248 [11:13:57] (03PS1) 10Yuvipanda: aptly: Setup depnedencies properly [puppet] - 10https://gerrit.wikimedia.org/r/234249 [11:14:17] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Fix typo in distribution name [puppet] - 10https://gerrit.wikimedia.org/r/234248 (owner: 10Yuvipanda) [11:14:50] (03CR) 10jenkins-bot: [V: 04-1] aptly: Setup depnedencies properly [puppet] - 10https://gerrit.wikimedia.org/r/234249 (owner: 10Yuvipanda) [11:15:43] (03PS2) 10Yuvipanda: aptly: Setup depnedencies properly [puppet] - 10https://gerrit.wikimedia.org/r/234249 [11:16:02] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Setup depnedencies properly [puppet] - 10https://gerrit.wikimedia.org/r/234249 (owner: 10Yuvipanda) [11:19:22] PROBLEM - HHVM busy threads on mw1143 is CRITICAL 100.00% of data above the critical threshold [86.4] [11:19:52] PROBLEM - HHVM queue size on mw1143 is CRITICAL 100.00% of data above the critical threshold [80.0] [11:25:03] 6operations, 10ops-eqiad: es1005 and es1006 have degraded RAIDs (failed disks each) - https://phabricator.wikimedia.org/T110008#1579488 (10jcrespo) 5Open>3declined a:3jcrespo New hardware replacement arrived on time, no need for this anymore. [11:25:52] (03PS1) 10Yuvipanda: aptly: Change naming schemes for repositories [puppet] - 10https://gerrit.wikimedia.org/r/234250 [11:26:23] godog: ^ have changed my mind about naming, that scheme made it quite different from what apt.wikimedia.org uses, so I've now switched to distribution being jessie-$projectname and component being just main [11:26:30] which seems ok [11:26:42] and more consistent with the wmf repo [11:27:17] (03CR) 10Yuvipanda: [C: 032] aptly: Change naming schemes for repositories [puppet] - 10https://gerrit.wikimedia.org/r/234250 (owner: 10Yuvipanda) [11:28:49] (03PS3) 10Yuvipanda: base: Don't install command-not-found-data either [puppet] - 10https://gerrit.wikimedia.org/r/232867 (owner: 10Tim Landscheidt) [11:29:07] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Don't install command-not-found-data either [puppet] - 10https://gerrit.wikimedia.org/r/232867 (owner: 10Tim Landscheidt) [11:29:31] PROBLEM - HHVM busy threads on mw1143 is CRITICAL 100.00% of data above the critical threshold [86.4] [11:31:24] (03PS1) 10Yuvipanda: aptly: Get rid of 'main' in aptly repo name [puppet] - 10https://gerrit.wikimedia.org/r/234251 [11:31:38] (03PS2) 10Yuvipanda: aptly: Get rid of 'main' in aptly repo name [puppet] - 10https://gerrit.wikimedia.org/r/234251 [11:32:31] (03CR) 10Yuvipanda: [C: 032] aptly: Get rid of 'main' in aptly repo name [puppet] - 10https://gerrit.wikimedia.org/r/234251 (owner: 10Yuvipanda) [11:33:36] <_joe_> !log restarted hhvm on mw1143, locked in __lll_lock_wait for stat_cache deadlock [11:33:42] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.290 second response time [11:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:12] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 65095 bytes in 0.397 second response time [11:37:23] godog: https://phabricator.wikimedia.org/T104194#1579530 is the final setup [11:41:31] RECOVERY - HHVM busy threads on mw1143 is OK Less than 30.00% above the threshold [57.6] [11:41:53] RECOVERY - HHVM queue size on mw1143 is OK Less than 30.00% above the threshold [10.0] [11:44:29] (03CR) 10Glaisher: "https://phabricator.wikimedia.org/P1812" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [11:45:13] (03PS3) 10Glaisher: Lift of IP cap on ta.wikipedia for IP 218.248.16.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) (owner: 10Shanmugamp7) [11:46:01] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [11:46:09] (03CR) 10Glaisher: [C: 031] "Per my comment on the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) (owner: 10Shanmugamp7) [11:51:07] (03CR) 10Matthias Mullie: [C: 031] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233100 (https://phabricator.wikimedia.org/T109816) (owner: 10Mjbmr) [11:58:44] (03PS1) 10Giuseppe Lavagetto: Add unit tests for HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234253 [11:59:04] (03CR) 10jenkins-bot: [V: 04-1] Add unit tests for HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234253 (owner: 10Giuseppe Lavagetto) [11:59:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [12:02:36] (03PS1) 10Hashar: contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) [12:02:54] <_joe_> hashar: update mock as well, while you're at it :P [12:03:01] <_joe_> (ref, ^^) [12:03:18] (03PS2) 10Giuseppe Lavagetto: Add unit tests for HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234253 [12:03:29] (03PS1) 10Yuvipanda: aptly: Name distribution correctly in clients [puppet] - 10https://gerrit.wikimedia.org/r/234255 [12:03:43] <_joe_> hashar: you know we have puppet-SWAT now right? [12:04:23] (03PS1) 10Chmarkine: Rewrite sitemap.wikimedia.org to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234256 (https://phabricator.wikimedia.org/T110511) [12:05:07] (03CR) 10Yuvipanda: "Can you explain / provide more details about 'not currently working'? Are there any tickets?" [puppet] - 10https://gerrit.wikimedia.org/r/234186 (owner: 10Alex Monk) [12:05:24] (03CR) 10Hashar: [C: 04-1] "Cherry picked on puppetmaster. Holding this change for a while in case it has unwanted side effects." [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) (owner: 10Hashar) [12:05:27] (03PS2) 10Yuvipanda: aptly: Name distribution correctly in clients [puppet] - 10https://gerrit.wikimedia.org/r/234255 [12:05:49] (03PS1) 10Chmarkine: Point sitemap.wikimedia.org to text-lb. [dns] - 10https://gerrit.wikimedia.org/r/234257 (https://phabricator.wikimedia.org/T110511) [12:05:56] (03CR) 10Yuvipanda: [C: 032] aptly: Name distribution correctly in clients [puppet] - 10https://gerrit.wikimedia.org/r/234255 (owner: 10Yuvipanda) [12:06:07] (03CR) 10Yuvipanda: [V: 032] aptly: Name distribution correctly in clients [puppet] - 10https://gerrit.wikimedia.org/r/234255 (owner: 10Yuvipanda) [12:06:24] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me and confirmed by our tests on ms-be2001." [puppet] - 10https://gerrit.wikimedia.org/r/234240 (owner: 10Filippo Giunchedi) [12:11:53] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:29:36] (03PS1) 10BBlack: codfw LVS installer -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/234260 (https://phabricator.wikimedia.org/T96375) [12:31:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 8 below the confidence bounds [12:33:01] (03CR) 10BBlack: [C: 032] codfw LVS installer -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/234260 (https://phabricator.wikimedia.org/T96375) (owner: 10BBlack) [12:34:28] (03PS3) 10Andrew Bogott: Install wmf salt version rather than setting up the upstream repo. [puppet] - 10https://gerrit.wikimedia.org/r/233403 (https://phabricator.wikimedia.org/T110032) [12:37:49] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1579655 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [12:37:55] (03CR) 10Andrew Bogott: [C: 032] Install wmf salt version rather than setting up the upstream repo. [puppet] - 10https://gerrit.wikimedia.org/r/233403 (https://phabricator.wikimedia.org/T110032) (owner: 10Andrew Bogott) [12:39:20] (03PS1) 10Andrew Bogott: Switch the nic for labnet1001 install. [puppet] - 10https://gerrit.wikimedia.org/r/234261 [12:40:59] (03PS2) 10Andrew Bogott: Switch the nic for labnet1001 install. [puppet] - 10https://gerrit.wikimedia.org/r/234261 [12:42:22] (03CR) 10Andrew Bogott: [C: 032] Switch the nic for labnet1001 install. [puppet] - 10https://gerrit.wikimedia.org/r/234261 (owner: 10Andrew Bogott) [12:45:40] !log re-imaging labnet1001 (I hope) [12:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:46:18] (03PS3) 10Muehlenhoff: Logstash: make sure all input defines deal with ferm [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [12:46:41] !log re-imaging lvs2006 [12:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:02] 7Puppet, 6Labs: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903#1579701 (10valhallasw) Or, from ldap: ```ldapsearch -h ldap-eqiad.wikimedia.org -p 389 -D "cn=proxyagent,ou=profile,dc=wikimedia,dc=org" -w (...) "(dc=208.80.155.130)" -b ou=hosts,dc=wikimedia,dc=org associate... [12:52:25] YuviPanda: nice! yeah that'll be okay as a default, worst case it can be revisited [12:52:42] PROBLEM - Host labnet1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:50] godog: yeah. [12:54:46] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1579712 (10MoritzMuehlenhoff) Fixing the default config will only limit the window; there's still a window between loading /etc/ferm/conf.d/00_main (which sets up the DROP policy) and the later rules which enable the permi... [12:55:23] (03PS2) 10Chmarkine: Rewrite sitemap.wikimedia.org to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234256 (https://phabricator.wikimedia.org/T110511) [12:55:41] (03PS3) 10Chmarkine: Rewrite sitemap.wikimedia.org to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234256 (https://phabricator.wikimedia.org/T110511) [12:55:46] 6operations, 10Datasets-Archiving: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1579717 (10Nemo_bis) As T106112 was oddly chosen ([[https://meta.wikimedia.org/wiki/Wikimania_Handbook#Upload|contrary to the Wikimania handbook]]), that server will need to be contacted from terbium ove... [12:55:51] (03PS2) 10Chmarkine: Point sitemap.wikimedia.org to text-lb. [dns] - 10https://gerrit.wikimedia.org/r/234257 (https://phabricator.wikimedia.org/T110511) [12:57:21] 6operations, 7HTTPS, 5Patch-For-Review: sitemap.wikimedia.org uses invalid SSL certificate - https://phabricator.wikimedia.org/T110511#1579722 (10Chmarkine) [12:57:51] (03PS2) 10Filippo Giunchedi: swift: lower conntrack TIME_WAIT timeout [puppet] - 10https://gerrit.wikimedia.org/r/234240 [12:57:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: lower conntrack TIME_WAIT timeout [puppet] - 10https://gerrit.wikimedia.org/r/234240 (owner: 10Filippo Giunchedi) [13:01:19] 6operations, 7HTTPS, 5Patch-For-Review: sitemap.wikimedia.org uses invalid SSL certificate - https://phabricator.wikimedia.org/T110511#1579734 (10Chmarkine) [13:01:21] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1579733 (10Chmarkine) [13:01:36] (03PS1) 10Andrew Bogott: Revert "Switch the nic for labnet1001 install." [puppet] - 10https://gerrit.wikimedia.org/r/234263 [13:01:37] 6operations, 7HTTPS, 5Patch-For-Review: sitemap.wikimedia.org uses invalid SSL certificate - https://phabricator.wikimedia.org/T110511#1579560 (10Chmarkine) I copied the CC list of T107575 to this one. [13:04:13] (03PS2) 10Andrew Bogott: Revert "Switch the nic for labnet1001 install." [puppet] - 10https://gerrit.wikimedia.org/r/234263 [13:04:31] !log running leader election now that all topics and partitions are rebalanced across new kafka nodes [13:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:04:41] (03PS4) 10Chmarkine: Rewrite sitemap.wikimedia.org to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234256 (https://phabricator.wikimedia.org/T110511) [13:05:10] (03CR) 10Andrew Bogott: [C: 032] Revert "Switch the nic for labnet1001 install." [puppet] - 10https://gerrit.wikimedia.org/r/234263 (owner: 10Andrew Bogott) [13:06:46] 6operations, 5Patch-For-Review, 7Pybal: Configure pybal ulimits higher - https://phabricator.wikimedia.org/T110091#1579745 (10BBlack) p:5Triage>3Normal [13:07:24] 6operations, 5Patch-For-Review, 7Pybal: Configure pybal ulimits higher - https://phabricator.wikimedia.org/T110091#1579747 (10BBlack) a:3BBlack Keeping it open as a reminder to fix for LVS on jessie w/ systemd properly, since the current fix is a hack. I'm booting the first LVS jessie box now. [13:09:01] (03PS1) 10Muehlenhoff: Enable base::firewall for swift storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/234264 [13:09:03] !log cloning es1008 into es1014 [13:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:09:13] !log disable puppet on ms-be2* in preparation for firewall changes [13:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:09:48] Krenair: I commented on https://gerrit.wikimedia.org/r/#/c/234186/ [13:09:57] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::firewall for swift storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/234264 (owner: 10Muehlenhoff) [13:10:39] (03PS2) 10Filippo Giunchedi: Enable base::firewall for swift storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/234264 (owner: 10Muehlenhoff) [13:10:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Enable base::firewall for swift storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/234264 (owner: 10Muehlenhoff) [13:12:25] (03PS1) 10Ottomata: Decom analytics1021 as a Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/234265 (https://phabricator.wikimedia.org/T106581) [13:14:03] (03PS1) 10Ottomata: Re-enable auto create topics for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/234266 [13:14:25] (03CR) 10Ottomata: [C: 032 V: 032] Decom analytics1021 as a Kafka broker [puppet] - 10https://gerrit.wikimedia.org/r/234265 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [13:14:51] (03CR) 10Ottomata: [C: 032 V: 032] Re-enable auto create topics for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/234266 (owner: 10Ottomata) [13:14:54] RECOVERY - Host labnet1001 is UPING OK - Packet loss = 0%, RTA = 1.10 ms [13:19:03] PROBLEM - salt-minion processes on labnet1001 is CRITICAL: Connection refused by host [13:19:14] PROBLEM - DPKG on labnet1001 is CRITICAL: Connection refused by host [13:19:44] PROBLEM - Disk space on labnet1001 is CRITICAL: Timeout while attempting connection [13:20:04] PROBLEM - RAID on labnet1001 is CRITICAL: Timeout while attempting connection [13:20:29] PROBLEM - dhclient process on labnet1001 is CRITICAL: Timeout while attempting connection [13:21:15] !log stopping kafka on analytics1021, it is no longer a kafka broker. [13:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:55] !log enable puppet on ms-be2* [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:15] 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch, 5Patch-For-Review: elasticsearch 1.6.0 fails to start after reboot - https://phabricator.wikimedia.org/T109497#1579828 (10JanZerebecki) This is fixed in the elasticsearch package 1.7.1 that is in the wikimedia repos for tru... [13:22:39] PROBLEM - configured eth on labnet1001 is CRITICAL: Timeout while attempting connection [13:23:09] (03PS3) 10Filippo Giunchedi: gdash: fix parser cache metrics [puppet] - 10https://gerrit.wikimedia.org/r/234234 (owner: 10Hashar) [13:23:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: fix parser cache metrics [puppet] - 10https://gerrit.wikimedia.org/r/234234 (owner: 10Hashar) [13:28:01] (03PS1) 10Filippo Giunchedi: swift: enable base::firewall on ms-fe2* [puppet] - 10https://gerrit.wikimedia.org/r/234268 [13:28:17] (03PS1) 10BBlack: Require ethtool package for ethtool execs [puppet] - 10https://gerrit.wikimedia.org/r/234269 [13:28:58] PROBLEM - nova-api process on labnet1001 is CRITICAL: Connection refused by host [13:29:08] !log doing rolling restart of kafka brokers to apply auto_create_topics change [13:29:11] (03CR) 10BBlack: [C: 032] Require ethtool package for ethtool execs [puppet] - 10https://gerrit.wikimedia.org/r/234269 (owner: 10BBlack) [13:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:59] RECOVERY - DPKG on labnet1001 is OK: All packages OK [13:30:09] RECOVERY - Disk space on labnet1001 is OK: DISK OK [13:30:38] RECOVERY - configured eth on labnet1001 is OK - interfaces up [13:31:09] RECOVERY - RAID on labnet1001 is OK [13:32:09] RECOVERY - nova-api process on labnet1001 is OK: PROCS OK: 37 processes with regex args ^/usr/bin/python /usr/bin/nova-api [13:33:10] RECOVERY - dhclient process on labnet1001 is OK: PROCS OK: 0 processes with command name dhclient [13:33:54] (03PS2) 10Muehlenhoff: swift: enable base::firewall on ms-fe2* [puppet] - 10https://gerrit.wikimedia.org/r/234268 (owner: 10Filippo Giunchedi) [13:34:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] swift: enable base::firewall on ms-fe2* [puppet] - 10https://gerrit.wikimedia.org/r/234268 (owner: 10Filippo Giunchedi) [13:35:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Add unit tests for FileConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234008 (owner: 10Giuseppe Lavagetto) [13:35:25] 6operations, 10Analytics-Cluster, 10Traffic: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1579863 (10Ottomata) FYI, our Kafka upgrade and expansion is complete. All Kafka brokers are now Jessie, so I think this can proceed. [13:35:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Add unit tests for HttpConfigurationObserver [debs/pybal] - 10https://gerrit.wikimedia.org/r/234253 (owner: 10Giuseppe Lavagetto) [13:36:28] RECOVERY - salt-minion processes on labnet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:37:03] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1579873 (10BBlack) lvs2006 reinstall went fine. They need reboots after successful puppetization for bnx2x params, but that aside this should go pretty smoothly I think. [13:45:55] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1579907 (10jcrespo) 5Resolved>3Open [13:47:16] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1579908 (10jcrespo) [13:47:19] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1544629 (10jcrespo) [13:47:22] !log re-imaging lvs2004 + lvs2005 [13:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:33] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1544629 (10jcrespo) 5Open>3Resolved [13:50:12] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1579935 (10jcrespo) Merged into T109279. Please reopen that one if it happens again. [13:51:15] (03CR) 10BBlack: [C: 04-1] "Holding this a bit on LVS reimagining in codfw, ongoing this morning..." [puppet] - 10https://gerrit.wikimedia.org/r/234205 (https://phabricator.wikimedia.org/T109159) (owner: 10Yurik) [13:51:36] lol "remaginging" [13:52:01] at least I spelled the wrong word correctly in the gerrit message :P [13:53:09] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1579954 (10daniel) [13:54:18] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 38.46% of data above the critical threshold [500.0] [13:57:50] <_joe_> bblack: is this ^^ you? [13:57:53] nope [13:58:02] at least, I don't think it is [13:58:07] <_joe_> ok so I'll keep one eye on it [14:04:55] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1579991 (10jcrespo) I do not think this is fixed. This is is a graph of the number of concurrent active connections of our busiest enwiki production node: {F246... [14:08:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:12:45] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: Timeout while attempting connection [14:14:17] ^ ms-fe2004 is a side effect on ongoing work [14:14:37] RECOVERY - puppet last run on ms-fe2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:05] !log reenable puppet on ms-fe2* [14:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:17:32] (03PS1) 10Alexandros Kosiaris: maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 [14:18:36] (03PS1) 10Muehlenhoff: Enable base::firewall on eqiad storage servers [puppet] - 10https://gerrit.wikimedia.org/r/234274 [14:19:41] (03PS2) 10Muehlenhoff: Enable base::firewall on eqiad storage servers [puppet] - 10https://gerrit.wikimedia.org/r/234274 [14:20:41] (03PS1) 10Muehlenhoff: Enable base::firewall on eqiad proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/234276 [14:21:55] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1580016 (10akosiaris) p:5Normal>3Low [14:22:31] 6operations, 6Discovery, 10Maps, 10Traffic: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#1580018 (10Yurik) >>! In T109162#1579396, @akosiaris wrote: > * A botnet asking all high zoom level tiles would indeed cause cache eviction but high zoom tiles get gen... [14:22:42] !log disable puppet on ms-fe1 / ms-be1 in prepration for puppet work [14:22:43] 6operations, 7HTTPS, 5Patch-For-Review: sitemap.wikimedia.org uses invalid SSL certificate - https://phabricator.wikimedia.org/T110511#1580020 (10Dzahn) Also see T101486. Question is what sitemap.wm.org is even here for or if we should simply remove it. Unless we'd actually do T23765 and use it for that. [14:22:43] (03PS2) 10Alexandros Kosiaris: maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 (https://phabricator.wikimedia.org/T106637) [14:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Enable base::firewall on eqiad storage servers [puppet] - 10https://gerrit.wikimedia.org/r/234274 (owner: 10Muehlenhoff) [14:23:47] (03PS1) 10Ottomata: Increasing num.replica.fetchers all around the cluster [puppet] - 10https://gerrit.wikimedia.org/r/234278 [14:33:01] (03PS2) 10Ottomata: Increasing num.replica.fetchers all around the cluster [puppet] - 10https://gerrit.wikimedia.org/r/234278 [14:34:08] (03PS2) 10Dzahn: Exempt mediawiki/http from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/234221 (owner: 10Muehlenhoff) [14:34:53] (03CR) 10Dzahn: [C: 031] "PS2 just aligned arrows for lint check" [puppet] - 10https://gerrit.wikimedia.org/r/234221 (owner: 10Muehlenhoff) [14:35:01] (03CR) 10Ottomata: [C: 032] Increasing num.replica.fetchers all around the cluster [puppet] - 10https://gerrit.wikimedia.org/r/234278 (owner: 10Ottomata) [14:36:58] (03CR) 10Yuvipanda: [C: 04-1] "(-1ing for puppet SWAT)" [puppet] - 10https://gerrit.wikimedia.org/r/234186 (owner: 10Alex Monk) [14:37:45] 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1580040 (10Dzahn) I thought somebody moved adminbot away from using .deb packages to using phabric? I wasn't really for it but thought that happenened anyways. [14:41:16] 6operations, 10Traffic: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1580041 (10BBlack) 3NEW [14:47:35] !log reenable puppet on ms-be1* [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:02] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1580062 (10Dzahn) Maybe this is another option? 3) On any test host, the puppet compiler or something, run the puppet class that generates our ferm rules before applying it on the actual production host. Then go to the... [14:52:43] !log re-imaging lvs200[123] [14:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:15] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1580071 (10Ironholds_backup) That's nice. I haven't. Could you CC me in? [14:53:50] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1580077 (10Ironholds_backup) Wait, you have. Doh, checking emails from top to bottom ;p [14:54:33] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1580082 (10fgiunchedi) another observed effect, not disruptive per se but can be unexpected. After enabling ferm (and thus conntrack) already established tcp sessions will get broken pipe upon receiving the next packet that... [14:55:39] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1580086 (10Aklapper) [14:56:09] 6operations, 10Traffic, 7network: Requests from a specific network are blocked - https://phabricator.wikimedia.org/T110208#1580093 (10BBlack) @Ironholds_backup - The contact is detailed at the top of this ticket, it came through OTRS. My unconfirmed suspicion at this point is that the traffic's coming from... [14:57:14] (03PS1) 10DCausse: Cirrus: add language detector plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/234283 (https://phabricator.wikimedia.org/T110077) [14:57:20] (03CR) 10Filippo Giunchedi: [C: 031] Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [14:58:07] (03CR) 10DCausse: [C: 04-1] "We will test it on beta first." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/234283 (https://phabricator.wikimedia.org/T110077) (owner: 10DCausse) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150827T1500). [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:15] jouncebot: yes Sir! [15:00:24] Who is SWAT'ng? [15:00:28] Wow, the bot got upgraded, cool. [15:00:45] kart_: I can SWAT [15:00:52] hello thcipriani [15:02:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234223 (https://phabricator.wikimedia.org/T109709) (owner: 10KartikMistry) [15:02:32] 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1580115 (10akosiaris) I am not really aware. If that's the case we should schedule the removal of the packages from the repo, if not we should update the trusty package. The request is new though (Aug... [15:02:34] (03Merged) 10jenkins-bot: Enable 'newarticle' campaign in itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234223 (https://phabricator.wikimedia.org/T109709) (owner: 10KartikMistry) [15:02:44] 7Puppet, 6Labs: Expose public hostname as Fact in puppet - https://phabricator.wikimedia.org/T101903#1580118 (10Andrew) Please don't pull it from ldap -- that whole dns-backed-by-ldap enterprise is slated for removal. It may be that we'll get proper reverse dns with the new setup; I'm not sure. In general, t... [15:07:27] (03PS1) 10Filippo Giunchedi: WIP: xenon additional instances [dns] - 10https://gerrit.wikimedia.org/r/234286 (https://phabricator.wikimedia.org/T95253) [15:10:47] (03PS1) 10Ottomata: Increase num.replica.fetchers to 12 across the Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/234287 [15:11:15] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable newarticle campaign in itwiki [[gerrit:234223]] (duration: 01m 52s) [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:23] ^ kart_ check please [15:11:37] Testing. [15:11:51] (03CR) 10Ottomata: [C: 032] Increase num.replica.fetchers to 12 across the Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/234287 (owner: 10Ottomata) [15:12:26] thcipriani: Thanks. Working! [15:12:39] kart_: awesome—thanks for testing! [15:15:46] (03PS3) 10Alexandros Kosiaris: maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 (https://phabricator.wikimedia.org/T106637) [15:16:17] (03PS1) 10BBlack: add various text backend defs to mobile [puppet] - 10https://gerrit.wikimedia.org/r/234289 (https://phabricator.wikimedia.org/T109286) [15:16:18] (03PS1) 10BBlack: Align mobile VCL much closer to text VCL [puppet] - 10https://gerrit.wikimedia.org/r/234290 (https://phabricator.wikimedia.org/T109286) [15:17:48] godog, shouldn't the xenon hosts be xenon100[12] instead of xenon-[ab]? [15:18:43] Krenair: I agree it should, it is "WIP" and there's more info on the related ticket [15:18:44] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1580190 (10akosiaris) Well, that's a worthy goal, but having to execute a command wouldn't exactly destroy it. In any case, it can be made conditional in a variety of ways. The sanest is probably a flag looked... [15:19:00] ok :) [15:21:10] (03PS4) 10Alexandros Kosiaris: maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 (https://phabricator.wikimedia.org/T106637) [15:21:16] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1580206 (10BBlack) [15:21:24] (03PS5) 10Alexandros Kosiaris: maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 (https://phabricator.wikimedia.org/T106637) [15:22:00] 6operations, 10Traffic: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1580209 (10BBlack) [15:22:02] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1215460 (10BBlack) [15:23:25] (03Abandoned) 10Alex Monk: Remove mw2140 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/234186 (owner: 10Alex Monk) [15:24:10] (03PS1) 10Filippo Giunchedi: WIP: xenon additional instances [puppet] - 10https://gerrit.wikimedia.org/r/234292 (https://phabricator.wikimedia.org/T95253) [15:27:09] (03CR) 10BryanDavis: Logstash: make sure all input defines deal with ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [15:30:17] 6operations: Initial ferm setup is disruptive - https://phabricator.wikimedia.org/T110514#1580243 (10chasemp) Worth noting this was so noticeable in the elasticsearch case because we have a low tolerance for node loss at the moment. There are some issues open to look into it :) [15:30:38] !log Disabled puppet on logstash100[1-3] prior to trying to enable ferm [15:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:31] moritzm: I'll get a script ready to apply the changes now [15:32:12] bd808: ok, I'm currently preparing a patch to enable base::firewall for logstash100[1-3] [15:33:19] (03PS1) 10Muehlenhoff: Enable ferm on logstash100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/234293 [15:38:00] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1580280 (10matej_suchanek) [15:38:58] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1580286 (10RobH) a:5Slaporte>3RobH [15:40:08] PROBLEM - puppet last run on logstash1001 is CRITICAL: Timeout while attempting connection [15:42:23] (03PS1) 10RobH: policy.wikimedia.org dns record change [dns] - 10https://gerrit.wikimedia.org/r/234296 [15:42:31] (03CR) 10jenkins-bot: [V: 04-1] policy.wikimedia.org dns record change [dns] - 10https://gerrit.wikimedia.org/r/234296 (owner: 10RobH) [15:42:38] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:33] (03CR) 10Alex Monk: "The first time I ran it on tin, it returned "$lang". Every other time it returned null. It did the same on terbium." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [15:43:48] bah, wrong comment delimiter used. [15:43:50] ; not # [15:44:25] (03PS2) 10RobH: policy.wikimedia.org dns record change [dns] - 10https://gerrit.wikimedia.org/r/234296 [15:45:08] (03PS3) 10RobH: policy.wikimedia.org dns record change [dns] - 10https://gerrit.wikimedia.org/r/234296 [15:45:54] !log logstash1001 not responding over ssh following ferm rules application; moritzm investigating [15:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:57] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1580308 (10RobH) https://gerrit.wikimedia.org/r/#/c/234296/ is the DNS change to move policy.wikimedia.org from a CNAME to our misc-web cluster to an A record on Wordpress side. I double-checked w... [15:47:22] 6operations, 10Citoid, 10Graphoid, 6Mobile-Apps, and 2 others: SCA services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#1580310 (10bearND) [15:47:49] (03CR) 10RobH: [C: 04-2] "While I think this change is +2 already, it cannot be submitted until we have the migration coordinated with Wordpress." [dns] - 10https://gerrit.wikimedia.org/r/234296 (owner: 10RobH) [15:47:55] !log killed hung ubuntu mirror rsync commands on carbon, from Jul 10 [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:05] (03CR) 10Yuvipanda: [C: 031] Remove apache vhost for non-existent beta incubator wiki [puppet] - 10https://gerrit.wikimedia.org/r/234165 (owner: 10Alex Monk) [15:50:30] (03CR) 10Yuvipanda: [C: 031] Remove apache vhost for *.labs.wikimedia.org from beta [puppet] - 10https://gerrit.wikimedia.org/r/234167 (owner: 10Alex Monk) [15:51:05] (03CR) 10Tim Landscheidt: "The Perl script uses a flat-file database (/data/project/.system/bigbrother.scoreboard) to keep track. This makes it possible (in theory," [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [15:51:07] (03CR) 10Yuvipanda: [C: 031] Remove other *.wikimedia.org stuff from beta apache config [puppet] - 10https://gerrit.wikimedia.org/r/234169 (owner: 10Alex Monk) [15:53:50] (03PS1) 10DCausse: Cirrus: set /langdetect/short-text/ the default langdetect profile [puppet] - 10https://gerrit.wikimedia.org/r/234297 (https://phabricator.wikimedia.org/T110077) [15:54:53] (03CR) 10DCausse: [C: 04-1] "We will test it on beta first" [puppet] - 10https://gerrit.wikimedia.org/r/234297 (https://phabricator.wikimedia.org/T110077) (owner: 10DCausse) [15:56:56] (03CR) 10Yuvipanda: "Fair enough if you don't want to do the unification (I can do that later) - but I definitely don't think we should be introducing a depend" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [15:58:17] !log rebooting logstash1001.mgmt.eqiad.wmnet for moritz as it is having issues [15:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:48] RECOVERY - Host logstash1001 is UPING OK - Packet loss = 0%, RTA = 1.21 ms [16:00:04] YuviPanda _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150827T1600). Please do the needful. [16:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:31] Hey krenair [16:00:56] hi [16:01:08] No one else put anything up? [16:01:16] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1580368 (10RobH) The certificate has been ordered, and the key put into our private repo. [16:01:22] PROBLEM - NTP on logstash1001 is CRITICAL: NTP CRITICAL: Offset unknown [16:01:31] Krenair: looks like not [16:01:56] Krenair: I've reviewed most of your changes, they seem straightforward. [16:02:08] So I didn't actually end up rewriting all of the apache config yet :) [16:02:20] I did chuck out a bunch of production stuff from beta though [16:02:25] Krenair: there might be a bit of a late start - 10-15 mins. Both of us are stuck somewhere.. [16:02:31] k [16:02:34] Sorry about that. [16:02:56] did you review the first change? I noticed all the others are +1 already [16:04:03] Krenair: ya am looking at it now [16:04:03] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:37] It's a simple copy+paste of the vhost above with the ServerName and a RewriteRule change [16:05:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on logstash100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/234293 (owner: 10Muehlenhoff) [16:06:03] RECOVERY - Host lvs2002 is UPING OK - Packet loss = 0%, RTA = 52.64 ms [16:06:30] (03PS1) 10RobH: policy.wikimdia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/234301 [16:06:34] !log logstash1001 back up after system reboot; we applied a default drop rule without applying the other iptables changes; will try again [16:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:52] (03CR) 10RobH: [C: 032] policy.wikimdia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/234301 (owner: 10RobH) [16:08:34] (03PS2) 10RobH: policy.wikimdia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/234301 [16:08:42] (03CR) 10RobH: [V: 032] policy.wikimdia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/234301 (owner: 10RobH) [16:10:02] 6operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#1580409 (10RobH) 5Open>3Resolved cert purchased and committed into our repo, private key in our private repo. T110203 deals with the migration and getting the key to wordpress. [16:11:41] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1580413 (10BBlack) [16:12:09] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1215460 (10BBlack) All codfw LVS are on jessie. Should investigate eth params (blocking task) before doing the other DCs or putting huge loads of traffic here. [16:12:27] (03PS1) 10Milimetric: Make statsd host configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) [16:13:13] (03PS3) 10BBlack: Relax REFERER restrictions for Maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/234205 (https://phabricator.wikimedia.org/T109159) (owner: 10Yurik) [16:13:58] (03CR) 10BBlack: [C: 032 V: 032] Relax REFERER restrictions for Maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/234205 (https://phabricator.wikimedia.org/T109159) (owner: 10Yurik) [16:14:23] thx [16:15:58] (03CR) 10Madhuvishy: Make statsd host configurable via hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) (owner: 10Milimetric) [16:16:10] !log ferm enabled on logstash1001 [16:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:07] !log ferm enabled on logstash1002 [16:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:19] (03PS2) 10Milimetric: Make statsd host configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) [16:19:55] (03CR) 10Milimetric: Make statsd host configurable via hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) (owner: 10Milimetric) [16:20:19] RECOVERY - NTP on logstash1001 is OK: NTP OK: Offset -0.00645172596 secs [16:22:08] RECOVERY - Disk space on labstore1002 is OK: DISK OK [16:22:12] !log ferm enabled on logstash1003 [16:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:54] Krenair: back! [16:23:04] ok [16:23:06] Krenair: i'm going to merge your 3 patches I +1'd, _joe_ is gonna look at the first one [16:23:20] (03PS2) 10Yuvipanda: Remove apache vhost for non-existent beta incubator wiki [puppet] - 10https://gerrit.wikimedia.org/r/234165 (owner: 10Alex Monk) [16:23:28] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove apache vhost for non-existent beta incubator wiki [puppet] - 10https://gerrit.wikimedia.org/r/234165 (owner: 10Alex Monk) [16:23:39] (03PS2) 10Yuvipanda: Remove apache vhost for *.labs.wikimedia.org from beta [puppet] - 10https://gerrit.wikimedia.org/r/234167 (owner: 10Alex Monk) [16:23:59] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove apache vhost for *.labs.wikimedia.org from beta [puppet] - 10https://gerrit.wikimedia.org/r/234167 (owner: 10Alex Monk) [16:24:13] (03PS2) 10Yuvipanda: Remove other *.wikimedia.org stuff from beta apache config [puppet] - 10https://gerrit.wikimedia.org/r/234169 (owner: 10Alex Monk) [16:24:22] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove other *.wikimedia.org stuff from beta apache config [puppet] - 10https://gerrit.wikimedia.org/r/234169 (owner: 10Alex Monk) [16:24:35] <_joe_> Krenair: I'm looking at the apache patch, and it can be improved, should I? Or do you want me to comment and figure it out by yourself? [16:25:00] _joe_, comment and let's see [16:25:30] Krenair: all other 3 patches been merged. [16:26:47] 6operations: puppet compiler - 404s when compilation fails - https://phabricator.wikimedia.org/T110546#1580452 (10Dzahn) [16:27:50] (03CR) 10Madhuvishy: [C: 031] "LGTM. Will leave it to Andrew to approve." [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) (owner: 10Milimetric) [16:28:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Seems correct, but needs some rework." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/233972 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [16:32:19] 6operations: puppet compiler - 404s when compilation fails - https://phabricator.wikimedia.org/T110546#1580465 (10Joe) from the jenkins console log: ``` [ 2015-08-27T15:56:56 ] ERROR: Unable to find facts for host fermium.wikimedia.org, skipping ``` Fermium is too new compared to when I last imported the facts... [16:32:41] 6operations: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546#1580467 (10Joe) [16:32:56] 6operations: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546#1580446 (10Joe) p:5Triage>3Normal a:3Joe [16:33:03] (03PS2) 10Alex Monk: Add affcom wiki domain to apache config [puppet] - 10https://gerrit.wikimedia.org/r/233972 (https://phabricator.wikimedia.org/T41482) [16:33:06] godog: Hi! I'm working on https://phabricator.wikimedia.org/T109547, and would like to experiment with sending hourly counts directly to graphite from Spark. I was gonna do it via statsd - but it doesn't allow for logging historical data - so I was wondering if I could send stats to graphite directly. I tested it and it seems like I can, but wondering if [16:33:06] there are any other concerns around that [16:33:53] 6operations, 10Adminbot: Upload new release of adminbot for Trusty - https://phabricator.wikimedia.org/T109947#1580474 (10scfc) My understanding is that `morebots` (still) lives on #Tool-Labs as tool `morebots`. And at least `~tools.morebots/README` and `~tools.morebots/labs.sh` still refer to: ``` jstart -N... [16:34:33] <_joe_> madhuvishy: oh you're using apache spark in analitycs? nice to know :) [16:35:09] (03PS3) 10Giuseppe Lavagetto: Add affcom wiki domain to apache config [puppet] - 10https://gerrit.wikimedia.org/r/233972 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [16:35:12] <_joe_> Krenair: cool! [16:35:23] (03CR) 10Giuseppe Lavagetto: [C: 032] Add affcom wiki domain to apache config [puppet] - 10https://gerrit.wikimedia.org/r/233972 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [16:35:28] great, thanks _joe_ [16:35:42] <_joe_> Krenair: wait for me to test it :P [16:39:09] !log new ferm rules on logstash100[1-3] are blocking grafana from reading dashboard configs. [16:39:12] _joe_: Yes :) [16:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:31] madhuvishy: sure, sending to graphite-in.eqiad.wmnet port 2003 will do that, my concerns would be around sending too many metrics, hourly doesn't seem to be a problem [16:40:04] (03PS1) 10Muehlenhoff: Add ferm rule for grafana traffic [puppet] - 10https://gerrit.wikimedia.org/r/234309 [16:40:14] 6operations, 10Traffic, 5Patch-For-Review: Switch codfw caches to tier2, begin pushing some traffic through them to test - https://phabricator.wikimedia.org/T110065#1580512 (10BBlack) Jessie LVS upgrades @ codfw successful, and we should be good to go, I think, for e.g. things like: https://gerrit.wikimedia.... [16:41:07] (03CR) 10BBlack: "Other issues are addressed now. Should take a little time to validate manually that everything's functioning as expected, kafka traffic i" [dns] - 10https://gerrit.wikimedia.org/r/231772 (owner: 10Faidon Liambotis) [16:41:56] godog: cool, yeah it would only be hourly, if I'm backfilling historical data for say a month - would that be okay? that'd be one time of course [16:43:12] <_joe_> Krenair: change is good, it's gonna be completely live in ~ 20 mins or so [16:43:40] (03PS2) 10Muehlenhoff: Add ferm rule for grafana traffic [puppet] - 10https://gerrit.wikimedia.org/r/234309 [16:46:53] madhuvishy: yup that's fine too, make sure to double check metric names, we are allowing arbitrary metric creation but deleting metrics is manual [16:47:21] (03CR) 10Dzahn: [C: 031] "yep, those are being dropped on logstash1002, and 9200 is opened by java and this is from the grafana role:" [puppet] - 10https://gerrit.wikimedia.org/r/234309 (owner: 10Muehlenhoff) [16:47:55] (03PS3) 10Muehlenhoff: Add ferm rule for grafana traffic [puppet] - 10https://gerrit.wikimedia.org/r/234309 [16:48:05] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rule for grafana traffic [puppet] - 10https://gerrit.wikimedia.org/r/234309 (owner: 10Muehlenhoff) [16:48:32] (03PS3) 10Ottomata: Make statsd host configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) (owner: 10Milimetric) [16:48:39] (03CR) 10Ottomata: [C: 032 V: 032] Make statsd host configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/234305 (https://phabricator.wikimedia.org/T110462) (owner: 10Milimetric) [16:48:52] godog: okay sure. I actually accidentally created couple of test-eventlogging ones on prod recently and would like to drop them too. [16:51:45] !log ferm rules on logstash100[1-3] have been amended to allow grafana from reading dashboard configs [16:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:27] madhuvishy: sure, sadly that requires a ticket at the moment, https://phabricator.wikimedia.org/tag/graphite/ [16:52:45] godog: no problem, will file one, thanks :) [16:54:44] 6operations, 6Services, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1580602 (10mobrovac) [16:56:12] 6operations, 7Monitoring: grafana.wikimedia.org calls out to AWS for JS assests - https://phabricator.wikimedia.org/T110484#1580606 (10greg) It violates our privacy policy for *.wikimedia.org domains, afaik (IANAL). [16:57:21] bblack, hi, could you add *.mediawiki.org to the referer check - makes documentation and references so much easier? [16:58:41] 6operations, 10Graphoid, 6Services, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1580626 (10mobrovac) There is work in progress to put Graphoid completely behind #RESTBase (ETA next few weeks), so in theory that should be enough. However, given that the subdomain... [17:02:16] 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1580647 (10mobrovac) {T108646} aims at putting #Citoid behind #RESTBase , but as noted on the ticket there, the domain should stay. [17:03:24] 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1580658 (10mobrovac) [17:05:07] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1580668 (10mobrovac) We have started discussing the move behind #RESTBase, but I wouldn't classify it as //imminent//, so let's just move it to //text-lb//... [17:06:44] (03PS1) 10Dzahn: exim: temp hack to stop exim when on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) [17:08:05] (03PS2) 10Dzahn: exim: temp hack to stop exim when on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) [17:08:09] !log bouncing Cassandra on restbase1001 to apply temporary GC settings [17:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:35] (03CR) 10Dzahn: [C: 04-1] "needs https://gerrit.wikimedia.org/r/#/c/234318/ or similar before it can be applied" [puppet] - 10https://gerrit.wikimedia.org/r/233873 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [17:09:23] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580677 (10mobrovac) Just FYI that `parsoid-lb.eqiad.wm.org` is used for #RESTBase testing (both from localhost... [17:12:16] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580686 (10mobrovac) >>! In T110474#1578627, @BBlack wrote: > So, I took a 1 hour log of all traffic on the 2x v... [17:12:57] _joe_, yay, I think it works [17:13:17] 6operations, 6Services, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1580692 (10BBlack) [17:13:19] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [17:15:50] (03PS1) 10Dzahn: fermium: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/234319 [17:15:59] (03PS1) 10Alex Monk: Move chapcom.wikimedia.org to affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234320 (https://phabricator.wikimedia.org/T41482) [17:16:26] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580717 (10BBlack) I think it's typically that low, but yeah we can run some longer checks. Note that's filtere... [17:17:21] (03PS2) 10Alex Monk: Move chapcom.wikimedia.org to affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234320 (https://phabricator.wikimedia.org/T41482) [17:18:28] yurik1: put it in a ticket somewhere, but I'd prefer we don't do that, unless it's restricted to specific hostnames there that make sense for docs and such (phab, wikitech?) [17:19:01] bblack, we are documenting everything on www.mediawiki.org [17:19:34] phab is for issues, and we haven't placed much on wikitech yet [17:19:43] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580735 (10cscott) Why are we decommissioning this? This is very useful as a public parsoid endpoint. We annou... [17:19:50] mediawiki tends to have everything generic for the technology [17:19:52] MaxSem, ^ [17:20:39] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580736 (10ssastry) Scott, at this point, kiwix, and everyone else can probably use the restbase api to access c... [17:20:46] Krenair: \o/ [17:20:47] (03CR) 10Alex Monk: [C: 032] Move chapcom.wikimedia.org to affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234320 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [17:20:53] (03Merged) 10jenkins-bot: Move chapcom.wikimedia.org to affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234320 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [17:21:01] 6operations, 6Discovery, 7Elasticsearch, 7Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1580741 (10Deskana) [17:22:31] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1580746 (10Krenair) a:3Krenair [17:22:34] yurik1: mediawiki.org doesn't really seem appropriate IMHO. This (maps.wm.o) is not about providing a map service for random MediaWiki installations to use, it's about hosting a service at WikiMedia for WikiMedia sites to use. I get that there's some MW software involved that others can potentially re-use... [17:22:37] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1580747 (10Krenair) 5stalled>3Open [17:22:43] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1580748 (10Krenair) [17:22:52] but for the foreseeable future, they're on their own for actually building out their own map servers too [17:22:56] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1558240 (10scfc) Are the instances that are on this virtual host gone for good? [17:23:43] (also, the earlier confusing is because I read "wikimedia.org" when you said "mediawiki.org", but that confusing is because of the assumptions above) [17:24:00] !log krenair@tin Synchronized multiversion/MWMultiVersion.php: https://gerrit.wikimedia.org/r/#/c/234320/ (duration: 00m 12s) [17:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:17] bblack, you do have a point, but we are hoping that kartotherian will be adapted by OSM and other systems - so even though it is not exactly part of mediawiki, it is the software WMF has built for the world to reuse. [17:24:34] this is mostly documentation about the development of our software [17:24:56] not the specifics of the WMF service, but it will have a number of links to the maps in order to explain what it does [17:25:24] https://www.mediawiki.org/wiki/Maps :P [17:25:25] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580765 (10cscott) Sure, I'm just pointing out that we've been telling people to use `parsoid-lb.eqiad.wikimedia... [17:25:31] wikitech will have some info about this as well, but probably much less - as it is only related to the specific settings we use [17:25:35] bblack, exactly :) [17:26:03] (03CR) 10Dzahn: [C: 031] "puppet compiler won't work for this - "invalid secret tendril key"" [puppet] - 10https://gerrit.wikimedia.org/r/234139 (owner: 10Rush) [17:26:09] IMHO it's as much about the service as the software, but whatever :P [17:27:01] yurik1, just because WMF has built something to be reused, doesn't mean it's considered in the scope of mediawiki.org [17:27:05] it may get deleted from there [17:27:31] Krenair, true, but so far we have placed all the docs there, and it seemed ok [17:27:40] you are the very first to object :) [17:28:11] bblack, guess we can add both - all these minor domains are ok to have it for documentation purposes [17:28:29] godog, are you sure this has reached all apaches? [17:29:29] ACKNOWLEDGEMENT - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon eevans Not down, extra arguments applied for T106619 have pushed the class name outside the limit. - The acknowledgement expires at: 2015-08-28 17:27:35. [17:29:35] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1580787 (10ssastry) For sure, announcements, etc. should happen. But, one thought is that this is a potential DO... [17:32:57] 6operations, 6Discovery, 7Elasticsearch, 7Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1580791 (10Deskana) This would be a wonderful list of tasks for the Operations Engineer that Discovery is thinking of hiring. :-) [17:34:45] Krenair: this == ? [17:34:56] oh, sorry [17:34:56] !log krenair@tin Synchronized multiversion/MWMultiVersion.php: (no message) (duration: 00m 12s) [17:34:59] (03PS2) 10Dzahn: fermium: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/234319 [17:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:48] YuviPanda, _joe_: so... something's still not quite there [17:36:46] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1449758 (10Deskana) >>! In T105707#1537144, @Deskana wrote: > Noting here that we are aware that there is some work required for us to... [17:37:27] 6operations, 6Discovery, 5codfw-rollout: Cirrus search in codfw - https://phabricator.wikimedia.org/T105703#1580831 (10Deskana) [17:37:35] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1580833 (10Andrew) As far as I know, very little was permanently damaged by this. Certainly a random sampling of instances on labvirt1007 look fine to me. Hashar, any objection to my closing this? [17:37:52] !log ack'd Cassandra process alert on restbase1001; temporary command args have pushed the class name beyond the limit [17:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:57] <_joe_> urandom: that log line will be used by all the opsens of the future to mock java [17:39:04] <_joe_> you know that right? [17:39:06] <_joe_> :P [17:39:25] _joe_: don't be ridiculous [17:39:26] <_joe_> (like we needed a new argument, ofc) [17:39:32] _joe_: you didn't need that reason [17:39:33] :) [17:40:11] <_joe_> urandom: it's just new fuel in a perpetual-mock engine [17:41:43] _joe_: in this case, it looks like there are some args/properties that repeated, so i'd say there is some buggy shell [17:41:55] _joe_: but yeah, still unwieldy [17:42:51] _joe_: most troubling is that you say that like i'm the bearer of the Java torch! [17:43:06] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234320/2 (duration: 00m 13s) [17:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:25] <_joe_> urandom: no I did the same with manybubbles, he can testify [17:43:34] [[what? [17:43:38] <_joe_> and I never thought either of you is a bearer of the java torch [17:43:45] _joe_: must i wear the scarlet J? [17:43:45] fucking java [17:43:47] <_joe_> manybubbles: not lose an occasion to mock java :) [17:43:52] do I have to read scrollback? [17:43:57] oh never [17:44:02] do it every time you can [17:44:02] <_joe_> :) [17:44:07] <_joe_> I do! [17:44:10] I try not to forget to [17:44:18] but I just have so many opportunties! [17:44:22] I have to be more choosy [17:45:59] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [17:46:32] Okay, SquidUpdate::purge fixed the weird redirect I seemed to be getting [17:47:13] <_joe_> manybubbles: "at least it's not springframework" [17:47:15] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/234320/ (duration: 00m 13s) [17:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:23] lol [17:47:29] I've done my fair share of spring [17:47:34] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1580863 (10Cmjohnson) Received 1 of 2 replacement servers [17:47:45] I eventually wrapped my head around it. but its not for the feight of heart [17:47:51] <_joe_> manybubbles: I've run apps written in spring, which is definitely worse than writing those [17:48:22] elasticsearch actually uses guice for dependency injection pretty deep. some folks want to remove it one day [17:48:30] I don't really want it gone. its not that complex [17:48:32] <_joe_> think of all the hacks I had to put toghether to make connections drain before I restarted an appserver - a thing that took, depending on the app, between 40s and 3 mins [17:48:39] <_joe_> because springframework [17:48:45] ah life [17:49:02] _joe_: you'll like this: https://github.com/elastic/elasticsearch/issues/13156 [17:49:46] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1580869 (10Krenair) https://affcom.wikimedia.org/wiki/Main_Page works, other pages should too. I guess we should add a redirect for chapcom -> affcom next, and remove the... [17:50:04] manybubbles: i like guice [17:50:28] <_joe_> manybubbles: ahahaha [17:51:09] _joe_: reload [17:53:04] (03CR) 10Dzahn: [C: 032] fermium: add mapped IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/234319 (owner: 10Dzahn) [17:56:11] <_joe_> manybubbles: default-to-verbose is the java way, too [17:56:26] fuck that - the api is name _cat [17:56:30] <_joe_> I learned to use multiline grep because stacktraces [17:56:40] lol. I've not done that yet [17:57:03] I want verbose error messages when I get error messages. at least in my log files [17:57:12] I like that behavior [17:57:18] because its too stupid to know what to filter out [17:58:38] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1580903 (10RobH) So things kept referring to a space I couldn't join. I overrode the options for S2 and added myself into the group view rights, once I had a single task to test loading, i went ahea... [17:59:27] (03PS1) 10Rush: elasticseach: ferm for 27-31 [puppet] - 10https://gerrit.wikimedia.org/r/234327 [17:59:33] (03CR) 10jenkins-bot: [V: 04-1] elasticseach: ferm for 27-31 [puppet] - 10https://gerrit.wikimedia.org/r/234327 (owner: 10Rush) [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150827T1800). [18:00:31] (03CR) 10Dzahn: "+ up ip addr add 2620:0:861:3:208:80:154:74/64 dev eth0" [puppet] - 10https://gerrit.wikimedia.org/r/234319 (owner: 10Dzahn) [18:00:41] (03PS2) 10Rush: elasticseach: ferm for 27-31 [puppet] - 10https://gerrit.wikimedia.org/r/234327 [18:04:57] (03CR) 10Rush: [C: 032] elasticseach: ferm for 27-31 [puppet] - 10https://gerrit.wikimedia.org/r/234327 (owner: 10Rush) [18:06:13] What does "svc" stand for in some of our host names? It's related to Service IPs, right? [18:06:31] service [18:06:32] It's not listed on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [18:06:34] I think [18:07:09] RECOVERY - RAID on analytics1004 is OK Active: 2, Working: 2, Failed: 0, Spare: 0 [18:08:02] .svc. would not be real hosts [18:08:12] but service IP addresses [18:09:13] paravoid ^^ :) [18:09:29] (03CR) 10Rush: [C: 031] "as I understand it this hack won't live another day and is a shim for the migration process itself. seems like a necessary evil." [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [18:09:38] <_joe_> Krinkle: it's logical IPs, yes [18:09:48] so they are clusters that have actual servers as members [18:10:35] Yeah [18:10:39] They point to an LVS? [18:10:54] (the svc IPs) [18:11:34] yes generally an LVS service would be the norm [18:11:50] I think it would probably valid to use that namespace for a non-LVS virtual service IP that floats or something too, though [18:12:09] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:13:15] bblack: Yeah, services that are not load balanced you mean? [18:13:18] to me the distinction is that it's not a hostname belonging to a particular host, it's a logical service hostname [18:13:48] PROBLEM - check_puppetrun on mintaka is CRITICAL Puppet has 1 failures [18:14:00] Krinkle: no, I meant more in the case where, for instance, we might assign an IP for foosvc.svc.eqiad.wmnet which could potentially live on host1.eqiad.wmnet or host2.eqiad.wmnet, but fails over via some mechanism like heartbeat [18:14:11] Right [18:14:17] (03CR) 10GWicke: "We already have a per-service user. How will this interact with that setup?" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [18:14:23] bblack: Like s1-master ? [18:14:50] I donno [18:15:11] there's lots of edge cases to explore there. commonly today, we don't use it for cases like that it seems. [18:15:19] (it being the svc subdomain) [18:15:26] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1580986 (10Tgr) I think a role is nicer than a command in that I don't have to google up pgsql syntax and figure out exactly what to put in the unless parameter, but I don't have strong feelings about it. (By p... [18:15:52] I don't know if anyone ever spelled out a standard. [18:16:21] !log setting up ferm on elastic1027-31 [18:16:25] it seems to be exclusive to LVS in practice [18:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:01] (03PS1) 10Jdlrobson: Disable section collapsing on h1s in Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234330 (https://phabricator.wikimedia.org/T110436) [18:17:20] (03CR) 10Gergő Tisza: Basic role for Sentry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [18:17:32] ganeti01.svc.codfw.wmnet - i believe that is not LVS related [18:17:41] 6operations, 10Wikimedia-Site-Requests: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1580997 (10Krenair) [18:17:43] by a broader definition, e.g. udplog.eqiad.wmnet could/should be udplog.svc.eqiad.wmnet, but it isn't. maybe that's a useful distinction in need of better documentation :) [18:18:39] there are cases like "carbon" where that's both the name of a host and a name that applies to (several) types of software, so the hostname alone within eqiad.wmnet isn't a distinguisher between obvious service-vs-host [18:18:41] I don't think I should deploy the train with "574 Parser cannot be freed while it is parsing. in /srv/mediawiki/php-1.26wmf20/includes/media/XMP.php on line 167" [18:18:48] PROBLEM - check_puppetrun on mintaka is CRITICAL Puppet has 1 failures [18:19:05] mintaka? [18:19:26] ^^^ that's the top error in fatalmonitor and it's not even out to group2 yet. https://phabricator.wikimedia.org/T89532 [18:22:03] anyone installing that "mintaka" host right now? [18:22:11] bblack: Hm.. Yeah, something like foo1001 when there is no other foo means it often ends up used directly, but I guess it'd be useful to have a foo.svc alias so that we can easily replace that host without needing to update anything [18:22:19] since we don't reuse hostnames after decomission [18:22:37] e.g. imagine eventlog1001 crashes, or is gradually replaced. [18:22:53] well, ideally any foo1001 should be behind LVS, and LVS should be handling foo.svc [18:22:59] right [18:23:22] another handy side-effect of that, is that the LVS service IPs are able to float across rows [18:23:27] we have resused a hostname before, but they were "named" hosts, not numbered ones, stuff like "carbon" vs. "foo1001" [18:23:34] but something like mysql master (e.g. s1-master) or eventlogging probably doesn't make sense behind LVS [18:23:41] whereas an arbitrary non-svc-subdomain address on a "regular" IP can't easily be moved to new hardware in a different row [18:23:48] PROBLEM - check_puppetrun on mintaka is CRITICAL Puppet has 1 failures [18:24:04] (03PS1) 10Chad: Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 [18:24:05] bblack: define row? [18:24:13] (without changing the IP, I mean) [18:24:23] physical rows in the datacenter are on distinct IP addressing subnets [18:24:28] Jeff_Green: mintaka issue known? [18:24:30] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1581043 (10hashar) 5Open>3Resolved a:3hashar Yup it is fine. File written too while the disk was full might end up with 0 bytes written too. For the CI slaves I just rebuild them to be safe. [18:24:33] LVS service IPs are not bound to a particular row [18:24:37] Ah, right. svc is typically not just the canonical domain name, it's also a canonical IP [18:24:38] mutante: yeah, I was just acking it [18:24:51] Jeff_Green: ok:) [18:24:56] (03CR) 10jenkins-bot: [V: 04-1] Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [18:25:16] we've run into this before, where a hostname like "lists.wikimedia.org" maps to an IP that's in a particular row, and so we can't move that IP to a new machine in a different one. [18:25:22] we can move the hostname, but the IP has to switch too [18:26:12] (03PS2) 10Chad: Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 [18:26:13] ACKNOWLEDGEMENT - check_puppetrun on mintaka is CRITICAL Puppet has 1 failures Jeff_Green fixing - The acknowledgement expires at: 2015-08-28 19:25:18. [18:26:21] bblack: is LVS the only way we put distance between that (so that we can "reuse" IPs) or is it also possible to do it directly? e.g. assign multiple IPs to a server. E.g. assign lists.svc to 10.0.2.1 and assign list1001 both 10.0.2.1 and 10.whatever internally. [18:26:24] the svc IPs have their own subnet that's row-independent, and the routers and various LVS magic can handle servicing it from backends in any row [18:26:48] Krinkle: it requires magic that's best handled by LVS [18:26:53] Right [18:26:55] it's nt impossible, but just saying [18:26:57] But that means extra hardware, right? [18:27:03] no [18:27:22] there's only one set of LVS clusters in a DC, that handle everything [18:27:26] it's just a matter of logical config [18:27:26] or an extra node at least to host LVS [18:27:31] Interesting [18:27:49] so a typical LVS server has multiple of those service IPs assigned to itself [18:27:51] in the tier-1 DCs we use 6x total LVS machines and in the tier-2 we use 4x. They're set up in pairs by LVS-role, but the LVS-roles are very broad [18:28:02] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581067 (10Krenair) >>! In T93760#1580903, @RobH wrote: > Access to the space will be restricted to WMF Staff. As there are no volunteers involved within the financial or capex areas of our tasking,... [18:28:03] yeah, lots [18:28:15] and then from there dispatch to one of one or more backends. [18:28:16] Nice [18:28:35] e.g. https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/lvs/balancer.pp#L28 [18:28:43] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581068 (10RobH) Nope. There is an ongoing discussion and vendor quotes and communication have to remain staff only for now. [18:28:44] so it is possible to assign multiple internal IPs to a node, it's just better to have it be handled on an LVS server rather than a service node directly. [18:28:45] cool [18:28:54] ^ shows the service IPs handled by the low-traffic LVS cluster in eqiad [18:29:22] Ah, I had no idea the various LVSs weren't all separate servers. [18:29:27] Interesting [18:30:03] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581076 (10Krenair) >>! In T93760#1581068, @RobH wrote: > Nope. There is an ongoing discussion and vendor quotes and communication have to remain staff only for now. That contradicts what you said... [18:30:05] I guess there is a limit at some point related to the amount of traffic per node, but other than that it's not much more expensive to host more LVSes on the same node. [18:30:24] yeah the limits are total traffic, and IP space assigned to those special subnets [18:30:40] What do you mean by rows by the way? [18:30:50] but total traffic isn't a pragmatic concern. the same setup is handling all our public traffic. your small new service won't kill it, probably. [18:30:53] <_joe_> Krinkle: datacenter rows [18:30:56] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581079 (10RobH) It does indeed, I was corrected in private about the policy regarding vendor communications. So we have to keep them for wmf employees only at this time. [18:31:02] (also, it only handles the inbound side, not return traffic, in capacity terms) [18:31:18] Yeah, the socket is transferred to the backend server, right? [18:31:19] Krinkle: physical rows of hardware, that you can walk inbetween. [18:31:22] so it responds directly [18:31:39] LVS-DR: http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-DR.html [18:32:02] Krenair: sorry i dont think i understand ya [18:32:11] are you agreeing that it should be restircted or saying it needs to be more open? [18:32:14] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581083 (10Aklapper) >>! In T93760#1580903, @RobH wrote: > So things kept referring to a space I couldn't join. Are "things" the quotes above in this task? > I overrode the options for S2 and added... [18:32:17] bblack: Oh, I guess we have conventions about internal IPs matching the physical layout somewhat. Though I imagine that's a social convention, not a technical limitation. [18:32:32] no, it's a physical thing [18:32:43] our network switch topology is layer out along physical rows [18:32:50] s/layer/laid out/ [18:33:04] Ah, right. We don't have a single router handle the entire 10.* subnet [18:33:10] That makes sense [18:33:13] cross-row traffic doesn't scale as well, and requires lots of longer messier cables, etc [18:33:18] so the switch for one row can only give out Ips for that row [18:33:24] makes more sense now [18:33:37] yeah [18:33:48] RECOVERY - check_puppetrun on mintaka is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:33:50] well s/router/switch/ in either direction in several statements above, but roughly yes [18:34:08] the LVS cluster machines are "special" in that they have direct ethernet connections cross-row into the switches of all rows [18:34:09] Yeah, I knew that, but didn't realise in this context. [18:34:11] robh, so having a space for vendor communications private to ops who are full employees seems reasonable to me. [18:34:30] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581087 (10RobH) @Aklapper: Indeed, I misunderstood which were testing spaces and which were not. I thought S2 was a privacy testing space, not an actual team working space. So once I joined it (vi... [18:34:42] yea, i just was wishywashy about the wmf nda thing [18:34:43] Something in your previous comment had suggested that theoretically it could be opened to other people with NDAs, so I left that as a possibility in my comment. [18:34:45] so i was confused =] [18:34:55] yep, and i went and added wmf nda [18:34:57] I'm not bothered about it either way. [18:35:00] and then was told it was not ok =] [18:35:09] no worries, i get you just want things clarified on tasks, this is why we get along =] [18:35:14] bblack: Thanks, this is helping me with my mental graph of our layout. One day I'm gonna make a visualisation about it. [18:35:21] What I don't want is a space that gets marked "WMF Staff Only" and ends up either including most staff, or including things which aren't vendor info [18:35:44] indeed [18:36:16] Which is why I said the "individual basis" bit (because inevitably there will be someone who proposes 'add this whole team!' and since it could be labelled as 'WMF Staff' it may be difficult to say no) [18:36:17] any task in the S4 space should be directly involved with invoicing, quotes, or pricing. [18:36:22] yea [18:37:57] The general concern is any space that is open to all staff by default may quickly devolve into a backchannel ticketing space for things that have nothing to do with vendor communication or pricing. This can be offset by ensuring that S4 remains only accessible by those directly involved in vendor communication and pricing. [18:38:07] I didn't know whether you'd open it to all foundation employees with ops access or even only a subset of that, so left that as a question too [18:38:27] right [18:38:28] ahh, the access will be #operations group which is ONLY the actual ops group in phabricator at present [18:38:28] sounds fibe [18:38:30] fine* [18:38:31] and then one by one [18:38:35] Uhm. [18:38:42] That would include non-employees, I think [18:38:44] unless that changed/ [18:38:46] when? [18:38:53] And probably contractors [18:39:08] RECOVERY - Router interfaces on cr1-eqdfw is OK host 208.80.153.198, interfaces up: 33, down: 0, dormant: 0, excluded: 0, unused: 0 [18:39:22] Krenair: no it doesnt... [18:39:26] https://phabricator.wikimedia.org/project/profile/29/ [18:39:33] ori [18:39:37] is the only non op =] [18:39:49] Jgreen and Ottomata? [18:39:55] they're opsen [18:40:05] fundraising ops and analytics ops, but ops [18:40:18] plus they have to approve specifications and quotes, so they would need to be there [18:40:29] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1581110 (10BBlack) Really a name like `parsoid-lb.eqiad.wikimedia.org` should never have been announced, but it... [18:40:30] basically, i'll end up adding in everyone that requests hardware over time for quote reviews, heh [18:40:54] Although looking at the current member list it doesn't include all people with ops access, only the foundation employees+contractors. But I'd expect it to include them, and therefore be separate from the space access [18:41:25] "This group should reflect the 'ops' group in admin.yaml." [18:41:38] Which you can only allow a subset of into your new space [18:42:06] (03CR) 10Mobrovac: [C: 04-1] cassandra: WIP support for multiple instances (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [18:42:18] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581128 (10RobH) So I misunderstood Krenair but IRC chat cleared it up. The general concern is any space that is open to all staff by default may quickly devolve into a backchannel ticketing space f... [18:42:33] Krenair: define ops access? [18:42:40] cuz ops access typically means in ops group [18:42:44] so im not sure what you mean. [18:42:52] robh, ops group means including volunteers with root [18:43:17] 2 of them ? =] [18:43:25] Yup. [18:43:25] domas and ryan i think is all.... [18:43:38] =) [18:43:49] so we should technically add them to the phab group [18:43:55] You should. [18:44:02] But you can't give them access to the space. [18:44:05] except ryan wanted out [18:44:21] so im not gonna ad him back [18:44:21] and i dont think domas uses phab? domas? =] [18:44:39] Krenair: i removed ryan a few weeks ago he was tired of the emails from being a member of the group [18:44:44] It's no use basing this on current membership. [18:44:57] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1581145 (10BBlack) Also, one request per hour, the bulk of which are internal service-checks of the /_version UR... [18:45:05] You are dealing with technically separate groups of people which can change. [18:45:49] you lost me [18:45:56] i dont get why i cannot use the #operations group [18:46:02] since its exactly the right folks i need to access [18:46:15] the fact that the two volunteers who have ops access arent in there doesnt really matter for htis [18:46:30] and we dont allow non ops in the group, except for employee exceptions like ori [18:46:46] Because any new volunteer going into that group can't get access to your space. [18:47:11] Because the non employee accessing quotes/ [18:47:15] Apparently [18:47:37] None of this considers T82799, of course [18:49:45] argh [18:49:49] im stuck in a nasty phabricator security loop [18:50:11] i cannot edit operations as admin, as they arent a member, and then i cannot set the descript as me as it prevents memvers leaving in options [18:50:13] good times =] [18:50:19] work arounddd [18:51:33] Krenair: I've applied the shortest term of repairs. [18:51:35] DO NOT ADD VOLUNTEERS to this group without clearing with @RobH. This group is used for access controls to the Operations Vendors S4 space. [18:51:40] appended to opserations description ;] [18:51:47] typos... cannot type. [18:51:48] That's backwards. [18:51:57] ? [18:52:12] Right now thats the closest we have to a valid ops group in phabricator. [18:52:15] You should have left the project alone and used a separate group for space access. [18:52:25] I'm not maintaining two access lists. [18:52:30] thats just asking for one to get neglected. [18:52:37] heh [18:52:48] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1581172 (10mobrovac) >>! In T110474#1580736, @ssastry wrote: > Scott, at this point, kiwix, and everyone else ca... [18:52:52] anyhow, this is initial space testing, we can change it to an acl*vendors group later if needed [18:52:57] Then you shouldn't restrict one to employees? [18:53:16] then i shouldnt restrict one what? [18:53:40] You need two access lists. [18:53:48] Ok, noted [18:53:55] I'm not going to bother keepign two during my testing. [18:53:56] You have one criteria for some access, and one criteria for another. [18:55:07] It's ridiculous to restrict the larger group because you want only a subset for something specific. [18:56:28] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1581191 (10RobH) My understanding of @krenair's issue is I am using #operations project which says it uses the ops group (which can include volunteers) for access to a space that is for the ops group... [18:56:42] i get what you are saying [18:57:00] but im testing, so im not about to maintain a list for it, ive noted that it would be an issue for proper final implementation though [18:57:04] Thank you. [18:57:23] it makes me wish we had phab poll ldap for the staff flag though =] [18:57:37] then maintain its own acl*wmfstaff group [18:57:41] To include all staff+contractors? [18:57:47] That's not wanted either. [18:58:03] its not unwanted. [18:58:10] if you are staff, you can see the quotes. [18:58:32] as all staff and contractors signed an NDA that is a bit more binding and broader in scope than the volunteer one [18:58:39] thats my understanding though, i could be wrong. [19:00:23] i'm having this conversation in one form or another in three different windows =] [19:00:24] it would be good if that NDA was also on phab, just like L2, then we'd know [19:00:28] but we don't [19:02:02] I think legal likes having a physical signature for the employee ones ;D [19:02:13] them and their love of dead trees. [19:02:30] (though i submitted mine via preview electronic signature anyhow) [19:02:45] for the staff handbook signature it was enough that i moved my mouse to "sign" with pixels :p [19:02:49] i dunno if they would allow a phab signature to stand in. [19:03:01] anyhow, back to testing. [19:03:13] i'd be surprised if L2 is legally binding but another is not [19:03:16] if anyone not in operations can see https://phabricator.wikimedia.org/T110566 then i did it wrong. [19:03:21] (or phab admins) [19:03:29] i dont get the difference between volunteer NDA process and employee NDA process [19:03:38] either legalpad is ok or not? [19:03:44] I actually get 404 Not Found instead of a permissions error there, robh. [19:03:52] huh.... [19:04:05] Krenair: technically thats more secure i suppose [19:04:11] as it doenst allow you to confirm task existence [19:04:11] Um. [19:04:24] You realise you can add one to the ticket number and get a valid ticket, right? [19:04:33] yep [19:04:39] the task #s are shared across the spaces [19:04:42] Therefore you can be pretty sure the '404 Not Found' is nonsense [19:04:58] it sounds like a phab bug [19:05:03] for it to 404 then, but meh. [19:05:03] (03CR) 10Hashar: "Same question as Gabriel. What will we do for other services such as Parsoid or OCG ? If we create a new user and user group per service," [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [19:05:41] noted on task though [19:06:07] at least you cannot see it, which means space isnt totally broken ;] [19:07:22] Not through Phabricator, no. [19:08:40] robh: what would you think about using spamassassin on our inbound mail to @wikimedia addresses? [19:08:51] ? [19:08:57] I assume you've set it up before? [19:09:05] scoring system to help identify spam [19:09:15] we've run it before [19:09:24] but i thought you guys moved to google cuz they did that stuff [19:09:29] http://spamassassin.apache.org/ [19:09:35] 'you guys' [19:09:47] OIT. [19:10:12] Google does do some of that stuff, but has no hooks to feed it spam vs ham. [19:10:36] if we had a spamassassin on the mx'es we could be more explicit and then add regexes that look for spamasssisn headers in google. [19:10:41] fwiw [19:10:56] I thought you might be excited about this idea.. :) [19:11:01] (03CR) 10Hashar: [C: 031] Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 (owner: 10Faidon Liambotis) [19:11:07] maybe it's not a problem.. [19:11:14] Doesn't SpamAssassin filter OTRS mail? [19:11:49] it scores list mail too [19:12:09] PROBLEM - puppet last run on elastic1029 is CRITICAL: Timeout while attempting connection [19:13:03] On some spam in my @wikimedia.org google account: [19:13:14] X-Spam-Report: Spam detection software, running on the system "polonium.wikimedia.org", has identified this incoming email as possible spam. [19:13:21] X-Spam-Score: 7.3 (+++++++) [19:13:54] mail that went through mailman has spamassasin headers [19:13:58] It all goes to the spam box though, I rarely see spam in my actual inbox [19:14:00] cajoel: We used to do that when we ran our own mail servers [19:14:01] Krenair: ahah -- you are correct. [19:14:04] X-Spam-Score: 1.6 (+) [19:14:07] RECOVERY - puppet last run on elastic1029 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:14:20] looks like we might be hitting all inbound mail... doing some more looking [19:14:25] cajoel: i dunno what the overhaul plan is for the primary mail systems though, ive not been involved in that [19:14:46] I know we wanted ot migrate the few folks using imap still off and then depreciate some of the mail systems [19:15:01] you'll want to follow up with faidon though, as he is heading that particular project iirc [19:15:17] odd, not all my mail looks like that. [19:15:23] we run it on listmail [19:15:30] individual list admins can set filters for it [19:15:34] I'm seeing it on some things which are NOT listmail [19:15:40] i would not attempt to delete mail for the users though [19:15:44] (03CR) 10Mobrovac: "Agreeing with Gabriel and Antoine. We have the choice of either:" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [19:15:52] mutante: right, I don't want to delete, just hint [19:15:55] flag it, but let them decide what to delete [19:15:57] yep [19:16:05] that's what the idea is on lists too [19:16:40] Google inbound mail can check a regex and then mark spam based on a level, so we could work with X-Spam-Score. [19:16:40] even though there is a (really high) score where it gets held [19:17:25] yea, i almost assumed it already does that and considers X-Spam-Score headers to make the decision what to label as spam [19:17:33] should I open a phab asking that all @wikimedia.org email gets an X-Spam-Score? [19:17:54] mutante: I think it's doing it's own analysis, but we can have it also leverage our headers. [19:18:01] it being google. [19:19:00] probably yes for the ticket. i don't know why it would be on some but not all mail [19:19:06] can just speak for lists [19:28:07] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1581360 (10RobH) Long ongoing discussion in IRC. Joel insists that someone in ops should assist in the writing and maintaining of the script. I maintain that this is OIT, and that discus... [19:31:13] (03CR) 10Thcipriani: "The idea here was to create a deploy-service user that is used for deploy of all services eventually (beginning with RESTBase here). The i" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [19:35:37] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1581385 (10JKrauska) We are a small organization. I feel we need cross team ownership of this process and procedure. I feel it's reasonable to have a developer from IT and Ops both under... [19:36:18] what's a C-level? [19:37:07] a not-quite-impressive grade point average? [19:37:50] Platonides: a level in an organization followed by two other random capital letters. (I also had to learn that.) [19:37:59] I don't think so if it needs to sign off things ;) [19:38:43] c-level means Corporate-*-officer I believe - top management [19:39:15] denoting the executive level of a corporation. [19:39:17] "a c-level corporate officer" [19:39:19] Origin [19:39:21] early 2000s: from the fact that initialisms for jobs at this level begin with C (for chief ). [19:40:23] i always wondered it "chief talent officer" counts [19:40:48] aka "the other CTO" [19:41:01] hehe [19:41:49] well, if the name only needs to begin with C, it could be done by Chris or Chase [19:42:08] as well as anyone from Community :) [19:42:11] so let's see if we have an "C"s on the staff page [19:42:36] i see an ED, VPs, and a Chief Advancement Officer [19:42:38] (03PS1) 10Alex Monk: Redirect chapcom.wikimedia.org to affcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234404 (https://phabricator.wikimedia.org/T41482) [19:42:43] it's the people at the left, I guess [19:42:44] (03CR) 10jenkins-bot: [V: 04-1] Redirect chapcom.wikimedia.org to affcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234404 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [19:43:16] i guess it needs the word "Chief" [19:43:42] Silly jenkins [19:43:55] (03PS2) 10Alex Monk: Redirect chapcom.wikimedia.org to affcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/234404 (https://phabricator.wikimedia.org/T41482) [19:44:03] Platonides: https://meta.wikimedia.org/wiki/Wikimedia_Forum/Archives/2015-07#C-level_staff [19:44:16] Platonides: sorry about that whole process, i'm trying [19:44:25] the c-level requirement is kind of new [19:44:27] (All I did was "git pull --rebase origin production") [19:44:31] i also don't think it can scale very well [19:45:47] well, better to show that it doesn't work with a veteran, rather than someone new to wm-world [19:46:22] Krenair: it's the first time jenkins-bot actually voted, PS1 was a different issue about rebasing [19:46:59] it feels new though that the bot says it on a non-voting result [19:49:55] are security tasks not included in wmf-nda? [19:50:00] Nope. [19:50:11] Those are even more restricted. [19:50:26] I shouldn't have used a negative :P [19:50:42] Sorry, yes, security tasks are not included in wmf-nda. [19:51:21] (03PS1) 10Rush: elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 [19:51:27] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 (owner: 10Rush) [19:51:30] (03PS2) 10Rush: elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 [19:52:17] perhaps I should ask csteipp about applying for it at the same time, then [19:52:19] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 (owner: 10Rush) [19:54:13] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1581469 (10Krenair) Oh and, I guess you'll want the favicon changed too? [19:54:47] Platonides, hmm. I think that once you get into #wmf-nda the only extra thing needed would be his approval, yeah [19:56:24] from an old mail he sent, that shouldn't be a problem [19:56:53] (03PS3) 10Rush: elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 [19:56:59] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 (owner: 10Rush) [19:57:24] !log twentyafterfour@tin Synchronized php-1.26wmf20/includes/media/XMP.php: deploy fix for T89532 on 1.26wmf20 (duration: 00m 13s) [19:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:59] (03PS4) 10Rush: elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 [20:00:09] (03PS3) 10Dzahn: exim: temp hack to stop exim when on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) [20:01:38] Platonides: Yeah, I'll add you to security once you have an nda [20:01:54] he already signed the NDA [20:01:56] hi cscott [20:02:00] *csteipp [20:02:58] * twentyafterfour is ready to deploy wmf20 [20:03:01] James_F: ^ [20:03:11] Yay. [20:03:14] Thanks. [20:04:02] (03PS5) 10Rush: elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 [20:04:08] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 (owner: 10Rush) [20:04:11] cajoel: we run spamassassin on incoming mail already [20:04:15] have been so for many years [20:04:29] I'm not seeing it on some mails, maybe I'm not looking at the right things... [20:04:30] 1s [20:04:51] Does it leave a score header on all mail? or just mail it identifies as spam? [20:05:00] (03PS1) 1020after4: wikipedia wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234409 [20:05:14] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234409 (owner: 1020after4) [20:05:20] (03Merged) 10jenkins-bot: wikipedia wikis to 1.26wmf20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234409 (owner: 1020after4) [20:05:21] paravoid: maybe we whitelist wmf->wmf mail? [20:06:17] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.26wmf20 [20:06:23] I've seen the header missing on non-wikimedia-internal mail [20:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:27] paravoid: are we using bayes? could I collect and feed the bayes filter? [20:06:29] it's blocked on the burocratic process, csteipp [20:06:38] yes, no [20:06:39] X-Spam-Report: Spam detection software, running on the system "sodium.wikimedia.org", has identified this incoming email as possible spam. (mail to pywikipedia-l-owners, classified as spam, in february) [20:06:52] no need to go so far bacj [20:06:56] *back [20:07:04] I'm in an all-day meeting atm [20:07:08] paravoid: no not using bayes? [20:07:15] happy to talk things through tomorrow or another day [20:07:25] paravoid: great.. later then [20:07:25] internal mail will go through polonium, not sodium, though [20:07:37] we are using bayes, in autolearn mode [20:07:45] (03PS6) 10Rush: elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 [20:08:45] cajoel: https://git.wikimedia.org/blob/operations%2Fpuppet.git/0608f9ea8b38b00c1f6aecad36400a86d367b48a/manifests%2Frole%2Fmail.pp#L26 [20:10:11] paravoid: beautiful -- I have a large sample of very spammy stuff getting past, that I'd like to look at a little closer and perahaps feed the filter if it looks helpful -- will sync with you. [20:10:15] (03CR) 10Rush: [C: 032] elasticsearch: ferm for 18-23 [puppet] - 10https://gerrit.wikimedia.org/r/234406 (owner: 10Rush) [20:10:37] great, let's talk at some point, meeting break is over now :) [20:11:27] !log ferm setup on elasticsearch10(1[8-9|2[0-3]) [20:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:00] (03CR) 10BryanDavis: [C: 031] "talked with Moritz and we decided that puppet reports won't be high volume." [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [20:15:16] Herp derp, what happened to ssh access to gerrit? [20:15:16] Unable to negotiate with 208.80.154.81: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1 [20:15:16] PROBLEM - puppet last run on elastic1022 is CRITICAL: Timeout while attempting connection [20:15:17] ostriches, are you using an ancient ssh not supporting modern ciphers? [20:15:17] ostriches: ssh 7? [20:15:17] Gerrit has a batshit sshd anyway [20:15:17] We've been over this. [20:15:35] (03CR) 10GWicke: "We should maintain basic isolation between services, so I think services reading each other's configs would be a no-go. Services being abl" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [20:16:25] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:21:05] ostriches: wfm, do you need a quick specific command? [20:21:27] i dont see recent changes either [20:21:42] ostriches, add some -v flags to see what ciphers are offered by you and the server respectively [20:22:14] i think everyon is missing who ostriches is... [20:22:35] It's a demon [20:23:28] so it isn't an ostrich? [20:23:40] nor several :P [20:24:44] it's a demonic ostrich https://40.media.tumblr.com/cc41fb4aa07f045377f7ceecdaa7948c/tumblr_mkplpovKx91rdfytpo1_500.png [20:25:15] (03PS1) 10Ottomata: Puppetize multiple kafka eventlogging processors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234415 (https://phabricator.wikimedia.org/T104228) [20:28:41] https://phabricator.wikimedia.org/P1939 [20:31:34] xDD [20:32:38] diffie-hellman-group1-sha1 isn't new though [20:32:57] same happens to me [20:32:57] 6operations: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1581684 (10Jgreen) 3NEW a:3Jgreen [20:32:59] it's almost the opposite, it's kind of old, because as you said, gerrit and the java sshd [20:33:15] it's probably related to cipher deprecation in openssh [20:33:19] maybe your client just stopped supporting it [20:33:30] 6operations: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1581694 (10Jgreen) [20:33:32] i see people talking about disabling it on very old posts [20:33:53] the last openssh release disabled some ciphers by default [20:34:00] I wonder if I need to explicitly enable it for that host? [20:34:08] that's a "solution" [20:35:09] yea, that or the other solution is probably upgrading gerrit? [20:35:22] probably [20:35:24] an older ssh that connects is agreeing on aes128-cbc hmac-md5 [20:35:43] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1581705 (10Jgreen) 3NEW a:3Jgreen [20:36:04] no, the problem is the key exchange [20:36:09] what's the key size on gerrit? [20:36:19] it's the key exchange algorithm [20:36:55] hmm, yes [20:37:01] that they have to agree on, so yea a recent upgrade on the client side probably disabled it. MAC OS X security upgrade [20:37:40] I don't use the bundled `ssh`, but yeah, that probably did it. [20:37:57] I see no mention in openssh 7 release notes [20:38:20] but note that next release will: [20:38:21] * Refusing all RSA keys smaller than 1024 bits (the current minimum [20:38:21] is 768 bits) [20:38:21] * Several ciphers will be disabled by default: blowfish-cbc, [20:38:21] cast128-cbc, all arcfour variants and the rijndael-cbc aliases [20:38:21] for AES. [20:38:21] * MD5-based HMAC algorithms will be disabled by default. [20:38:37] aha! [20:38:45] Support for the 1024-bit diffie-hellman-group1-sha1 key exchange [20:38:45] is disabled by default at run-time. It may be re-enabled using [20:38:45] the instructions at http://www.openssh.com/legacy.html [20:39:04] "OpenSSH supports this method, but does not enable it by default because is weak and within theoretical range of the so-called Logjam attack. " [20:39:38] that's it [20:39:54] ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 gerrit.wikimedia.org -p 29418 does work [20:40:09] so, add to .ssh/config under your Host gerrit.wikimedia.org line [20:40:11] Yeah [20:40:14] I did that [20:40:15] KexAlgorithms +diffie-hellman-group1-sha1 [20:40:31] Host gerrit.wikimedia.org [20:40:31] Port 29418 [20:40:31] KexAlgorithms +diffie-hellman-group1-sha1 [20:40:40] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1581723 (10Jgreen) [20:41:30] Thx guys [20:46:59] (03PS2) 10Ottomata: Puppetize multiple kafka eventlogging processors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234415 (https://phabricator.wikimedia.org/T104228) [20:48:18] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1581776 (10RobH) The certificate has been provided to Simon with WordPress support. Additionally, the public key he provided today appears to be an SSH key, not a GPG key. I've requested that he... [20:51:47] robh: was it the public or private version! :P [20:52:00] *? not ! [20:52:13] he sent me a public ssh key [20:52:18] heh, at least he didnt send me a private one. [20:52:36] i went to push into gpg and got error, then looked at contents and started laughing [20:52:51] then i checked it a few more times before asking if he got it wrong ;] [20:53:34] it would have been fun to use that [20:53:42] after all, there's a rsa key inside [20:53:51] i encrypted it, good luck getting it decrypted ever.... [20:53:53] heh [20:53:55] he would be mad trying to convert his private key for decrypting xDD [20:54:17] 6operations, 10Wikimedia-Site-Requests: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1581808 (10Krenair) a:3Krenair Approval from https://phabricator.wikimedia.org/T26928#280223 [20:54:21] breaking tier 1 tech support for fun and profit [20:54:46] well, if tier 1 is not able to create a gpg key [20:54:57] it seems reasonable to escalate to tier 2 [20:55:05] at least they should protect the key better [20:55:55] (03PS1) 10BryanDavis: Enable logging of XMP debug log channel for severity >=warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234425 (https://phabricator.wikimedia.org/T89532) [20:56:43] and then we wonder again why the policy has to be on a wordpress in the first place [20:56:46] Then when you escalate to management the response would be 'gpg key? Why would need that to encrypt an email? What is gpg anyway?' [20:57:37] (03PS1) 10Alex Monk: Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) [21:01:32] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1581837 (10GWicke) @krenair, @ssastry: I think both VE and rt testing should be able to use http://parsoid.svc.e... [21:03:01] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1581844 (10Krenair) However @BBlack intends to remove that as well (see description) [21:03:09] hehehe [21:03:23] mutante, I completely agree [21:03:37] maybe because "legal isn't able to edit anything else" [21:03:52] which would be an appropiate argument for other companies [21:03:59] but really sad for wikimedia [21:04:34] I prefered the static html approach [21:04:50] Random thing I discovered in the wikimedia-chapter apache vhost: [21:04:53] RewriteRule ^/stats(/(.*$)|$) %{ENV:RW_PROTO}://www.wikimedia.org/stats/%{SERVER_NAME}/$1 [R=301,L] [21:05:00] I had no idea this was a thing? [21:06:35] Platonides: yes. it could either be "because we pay a contractor to design it and they want to buy a template instead of writing a skin", or it could be "because we have to keep editing it on a regular basis". if it's the former i would say "why don't you use wordpress to create it and when done save the resulting HTML and give it to us. but that ship has already sailed because "due date this week" [21:07:31] (03PS1) 10Alex Monk: Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) [21:07:44] install local wordpress inside office, [21:07:51] setup wget -m + cron job [21:07:59] :) [21:08:34] something tells me the person making it is not at office but also 3rd party [21:08:39] (cron job with { wget -m + rsync }) [21:08:41] but i dont know [21:09:00] so a third party will be writing our policy? uh.. oh.. [21:09:16] we have not been told what happened between "it's going to be static HTML" and "it will be on wordpress" [21:09:38] * cscott pops up [21:09:43] of course, otherwise you could argue back! [21:09:56] Platonides: hi [21:10:12] we can also delete the puppet work again [21:10:20] hi cscott [21:10:29] sorry to ping you when expecting to autocomplete csteipp [21:11:23] oh i see [21:13:46] (03CR) 1020after4: [C: 031] Enable logging of XMP debug log channel for severity >=warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234425 (https://phabricator.wikimedia.org/T89532) (owner: 10BryanDavis) [21:13:47] At least there's a different number of letters in your names. [21:14:28] * MatmaRex hugs Krenair [21:15:05] i can relate. just look how many people whose nicks start with 'ma' are in here. [21:15:47] hehe [21:16:30] solution is to have 26 bots, a-bot to z-bot and have them say "Did you mean?" :p [21:17:15] they would need context [21:17:27] if we create such AI, why not add it as a irc client plugin? [21:17:45] (although my problem is the reverse of Krenai.r's; he gets pings that he shouldn't, i don't get pings that i should) [21:17:56] that's worse [21:18:16] you could rename to RexMatma [21:18:21] but many res, too [21:18:35] my last name has a pretty unique prefix, i guess. [21:19:01] I have a different problem. I get no pings full stop :( [21:19:10] Which isn't a problem actually, more of a godsend :) [21:19:30] that's because nobody likes you ;) [21:21:49] (03PS1) 10Alex Monk: Add be-tarask for renaming from be-x-old [dns] - 10https://gerrit.wikimedia.org/r/234429 (https://phabricator.wikimedia.org/T11823) [21:22:02] (03CR) 1020after4: [C: 031] "@GWicke: I'm not sure what you are advocating? One user or one per service?" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [21:22:04] MatmaRex, how to debug a skin refusing to be enabled? [21:22:07] (yep, many tabs) [21:22:16] Platonides: they do :( [21:22:28] Platonides: refusing how? [21:22:47] it is listed, with "(disabled)" [21:22:58] even though localsettings has the require_once [21:23:23] I must have made a silly mistake when creating the file [21:23:40] Platonides: the skin should do $wgValidSkinNames['foo'] = 'Foo;' in its .php file [21:24:09] where 'foo' is the internal name you'd use in ?useskin=foo, and 'Foo' in the Foo as in FooSkin where FooSkin is the main PHP class of the skin [21:24:13] yep, that was it [21:24:27] it used an extra dash in ValidSkinNames [21:24:33] now it 500s :) [21:24:38] hah [21:24:47] (03CR) 1020after4: [C: 04-1] Setup Gerrit role account for Phabricator actions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [21:25:44] it works now [21:25:50] (03CR) 1020after4: "Is this ready to go? Are we really fully committed to staticarray now?" [tools/scap] - 10https://gerrit.wikimedia.org/r/224629 (owner: 10Ori.livneh) [21:27:44] (03CR) 1020after4: Add service deploy via scap (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [21:28:10] JohnFLewis: hi [21:36:42] (03CR) 10John F. Lewis: [C: 031] "A necessary evil as Chase said." [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [21:37:36] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1582029 (10mmodell) Phabricator uses a ton of individual databases. Each request to phabricator can generate 10s of mysql connections. For example, loading this m... [21:37:42] legoktm: hi (probably why I get no pings; I ignore them :) ) [21:37:59] :P [21:39:15] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1582035 (10mmodell) Every time I type in the comment box, phabricator makes an async connection to render the comment preview, each one of these preview-renders c... [21:45:35] (03PS1) 10Tim Landscheidt: gridengine: Fix status check for gridengine-exec [puppet] - 10https://gerrit.wikimedia.org/r/234432 (https://phabricator.wikimedia.org/T110532) [21:48:09] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta which in this case doesn't give me any confidence because I tested that the previous patch solved the issue with the e" [puppet] - 10https://gerrit.wikimedia.org/r/234432 (https://phabricator.wikimedia.org/T110532) (owner: 10Tim Landscheidt) [21:48:17] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1582076 (10jcrespo) Can we at least hack it so that those connections go to a separate slave. We //actually have the resources// ready for it, now in a passive role. [21:53:38] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1582096 (10Joe) @jcrespo it seems like we do have that possibility: https://secure.phabricator.com/rP4a2981252f51538f1e8abbf8d499253f58408659 [21:56:30] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: import all lists with the script we wrote for that - https://phabricator.wikimedia.org/T110131#1582107 (10JohnLewis) Glancing over the above list and recalling memories related to them (but not tickets right now; perhaps will track these tomorrow if... [22:02:24] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1582125 (10mmodell) >>! In T109279#1582096, @Joe wrote: > @jcrespo it seems like we do have that possibility: > > https://secure.phabricator.com/rP4a2981252f5153... [22:13:41] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1582177 (10GWicke) >>! In T110474#1581844, @Krenair wrote: > However @BBlack intends to remove that as well (see... [22:14:52] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [22:16:20] (03CR) 10John F. Lewis: [C: 031] "Lgtm. Will bear the weight and add it to SWAT as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) (owner: 10Shanmugamp7) [22:17:32] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 7 failures [22:18:54] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1582189 (10Krenair) parsoid vs. parsoidcache, right, sorry. [22:27:39] (03PS2) 10Tim Landscheidt: Tools: Migrate from bigbrother to bigbrothermonitor [puppet] - 10https://gerrit.wikimedia.org/r/234051 [22:27:56] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1582229 (10Dzahn) 3NEW [22:28:02] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1582236 (10Dzahn) a:3Dzahn [22:28:38] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1582238 (10Dzahn) T110609 will help with this [22:29:09] (03CR) 10Tim Landscheidt: [C: 04-1] "Still not to be merged yet, as depending on I00cd7a90273e0d745699855eb671710afb4e85a7 + package building + package deployment." [puppet] - 10https://gerrit.wikimedia.org/r/234051 (owner: 10Tim Landscheidt) [22:29:17] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1582248 (10Krenair) [22:29:18] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1582247 (10Krenair) [22:31:02] (03PS4) 10Dzahn: exim: temp hack to stop exim when on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) [22:35:04] PROBLEM - Disk space on sodium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=46%): /var/lib/ureadahead/debugfs 0 MB (0% inode=46%) [22:38:52] mutante, that sounds bad? ^ [22:43:03] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:43:54] Krenair: arg, it does. it's a bacula restore, deleting [22:44:23] it's strange too because that data existed there before [22:44:53] RECOVERY - Disk space on sodium is OK: DISK OK [22:45:32] yea, that was way bigger than it was before, wth [22:46:43] .. because restore is to / and it came from a separate logical volume [22:48:36] (03CR) 10Chad: Setup Gerrit role account for Phabricator actions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [22:49:02] (03PS3) 10Chad: Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 [22:49:16] twentyafterfour: I redid it to use api tokens. [22:49:19] One string to deal with instead of 2. [22:54:25] I guess the redirection bit should probably go where the existing redirect stuff is too [22:57:10] 6operations, 10Wikimedia-Mailing-lists: test sending individual mails from fermium during migration - https://phabricator.wikimedia.org/T110441#1582333 (10Dzahn) https://wikitech.wikimedia.org/wiki/Exim [23:00:05] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150827T2300). [23:00:05] matt_flaschen legoktm jdlrobson JohnFLewis: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] hey [23:00:28] Hey [23:00:30] hi [23:00:30] where did cscott go? [23:00:38] (cur | prev) 22:33, 27 August 2015‎ Cscott (Talk | contribs)‎ . . (18,804 bytes) (-75)‎ . . (→‎Week of August 24th: Removed myself from evening SWAT since I won't be able to be online during the SWAT window.) [23:00:41] meh, ok [23:00:55] Krenair: yeah, i have to hop offline RSN [23:01:04] Krenair: unless you'd like to take responsibility for the patch? [23:01:59] 6operations, 10Wikimedia-Mailing-lists: test sending individual mails from fermium during migration - https://phabricator.wikimedia.org/T110441#1582346 (10Dzahn) http://www.exim.org/exim-html-current/doc/html/spec_html/ch-the_exim_command_line.html [23:02:00] I'll do it [23:02:26] (03CR) 10Alex Monk: [C: 032] Re-enable Flow for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233100 (https://phabricator.wikimedia.org/T109816) (owner: 10Mjbmr) [23:02:45] Krenair: ok, cool. if VE breaks just speedy-revert and I'll figure out what went wrong later. [23:02:52] (03Merged) 10jenkins-bot: Re-enable Flow for fawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233100 (https://phabricator.wikimedia.org/T109816) (owner: 10Mjbmr) [23:03:02] Krenair: i'll try to keep on IRC via my phone as long as possible. [23:03:08] cscott, that's the plan, don't worry [23:03:21] (usually something breaks in swat = speedy revert) [23:04:00] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/233100/ (duration: 00m 12s) [23:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:04:33] Krenair: testing VE in prod (enwiki), beta, and on a private wiki (like officewiki) should cover most of the bases. maybe wikitech, too, since it's weird. [23:04:41] matt_flaschen, please test [23:04:50] cscott, yup [23:07:20] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1582369 (10Dzahn) 71729 ./heldmsg-wikiru 66819 ./heldmsg-wikinews 43642 ./heldmsg-maps 40495 ./heldmsg-wikimedia 38696 ./heldmsg-wikifa 37172 ./heldmsg-wiktionary 26928 ./heldmsg-wi... [23:08:06] (03PS3) 10Chad: Assign swift roles via ENC [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [23:08:21] Krenair, yeah, it looks good. [23:08:24] great [23:08:31] 6operations, 10Wikimedia-Mailing-lists: Identify lists with *large* moderation queues - https://phabricator.wikimedia.org/T110438#1582375 (10JohnLewis) Needs clarify above though. Suffix is missing for some lists like wikiru-l and then a few Wikimedia's are listed. [23:09:56] (03CR) 10Chad: Move web::sites to web::prod_sites; begin unification in new class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [23:10:31] legoktm, so I know that Tim approved your commit etc. [23:10:31] Krenair: thanks again. [23:11:02] legoktm, but what about ExtensionProcessor::$mergeStrategies['wgExtraNamespaces'] ? [23:11:29] sorry i'm here btw [23:11:32] Krenair: oh, we can just remove that [23:11:54] it won't do anything bad [23:12:02] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:12:14] (03PS1) 10Yuvipanda: k8s: Don't turn on all the admit controllers [puppet] - 10https://gerrit.wikimedia.org/r/234436 [23:12:16] (03PS1) 10Yuvipanda: k8s: Increase verbosity of logging for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234437 [23:12:20] legoktm, you upload, I approve? [23:12:35] on it :) [23:12:48] meanwhile we need to wait for jenkins :) [23:12:49] (03PS2) 10Yuvipanda: k8s: Don't turn on all the admit controllers [puppet] - 10https://gerrit.wikimedia.org/r/234436 [23:12:57] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Don't turn on all the admit controllers [puppet] - 10https://gerrit.wikimedia.org/r/234436 (owner: 10Yuvipanda) [23:14:22] (03CR) 10Alex Monk: [C: 032] Lift of IP cap on ta.wikipedia for IP 218.248.16.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) (owner: 10Shanmugamp7) [23:14:28] (03Merged) 10jenkins-bot: Lift of IP cap on ta.wikipedia for IP 218.248.16.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234009 (https://phabricator.wikimedia.org/T110352) (owner: 10Shanmugamp7) [23:14:56] Krenair: thanks (on Shan's behalf) :) [23:15:01] :) [23:15:41] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/234009/ (duration: 00m 13s) [23:15:41] Now can the WMF cover the expense to India to test that change? ;) [23:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:21] JohnFLewis, lol [23:18:24] (03PS5) 10Dzahn: exim: temp hack to stop exim when on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) [23:21:28] Krenair: it merged! [23:22:11] (03CR) 10Dzahn: [C: 032] exim: temp hack to stop exim when on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234318 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [23:22:19] yeah, I saw [23:22:28] just figuring out which way is the safe way to sync it [23:22:32] YuviPanda: pending change on master [23:22:45] mutante: bah, sorry. can you merge? [23:22:57] YuviPanda: yep, doing [23:23:01] mutante: thanks! [23:23:07] looks like MWNamespace first, then ExtensionProcessor [23:23:19] done, np, just wanted to make sure you know it's applied [23:23:40] !log krenair@tin Synchronized php-1.26wmf20/includes/MWNamespace.php: https://gerrit.wikimedia.org/r/#/c/234328/ (duration: 00m 13s) [23:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:05] !log krenair@tin Synchronized php-1.26wmf20/includes/registration/ExtensionProcessor.php: https://gerrit.wikimedia.org/r/#/c/234328/ (duration: 00m 12s) [23:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:40] !log krenair@tin Synchronized php-1.26wmf20/includes/DefaultSettings.php: https://gerrit.wikimedia.org/r/#/c/234328/ (duration: 00m 12s) [23:24:46] legoktm, please test [23:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:06] mutante: so labs puppetmaster doesn't care for puppetmerge. it's updated on a quick running cron... [23:26:16] so sometimes I forget since it *does* get applied on labs even if I don't do the merge [23:26:35] YuviPanda: oh, interesting. ok [23:27:04] Whoops, just noticed a warning occurring in production that's partially my fault [23:27:25] Glaisher, Aug 27 22:18:25 mw1033: #012Warning: array_merge() expects at least 1 parameter, 0 given in /srv/mediawiki/php-1.26wmf20/extensions/CentralAuth/includes/specials/SpecialGlobalUsers.php on line 169 [23:27:53] That's what I get for suggesting call_user_func_array( 'array_merge', ... ) I guess [23:27:58] * ebernhardson thinks array_merge should just return the empty array :P [23:28:14] Krenair: its annoying, but you need a 3 level conditional for that kind of merge :( written several times [23:28:26] Un(?)fortunately I am not a PHP developer. :) [23:28:44] Krenair: hmm, not working: https://fa.wikivoyage.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&uselang=en [23:28:55] if ( $len === 0 ) { return array(); } elseif ( $len == 1 ) { return reset($arr); } else { return call_user_func_array( 'array_merge', $arr ); } [23:29:05] oh wait [23:29:15] Krenair: can you touch Gadgets/extension.json and sync it? [23:29:19] sure [23:29:59] !log krenair@tin Synchronized php-1.26wmf20/extensions/Gadgets/extension.json: touch (duration: 00m 13s) [23:30:00] Krenair: generally i don't accept the answer "i am not an XXX developer" :P every hiring thread on the internet complains about how places assume developers can't learn new things, so i assume everyone can learn to write anything :P [23:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:13] Krenair: working now :D woot [23:30:15] legoktm, \o/ [23:30:18] looks good to me [23:30:27] ebernhardson, hehe, fair enough [23:30:28] Mjbmr: ^^ [23:33:02] (03PS1) 10Yuvipanda: k8s: Remove unnecessary etcd dependency [puppet] - 10https://gerrit.wikimedia.org/r/234439 [23:33:42] Krenair: ping cscott_phone when you get to the VRS stuff [23:33:45] jdlrobson, you there? [23:33:46] cscott_phone, k [23:33:54] yup [23:34:34] Hopefully my Android IRC client will beep or something [23:34:46] Krenair: any questions/problems about the patches? [23:34:53] nope, looks good to me [23:35:00] cool let me know when you need me to test [23:35:35] If this was going into master I'd be bikeshedding about the docs, but it's too late for that [23:35:53] sent to jenkins [23:36:16] (03CR) 10Alex Monk: [C: 032] Always use VRS to configure Visual Editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233439 (owner: 10Cscott) [23:36:42] (03Merged) 10jenkins-bot: Always use VRS to configure Visual Editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233439 (owner: 10Cscott) [23:37:33] (03PS2) 10Yuvipanda: k8s: Increase verbosity of logging for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234437 [23:37:53] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Increase verbosity of logging for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234437 (owner: 10Yuvipanda) [23:38:03] (03PS2) 10Yuvipanda: k8s: Remove unnecessary etcd dependency [puppet] - 10https://gerrit.wikimedia.org/r/234439 [23:38:12] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Remove unnecessary etcd dependency [puppet] - 10https://gerrit.wikimedia.org/r/234439 (owner: 10Yuvipanda) [23:39:38] (03PS1) 10Dzahn: Revert "exim: temp hack to stop exim when on fermium" [puppet] - 10https://gerrit.wikimedia.org/r/234440 [23:40:15] (03CR) 10Dzahn: [C: 04-2] "this revert is here as a reminder for me - but only for the actual migration day" [puppet] - 10https://gerrit.wikimedia.org/r/234440 (owner: 10Dzahn) [23:40:51] cscott_phone, fyi, this looks good on beta [23:41:37] oh... hmm [23:41:39] now not so much [23:42:06] * Krenair tries testwiki only [23:44:03] seems fine [23:45:23] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [23:46:19] Ugh, I know what's wrong with beta. [23:46:20] HTTPS. [23:48:24] (03Abandoned) 10Tim Landscheidt: Move role::labs::tools::* to module role [puppet] - 10https://gerrit.wikimedia.org/r/231866 (owner: 10Tim Landscheidt) [23:49:16] Krenair: Ha. [23:50:52] Yeah, my live hack on deployment-bastion (partially reverting the CommonSettings-labs changes) fixes it there [23:51:15] will test on other domains in prod via mw1017 more before rolling it out further though [23:52:09] enwiki works [23:54:49] some private wikis I tested work [23:55:26] cscott_phone, doing it now [23:55:39] Krenair: are we going to run out of time? I need to head off soon :/ [23:55:40] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/233439/ (duration: 00m 12s) [23:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:08] seems good [23:56:08] (but i can test from phone if necessary) [23:56:11] jdlrobson, doing yours now [23:56:29] thanks alex! :) [23:56:35] Krenair: caught up on phone [23:56:50] Looks like there was an HTTPS issue but you fixed it? [23:57:31] I had a patch a month ago to remove some lies in the beta configuration about protocol relative stuff [23:57:52] Guess there's more of that? [23:57:53] !log krenair@tin Synchronized php-1.26wmf20/extensions/MobileFrontend/includes/config/Experimental.php: https://gerrit.wikimedia.org/r/#/c/234331/1 (duration: 00m 14s) [23:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:01] cscott_phone, beta issues, will follow up the -labs part [23:58:38] !log krenair@tin Synchronized php-1.26wmf20/extensions/MobileFrontend/includes/MobileFormatter.php: https://gerrit.wikimedia.org/r/#/c/234331/1 (duration: 00m 12s) [23:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:41] (03PS2) 10Alex Monk: Disable section collapsing on h1s in Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234330 (https://phabricator.wikimedia.org/T110436) (owner: 10Jdlrobson) [23:59:51] (03CR) 10Alex Monk: [C: 032] Disable section collapsing on h1s in Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234330 (https://phabricator.wikimedia.org/T110436) (owner: 10Jdlrobson)