[00:00:04] while ping & http to other sites seems to be okay [00:00:05] RoanKattouw, ^d, Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150220T0000). [00:00:13] <^d> not it [00:00:18] Ooh crap I need to schedule my thing [00:00:22] I'll take it [00:00:30] And forgive myself for my scheduling problems [00:00:38] eh.. sshd[2646]: pam_unix(sshd:session): session opened for user aaron ? [00:02:20] WTF [00:02:24] Did Ori already deploy my code?! [00:02:44] He kind of did [00:02:46] AaronS: confusing, it looks like it already works [00:02:48] RoanKattouw: ..? [00:03:10] Connection to 208.80.154.149 timed out while waiting to read [00:03:10] ori: I forgot I had lined up another commit for the SWAT, and you ended up accidentally deploying it [00:03:12] no change [00:03:15] i was about to suggest something like ssh -c aes128-cbc [00:03:34] but that sounds different.. it already claims it opens the session [00:04:01] ori: So surprise, our bug fix for backspace is live a few hours early :D [00:04:50] yurik: MaxSem: legoktm: Ping for SWAT [00:04:57] RoanKattouw: pong [00:05:01] RoanKattouw, here [00:05:09] pong [00:05:13] (03CR) 10Catrope: [C: 032] Temporarily remove 'm' from metawiki's $wgLocalInterwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191812 (https://phabricator.wikimedia.org/T89916) (owner: 10Legoktm) [00:05:19] (03Merged) 10jenkins-bot: Temporarily remove 'm' from metawiki's $wgLocalInterwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191812 (https://phabricator.wikimedia.org/T89916) (owner: 10Legoktm) [00:06:25] (03CR) 10Catrope: [C: 032] Enable WikiGrok in repo mode on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191670 (owner: 10MaxSem) [00:06:31] (03Merged) 10jenkins-bot: Enable WikiGrok in repo mode on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191670 (owner: 10MaxSem) [00:07:27] !log catrope Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 08s) [00:07:30] Logged the message, Master [00:07:40] !log catrope Synchronized wmf-config/mobile.php: SWAT (duration: 00m 06s) [00:07:43] Logged the message, Master [00:07:55] MaxSem, legoktm; Your config changes are done, please verify [00:08:11] * legoktm does [00:08:20] https://meta.wikimedia.org/w/api.php?action=parse&text=%5B%5B:m:Foo%5D%5D looks good, thanks! [00:10:37] RoanKattouw, looks good so far [00:13:52] mutante: do you understand ganglia enough to account for why the virt1xxx nodes are suddenly no longer included in the eqiad virt cluster page? [00:17:52] andrewbogott: no. just that i would be suspicious of changes that do something with the $cluster_name variable [00:18:06] maybe it got moved from puppet to hiera? [00:18:10] Yeah, I bet I broke it myself [00:18:46] maybe a role class set the $cluster_name and now that role isn't used anymore on those nodes [00:20:32] 3ops-eqiad, operations: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#1051878 (10Dzahn) worked around it by "mkdir /etc/php5/conf.d" then running puppet twice re-added to dsh groups by reverting the change that removed it ran sync-common to get mw code up-to-date re-added to pybal [00:21:09] !log tstarling Synchronized php-1.25wmf17/extensions/Collection/Collection.body.php: (no message) (duration: 00m 05s) [00:21:12] Logged the message, Master [00:21:32] mutante: totally right, there’s a recent patch with title ‘virt: use role, hiera’ [00:24:31] andrewbogott: yes, this sounds like it https://gerrit.wikimedia.org/r/#/c/185153/4/hieradata/role/common/nova/compute.yaml [00:24:51] maybe they are just virt now but not eqiad-virt or so ? [00:24:53] mutante: thanks, now I will figure out why it broke… [00:25:28] !log tstarling Synchronized php-1.25wmf17/extensions/Collection/Collection.body.php: (no message) (duration: 00m 07s) [00:25:32] Logged the message, Master [00:26:13] wikidatawiki WikiGrok\QuestionStore::store 10.64.32.28 1146 Table 'wikidatawiki.wikigrok_questions' doesn't exist [00:26:21] * AaronS looks at MaxSem [00:26:23] hm, it was just ‘virt’ before [00:26:38] TimStarling: ? [00:26:39] grmbl [00:27:34] hmm, why I can't see exceptions in logstash fatalmonitor anymore? :P [00:27:36] yes? [00:27:56] !log catrope Synchronized php-1.25wmf17/extensions/ZeroBanner: SWAT (duration: 00m 06s) [00:27:59] Logged the message, Master [00:28:01] andrewbogott: i would comment that on T86774 [00:28:02] !log catrope Synchronized php-1.25wmf18/extensions/ZeroBanner: SWAT (duration: 00m 06s) [00:28:05] Logged the message, Master [00:28:07] MaxSem: because mediawiki to logstash has been turned off since the outage [00:28:13] TimStarling: Was that Collection.body.php change just a debugging thing and then a revert, or something? [00:28:18] morebots: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150205-SiteOutage [00:28:18] I am a logbot running on tools-exec-11. [00:28:18] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [00:28:18] To log a message, type !log . [00:28:21] yes, debugging and then revert [00:28:25] gah [00:28:27] MaxSem: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150205-SiteOutage [00:28:29] OK, cool [00:28:32] for https://phabricator.wikimedia.org/T89918 [00:28:33] mutante: it looks like that patch would have broken virt1001-1009 and left 1010-1012 working, but in fact none of them are working [00:28:44] bd808, at least fatals/warnings are up to date [00:28:56] andrewbogott: neptunium, virt1000 and silver are still there somehow [00:29:02] It's just that I was on tin and didn't see a trace of what you'd done, so I was a little confused at first [00:29:24] mutante: they still have it set in site.pp [00:29:55] andrewbogott: at least that's kind of proof it's a hiera issue [00:30:07] yeah [00:31:25] they are " role nova::compute [00:31:51] that's like it should work.. but yea.. [00:32:02] (03PS1) 10MaxSem: Pull WG on WD for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191821 [00:32:38] (03CR) 10Catrope: [C: 032] Pull WG on WD for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191821 (owner: 10MaxSem) [00:32:42] (03Merged) 10jenkins-bot: Pull WG on WD for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191821 (owner: 10MaxSem) [00:33:16] !log catrope Synchronized wmf-config/InitialiseSettings.php: Pull WikiGrok from wikidata for now (duration: 00m 05s) [00:33:22] Logged the message, Master [00:33:35] !log maxsem Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [00:33:38] Logged the message, Master [00:33:44] heh [00:36:23] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1051895 (10Andrew) 3NEW a:3Joe [00:37:31] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1051904 (10Dzahn) seems like this is from https://gerrit.wikimedia.org/r/#/c/185153/4 [00:38:30] 3operations: restructure site.pp to use roles, hiera. - https://phabricator.wikimedia.org/T86774#1051911 (10Dzahn) seems like there is an issue with ganglia and the $cluster_name variable. see T90035 [00:41:42] 3ops-eqiad, operations: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#1051915 (10Dzahn) everything seems like back to normal and this should be resolved, except: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=Application+servers+eqiad&h=mw1062.eqiad.wmnet&jr=&js=&v=... [00:41:48] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1051916 (10Andrew) Hm, due to the fact that that patch specifies cluster: virt for virt1012 but not for anything else, I'd expect to see that host and only that one in ganglia. But i... [00:48:02] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1051921 (10Dzahn) it looks odd indeed that it is only specified on virt1012 and the other 2, but it also sets it to 'virt' in common/nova/compute.yaml which makes you think all in rol... [00:51:58] (03CR) 10Dzahn: "merge me tomorrow (Friday)" [puppet] - 10https://gerrit.wikimedia.org/r/191218 (https://phabricator.wikimedia.org/T89739) (owner: 10Dzahn) [00:53:38] 3Phabricator, operations: re-use server 'radon' as phab failover - https://phabricator.wikimedia.org/T88818#1051922 (10Dzahn) a:3chasemp hey @chasemp this can be reused anytime. just assigning it to you to let you know, because you asked about radon in icinga. i'm happy to also take it back and either reinstal... [01:24:06] (03PS1) 10Dzahn: add internal LVS service IP for zotero [dns] - 10https://gerrit.wikimedia.org/r/191824 (https://phabricator.wikimedia.org/T89870) [01:25:23] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Fix "Error: /Stage[main]/Contint::Packages/File[/etc/php5/conf.d/apc.ini]/ensure: change from absent to file failed" - https://phabricator.wikimedia.org/T90039#1051961 (10Krinkle) 3NEW [01:25:47] (03PS2) 10Dzahn: add internal LVS service IP for zotero [dns] - 10https://gerrit.wikimedia.org/r/191824 (https://phabricator.wikimedia.org/T89870) [01:33:34] 3operations: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1051981 (10Dzahn) so we really just need the internal one, right? not like the shared public one like here: https://gerrit.wikimedia.org/r/#/c/188537/2/templates/wikimedia.org [01:34:55] 3operations: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1051982 (10Dzahn) [01:37:14] RoanKattouw, did you guys deploy my swat patch? [01:37:24] Yes [01:37:31] Sorry forgot to notify you [01:37:36] And forgot to mark it as done [01:37:40] yep ) [01:37:42] thx )) [01:51:46] 3operations: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1051990 (10Dzahn) a:3Dzahn [01:58:29] 3operations, Wikimedia-Bugzilla, Phabricator: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1051992 (10Dzahn) What John said. We want to be able to close this task so we can remove old-bugzilla and only keep static-bugzilla, and making those attachments available seems a blo... [01:59:56] 3operations, Wikimedia-Bugzilla, Phabricator: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1051993 (10Dzahn) fwiw, getting the database sanitized (without even an existing schema from Mozilla) and getting somebody to review that it is properly sanitized to release it seems... [02:01:21] 3operations, Wikimedia-Bugzilla, Phabricator: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1051996 (10Dzahn) what i don't want is being stuck with this ticket forever and support _2_ old BZ services, the whole point of static- is to kill old- [02:05:32] 3operations: Wikimedia mailing lsits for Wikimania - https://phabricator.wikimedia.org/T90042#1052010 (10emailbot) [02:13:14] 3operations: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1052180 (10Catrope) @Mvolz Does Citoid access the internet at all? I thought only Zotero did? Cause that task is T89874 [02:16:54] 3operations, Wikimedia-Bugzilla, Phabricator: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1052187 (10chasemp) What is the use case for obsoleted attachments? [02:18:13] andrewbogott: are there cron/maintenance scripts for labswiki that do not sudo as www-data or something? [02:18:25] 2015-02-18 08:28:54 silver labswiki: file_put_contents(/srv/org/wikimedia/controller/wikis/images/thumb/a/a8/Gnome-user-trash-full.svg/20px-Gnome-user-trash-full.svg.png): failed to open stream: Permission denied [02:18:56] AaronS: maybe. They probably just run as root, though… [02:19:01] or as apache, heh [02:20:08] hm, that file is certainly owned by www-data [02:20:39] there are backup scripts that run as root [02:21:52] btw how do jobs get run for labswiki? [02:22:46] !log l10nupdate Synchronized php-1.25wmf17/cache/l10n: (no message) (duration: 00m 01s) [02:22:50] Logged the message, Master [02:23:54] !log LocalisationUpdate completed (1.25wmf17) at 2015-02-20 02:22:50+00:00 [02:23:57] Logged the message, Master [02:24:49] !log l10nupdate Synchronized php-1.25wmf18/cache/l10n: (no message) (duration: 00m 01s) [02:24:52] Logged the message, Master [02:24:58] I see, a cron in puppet [02:25:11] runs as :mediawiki::users::web [02:25:14] * AaronS looks that up [02:25:22] AaronS: here’s everything on silver: https://dpaste.de/rrBh [02:25:54] www-data...checks out [02:25:56] !log LocalisationUpdate completed (1.25wmf18) at 2015-02-20 02:24:52+00:00 [02:25:59] Logged the message, Master [02:27:03] * AaronS was looking at the wrong yaml file [02:27:07] so it is apache [02:27:26] I can probably move those to www-data. [02:27:27] * andrewbogott looks [02:27:27] [18:19] AaronS or as apache, heh [02:30:19] 3Services, operations, RESTBase: setup internal LVS for restbase eqiad servers - https://phabricator.wikimedia.org/T89636#1052191 (10GWicke) 5Open>3Resolved Working just fine. Thank you, @fgiunchedi! [02:30:21] 3Scrum-of-Scrums, operations, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1052193 (10GWicke) [02:31:23] 3Deployment-Systems, operations: trebuchet puppet provider broken on systems without upstart - https://phabricator.wikimedia.org/T89461#1052194 (10GWicke) 5Open>3Resolved a:3GWicke This looks fixed now, at least on the three nodes that are live so far. [02:31:24] 3Scrum-of-Scrums, operations, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#795710 (10GWicke) [02:31:38] * AaronS wonders whats with all the HttpError spam [02:32:31] mutante, so static/old bz... [02:32:56] what's the setup for hidden bugs with static? [02:39:24] 3Scrum-of-Scrums, operations, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1052209 (10GWicke) We received access to the cluster today, and got three nodes up and running. One node is out of rotation to investigate a hardware issue with one of the SSDs (T89639), and two others... [02:40:21] 3Scrum-of-Scrums, operations, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1052211 (10GWicke) 5Open>3Resolved [02:41:05] (03PS1) 10Andrew Bogott: Move "mediawiki::users::web" to www-data on silver. [puppet] - 10https://gerrit.wikimedia.org/r/191842 [02:41:18] AaronS: will that help? [02:41:23] 3hardware-requests, Scrum-of-Scrums, operations, RESTBase: RESTBase production hardware - 4 of 6 ready - https://phabricator.wikimedia.org/T76986#1052215 (10GWicke) [02:41:25] 3Services, RESTBase, operations: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1041147 (10GWicke) [02:45:04] (03CR) 10Aaron Schulz: [C: 031] Move "mediawiki::users::web" to www-data on silver. [puppet] - 10https://gerrit.wikimedia.org/r/191842 (owner: 10Andrew Bogott) [02:45:41] (03CR) 10Andrew Bogott: [C: 032] Move "mediawiki::users::web" to www-data on silver. [puppet] - 10https://gerrit.wikimedia.org/r/191842 (owner: 10Andrew Bogott) [02:53:24] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [03:04:41] AaronS: did that quite down the logs, or is it too soon to tell? [03:08:37] I'd wait a bit longer [03:09:44] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:26:04] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [03:40:00] Pages are loading very slowly [03:40:04] :| [03:42:03] https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/Crystal_Project_colors.png/50px-Crystal_Project_colors.png took 2 minutes to load [03:44:01] (03PS1) 10Tim Landscheidt: Add txt2yaml [software] - 10https://gerrit.wikimedia.org/r/191846 [03:44:34] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:46:00] i'm not able to replicate this slowness, using the above url, different sized thumbs, or the site in general. anyone else? [03:46:18] Bsadowski1 are you confident this is not a network or dns issue? [03:46:35] Not sure... [03:46:47] Seems to be fine now [03:47:13] My internet is fine, jgage [03:47:24] And I am not using my ISP's DNS server. [03:47:42] ok, thanks for reporting. if you see a recurrence or pattern please let us know. [03:48:17] i don't see anything anomalous in ganglia so far [03:48:22] It's just that bits.wikimedia.org was being weird [03:48:36] what continent are you on? [03:48:59] North America [03:49:02] k [03:49:27] It was loading scripts at the time [03:49:44] I was watching in Firebug's network activity tab [04:34:07] 3operations: Enable TRIM for SSDs? - https://phabricator.wikimedia.org/T89584#1052294 (10GWicke) [04:36:17] 3operations: Enable TRIM for SSDs? - https://phabricator.wikimedia.org/T89584#1052296 (10GWicke) @faidon, it seems that linux sw raid [gained that capability in 2012](https://lkml.org/lkml/2012/3/11/261), so nowadays it might be worth even for raids. [04:39:03] 3Ops-Access-Requests, operations: Requesting access to analytics-privatedata-users for jamesur - https://phabricator.wikimedia.org/T89739#1052298 (10Krenair) Is this still valid? Will @Jalexander continue to handle legal-related requests given the Community Engagement restructure? [04:40:42] (03CR) 10Alex Monk: "Make sure the question on the task gets answered before merging" [puppet] - 10https://gerrit.wikimedia.org/r/191218 (https://phabricator.wikimedia.org/T89739) (owner: 10Dzahn) [04:43:10] 3Ops-Access-Requests, operations: Requesting access to analytics-privatedata-users for jamesur - https://phabricator.wikimedia.org/T89739#1052300 (10Tnegrin) yes -- this access is needed for a current issue that we need to address. [04:44:19] 3Ops-Access-Requests, operations: Requesting access to analytics-privatedata-users for jamesur - https://phabricator.wikimedia.org/T89739#1052301 (10Jalexander) >>! In T89739#1052298, @Krenair wrote: > Is this still valid? Will @Jalexander continue to handle legal-related requests given the Community Engagement... [05:33:42] 3operations, Wikimedia-Bugzilla, Phabricator: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1052325 (10Krenair) We'd need better search in phabricator before old- can be turned off (or a BZ DB dump). Also we should keep static copies of the hidden bugs so that they can be ad... [05:36:06] (03PS1) 10KartikMistry: Beta: CX: Enable more language pairs [puppet] - 10https://gerrit.wikimedia.org/r/191849 [06:07:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Feb 20 06:06:39 UTC 2015 (duration 6m 38s) [06:07:51] Logged the message, Master [06:27:44] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [06:28:24] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:34] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail [06:28:35] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:35] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:04] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:54] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:24] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [06:47:44] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:48:55] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:49:14] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:49:24] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:50:14] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:50:35] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:51:04] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:51:24] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:19:39] (03PS2) 10KartikMistry: WIP: Do not use registry and fallback to config.default.js [puppet] - 10https://gerrit.wikimedia.org/r/191263 [07:30:26] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1052435 (10Joe) For the record: yesterday I inserted willingly a server in the nutcracker prod config while it was still rebooting. On 80% of the cl... [07:53:51] (03PS1) 10Giuseppe Lavagetto: dhcp: Adding entries for mc2002-2006 [puppet] - 10https://gerrit.wikimedia.org/r/191851 [08:22:52] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1052455 (10Joe) After spending almost one day on it: - MAC addresses for all hosts are missing in the dhcp configuration (I am fixing it slowly, not easy as there is apparently no way of getting the intel NIC mac address fr... [08:23:13] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1052456 (10Joe) p:5Normal>3High [08:35:55] (03CR) 10Giuseppe Lavagetto: [C: 032] dhcp: Adding entries for mc2002-2006 [puppet] - 10https://gerrit.wikimedia.org/r/191851 (owner: 10Giuseppe Lavagetto) [08:51:31] (03CR) 1020after4: [C: 032] $wgTranslateBlacklist: "en" conditional to the wiki being in English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190783 (owner: 10Nemo bis) [08:51:43] (03Merged) 10jenkins-bot: $wgTranslateBlacklist: "en" conditional to the wiki being in English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190783 (owner: 10Nemo bis) [09:02:04] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [09:04:18] twentyafterfour: Don't forget to actually deploy those changes :P [09:04:33] 3operations: Enable TRIM for SSDs? - https://phabricator.wikimedia.org/T89584#1052496 (10faidon) I said **HW** RAID above, didn't I? :) [09:05:16] hoo: can't it just go out with the next deployment window? [09:05:22] twentyafterfour: No [09:05:29] You merged it, so you need to deploy it [09:06:09] Imagine someone needs to push an emergency fix now: They'll be confused by those changes and a) loose time checking them and b) sync unrelated changes [09:08:08] <_joe_> twentyafterfour: or say a database fails [09:08:28] <_joe_> well, in that case we do sync-file, true that [09:08:56] _joe_: But they'd still pull in those changes [09:09:35] <_joe_> yes [09:10:28] twentyafterfour: I'd suggest you to either push them or at to revert them, merge that and pull that on tin [09:10:33] Also it's Friday [09:11:26] by push you mean sync file? [09:11:48] yeah [09:13:04] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [09:16:08] !log twentyafterfour Synchronized wmf-config/CommonSettings.php: wgTranslateBlacklist (duration: 00m 07s) [09:16:14] Logged the message, Master [09:21:18] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1052506 (10Joe) p:5Triage>3High [09:22:21] (03CR) 10Filippo Giunchedi: [C: 04-1] add internal LVS service IP for zotero (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/191824 (https://phabricator.wikimedia.org/T89870) (owner: 10Dzahn) [09:28:27] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1052512 (10faidon) @Jgreen, it has a MegaRAID SAS controller, so you need `megacli`, not `mpt-status`. [09:33:03] 3RESTBase, operations: Investigate apparent restbase request rate under-reporting in graphite: statsd issue? - https://phabricator.wikimedia.org/T89846#1046847 (10fgiunchedi) [10:02:06] !log reboot restbase1006 after disk reseat [10:02:13] Logged the message, Master [10:03:35] paravoid: curious I thought all ciscos had mpt? plus the weird physical/logical mapping [10:03:43] 3Services, RESTBase, operations: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1052569 (10fgiunchedi) disk has reappeared as sdd, thus rebooting, however this looks more like the controller :( ``` [543994.227854] Result: hostbyte=DID_NO_CONNECT d... [10:04:43] oh, maybe I'm wrong [10:05:05] yeah I think I am [10:05:13] I'll delete that :) [10:05:47] heh, sadly I remember that from trying to figure out what's the disk to pull with Chris [10:05:54] RECOVERY - RAID on restbase1006 is OK: OK: Active: 7, Working: 7, Failed: 0, Spare: 0 [10:06:46] 3operations, MediaWiki-Core-Team: Unexpected N4HPHP13DataBlockFullE - https://phabricator.wikimedia.org/T89958#1052584 (10hashar) Should we start monitoring the cold cache usage (maybe via diamond) and alarm on it? [10:11:33] 3Services, RESTBase, operations: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1052595 (10fgiunchedi) same error when just booting up again, let's go with controller DOA, not sure what's the next step for replacement @cmjohnson ``` [ 81.334942]... [10:12:05] 3Services, RESTBase, operations: restbase1006 faulty disk controller - https://phabricator.wikimedia.org/T89639#1052596 (10fgiunchedi) [10:17:45] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1052604 (10fgiunchedi) @jgreen I have come across this before, see T84981 (basically mpt-status doesn't work on the cisco but lsiutil does) [10:31:37] godog: hello [10:31:55] have the varnish vcl rules been applied to the parsoid varnishes? [10:35:53] hey mobrovac, nope, the relevant change is https://gerrit.wikimedia.org/r/#/c/191061/ [10:37:14] ah ok, that explains why i hit parsoid when trying a restbase-like url [10:37:15] :) [10:37:29] hehe indeed [10:38:40] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: CX: Enable more language pairs [puppet] - 10https://gerrit.wikimedia.org/r/191849 (owner: 10KartikMistry) [10:39:22] another question, i'm trying to go to graphite.wikimedia.org to see some restbase stats, but can [10:39:29] can't seem to log in there [10:39:47] tried my labs login, wm mail login, but no luck [10:40:32] 3operations: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1052701 (10akosiaris) Yeah Daniel, internal one. Just fill in the gap before starting to use .32 and above as per @fgiunchedi's comment [10:41:04] mobrovac: labs login is what you should use, I think that has to do with ldap group membership, checking [10:41:04] (03PS1) 10Giuseppe Lavagetto: mediawiki: install extension in the correct path [puppet] - 10https://gerrit.wikimedia.org/r/191863 [10:41:18] cool thnx [10:42:07] (03PS2) 10Filippo Giunchedi: set permissions on cassandra files [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191651 [10:42:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] set permissions on cassandra files [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191651 (owner: 10Filippo Giunchedi) [10:52:57] 3Ops-Access-Requests, operations: wmf ldap group membership for mobrovac - https://phabricator.wikimedia.org/T90108#1052742 (10fgiunchedi) 3NEW [10:53:55] (03Abandoned) 10Giuseppe Lavagetto: contint: install Java 8 on Trusty servers [puppet] - 10https://gerrit.wikimedia.org/r/183222 (https://phabricator.wikimedia.org/T85964) (owner: 10Hashar) [10:54:02] 3Ops-Access-Requests, operations: wmf ldap group membership for mobrovac - https://phabricator.wikimedia.org/T90108#1052758 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi ``` root@neptunium:~# modify-ldap-group --addmembers=mobrovac wmf root@neptunium:~# ldaplist -l group wmf | grep -i mobro member: uid=mob... [10:54:29] mobrovac: you should have been in wmf group as part of your onboarding, for some reason you were not [10:54:52] hehe, i guess so [10:54:55] grazie godog [10:55:10] prego mobrovac :) [10:55:22] dentro! [10:55:22] yuhu [10:59:07] 3operations: scale statsd reporting/aggregation (plan) - https://phabricator.wikimedia.org/T89857#1052779 (10fgiunchedi) [11:00:02] 3operations: replace txstatsd - https://phabricator.wikimedia.org/T90111#1052785 (10fgiunchedi) 3NEW a:3fgiunchedi [11:00:13] 3operations: diamond network collector loss not accurate - https://phabricator.wikimedia.org/T89858#1052795 (10fgiunchedi) the reason for this is that the network collector is separate from ip/tcp/udp collectors which would have been the right thing in this case, since enabling those collector will push a signif... [11:08:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The trebuchet deployed package also needs an entry in manifests/role/deployment.pp, role::deployment::config" (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:16:03] 3Services, Parsoid, operations: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1052834 (10mobrovac) [11:19:13] 3Services, Parsoid, operations: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1052852 (10mobrovac) A basic, preliminary v0 of the service template is available at https://github.com/d00rman/service-template-node. We plan to switch to swagger for dr... [11:22:14] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore .gitreview when building source [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 (owner: 10Alexandros Kosiaris) [11:22:27] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore .gitreview when building source [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/167763 (owner: 10Alexandros Kosiaris) [11:22:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Ignore .gitreview when building source [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/167764 (owner: 10Alexandros Kosiaris) [11:23:13] (03Abandoned) 10Alexandros Kosiaris: Introduce rack/rackrow facts based on LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167645 (https://phabricator.wikimedia.org/T84518) (owner: 10Alexandros Kosiaris) [11:23:39] (03PS3) 10Alexandros Kosiaris: txstatsd: ensure $init_file attributes [puppet] - 10https://gerrit.wikimedia.org/r/185424 [11:27:47] (03CR) 10Giuseppe Lavagetto: "I have a doubt about the regex for the rack." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167645 (https://phabricator.wikimedia.org/T84518) (owner: 10Alexandros Kosiaris) [11:28:58] urgh puppet line wrap is ugly :( akosiaris re: https://gerrit.wikimedia.org/r/#/c/191652/1/manifests/metrics.pp [11:30:49] (03PS2) 10Filippo Giunchedi: report cassandra metrics with metrics-graphite [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) [11:31:40] (03CR) 10Filippo Giunchedi: report cassandra metrics with metrics-graphite (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:31:52] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1052876 (10Joe) https://gerrit.wikimedia.org/r/#/c/191863/ has a patch. Reviews welcome! [11:32:10] (03CR) 10Filippo Giunchedi: "trebuchet repo is part of https://gerrit.wikimedia.org/r/#/c/191654/ (this is a submodule, sadly)" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:32:10] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1052884 (10Joe) [11:34:45] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1052894 (10fgiunchedi) also pending is backfill of metrics from tungsten via carbonate, but see https://github.com/jssjr/carbonate/issues/47 on why we can't do it straight away (or without shutting carbon-cache down anyway) [11:36:38] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1052896 (10Joe) @DZahn you are correct - I will take a look to what's wrong here. [11:46:36] (03CR) 10Alexandros Kosiaris: [C: 032] Use network::constants to populate url_downloader ACLs [puppet] - 10https://gerrit.wikimedia.org/r/191385 (owner: 10Alexandros Kosiaris) [11:47:34] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1052906 (10Joe) Ok so, the puppet code is correct, which can be verified by looking at the ganglia config on any of the hosts: cluster { name = "Virtualization cluster eqiad"... [11:48:13] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1052907 (10Joe) a:5Joe>3None [11:48:30] 3operations: restructure site.pp to use roles, hiera. - https://phabricator.wikimedia.org/T86774#1052910 (10Joe) [11:48:31] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1051895 (10Joe) [11:50:38] (03CR) 10Alexandros Kosiaris: Reuse parsoid varnish for restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/191061 (https://phabricator.wikimedia.org/T78194) (owner: 10Alexandros Kosiaris) [11:53:40] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1052940 (10Joe) Upon inspection: virt1000 has correct firewall rules apparently, so maybe this has to do with something else (maybe the nova network rules?). but the config in puppet... [11:56:52] (03PS2) 10Alexandros Kosiaris: Reuse parsoid varnish for restbase [puppet] - 10https://gerrit.wikimedia.org/r/191061 (https://phabricator.wikimedia.org/T78194) [11:58:11] (03CR) 10Alexandros Kosiaris: "I have doubts about the approach in general, but yeah the rack regexp could be better. The one you point out is definitely better" [puppet] - 10https://gerrit.wikimedia.org/r/167645 (https://phabricator.wikimedia.org/T84518) (owner: 10Alexandros Kosiaris) [11:59:35] (03CR) 10Mobrovac: [C: 031] "Even better with 'or' :)" [puppet] - 10https://gerrit.wikimedia.org/r/191061 (https://phabricator.wikimedia.org/T78194) (owner: 10Alexandros Kosiaris) [12:28:03] 3operations: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1052981 (10Aklapper) [12:45:42] (03PS1) 10Alexandros Kosiaris: Restrict url_downloader's proxy to WMF [puppet] - 10https://gerrit.wikimedia.org/r/191872 [12:55:33] (03PS2) 10Alexandros Kosiaris: Restrict url_downloader's proxy to WMF [puppet] - 10https://gerrit.wikimedia.org/r/191872 [12:59:26] (03PS3) 10Alexandros Kosiaris: Restrict url_downloader's proxy to WMF [puppet] - 10https://gerrit.wikimedia.org/r/191872 [13:14:37] jenkins down? [13:18:28] hoo: seems like it, lemme see what I can do [13:19:37] Thank you :) [13:19:45] hoo: actually not... I see zuul is running 6 jobs right now [13:19:55] 26 completed, 4 queued [13:20:01] maybe recheck ? [13:20:32] it just posted a +2 on my change. It took it something like 20 minutes though [13:21:02] I have a change in front of me where it didn't comment in an hour [13:21:13] which one ? [13:21:39] https://gerrit.wikimedia.org/r/191870 [13:21:47] I rebased it now [13:22:12] MEDIAWIKI/CORE191870,2ETA: 2 min, queued 4 min ago [13:22:21] that is what https://integration.wikimedia.org/zuul/ says [13:22:49] it did have a hiccup though according to the graphs in that same page [13:22:53] ok, maybe the first one just fell between the cracks [13:23:11] no, there seems to have been an actual problem [13:23:19] but it also seems to be ok now [13:26:10] (03CR) 10Alexandros Kosiaris: [C: 032] Restrict url_downloader's proxy to WMF [puppet] - 10https://gerrit.wikimedia.org/r/191872 (owner: 10Alexandros Kosiaris) [13:27:10] Yes, seems to work again [14:03:11] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Fix "Error: /Stage[main]/Contint::Packages/File[/etc/php5/conf.d/apc.ini]/ensure: change from absent to file failed" - https://phabricator.wikimedia.org/T90039#1053166 (10Krinkle) [14:03:12] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1053168 (10Krinkle) [14:03:48] 3Continuous-Integration, operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1051351 (10Krinkle) [14:04:28] 3Continuous-Integration, Wikimedia-Labs-Infrastructure: Fix "Error: /Stage[main]/Contint::Packages/File[/etc/php5/conf.d/apc.ini]/ensure: change from absent to file failed" - https://phabricator.wikimedia.org/T90039#1051961 (10Krinkle) [14:10:33] (03CR) 10Krinkle: "Could this fix contint/manifests/packages.pp at the same time as well? I'm curious why it only started failing now. I've recently created " [puppet] - 10https://gerrit.wikimedia.org/r/191863 (owner: 10Giuseppe Lavagetto) [14:21:54] (03CR) 10Giuseppe Lavagetto: "It has started failing because we've moved our php trusty package to correctly handle the ini file and put it in /etc/php5/mods-available," [puppet] - 10https://gerrit.wikimedia.org/r/191863 (owner: 10Giuseppe Lavagetto) [14:45:20] (03PS2) 10Ottomata: add jamesur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/191218 (https://phabricator.wikimedia.org/T89739) (owner: 10Dzahn) [14:45:32] (03CR) 10Ottomata: [C: 032 V: 032] add jamesur to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/191218 (https://phabricator.wikimedia.org/T89739) (owner: 10Dzahn) [14:48:46] 3Ops-Access-Requests, operations: Requesting access to analytics-privatedata-users for jamesur - https://phabricator.wikimedia.org/T89739#1053313 (10Ottomata) 5Open>3Resolved K you should be good to go. To check: ``` ssh stat1002.eqiad.wmnet hive --database wmf show tables; describe webrequest; ``` :) [14:59:24] 3Ops-Access-Requests, operations: Requesting access to analytics-privatedata-users for jamesur - https://phabricator.wikimedia.org/T89739#1053348 (10Nemo_bis) > increasing need for LCA private data pulls Is this need/process documented somewhere? [15:20:37] (03PS1) 10Giuseppe Lavagetto: admin: move to hiera, use roles where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/191890 [15:20:39] (03PS1) 10Giuseppe Lavagetto: admin: move to hiera, use roles/2 [puppet] - 10https://gerrit.wikimedia.org/r/191891 [15:22:05] 3ops-eqiad, operations: cr1-eqiad Control Board error - https://phabricator.wikimedia.org/T89999#1053431 (10Cmjohnson) This is the response I received from Juniper regarding the CB1 error. His suggestion. Can you set up a MW to switch the CB mastership? The SCB contains circuitry that controls the PEMs. Onl... [15:27:54] (03PS2) 10Giuseppe Lavagetto: mediawiki: install extension config files in the correct path [puppet] - 10https://gerrit.wikimedia.org/r/191863 [15:28:04] 3Continuous-Integration, operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1053454 (10Joe) Added a fix for the contint::packages class as well. [15:28:15] <_joe_> Krinkle: this should fix it [15:28:24] <_joe_> Krinkle: apc is already disabled on trusty [15:31:14] _joe_: disabled but Required? [15:31:23] _joe_: or would our contint needlessly install it [15:31:32] We disabled it before trusty [15:31:38] on precise [15:31:50] Which we're still having some nodes of [15:31:57] that should be re-creatable [15:32:15] <_joe_> Krinkle: it shouldn't need to be installed on trusty, no (the disabling config I mean) [15:32:24] <_joe_> and yes, my last PS is wrong :P [15:32:31] OK, just checking :) [15:32:52] okay...so I don't see the failed drive...coren ejected it [15:33:21] cmjohnson1: Dude, that was a joke. You can't eject a nonremovable drive. :-) [15:33:28] Sorry, I tought the joke was obvious. [15:33:30] (03PS3) 10Giuseppe Lavagetto: mediawiki: install extension config files in the correct path [puppet] - 10https://gerrit.wikimedia.org/r/191863 [15:34:15] coren: i didn't get the joke I assumed you removed it [15:34:27] so I am not seeing it now [15:34:36] (03CR) 10Hashar: [C: 031] "-1 -> +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [15:34:44] (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [15:35:14] cmjohnson1: I'm surprised that, the drive having failed basic comm, it shows any activity on t he led. [15:36:14] cmjohnson1: (And I'm really sorry my joke wasn't obvious) [15:36:25] oh, it just came in the middle of several things going on yesterday [15:36:47] i just figured you meant something else [15:38:25] _joe_: Hm.. so mediawiki::packages (includes packages/php5.pp) ensures=>present on php-apc [15:38:31] which contint also includes [15:38:38] both unconditionally [15:38:53] <_joe_> Krinkle: but on trusty you will install apcu [15:39:25] <_joe_> which will not do bytecode caching [15:39:32] !log virt1002 removing disk 0 which should be /dev/sda [15:39:38] Logged the message, Master [15:40:33] _joe_: Hm.. [15:40:42] _joe_: it's aliased from php5-apc? [15:40:58] <_joe_> Krinkle: 99.99% sure, but lemme check [15:41:18] http://packages.ubuntu.com/trusty/php-apc [15:41:38] <_joe_> Krinkle: php-apc is only a stub package [15:41:43] cmjohnson1: Tell me once it's gone, I'll be able to tell you if it was the wrong one at least. [15:41:45] Right [15:43:00] https://github.com/krakjoe/apcu#apcu [15:43:03] ambitious :) [15:43:13] "" [15:43:13] When O+ takes over, many will be tempted to use 3rd party solutions to userland caching, possibly even distributed solutions; this would be a grave error. The tried and tested APC codebase provides far superior support for local storage of PHP variables. [15:43:14] "" [15:43:49] https://github.com/krakjoe/apcu/issues/108 [15:44:01] Hm.. that looks worrying. But I don'tknow what I'm talking about [15:44:20] <_joe_> Krinkle: but we don't /use/ it [15:44:36] _joe_: Right. [15:44:50] _joe_: it doesn't hook into PHP? [15:45:01] MediaWiki looks for presence of APC and uses it [15:45:05] for objectcache [15:45:08] <_joe_> oh, ok [15:45:13] cmjohnson1: U can haz suksess. A wild disk appears! [15:45:19] <_joe_> ok, let's remove that [15:45:27] <_joe_> but in another patchset [15:45:31] Sure : [15:45:33] :) [15:45:51] I reckon prod will bypass those checks, but vanilla mediawiki as used by Jenkins might not. [15:46:20] coren: okay cool [15:46:23] Though the object cache is harmless, it's not useful. It would get filled with duplicate and random stuff from 100s of different temporary mediawiki intalls. Could go wrong. [15:46:44] I'd rather it stay disabled on trusty [15:47:19] _joe_: In case of the CI instance, the apc file exist error was not fatal. [15:47:40] <_joe_> Krinkle: ok [15:47:44] _joe_: Do you know in what state that leaves the instance in? E.g. installed or not, enabled or disabled? [15:48:05] Perhaps I should depool that new slave. [15:48:07] <_joe_> Krinkle: that file installation from puppet had no functional use [15:48:19] <_joe_> so it's just puppet failing, nothing else [15:48:32] <_joe_> but, lemme merge the patch [15:48:41] But if it installed apc and then unable to place contint/apc-disable.ini, it would leave it enabled, no? [15:48:57] file apc-disable.ini require[php-apc] [15:49:45] contint has its own manually synced puppetmaster so don't worry about ci for now, go ahead if its good for prod :) [15:49:46] <_joe_> the apc package is empty, so... [15:50:02] Ah, I thought the apc package forwards to apcu [15:50:25] like an alias or wrapper [15:52:17] cmjohnson1: Array in resync. 100% win. [15:53:11] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1053653 (10coren) 5Open>3Resolved New disk is in place, and resyncing. [15:53:22] great news coren! thanks sorry bout the misunderstanding. thought it was a French Canadian thing :-P [15:59:07] <_joe_> Krinkle: no it depends on php5-apcu [15:59:24] <_joe_> Krinkle: I'll do another patchset [15:59:40] <_joe_> but first.. coffeee [16:00:39] It says on the ubuntu page that php5-apcu "provides this package" (php-apc). Which would suggest the package is empty (installing directly does nothing) but installing php5-apc would also implement php-apc. [16:00:54] I'm new to the Debian/ubuntu lingo [16:04:35] (03CR) 10GWicke: [C: 031] "One thing we should check is whether the standard no-cache behavior for POSTs still applies with the early return. We ran into that when w" [puppet] - 10https://gerrit.wikimedia.org/r/191061 (https://phabricator.wikimedia.org/T78194) (owner: 10Alexandros Kosiaris) [16:06:21] (03CR) 10GWicke: "Although, is the hit_for_pass behavior taking care of making all rest.wikimedia.org requests uncacheable?" [puppet] - 10https://gerrit.wikimedia.org/r/191061 (https://phabricator.wikimedia.org/T78194) (owner: 10Alexandros Kosiaris) [16:19:24] (03CR) 10GWicke: "In the longer term encoding the row / rack in the ip address would be great I think. Cassandra for example has built-in support for such a" [puppet] - 10https://gerrit.wikimedia.org/r/167645 (https://phabricator.wikimedia.org/T84518) (owner: 10Alexandros Kosiaris) [16:21:27] 3Phabricator, operations: re-use server 'radon' as phab failover - https://phabricator.wikimedia.org/T88818#1053804 (10chasemp) wondering what the hardware specs are on it in comparison and if it makes a suitable secondary. I would think at this point it is becoming clear that phab is in the critical path for a... [16:21:34] (03CR) 10Alexandros Kosiaris: [C: 032] "Yeah, hit for pass will do exactly that. Let's check the no-cache behavior of POSTs though indeed" [puppet] - 10https://gerrit.wikimedia.org/r/191061 (https://phabricator.wikimedia.org/T78194) (owner: 10Alexandros Kosiaris) [16:25:13] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 2 failures [16:28:50] (03PS4) 10Giuseppe Lavagetto: mediawiki: install extension config files in the correct path [puppet] - 10https://gerrit.wikimedia.org/r/191863 [16:29:07] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: install extension config files in the correct path [puppet] - 10https://gerrit.wikimedia.org/r/191863 (owner: 10Giuseppe Lavagetto) [16:29:20] 3operations: replace txstatsd - https://phabricator.wikimedia.org/T90111#1053893 (10fgiunchedi) one candidate to replace txstatsd is https://github.com/armon/statsite . Looking at what txstatsd types we are using it seems statsite doesn't support `meter` type in https://github.com/armon/statsite/issues/77 so we'... [16:30:10] <_joe_> come on jenkins... [16:34:49] (03PS1) 10Alexandros Kosiaris: Fix for 97d7998 [puppet] - 10https://gerrit.wikimedia.org/r/191900 [16:35:44] (03CR) 10Alexandros Kosiaris: [C: 032] Fix for 97d7998 [puppet] - 10https://gerrit.wikimedia.org/r/191900 (owner: 10Alexandros Kosiaris) [16:37:13] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:56:35] 3MediaWiki-General-or-Unknown, operations, Services, Analytics, Wikidata, wikidata-query-service: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1054081 (10bd808) [16:59:43] (03PS1) 10Cmjohnson: adding mac address to dhcp file for restbase1001/2 [puppet] - 10https://gerrit.wikimedia.org/r/191904 [17:01:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: do not install php-apc on newer hosts [puppet] - 10https://gerrit.wikimedia.org/r/191906 [17:04:47] (03CR) 10Cmjohnson: [C: 032] adding mac address to dhcp file for restbase1001/2 [puppet] - 10https://gerrit.wikimedia.org/r/191904 (owner: 10Cmjohnson) [17:10:28] 3ops-codfw, operations: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1054116 (10Papaul) The last mw server is mw2214 and not mw2215. [17:10:35] 3Staging, operations: Package geoipupdate for jessie - https://phabricator.wikimedia.org/T90229#1054117 (10Chad) 3NEW [17:13:57] 3ops-eqiad, operations: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1054139 (10Cmjohnson) restbase1001 and restbase1002 have been setup and are ready to be installed. dhcp file was udpated with MAC address's from 10G NIC. [17:15:04] 3ops-eqiad, operations: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1054140 (10Cmjohnson) Once @fgiunchedi finishes install please close ticket. Worth noting is the access switch still needs redundant power which will take place on Monday 2/23. [17:33:48] 3operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1054217 (10fgiunchedi) 3NEW [17:33:53] _joe_: ^ [17:34:51] <_joe_> godog: I'd say it's our work for monday [17:35:02] indeedly [17:36:24] (03PS1) 10Ottomata: Create apache site template for forcing https and doing transparent reverse proxy, use this for hue [puppet] - 10https://gerrit.wikimedia.org/r/191911 (https://phabricator.wikimedia.org/T85834) [17:36:26] mutante: ^ thoughts? [17:36:45] (03PS1) 10Papaul: removed mw2215 from list, last mw server is mw2214 [dns] - 10https://gerrit.wikimedia.org/r/191912 [17:36:46] i made it kinda generic, but i could make it specific if you think i should [17:47:06] gwicke: restbase is spamming txstatsd with this, 2015-02-20 17:45:43+0000 [-] Bad line: 'restbase.sys-key_rev_value-bucket-key-revision-tid.GET:4xx,ALL|ms' [17:47:45] godog: hmm [17:48:04] thanks for the heads-up, was wondering why the stats had stopped [17:48:59] looks like it's using the regular statsd client, which supports batches [17:49:20] but txstatsd doesn't [17:50:02] godog: http://rest.wikimedia.org/en.wikipedia.org/v1/?doc [17:50:10] ;) [17:50:42] gwicke: \o/ [17:50:47] pretty! [17:50:54] you should send an email to wikitech showing it off [17:51:12] yeah, just tweaking the wording slightly to avoid offense [17:52:01] meh, call it the standard content api [17:52:23] we're doing it, it might as well be successful [17:52:34] gwicke: nice! [17:52:46] i am going to continue resenting the choice of node for the rest of time, but that can't be helped [17:52:53] gwicke: awesome [17:53:01] (03CR) 10Alexandros Kosiaris: "So, whenever a box has to move for some reason in another rack we have to renumber it as well ? And so that in case we forget (and we will" [puppet] - 10https://gerrit.wikimedia.org/r/167645 (https://phabricator.wikimedia.org/T84518) (owner: 10Alexandros Kosiaris) [17:53:06] ori: yeah, but Wikimedia REST API sounds good too [17:53:51] ori: if you can convince folks that we should speed things up with Go or Rust, I'd be game [17:54:08] * ori takes screenshot [17:54:12] perf-wise it wouldn't help much shough [17:54:16] *thoug [17:54:17] h [17:54:19] * gwicke can't type [17:55:00] gwicke: I don't think it is the batching, that 4xx,ALL value for a timer is not legal [17:56:12] godog: let me double-check that code, was recently refactoring that area heavily & might have broken something [17:56:13] godog: it took *all* the milliseconds to complete [17:56:51] all 2^32 of them! [17:57:32] gwicke: thanks! [17:57:57] godog: thanks as well! [18:00:43] could somebody here perhaps add me to the services group in gerrit? [18:01:00] ^d: ^^ [18:01:04] sure [18:01:26] smooth [18:01:29] or mediawiki in general [18:01:37] cheers ori [18:01:41] s/or/and/ [18:01:42] :) [18:01:59] I *think* services inherits from mediawiki [18:02:29] done [18:02:37] for both [18:02:40] thx! [18:03:14] !log added mobrovac to mediawiki and services gerrit groups [18:03:18] Logged the message, Master [18:03:26] cheers! [18:06:14] Is there any way to use the novaclient for wikitech/labs instead of the web interface? Or is keystone locked down to just wikitech? [18:07:43] andrewbogott: Coren ^ [18:08:10] thcipriani: novaclient isn’t open to volunteers at the moment. [18:08:19] We’d like to open it but don’t have a clear technical plan at the moment. [18:08:20] what about staff? [18:08:35] um… just ops so far. [18:08:41] what about RelEng? :) :) [18:08:45] It might be possible to open it to deployers [18:08:49] but would be a bit of work [18:09:00] What andrew said. [18:09:08] sorry, I interjected myself here, I'll go back to debating budgets for next year [18:09:46] (but, I'd just like us to be prepared for use-cases that we (releng) might have that would make that access useful) [18:11:04] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [18:11:31] Definitely would be a nice-to-have. Setting up security groups for staging project now. Daydreaming about novaclient :) [18:11:43] thcipriani: are you really, though? [18:12:24] ori: sadly, yes. Yes I am. [18:13:04] andrewbogott, so... deployers but not volunteers? how is that going to work? [18:13:28] Krenair: deployers can just log into a production server that’s set up to use novaclient. [18:13:50] In the long run, though, we want users on labs instances to have a procedure for keystone auth. That way they can use swift, among other things. [18:13:56] (some deployers are non-staff) [18:14:24] greg-g: well, if you’re in the ‘deployment’ group in puppet then you have a login on silver. [18:14:31] And silver already chats with nova [18:14:45] gotcha, just be pedantic :) [18:14:46] (03PS1) 10Ottomata: Update ganglia aggregators for analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/191917 [18:14:49] But, I’m not proposing this, just saying it could be possible if someone made a convincing case [18:17:05] (03CR) 10Ottomata: [C: 032] "Analytics cluster is currently missing from ganglia. I don't think this will fix, but it is an improvement" [puppet] - 10https://gerrit.wikimedia.org/r/191917 (owner: 10Ottomata) [18:17:48] _joe_: That sounds familiar ^ [18:18:04] ottomata: did ganglia/analytics regress, or was it never there to begin with? [18:18:05] greg-g, some ops are also non-staff :) [18:18:18] regress [18:18:21] dunno what's going on [18:18:28] ori just pointed me to it today [18:18:41] happened for virt cluster as well. https://phabricator.wikimedia.org/T90035 [18:19:39] <_joe_> ottomata: check the ganglia config files [18:19:41] i have suggested before having a human do a visual scan of the weekly view in ganglia once a week, but no one agreed, so i do them myself when i remember. i don't think i've ever looked and not found something significant. [18:20:10] 3operations: Virt nodes missing from the 'virtualization cluster eqiad' ganglia report - https://phabricator.wikimedia.org/T90035#1054377 (10Ottomata) Analytics hosts are missing from ganglia as of a couple of days ago too! [18:20:34] _joe_: not much has changed, afaik [18:20:39] i am checking them though [18:20:48] and, i can query one of the aggregators from uranium and see the metrics [18:20:54] so the cluster has the metrics [18:21:14] <_joe_> so you can see the metrics on the aggregator but not in the gui? [18:21:19] right [18:21:27] <_joe_> I was suggesting, if you suspect puppet issues [18:22:00] <_joe_> look at the config files on the hosts that do not appear [18:22:10] Krenair: touche :) [18:22:41] _joe_, afaict the nodes are fine, and it is the whole analytics cluster that is missing [18:22:50] <_joe_> ok [18:23:01] <_joe_> so maybe you miss a config in uranum? [18:23:33] rrd files on uranium for an aanlytics host haven't been updated since feb 18 [18:23:44] <^d> ori, mobrovac: membership in 'mediawiki' group is based on a vote :) [18:23:46] is gmetad not reaching out to some aggregators? [18:24:01] <^d> services is fine, but the core group really should go through process [18:24:03] ah [18:24:14] ^d: not for staff, iirc? [18:24:41] _joe_: if i query the gmetad xml port, i do not see analytics cluster hosts [18:24:52] i do see analytics kafka cluster hosts [18:24:55] which are defined to be in their own cluster [18:25:19] <^d> ori: By extension of being in "wmf" [18:25:35] <^d> But explicit membership in "mediawiki" has a process :) [18:28:12] am looking at traffic for port 8649 | grep analytics on uranium [18:28:24] and i see it talking to other analytics hosts, just not ones in the analytics cluster... [18:28:25] hm [18:28:34] restarted an aggrators gmond... [18:28:44] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [18:30:03] ah, whoa, analytics cluster has even disappeared from GUI dropdown! [18:30:04] :p [18:30:08] ^d: ok, i'll fix [18:31:23] PROBLEM - RAID on restbase1006 is CRITICAL: CRITICAL: Active: 8, Working: 8, Failed: 1, Spare: 0 [18:32:18] gonna restart gmetad ... :/ [18:35:57] 3Phabricator, operations: test ops-request post 2/18 update direct to rt - https://phabricator.wikimedia.org/T89832#1054443 (10Krenair) [18:39:28] !log killing icinga-admin.w.o url support [18:39:33] Logged the message, Master [18:39:40] !log killing icinga-admin.w.o url support per T90002 [18:39:42] Logged the message, Master [18:46:24] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [18:46:36] (03CR) 10RobH: [C: 032] remove support for icinga-admin.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/191699 (owner: 10RobH) [18:46:53] (03CR) 10RobH: [C: 032] remove support for icinga-admin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/191700 (owner: 10RobH) [18:50:42] (03CR) 10Dzahn: add internal LVS service IP for zotero (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/191824 (https://phabricator.wikimedia.org/T89870) (owner: 10Dzahn) [18:51:18] (03PS3) 10Dzahn: add internal LVS service IP for zotero [dns] - 10https://gerrit.wikimedia.org/r/191824 (https://phabricator.wikimedia.org/T89870) [18:52:42] 6operations, 5Patch-For-Review: reclaim dysprosium for spare (was: server status) - https://phabricator.wikimedia.org/T83070#1054589 (10Cmjohnson) [18:52:43] 6operations, 10ops-eqiad: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1054587 (10Cmjohnson) 5Open>3Resolved Fixed the idrac license issue. [18:52:44] oh neon, why you so slowwww [18:53:09] * robh started the puppet run and went to get a drink and snack, has finished the snack, and is still watching puppet run [18:54:08] yep, slowest run of all nodes i believe [18:54:11] argh [18:54:17] and i missed the install cert line someplace [18:54:28] icinga is still up, its just breaking the puppet run [18:54:29] =P [18:54:45] !log i broke puppet on neon, workign to fix [18:54:48] if config check fails it wont restart it [18:54:48] Logged the message, Master [18:54:58] which is good compared to before [18:55:03] its the apache stack not icinga config so yay [18:55:06] ah, ok [18:55:07] much easier [18:57:52] (03PS1) 10Mobrovac: Update RESTBase's configuration file [puppet] - 10https://gerrit.wikimedia.org/r/191922 [18:58:37] (03PS1) 10RobH: remove support for icinga-admin.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/191923 [18:59:08] (03CR) 10RobH: [C: 032] remove support for icinga-admin.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/191923 (owner: 10RobH) [19:01:26] (03CR) 10GWicke: [C: 031] Update RESTBase's configuration file [puppet] - 10https://gerrit.wikimedia.org/r/191922 (owner: 10Mobrovac) [19:01:39] (03PS2) 10Ori.livneh: Update RESTBase's configuration file [puppet] - 10https://gerrit.wikimedia.org/r/191922 (owner: 10Mobrovac) [19:01:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Update RESTBase's configuration file [puppet] - 10https://gerrit.wikimedia.org/r/191922 (owner: 10Mobrovac) [19:03:12] (03PS1) 10Dzahn: add zotero role class skeleton [puppet] - 10https://gerrit.wikimedia.org/r/191925 (https://phabricator.wikimedia.org/T89867) [19:05:17] !log neon runs puppet fine, back to full service [19:05:21] Logged the message, Master [19:06:15] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:07:37] 6operations: icinga-admin certificate expires 2015-02-26 - replace or depreciate? - https://phabricator.wikimedia.org/T90002#1054740 (10RobH) 5Open>3Resolved I neglected to pull the install cert, done via: https://gerrit.wikimedia.org/r/#/c/191923/ This is now all merged and complete, icinga-admin.wikimed... [19:09:28] 6operations: icinga-admin certificate expires 2015-02-26 - replace or deprecate? - https://phabricator.wikimedia.org/T90002#1054745 (10faidon) [19:10:21] (03PS1) 10Chad: Tidy up SpecialVersionUrl hook usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191927 (https://phabricator.wikimedia.org/T75759) [19:19:02] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to analytics-privatedata-users for jamesur - https://phabricator.wikimedia.org/T89739#1054823 (10Jalexander) >>! In T89739#1053348, @Nemo_bis wrote: >> increasing need for LCA private data pulls > > Is this need/process documented so... [19:23:13] (03PS2) 10Chad: es-tool: support IPv6 addresses in (un)ban-node [puppet] - 10https://gerrit.wikimedia.org/r/191357 [19:23:58] 6operations, 10Wikimedia-Logstash, 10hardware-requests: purchase 3 additional logstash nodes - https://phabricator.wikimedia.org/T89402#1054858 (10RobH) The latest quote back looks correct but they forgot to make it 3, not 1 system. The disks are SATA, so they are also preparing a SAS quote for comparison. [19:30:00] 6operations, 10Staging: Package geoipupdate for jessie - https://phabricator.wikimedia.org/T90229#1054901 (10faidon) Yeah, we have packages. I made them for Debian specifically (c.f. [[ https://bugs.debian.org/768979 | Intent-to-Package bug ]]) and backported them to Ubuntu. I've yet to upload them in Debian... [19:41:30] Coren: due to some disappointment with designate I’m investigating an older dns implementation in openstack (which I wrote ages ago…) here’s what it does: https://phabricator.wikimedia.org/P316 [19:42:04] It’s not very flexible, but I think I can work around the issue of name collision by adding another feature (which I also wrote ages ago) that explicitly prevents duplicate instance names. [19:42:07] What do you think? [19:42:44] I defer to your judgment in the matter; but what caused such disapointment in designate? [19:44:11] Coren: Regarding the above, my main concern is the loss of things like associatedDomain: i-000008ab.eqiad.wmflabs [19:44:19] I’m pretty sure that wasn’t good for anything, but not positive [19:44:44] Coren: Designate supports pdns, and the docs imply that it supports the ldap backend but when I asked the actual designate developers they were like, wah? [19:45:03] Ah, that /is/ disapointing. :-) [19:45:10] So I would have to code a new backend driver to use designate. Which I don’t mind, but presumably that code would happen on the tip which wouldn’t work with the version of nova we’re running... [19:45:11] etc. etc. [19:45:23] It’s still the right solution, using the old built-in dns would be a stopgap. [19:45:25] Hm. The i-* aren't used from within the instances, but doesn't puppet use those for certs? [19:46:00] It uses those ids for certs but I don’t know that that has anything to do with ldap. [19:47:26] Coren: Another (ambitious) option is… stop using pdns/ldap entirely. That + hiera means we wouldn’t have to keep instance info in ldap at all. [19:48:55] That's... I'd *love* that, but that seems a bit of an overreach. At least, for the short term. [19:49:27] Yeah [19:49:51] So, it doesn’t sound like you have immediate revulsion towards the proposed change, so let me experiment a bit more and make sure it can work. [19:51:16] LDAP is one of our more brittle SPOFs; reducing our dependency on it can only lead to good things. :-) [19:52:01] Yeah. Although whenever I think about it I decide that we can’t eliminate it entirely. [19:52:11] Getting instances out of ldap would be a good start though. [19:52:33] I think when Yuvi gets back I will see if he wants to do the work of moving puppet config out of ldap [19:53:13] (03PS1) 1001tonythomas: Added BounceHandler extension to special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191937 [19:54:52] (03PS1) 10Dzahn: LVS configuration for zotero service [puppet] - 10https://gerrit.wikimedia.org/r/191938 (https://phabricator.wikimedia.org/T89867) [20:00:35] springle: seems like lots of QPS values are missing from https://noc.wikimedia.org/dbtree/ [20:04:01] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 5Patch-For-Review: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1055057 (10Dzahn) well.. can we add JohnLewis to the BZ security group then so he can also dump the static copies of the hidden bugs ?:)) [20:09:45] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 5Patch-For-Review: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1055098 (10Dzahn) If this ticket turns out to have a dependency on "better search in phab" (which i'm not convinced of yet), i would politely give it back... [20:10:17] 6operations, 3wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1055108 (10RobH) 3NEW a:3RobH [20:10:30] 6operations, 3wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1055116 (10RobH) [20:10:31] 6operations, 10ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1055117 (10RobH) [20:10:57] (03PS1) 10RobH: setting install entries for codfw wtp systems [puppet] - 10https://gerrit.wikimedia.org/r/191940 [20:11:48] (03CR) 10RobH: [C: 032] setting install entries for codfw wtp systems [puppet] - 10https://gerrit.wikimedia.org/r/191940 (owner: 10RobH) [20:15:17] 6operations: create mgmt dns entries for asset tags of wtp2001-2020 - https://phabricator.wikimedia.org/T90274#1055139 (10RobH) 3NEW a:3Papaul [20:15:33] 6operations: create mgmt dns entries for asset tags of wtp2001-2020 - https://phabricator.wikimedia.org/T90274#1055147 (10RobH) [20:15:34] 6operations, 10ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1055148 (10RobH) [20:15:52] 6operations: create mgmt dns entries for asset tags of wtp2001-2020 - https://phabricator.wikimedia.org/T90274#1055149 (10RobH) p:5Triage>3Low [20:16:14] 6operations, 10ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1055154 (10RobH) [20:16:15] 6operations, 3wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1055153 (10RobH) [20:16:16] 6operations: create mgmt dns entries for asset tags of wtp2001-2020 - https://phabricator.wikimedia.org/T90274#1055139 (10RobH) [20:16:17] 6operations, 10ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1055151 (10RobH) 5Open>3Resolved [20:22:45] (03PS1) 10Andrew Bogott: Enforce unique instance names. [puppet] - 10https://gerrit.wikimedia.org/r/191943 [20:24:08] (03PS1) 10RobH: setting codfw wtp production dns entries [dns] - 10https://gerrit.wikimedia.org/r/191944 [20:25:41] 6operations, 3wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1055183 (10RobH) p:5Triage>3Normal install params https://gerrit.wikimedia.org/r/#/c/191940/ dns https://gerrit.wikimedia.org/r/#/c/191944/ still need to provision the network ports from the mappings on linked ti... [20:30:23] (03PS2) 10Andrew Bogott: Enforce unique instance names. [puppet] - 10https://gerrit.wikimedia.org/r/191943 [20:30:25] (03PS1) 10Andrew Bogott: Remove the isolated_hosts section [puppet] - 10https://gerrit.wikimedia.org/r/191946 [20:30:29] 6operations, 5Patch-For-Review: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1055199 (10Dzahn) Yep, amended. using .16 now. Also uploaded changes for adding a skeleton role class and LVS config. Linked to the "puppetize zotero" bug. [20:31:23] 6operations, 5Patch-For-Review: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1055201 (10Dzahn) p:5Triage>3Normal [20:34:09] 6operations, 5Patch-For-Review: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1055205 (10Dzahn) fwiw, i asked bblack and he checked and said we can use the entire /24 range of IPs here even though it's not a real /24 network as such but all indidivual /32 but we can u... [20:34:46] (03CR) 10Dzahn: [C: 032] "10.2.2.16 it is - Destination Net Unreachable" [dns] - 10https://gerrit.wikimedia.org/r/191824 (https://phabricator.wikimedia.org/T89870) (owner: 10Dzahn) [20:37:51] 6operations, 5Patch-For-Review: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1055210 (10Dzahn) 5Open>3Resolved merged. 10.2.2.16 is the new service IP. [20:39:48] 6operations, 10Citoid, 6Services: Provide service monitoring for the citoid and zotero services - https://phabricator.wikimedia.org/T87496#1055213 (10Dzahn) [20:42:00] 6operations, 10Citoid, 6Services: Provide service monitoring for the citoid and zotero services - https://phabricator.wikimedia.org/T87496#1055215 (10Dzahn) There are different ways to monitor it. High-level by putting metrics in ganglia and then checking for changes there or a bit lower-level with Icinga t... [20:42:50] gwicke: do you think each service should have a matching phabricator project tag? [20:43:15] mutante: are you thinking about zotero? [20:43:42] yea, because there is 'citoid' and just 'services' and we have quite a few bugs related to zotero that could be grouped that way [20:44:04] in general I'd say yes, but zotero is a bit of a special case as it's not really its own service & we all agree that it should go away asap [20:44:30] can i argue that for me it's a service because it has a "zotero.svc" DNS name :?) [20:44:38] it's something that's used internally by citoid [20:44:53] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [20:44:55] and should not be used by anything else [20:44:57] wait, we agree that it should go away?? [20:45:04] we have all those bugs about puppetizing it :p [20:45:11] and i literally just added a service IP [20:45:13] yes, see the discussion [20:45:20] where? [20:45:32] then why have the bugs.. sigh? [20:45:44] in the short term we need to use it because it's the quickest way to get the zotero scraper code to run [20:46:07] mutante: https://phabricator.wikimedia.org/T76308 [20:46:48] the scraper code is just JS, but it depends on some framework code that's currently somewhat specific to xulrunner [20:47:13] "pragmatic in the short term" .. umpfs [20:47:25] if we can find a way to replicate that framework code on node, then we can do all the scraping directly in citoid, without xulrunner [20:47:43] which is good from a security pov, as xulrunner is hard to maintain [20:47:45] i thought that is about it not being puppetized [20:47:54] it packages an old libssl for example [20:48:25] 10Ops-Access-Requests, 6operations: Requesting access to contint-admins for legoktm - https://phabricator.wikimedia.org/T90275#1055222 (10Legoktm) 3NEW [20:48:37] yet, we have new stuff like "Provide service monitoring for the citoid and zotero services" [20:49:34] it's a bit odd to me to say we don't need to add puppet but we need monitorin [20:49:50] for now we need a way to deploy it [20:50:35] Alex prefers to make zotero its own service [20:50:47] rather than treating it more like an internal citoid thing [20:51:09] and that is "Update the citoid/deploy branch to not contain zotero deploy" i guess [20:51:11] 10Ops-Access-Requests, 6operations: Requesting access to contint-admins for legoktm - https://phabricator.wikimedia.org/T90275#1055234 (10hashar) Kunal has been extremely helpful for CI configuration and related work. I fully endorse him for CI administration access. [20:51:16] either way works [20:51:34] ok [20:51:57] well then, there is a service IP for it to use now [20:52:08] it should definitely not be public [20:52:10] window 43 [20:52:13] gah [20:52:27] it's not, just eqiad.wmnet [20:52:32] mutante: kk, thx [20:53:22] (03CR) 10Andrew Bogott: [C: 032] Enforce unique instance names. [puppet] - 10https://gerrit.wikimedia.org/r/191943 (owner: 10Andrew Bogott) [20:54:01] (03CR) 10Andrew Bogott: [C: 032] Remove the isolated_hosts section [puppet] - 10https://gerrit.wikimedia.org/r/191946 (owner: 10Andrew Bogott) [20:58:58] (03CR) 10Dzahn: "can that Apache template also live in the Apache module? like modules/apache/templates? would like to get rid of the global ./templates/ap" [puppet] - 10https://gerrit.wikimedia.org/r/191911 (https://phabricator.wikimedia.org/T85834) (owner: 10Ottomata) [20:59:42] (03PS1) 10Hashar: Add legoktm to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/191954 (https://phabricator.wikimedia.org/T90275) [21:00:11] (03CR) 10Hashar: [C: 031] Add legoktm to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/191954 (https://phabricator.wikimedia.org/T90275) (owner: 10Hashar) [21:02:00] (03CR) 10Dzahn: [C: 031] Add legoktm to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/191954 (https://phabricator.wikimedia.org/T90275) (owner: 10Hashar) [21:02:29] 6operations, 10Citoid: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1055263 (10Jdforrester-WMF) [21:02:30] 6operations, 10Citoid: Configure zotero to use an outbound proxy - https://phabricator.wikimedia.org/T89874#1055262 (10Jdforrester-WMF) [21:02:31] 6operations, 10Citoid: Backport and using zotero-standalone for the zotero service - https://phabricator.wikimedia.org/T89866#1055264 (10Jdforrester-WMF) [21:02:32] 6operations, 10Citoid: Puppetize zotero - https://phabricator.wikimedia.org/T89867#1055261 (10Jdforrester-WMF) [21:02:33] 6operations, 10Citoid: Assign hardware for the zotero service - https://phabricator.wikimedia.org/T89869#1055265 (10Jdforrester-WMF) [21:02:35] 6operations, 10Citoid: Configure citoid to use the new zotero service - https://phabricator.wikimedia.org/T89873#1055267 (10Jdforrester-WMF) [21:02:36] 6operations, 10Citoid: Update the citoid/deploy branch to not contain zotero deploy - https://phabricator.wikimedia.org/T89872#1055266 (10Jdforrester-WMF) [21:05:11] (03PS1) 10John F. Lewis: bz: load image via https and static- [puppet] - 10https://gerrit.wikimedia.org/r/191955 [21:05:30] mutante: it could,a nd i could even create a define abstraciton for it too [21:05:42] but, i didn't put it there, because i assumed we wanted the apache module simple [21:05:57] (03CR) 10Aude: [C: 031] "would be super helpful to have legoktm in content-admins" [puppet] - 10https://gerrit.wikimedia.org/r/191954 (https://phabricator.wikimedia.org/T90275) (owner: 10Hashar) [21:06:06] apache::site::proxy::https [21:06:08] or whatever [21:06:15] thoughts? [21:09:11] 10Ops-Access-Requests, 6operations: Requesting access to contint-admins for legoktm - https://phabricator.wikimedia.org/T90275#1055280 (10Dzahn) Legoktm already signed L3. +1 , will just have to wait for 3 business days. [21:09:57] ottomata: so if "hue" was its own module, i would probably say /modules/hue/templates/apache/ [21:11:17] i'm not sure, it was really an open question. i would just like to reduce the stuff that is in global ./templates/ [21:12:00] i think it's cool that you made it generic though [21:14:15] (03CR) 10Dzahn: [C: 032] "oh,thank you, absolutely, i wanted to use the local image" [puppet] - 10https://gerrit.wikimedia.org/r/191955 (owner: 10John F. Lewis) [21:19:09] mutante: hue is part of cdh [21:19:11] i could put this in there [21:19:17] thought maybe this would be useful in more places [21:19:19] but maybe not, eh? [21:27:24] 6operations, 10Citoid, 10VisualEditor, 4§ VisualEditor Q3 Blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1055318 (10Jdforrester-WMF) 3NEW [21:28:16] 6operations, 10Citoid: Configure citoid to use the new zotero service - https://phabricator.wikimedia.org/T89873#1055334 (10Jdforrester-WMF) [21:28:17] 6operations, 10Citoid: Update the citoid/deploy branch to not contain zotero deploy - https://phabricator.wikimedia.org/T89872#1055333 (10Jdforrester-WMF) [21:28:18] 6operations, 10Citoid: Backport and using zotero-standalone for the zotero service - https://phabricator.wikimedia.org/T89866#1055335 (10Jdforrester-WMF) [21:28:19] 6operations, 10Citoid: Assign hardware for the zotero service - https://phabricator.wikimedia.org/T89869#1055336 (10Jdforrester-WMF) [21:28:20] 6operations, 10Citoid: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1055337 (10Jdforrester-WMF) [21:28:21] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1055340 (10Jdforrester-WMF) [21:28:24] 6operations, 10Citoid, 6Services: Provide service monitoring for the citoid and zotero services - https://phabricator.wikimedia.org/T87496#1055338 (10Jdforrester-WMF) [21:28:26] 6operations, 10Citoid, 10VisualEditor, 4§ VisualEditor Q3 Blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1055332 (10Jdforrester-WMF) [21:28:28] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1055344 (10ori) Note that xulrunner used to be packaged for Ubuntu, but it was dropped in the Oneiric release to make it easier to keep pace with Mozil... [21:28:33] 6operations, 10Citoid, 10VisualEditor, 4§ VisualEditor Q3 Blockers: Improve citoid production service - https://phabricator.wikimedia.org/T90281#1055318 (10Jdforrester-WMF) [21:28:35] 6operations: create mgmt dns entries for asset tags of wtp2001-2020 - https://phabricator.wikimedia.org/T90274#1055347 (10Papaul) Thanks, will be glad to do that. [21:30:11] MatmaRex: yeah, it's pretty client-dependent [21:30:28] the colors match pretty well in irssi with putty, but otherwise it varies [21:42:49] 10Ops-Access-Requests, 6operations: Requesting access to contint-admins for legoktm - https://phabricator.wikimedia.org/T90275#1055395 (10hashar) The reason for the 3 business days waiting period is described at https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access . Ba... [21:43:06] (03CR) 10Hashar: "On hold for 3 days pending reviews per https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access" [puppet] - 10https://gerrit.wikimedia.org/r/191954 (https://phabricator.wikimedia.org/T90275) (owner: 10Hashar) [22:02:03] (03PS3) 10Rush: phab update security extensions for access-request [puppet] - 10https://gerrit.wikimedia.org/r/191387 [22:02:08] (03PS20) 10Gage: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 [22:02:33] (03CR) 10Rush: [C: 032 V: 032] phab update security extensions for access-request [puppet] - 10https://gerrit.wikimedia.org/r/191387 (owner: 10Rush) [22:02:49] 7Blocked-on-Operations, 6operations, 6Phabricator, 5Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1055473 (10chasemp) [22:04:14] 10Ops-Access-Requests, 6operations: Requesting access to contint-admins for legoktm - https://phabricator.wikimedia.org/T90275#1055476 (10Dzahn) p:5Triage>3Normal Legoktm already signed L3. +1 , will just have to wait for 3 business days. [22:04:18] 6operations: Enable TRIM for SSDs? - https://phabricator.wikimedia.org/T89584#1055478 (10GWicke) > I said HW RAID above, didn't I? :) Oh, sorry. I blame the lack of reading comprehension on me posting late at night ;) So, that restricts this to SW RAIDs like the Cassandra boxes. [22:05:50] jamesofur: [22:05:56] ? [22:07:35] jamesofur: ah, sorry, i was just wondering if T89904 still has something we can do or if it's done [22:07:50] see my last comments [22:08:07] * jamesofur nods [22:08:30] Probably not at this point, though I do think we should think about some kind of rate limiting option if there is anyway to do that [22:08:39] so that someone can't use our system to mass flood anyone [22:09:12] ok, it might be exim config rather than mailman [22:09:42] yeah, and that likely belongs in a separate task regardless [22:09:53] yea [22:10:11] I'll close this one out [22:11:50] thanks [22:13:07] 6operations, 10RESTBase, 5Patch-For-Review, 7RESTBase-API: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1055503 (10GWicke) 5Open>3Resolved This is now live: https://rest.wikimedia.org/en.wikipedia.org/v1/?doc Thank you, @akosiaris and @fgiunchedi! Lets track further imp... [22:37:59] (03CR) 10Gage: "PS20:" [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [22:44:34] (03PS1) 10Rush: Revert "phab update security extensions for access-request" [puppet] - 10https://gerrit.wikimedia.org/r/192009 [22:45:15] (03CR) 10Rush: [C: 032 V: 032] Revert "phab update security extensions for access-request" [puppet] - 10https://gerrit.wikimedia.org/r/192009 (owner: 10Rush) [22:46:04] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:47:07] (03PS1) 10RobH: wtp2001-2020 entries had bad quoting [puppet] - 10https://gerrit.wikimedia.org/r/192010 [22:49:32] (03CR) 10RobH: [C: 032] wtp2001-2020 entries had bad quoting [puppet] - 10https://gerrit.wikimedia.org/r/192010 (owner: 10RobH) [22:49:56] paravoid: the 35ms for eqiad<=>codfw is RTT not TT right? [22:50:23] fwiw checkout puppet on ms-fe1001 and no error.... [22:50:24] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:50:35] gotta love transient alerts [22:51:44] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:53:59] AaronSchulz: something like that, yes [22:54:01] 32 maybe? [22:54:17] it will improve soon by itself and we may get a new wave that's even better [22:54:58] although we have two paths and the slower one may be active at times too [22:55:19] Expected RTD [22:55:19] North route: 51 ms (future routing will bring this down to 42ms) [22:55:19] South route: 32.7ms [22:55:35] 64 bytes from bast2001.wikimedia.org (208.80.153.5): icmp_req=60 ttl=62 time=42.9 ms [22:55:58] so, nothern route right now, already shorter [22:57:02] ACKNOWLEDGEMENT - RAID on restbase1006 is CRITICAL: CRITICAL: Active: 8, Working: 8, Failed: 1, Spare: 0 daniel_zahn T89639 - faulty disk controller [23:06:46] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1055620 (10ori) I used LD_DEBUG=files on citoid.wmflabs.org to see which libraries zotero depends on but does not bundle, and then dpkg to figure out w... [23:08:52] (03PS1) 10Ori.livneh: zotero: require 'firefox'; get rid of shell wrapper [puppet] - 10https://gerrit.wikimedia.org/r/192016 [23:11:46] !log ori Synchronized php-1.25wmf18/extensions/VisualEditor: 5c4457a555: Update VisualEditor for cherry-picks (duration: 00m 05s) [23:11:51] Logged the message, Master [23:12:35] !log ori Synchronized php-1.25wmf17/extensions/VisualEditor: f14dc93302: Update VisualEditor for cherry-picks (duration: 00m 06s) [23:12:38] Logged the message, Master [23:14:06] (03PS2) 10Ori.livneh: zotero: require 'firefox'; get rid of shell wrapper [puppet] - 10https://gerrit.wikimedia.org/r/192016 [23:28:23] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1055705 (10Dzahn) 12:35 https://phabricator.wikimedia.org/P317 12:35 should work [23:48:24] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1055784 (10Dzahn) tried with the newer version that John found and after he hacked it a bit to skip the missing longdesc thing. now this though: ``` @zirconium:/srv/org/wikimedia/... [23:55:32] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1055800 (10Dzahn) the diff here was: https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-PSTE-ta6karxzgl6nfwb/ -- John, here's the longdescs table: ``` mysql:bugs@db2... [23:57:07] Bugzilla.. that sucks. you change your db schema in every version but no docs