[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T0000). [00:02:35] (03CR) 10Dzahn: [C: 032] "- has manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/232658 (https://phabricator.wikimedia.org/T108696) (owner: 10Dzahn) [00:02:38] 6operations: Run assert check to verify the existence of certain texts in the footer - https://phabricator.wikimedia.org/T108081#1555654 (10ZhouZ) Thanks Chase - this is great. And I guess automation has already done its job. I will have to dig into this a bit deeper (I just came back from vacation) but the re... [00:15:29] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1555671 (10Dzahn) @tjones done, you have access to fluorine now :) [fluorine:~] $ sudo -u tjones tail /a/mw-log/api.log works for me now since you already have s... [00:16:00] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1555679 (10Dzahn) 5Open>3Resolved [00:16:11] 10Ops-Access-Requests, 6operations: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1527694 (10Dzahn) [00:16:38] 10Ops-Access-Requests, 6operations: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1527694 (10Dzahn) [00:17:37] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1555693 (10Dzahn) a:3Muehlenhoff [00:18:28] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1555015 (10Dzahn) a:5Muehlenhoff>3MoritzMuehlenhoff [00:22:30] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1555712 (10Dzahn) or create a new mount /var or /var/log outside of / (take space away from /srv or there might be free extents)? [00:23:12] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1555713 (10Dzahn) a:3MoritzMuehlenhoff [00:25:32] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1555720 (10Dzahn) also: T93710 [00:27:27] 6operations: package and puppetize ishmael - https://phabricator.wikimedia.org/T82225#1555735 (10Dzahn) @jcrespo do you ever use ishmael.wikimedia.org ? [00:31:28] 6operations: rename gerrit2 account in LDAP - https://phabricator.wikimedia.org/T80648#1555739 (10Dzahn) it's 2.8 now. should this just be closed as rejected then? [00:33:11] 6operations, 7Monitoring: create ganglia aggregator hosts - https://phabricator.wikimedia.org/T80459#1555742 (10Dzahn) meanwhile we use ganglia_new (which was renamed to ganglia and ganglia_old is gone) and we have a specific aggregator per DC, not the random hosts anymore. so i think we can call it resolved.... [00:36:27] 6operations, 7Monitoring: create ganglia aggregator hosts - https://phabricator.wikimedia.org/T80459#1555749 (10Dzahn) ULSFO: bast4001 EQIAD: carbon, netmon1001 CODFW: install2001, netmon1001 ESAMS: hooft [00:36:34] 6operations, 7Monitoring: create ganglia aggregator hosts - https://phabricator.wikimedia.org/T80459#1555750 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:38:30] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1555760 (10Dzahn) It does not seem like we are planning to use dsh anymore in the future. We deleted most groups except: mediawiki-installation, parsoid and "scap-test" anyways. Does it even make sense to keep this open? [00:49:53] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1555784 (10CCogdill_WMF) I agree this is not a great solution, and have made that argument in the past. However, changing the domain w... [00:52:53] 6operations, 7Monitoring: revamp apaches ganglia grouping - https://phabricator.wikimedia.org/T79947#1555792 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:55:02] 6operations, 10Beta-Cluster, 7HHVM: hhvm apache fills /var/log/apache2 with access logs - https://phabricator.wikimedia.org/T75262#1555799 (10Dzahn) Why "externally blocked" ? [00:55:23] PROBLEM - puppet last run on mw2186 is CRITICAL puppet fail [00:56:24] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1555805 (10Dzahn) yet another duplicate of T83702 etc [00:57:25] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1555818 (10Dzahn) @Melos please see progress on T83702 , added you there too so we can close this one as a duplicate [00:57:39] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1555819 (10Dzahn) 5Open>3declined a:3Dzahn [00:59:14] RECOVERY - puppet last run on stat1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:00:18] mutante, netmon1001 is in codfw? :) [01:02:09] Krenair: no, but: [01:02:14] 2012 class { 'ganglia::monitor::aggregator': [01:02:14] 2013 sites => ['eqiad', 'codfw'], [01:02:18] huh [01:02:52] i don't know, i just know about the other aggregators [01:05:01] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1555831 (10Dzahn) see T56713#610707 for the reasoning [01:05:45] 6operations, 10Security-Reviews, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#1555837 (10Dzahn) also see T108546 [01:11:04] PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail [01:11:30] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1555845 (10Dzahn) a:3Dzahn [01:12:53] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1555849 (10Dzahn) Anyone wants to take a shot at installing the OS here? [01:13:24] 6operations: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1555852 (10Dzahn) a:3Dzahn [01:14:50] 6operations, 10Traffic, 7Browser-Support-Internet-Explorer, 7HTTPS: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#1555856 (10Dzahn) Did Microsoft make a change yet? [01:15:56] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1555857 (10Dzahn) just wondering, let's say the subtask was resolved, for which services is John requesting permissions? mailman? [01:17:13] 6operations, 10Traffic, 7HTTPS: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1555859 (10Dzahn) >>! In T103919#1406876, @Chmarkine wrote: > stats.wikimedia.org doesn't redirect http to https. It has mixed content (T93702). Do we need to fix that first?... [01:21:37] (03PS1) 10Ori.livneh: WIP: Add mwgrep-web [puppet] - 10https://gerrit.wikimedia.org/r/232668 (https://phabricator.wikimedia.org/T71489) [01:22:20] mutante, re https://phabricator.wikimedia.org/T101213#1555862 - that script is in operations/puppet [01:23:27] 7Puppet, 6operations: Puppetize ircyall & set up instance appropriately - https://phabricator.wikimedia.org/T1357#1555880 (10Dzahn) Somehow i doubt this will be picked up by others because "fix config issues" is vague and the logrotate thing is not part of this ticket? [01:24:34] RECOVERY - puppet last run on mw2186 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [01:24:59] 6operations, 10Wikimedia-Mailing-lists: mailman: centralize logging or create a mailman admin group - https://phabricator.wikimedia.org/T99734#1555882 (10Dzahn) Sounds like this is actually an access request. [01:26:23] ori, nice, but did you write on this a while ago, before the private wikis change got merged? [01:26:34] Krenair: yea? but all maintenance scripts are and still not written by operations [01:28:07] Krenair: no -- I just whipped this up today. I may have misunderstood the purpose of the private wikis change. I replaced that code because I wanted it to be runnable from other hosts (which may not have a copy of private.dblist), and I made it simply exclude private/closed/fishbowl wikis since I figured no one would care about those. [01:28:54] 6operations, 10Wikimedia-Mailing-lists: mailman: centralize logging or create a mailman admin group - https://phabricator.wikimedia.org/T99734#1555885 (10Dzahn) we have that mailman-admins group meanwhile. it also includes being able to use "journalctl" to read logs it is applied on fermium. we agreed it won'... [01:28:59] also, if the idea is to make this public, providing results from private wikis is a no-go [01:29:08] ah that's the "if 'fishbowl' in site or 'closed' in site or 'private' in site", right [01:29:09] okay [01:29:11] makes sense [01:29:29] 6operations, 10Wikimedia-Mailing-lists: mailman: centralize logging or create a mailman admin group - https://phabricator.wikimedia.org/T99734#1555886 (10Dzahn) 5Open>3Resolved a:3Dzahn mailman-admins: gid: 757 description: Admins for mailman members: [johnflewis] privileges: ['ALL = (l... [01:29:40] * ori nods [01:32:37] (03PS1) 10Dzahn: add CNAME videoserver.wm.org -> archive.org [dns] - 10https://gerrit.wikimedia.org/r/232669 (https://phabricator.wikimedia.org/T99216) [01:33:12] fishbowl and closed don't need to be filtered for security [01:33:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [01:33:25] might be something to leave to the user [01:33:28] (03PS2) 10Dzahn: add CNAME videoserver.wm.org -> archive.org [dns] - 10https://gerrit.wikimedia.org/r/232669 (https://phabricator.wikimedia.org/T99216) [01:35:13] Krenair: I'm going to run off soon; comment on the patch? [01:35:17] ok [01:35:32] 6operations, 6Services: SCA: Move logs to /srv/ - https://phabricator.wikimedia.org/T107900#1555912 (10GWicke) @mobrovac, keeping the log level at `info` should be fine for relatively low-traffic services. The logstash cluster was beefed up a bit recently, so should be able to sustain a couple dozen log lines... [01:36:32] 6operations, 7HTTPS, 7LDAP: SSL certificates on LDAP servers expiring 2015-09-20 - https://phabricator.wikimedia.org/T103590#1555913 (10Dzahn) p:5Low>3Normal raising up to normal since the date is getting closer [01:38:24] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:39:21] 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1555920 (10Dzahn) Aren't all the public archives on gmane.org anyways and get indexed there? [01:40:09] 7Puppet, 6operations: removing admin::groups from hiera doesn't revoke permissions - https://phabricator.wikimedia.org/T89961#1555922 (10Dzahn) so rejected then? [01:41:20] (03CR) 10Alex Monk: WIP: Add mwgrep-web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232668 (https://phabricator.wikimedia.org/T71489) (owner: 10Ori.livneh) [01:41:33] 6operations: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203#1555925 (10Dzahn) So if this is not in a deb anymore what does that mean for the user it runs as? [01:41:56] 6operations: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203#1555928 (10Dzahn) [01:41:57] 6operations, 7Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#1555927 (10Dzahn) [01:45:55] PROBLEM - puppet last run on mw1081 is CRITICAL Puppet has 1 failures [01:52:33] PROBLEM - puppet last run on mw1011 is CRITICAL Puppet has 1 failures [01:56:57] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1555937 (10BBlack) >>! In T102827#1555784, @CCogdill_WMF wrote: > I agree this is not a great solution, and have made that argument in... [02:04:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [02:04:34] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:04:52] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#1555960 (10Negative24) [02:04:55] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org is unstable - https://phabricator.wikimedia.org/T83702#1555961 (10Negative24) [02:05:04] (03CR) 10Legoktm: "This doesn't really fix T71489...but if this is ok as a web service, can it just be a MW special page?" [puppet] - 10https://gerrit.wikimedia.org/r/232668 (https://phabricator.wikimedia.org/T71489) (owner: 10Ori.livneh) [02:05:34] 6operations, 10Gitblit-Deprecate, 10Wikimedia-Git-or-Gerrit: Git.wikimedia.org keeps going down - https://phabricator.wikimedia.org/T73974#755836 (10Negative24) @Dzahn, Isn't this the correct way to merge dups? [02:11:14] RECOVERY - puppet last run on mw1081 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:19:43] RECOVERY - puppet last run on mw1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:20:05] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:21:45] (03PS1) 10Mattflaschen: Set wgFlowMigrateReferenceWiki to true to start ref_src_wiki population [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232671 (https://phabricator.wikimedia.org/T107204) [02:22:03] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18672 bytes in 1.011 second response time [02:38:55] !log l10nupdate@tin Synchronized php-1.26wmf18/cache/l10n: l10nupdate for 1.26wmf18 (duration: 10m 41s) [02:45:14] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf18) at 2015-08-20 02:45:14+00:00 [02:47:44] PROBLEM - HHVM rendering on mw2180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.212 second response time [02:49:54] PROBLEM - HHVM processes on mw2180 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [02:50:44] PROBLEM - Apache HTTP on mw2180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.210 second response time [02:53:59] (03PS1) 10Alex Monk: Allow dblist files containing +/- to be used in dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232672 [02:55:33] (03PS2) 10Alex Monk: Allow dblist filenames containing +/- to be used in dblist expressions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232672 [03:00:57] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 27s) [03:04:09] (03PS1) 10Alex Monk: Remove link to old static.wikipedia.org from dumps download page [puppet] - 10https://gerrit.wikimedia.org/r/232674 [03:15:41] (03PS1) 10Alex Monk: Make foreachwiki accept dblist expressions [puppet] - 10https://gerrit.wikimedia.org/r/232675 (https://phabricator.wikimedia.org/T101213) [03:20:38] (03CR) 10Alex Monk: [C: 04-1] "I completely missed 32ca6947" [puppet] - 10https://gerrit.wikimedia.org/r/232675 (https://phabricator.wikimedia.org/T101213) (owner: 10Alex Monk) [03:48:28] (03PS2) 10Alex Monk: Make foreachwiki accept dblist expressions [puppet] - 10https://gerrit.wikimedia.org/r/232675 (https://phabricator.wikimedia.org/T101213) [04:01:31] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1556078 (10mmodell) @gwicke: Where is the error rate for services logged? I'd like to try my hand at building a monitoring task that watch... [04:13:51] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1556086 (10GWicke) @mmodell, the set of metrics and logs to look at depends on the service. For RESTBase, we could for example look at the... [04:20:10] (03PS6) 10GWicke: Lower the InitiatingHeapOccupancyPercent from 45% to 35% [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) [04:21:04] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 336 MB (3% inode=84%) [05:02:03] PROBLEM - Disk space on mw1114 is CRITICAL: DISK CRITICAL - free space: / 8184 MB (3% inode=92%) [05:22:08] <_joe_> uhm mw1114 again? [05:22:22] <_joe_> "good morning" [05:26:04] RECOVERY - Disk space on iridium is OK: DISK OK [05:26:51] <_joe_> !log compacted pacct.0 on iridium, now wondering why we have process accounting turned on there [05:27:15] morebots went missing. [05:45:35] <_joe_> seems like it [05:48:03] RECOVERY - Disk space on mw1114 is OK: DISK OK [05:50:21] <_joe_> !log removed a few old apache files on mw1114 [05:50:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: retain only 18 weeks of logs [puppet] - 10https://gerrit.wikimedia.org/r/232467 [05:52:48] (03PS3) 10Giuseppe Lavagetto: mediawiki: retain only 12 weeks of logs [puppet] - 10https://gerrit.wikimedia.org/r/232467 [05:53:09] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: retain only 12 weeks of logs [puppet] - 10https://gerrit.wikimedia.org/r/232467 (owner: 10Giuseppe Lavagetto) [05:56:06] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1556189 (10mmodell) >>! In T93428#1556086, @GWicke wrote: > @mmodell, the set of metrics and logs to look at depends on the service. For R... [05:57:20] <_joe_> twentyafterfour: I think that is a horrible idea ^^ [05:57:31] <_joe_> having montioring to see if a release is good is ok [05:57:43] <_joe_> but tying monitoring directly in the deploy tool is foolish [05:57:55] _joe_: why is that foolish? [05:57:55] <_joe_> or at least should be possible to disable it [05:58:11] it wouldn't be on-by-default [05:58:26] <_joe_> I think what is useful is to have a canary pool [05:58:45] <_joe_> where you deploy and then you can decide to promote your deploy to the whole cluster or not [05:58:49] yes for sure, but we still need a way to monitor the canary [05:59:03] <_joe_> yes I'm not saying not to use monitoring [05:59:29] <_joe_> I say any external monitoring (e.g. not local to the server) should be decoupled from the deploy tool [05:59:31] something that can detect an increase in errors and bring it to the attention of the deployer would be very helpful, IMO [05:59:40] <_joe_> yes I agree [05:59:45] this would be decoupled [05:59:51] <_joe_> that's not what gabriel was saying :) [05:59:54] a standalone command [06:00:29] but the deployment process could fire up a real-time monitor process that runs while the deployment is happening [06:00:51] and flash some kind of alert when an anomaly is detected [06:00:55] <_joe_> ok, I agree completely [06:01:21] <_joe_> that is reasonable. The idea that the rolling deploy could be stopped by logstash being slow terrified me :P [06:01:41] <_joe_> (just an example, but you get that) [06:02:02] well, I think it should be up to the person controlling the deployment to decide [06:02:11] <_joe_> OTOH, we do have local functional monitoring that you could use to test that a service is "working" [06:02:13] but more automated feedback would be good [06:02:24] <_joe_> and that you can use directly [06:03:13] <_joe_> I wrote a small script that does auto-monitoring based on the swagger specs, and that concept could be extended to mediawiki too, I guess [06:03:27] oh? that sounds interesting [06:03:36] <_joe_> so you could just use it to declare a deploy done on a single server [06:04:28] <_joe_> https://gerrit.wikimedia.org/r/#/c/231790/5/modules/service/templates/deployment_script.sh.erb at line 110 we do that exactly [06:04:50] <_joe_> (this is a smallish script to help deploys of rb/services while we don't get something serious) [06:06:09] <_joe_> where /usr/local/lib/nagios/plugins/service_checker is this https://github.com/wikimedia/operations-puppet/blob/production/modules/service/files/checker.py [06:06:54] cool, that takes care of one of the check, one wheel I won't have to reinvent [06:07:41] I will make notes of this on the monitoring ticket (https://phabricator.wikimedia.org/T109515) [06:09:07] <_joe_> yeah we actually use that to monitor the health of apps while they run [06:09:59] <_joe_> bbiab [06:15:53] noted. [06:30:33] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:30:44] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw1172 is CRITICAL puppet fail [06:31:26] PROBLEM - puppet last run on db1045 is CRITICAL Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on mc2015 is CRITICAL Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw2077 is CRITICAL puppet fail [06:31:43] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on db2058 is CRITICAL Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 2 failures [06:32:24] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures [06:55:43] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:24] RECOVERY - puppet last run on db1045 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mc2015 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on db2058 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw1172 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:35] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:58:41] 6operations, 10ops-codfw: mw2180 has a faulty disk - https://phabricator.wikimedia.org/T109687#1556269 (10Joe) 3NEW [07:00:09] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1556276 (10jcrespo) I will add the user on puppet. Just for the record- on our configuration, users with hosts using dns entries are i... [07:06:27] 6operations: package and puppetize ishmael - https://phabricator.wikimedia.org/T82225#1556293 (10jcrespo) @Dzahn I don't, mainly because Springle didn't like it, so currently I am running `pt-query-digest` manually while I was trying to implement a replacement: T99485. [07:23:35] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1556326 (10Joe) @mmodell you could be interested in the check_graphite nagios script we use - it has threshold alerts and alerts based on... [08:24:54] PROBLEM - dhclient process on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:05] PROBLEM - salt-minion processes on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:23] PROBLEM - Hadoop DataNode on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:34] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:45] RECOVERY - Disk space on labvirt1007 is OK: DISK OK [08:29:25] (03PS2) 10Giuseppe Lavagetto: labstore: ignore the replication snapshots with check_disk [puppet] - 10https://gerrit.wikimedia.org/r/231043 (owner: 10coren) [08:31:02] (03CR) 10Giuseppe Lavagetto: [C: 032] labstore: ignore the replication snapshots with check_disk [puppet] - 10https://gerrit.wikimedia.org/r/231043 (owner: 10coren) [09:21:26] 6operations, 10RESTBase, 10RESTBase-Cassandra: Cassandra internode TLS encryption - https://phabricator.wikimedia.org/T108953#1556508 (10fgiunchedi) >>! In T108953#1539824, @BBlack wrote: > The latter is what I'd like to do for the client auth and varnish<->varnish parts of T108580 as well, but one of the ou... [09:22:15] PROBLEM - puppet last run on analytics1052 is CRITICAL puppet fail [09:25:43] (03PS2) 10Alexandros Kosiaris: labs: Set Vagrant Puppet environment in mwvagrant wrapper [puppet] - 10https://gerrit.wikimedia.org/r/232532 (owner: 10BryanDavis) [09:25:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/232532 (owner: 10BryanDavis) [09:31:16] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1556518 (10jcrespo) es1009 and es1006, both read/write masters are down to 5%-175G. Any news about the order? [09:34:00] 7Puppet, 6operations: more verbose hiera messages on failures - https://phabricator.wikimedia.org/T109692#1556522 (10fgiunchedi) 3NEW [09:38:57] (03PS1) 10Filippo Giunchedi: swift_new: bump max_connections to match swift [puppet] - 10https://gerrit.wikimedia.org/r/232700 [09:38:59] (03PS1) 10Filippo Giunchedi: hiera: fix trailing comma for swift_new in esams [puppet] - 10https://gerrit.wikimedia.org/r/232701 (https://phabricator.wikimedia.org/T109692) [09:39:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift_new: bump max_connections to match swift [puppet] - 10https://gerrit.wikimedia.org/r/232700 (owner: 10Filippo Giunchedi) [09:39:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] hiera: fix trailing comma for swift_new in esams [puppet] - 10https://gerrit.wikimedia.org/r/232701 (https://phabricator.wikimedia.org/T109692) (owner: 10Filippo Giunchedi) [09:42:23] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1556550 (10Nemo_bis) > Are there environments where curl is not available? (Shared hosting?) Yes. (According to users.) [09:43:03] RECOVERY - puppet last run on ms-fe3001 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:50:45] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:04:48] (03CR) 10Filippo Giunchedi: "couple more comments, also +Tyler as he might be interested too" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [10:24:30] 6operations, 10ContentTranslation-cxserver, 6Language-Engineering, 10MediaWiki-extensions-ContentTranslation, and 3 others: Apertium leaves a ton of stale processes, consumes all the available memory - https://phabricator.wikimedia.org/T107270#1556596 (10akosiaris) 5Open>3Resolved Seems like the proble... [10:35:16] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1556635 (10Joe) [10:35:20] 6operations, 6Services, 3Mobile-Content-Service, 5Patch-For-Review, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1556633 (10Joe) 5Open>3Resolved a:5Joe>3akosiaris [10:37:29] PROBLEM - Disk space on labvirt1007 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 90388 MB (3% inode=99%) [10:37:54] 6operations, 6Labs: bastion-02.bastion.eqiad.wmflabs not restricted_from=(ops) like bastion-01 is - https://phabricator.wikimedia.org/T109641#1556640 (10yuvipanda) 5Open>3Resolved Done! Thanks for spotting! [10:43:19] (03PS2) 10Alexandros Kosiaris: cxserver: Use registry from cxserver repository [puppet] - 10https://gerrit.wikimedia.org/r/232018 (https://phabricator.wikimedia.org/T103856) (owner: 10KartikMistry) [10:43:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Use registry from cxserver repository [puppet] - 10https://gerrit.wikimedia.org/r/232018 (https://phabricator.wikimedia.org/T103856) (owner: 10KartikMistry) [10:56:10] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1556681 (10mobrovac) [10:56:43] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service: Deployment of Mobile App's service on the SCB cluster - https://phabricator.wikimedia.org/T92627#1556683 (10mobrovac) 5Open>3Resolved a:5bearND>3akosiaris [11:02:18] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 561723 msg: ocg_render_job_queue 4464 msg (=3000 critical) [11:02:58] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 564230 msg: ocg_render_job_queue 6508 msg (=3000 critical) [11:03:10] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 564688 msg: ocg_render_job_queue 6846 msg (=3000 critical) [11:07:54] (03CR) 10Mobrovac: "A couple more comments from me as well :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [11:08:00] !logmsgbot Updated cxserver to e221462 [11:17:28] PROBLEM - Disk space on ms-be2009 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error [11:17:49] PROBLEM - RAID on ms-be2009 is CRITICAL 1 failed LD(s) (Offline) [11:22:59] PROBLEM - puppet last run on ms-be2009 is CRITICAL Puppet has 1 failures [11:23:33] (03PS1) 10Filippo Giunchedi: swift: move 'role swift::proxy' on top of node declaration [puppet] - 10https://gerrit.wikimedia.org/r/232707 [11:27:51] (03PS2) 10Filippo Giunchedi: swift: move 'role swift::proxy' on top of node declaration [puppet] - 10https://gerrit.wikimedia.org/r/232707 [11:27:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: move 'role swift::proxy' on top of node declaration [puppet] - 10https://gerrit.wikimedia.org/r/232707 (owner: 10Filippo Giunchedi) [11:28:17] I think I crashed megacli [11:33:04] easy to imagine, did you get a commandline option with the wrong upper/lower casing? [11:33:15] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1556735 (10jcrespo) 5Resolved>3Open ``` Device Present ================ Virtual Drives : 14 Degraded : 0 Offline : 1 Physical Devices : 16... [11:35:26] (03PS2) 10Filippo Giunchedi: Switch ms-fe/ms-be eqiad to swift_new [puppet] - 10https://gerrit.wikimedia.org/r/231237 (owner: 10Faidon Liambotis) [11:35:37] no, I literally crashed it [11:35:48] see ^ [11:37:16] I am not sure that pulling the drive out and putting it back again is _that_ magical [11:37:39] 6operations, 10ops-codfw: ms-be2009 - RAID degraded / failed disk - https://phabricator.wikimedia.org/T107877#1556739 (10fgiunchedi) @papaul, see above, please order/replace drive, thanks! [11:37:43] *nod* [11:39:01] the "problem" is that RAIDs are very reliable, so they will not fail unless the disk is impossible to write to [11:39:21] hw controllers, I mean there [11:39:50] and with fail, I mean, marked a disk as failed [11:43:07] indeed, so it is usually the case that the disk is really gone [11:43:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Not a bad idea. Some minor comments inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [11:44:47] (03PS1) 10Giuseppe Lavagetto: admin: Add relevant aliases for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/232711 [11:44:55] mobrovac: ^^ [11:45:14] !log disable puppet on ms-fe/be1 in preparation to apply https://gerrit.wikimedia.org/r/#/c/231237 [11:45:39] joePanda: lol [11:45:49] sorry couldn't resist [11:45:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Switch ms-fe/ms-be eqiad to swift_new [puppet] - 10https://gerrit.wikimedia.org/r/231237 (owner: 10Faidon Liambotis) [11:45:59] yeah, i kind of askes for a troll [11:46:05] s/askes/asked/ [11:46:11] godog: \o/ [11:46:24] (03CR) 10Matthias Mullie: [C: 031] "Good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232671 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [11:46:33] godog: \o/ \o/ [11:46:45] joePanda: thnx, i'll amend it with some other good aliases :) [11:46:50] heheh not yet applied fully :D [11:47:09] (03CR) 10Alex Monk: [C: 031] admin: Add relevant aliases for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/232711 (owner: 10Giuseppe Lavagetto) [11:47:19] loool [11:47:39] (03PS2) 10Yuvipanda: admin: Add relevant aliases for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/232711 (owner: 10Giuseppe Lavagetto) [11:47:49] (03CR) 10Yuvipanda: [C: 032] ":P" [puppet] - 10https://gerrit.wikimedia.org/r/232711 (owner: 10Giuseppe Lavagetto) [11:47:50] oh come on now [11:48:03] <_jovi_> :D [11:52:37] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I have two fundamental doubts about this:" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [11:53:13] !log depool ms-fe1001 to test a reboot [11:56:26] (03Abandoned) 10Giuseppe Lavagetto: admin: Add relevant aliases for mobrovac [puppet] - 10https://gerrit.wikimedia.org/r/232711 (owner: 10Giuseppe Lavagetto) [12:00:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am gonna echo Giuseppe here. Why chain the 2 restbases ? I see no reason for that. Not only that, but it sounds like a messy architectur" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [12:01:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [12:01:59] wait what [12:02:23] RECOVERY - Disk space on ms-be2009 is OK: DISK OK [12:04:58] !log repool ms-fe1001 [12:06:15] (03CR) 10BBlack: [C: 04-1] "See ticket" [dns] - 10https://gerrit.wikimedia.org/r/232669 (https://phabricator.wikimedia.org/T99216) (owner: 10Dzahn) [12:06:35] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video, 5Patch-For-Review: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1556807 (10BBlack) Aside from @dzahn proposing the patch above, I don't think any ops have even been on the CC for this... [12:08:33] _jovi_, joePanda: I added 4 things to puppet swat [12:08:35] !log reenable puppet on ms-fe1/ms-be1 [12:08:42] Krenair: ok! [12:09:50] any obvious issues with those ones? [12:10:49] I excluded some from my open patch list because they needed wider discussion etc. [12:16:33] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [12:30:16] can someone check if nutcracker is running correctly on mw1142? Getting API errors indicating that it isn't [12:37:56] sitic, icinga says it is running [12:37:56] 10Ops-Access-Requests, 6operations: add papaul to ops LDAP group - https://phabricator.wikimedia.org/T109640#1556900 (10akosiaris) He is a member of the Technical Operations team, it's OK to add him to the group. [12:39:03] what errors are you getting, sitic? I was detecting some api issues before [12:39:59] jynus: I'm getting "Nonce already used" OAuth errors from mw1142, which should be some memcache failures (https://phabricator.wikimedia.org/T106066) [12:40:53] ok, then unrelated to what I was tracking [12:42:54] akosiaris, around? we had access rights meltdown the other day, pgsql rights got messed up for the water table - https://phabricator.wikimedia.org/T109530 [12:43:17] had to manually fix it, but it might any second upon auto-table update (at least that's what i think caused it) [12:43:36] yurik: read my comment in that ticket [12:43:45] mmm, maybe not unrelated [12:44:10] I was getting lots of mysql api connections, so consistent with memcached misses [12:44:40] (03PS1) 10Alexandros Kosiaris: Set log_dir for SCA, SCB [puppet] - 10https://gerrit.wikimedia.org/r/232716 (https://phabricator.wikimedia.org/T107900) [12:48:27] (03PS1) 10Mobrovac: Citoid: Add security config parameters [puppet] - 10https://gerrit.wikimedia.org/r/232717 (https://phabricator.wikimedia.org/T98533) [12:48:47] !log restarted nutcracker on mw1142 [12:49:13] jynus: yes, nutcracher on mw1142 had problems [12:49:13] (03CR) 10Mobrovac: [C: 031] Set log_dir for SCA, SCB [puppet] - 10https://gerrit.wikimedia.org/r/232716 (https://phabricator.wikimedia.org/T107900) (owner: 10Alexandros Kosiaris) [12:49:16] errors gone [12:49:26] but it would have been better to depool it [12:49:29] and let ori gdb it [12:49:37] akosiaris: https://gerrit.wikimedia.org/r/232717 when you've got a moment [12:49:50] at least, that's what bd808 and ori have requested [12:50:39] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid: Add security config parameters [puppet] - 10https://gerrit.wikimedia.org/r/232717 (https://phabricator.wikimedia.org/T98533) (owner: 10Mobrovac) [12:50:51] thnx akosiaris! [12:51:08] mobrovac: yw [12:51:08] ok, now I know, akosiaris [12:52:41] it is interesting the chain of events, memcache fails, so my API servers got saturated [12:53:47] funny is that mysql stands stable except on the initial 5000 connection peaks, and they only fail because of the 3 second timout [12:54:11] on the 3-year old servers [12:55:00] and only on en-, the rest of the langs didn't notice it [12:57:43] 6operations, 7HTTPS: Chrome on OS X 10.11 ("El Capitan") does not trust Wikimedia certificates - https://phabricator.wikimedia.org/T109029#1556954 (10BBlack) FYI, as of El Capitan Beta 5 release + Chrome 44.0.2403.155, the issue seems to be resolved, although I'm not sure which update fixed it :) [13:05:31] PROBLEM - puppet last run on nescio is CRITICAL puppet fail [13:07:05] https://commons.wikimedia.org/wiki/Special:Contributions/127.0.0.1 [13:07:07] o_O [13:07:26] there are edits by localhost [13:07:41] PROBLEM - puppet last run on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:48] probably the new categorization code [13:08:32] it just does User::newFromId( 0 ) if it doesn't get a revision id [13:08:53] * Krenair loads url [13:09:05] Steinsplitter, oh, from 2013/2014? [13:09:13] huh O_O true [13:09:14] 04:48, 20 August 2015 (UTC) 127. 0.0.1 [13:09:14] (talk) (blocklist) [13:09:15] ?? possibly open proxy 17‑6‑2015 [13:09:23] wondring why this is triggering the open proxy filter [13:09:31] (03PS2) 10Alexandros Kosiaris: Set log_dir for SCA, SCB [puppet] - 10https://gerrit.wikimedia.org/r/232716 (https://phabricator.wikimedia.org/T107900) [13:09:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Set log_dir for SCA, SCB [puppet] - 10https://gerrit.wikimedia.org/r/232716 (https://phabricator.wikimedia.org/T107900) (owner: 10Alexandros Kosiaris) [13:09:45] and maybe the existing edits can be assigned to the user who made it :) [13:10:21] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1556979 (10jcrespo) mw1142 had the same problem tonight: https://logstash.wikimedia.org/#dashboard/temp/AU9LNp9HOkQDz4dSqpM2 Sorry I restarted instead of depool it. [13:11:10] PROBLEM - puppet last run on scb1001 is CRITICAL Puppet has 1 failures [13:13:10] RECOVERY - puppet last run on scb1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:20] RECOVERY - puppet last run on nescio is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:32:46] (03PS1) 10Alexandros Kosiaris: service::node: change logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/232722 [13:34:32] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1557006 (10mmodell) @joe: thanks, that seems a lot more powerful than the ideas I had come up with so far. So we can call the nagios chec... [13:36:02] akosiaris: for the logrotate patch, which signal is sent by invokerc.d reload? [13:37:46] bah, init script's reload action [13:37:47] duh [13:37:56] which is not defined [13:38:28] akosiaris: we'll need a restart though, not a relaod [13:40:51] mobrovac: hmm I was hoping to avoid a restart [13:41:28] well, its upstart and systemd so initscript is rather unimportant [13:41:54] akosiaris: normally, a reload (SIGHUP) would re-init the workers, but not the master process [13:42:32] but this is something we should probably try to correct [13:45:22] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 652318 msg: ocg_render_job_queue 0 msg [13:45:41] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 652359 msg: ocg_render_job_queue 0 msg [13:45:51] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 652378 msg: ocg_render_job_queue 0 msg [13:46:33] mobrovac: yes, the master process should also honour SIGHUP, not ignore it [13:48:17] akosiaris: i agree, but it's a current limitation of the clustering lib we use for forking and monitoring workers [13:48:36] that said, there must be a way to force the master to close its fd on SIGHUP as well [13:49:08] mobrovac: can't you just handle SIGHUP ? [13:49:17] is the clustering lib in the way ? [13:49:23] what's that clustering lib btw ? [13:50:08] we do, it's just that the interaction between the clustering lib and the logging lib is a bit dark in that corner case [13:51:07] akosiaris: i'll file a task for that [13:53:12] mobrovac: hmm that change also needs systemd unit file updated as well [13:53:22] (03PS3) 10BBlack: vcl_cookies: reduce text-vs-mobile cookie variance [puppet] - 10https://gerrit.wikimedia.org/r/232638 (https://phabricator.wikimedia.org/T109286) [13:53:30] akosiaris: for the reload? [13:54:04] (03PS4) 10BBlack: vcl_cookies: use common pass_auth in mobile [puppet] - 10https://gerrit.wikimedia.org/r/232518 (https://phabricator.wikimedia.org/T109286) [13:54:06] (03PS4) 10BBlack: vcl_cookies: re-arrange mobile recv order a bit to match text [puppet] - 10https://gerrit.wikimedia.org/r/232517 (https://phabricator.wikimedia.org/T109286) [13:54:08] (03PS4) 10BBlack: vcl_cookies: switch mobile to text-common code, mostly [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) [13:54:10] (03PS4) 10BBlack: vcl_cookies: reduce text-vs-mobile cookie variance [puppet] - 10https://gerrit.wikimedia.org/r/232638 (https://phabricator.wikimedia.org/T109286) [13:54:24] jouncebot: next [13:54:24] In 1 hour(s) and 5 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T1500) [13:54:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Needs systemd unit ExecReload support (upstart already defaults to sending a SIGHUP) as well as support from service-runner" [puppet] - 10https://gerrit.wikimedia.org/r/232722 (owner: 10Alexandros Kosiaris) [13:54:43] mobrovac: yes [13:54:45] sudo systemctl reload mobileapps.service [13:54:45] Failed to reload mobileapps.service: Job type reload is not applicable for unit mobileapps.service. [13:55:05] so we need ExecReload = kill -HUP $MAINPID as I see it [13:55:13] which is easy [13:55:29] support for SIGHUP seems to be the difficult part though [13:56:21] ACKNOWLEDGEMENT - RAID on ms-be2009 is CRITICAL 1 failed LD(s) (Offline) Filippo Giunchedi T107877 [13:56:23] (03CR) 10BBlack: [C: 032] vcl_cookies: switch mobile to text-common code, mostly [puppet] - 10https://gerrit.wikimedia.org/r/232516 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [13:56:30] ACKNOWLEDGEMENT - puppet last run on ms-be2009 is CRITICAL Puppet has 1 failures Filippo Giunchedi T107877 [13:56:34] akosiaris: that will keep the master proc logging into the rotated file, not sure about workers at this point (since i am not sure whether the cluster lib is passing the fd over the pipe or not) [13:57:00] (03CR) 10BBlack: [C: 032] vcl_cookies: re-arrange mobile recv order a bit to match text [puppet] - 10https://gerrit.wikimedia.org/r/232517 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [13:57:20] (03CR) 10BBlack: [C: 032] vcl_cookies: use common pass_auth in mobile [puppet] - 10https://gerrit.wikimedia.org/r/232518 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [13:58:04] (03CR) 10BBlack: [C: 032] vcl_cookies: reduce text-vs-mobile cookie variance [puppet] - 10https://gerrit.wikimedia.org/r/232638 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [14:10:25] (03CR) 10Alex Monk: "Hashar: ping?" [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [14:23:34] (03PS1) 10Andrew Bogott: Increase nf_conntrack_max and nf_conntrack_buckets on labnet1002. [puppet] - 10https://gerrit.wikimedia.org/r/232727 [14:24:09] akosiaris, thx for the reply - should we alter the import script to grant perms? I thought postgres should automatically grant corrent perms to all new tables by default [14:28:52] yurik: no, it does not. It actually exactly the opposite. That's why that import script is wrong [14:29:05] yurik: and yes that script needs fixing [14:29:59] akosiaris, i am about to change it to add index (we were running 1/8th the speed because that script didn't add indexes) [14:30:26] i fixed it manually, but need to add it to the script. Should I add the GRANT lines as well? [14:30:33] yurik: another reason that script is flawed [14:30:49] any changes done to the table are lost, thanks for pointing that out [14:31:06] akosiaris, are you proposing to abandon the script or do incremental improvements? :) [14:31:17] i understand its broken, hence will add a few things now [14:32:21] if you do many improvements at once and not incremental in order to fix the big architectural problems it has, I am fine with keeping it [14:33:22] (03Abandoned) 10Andrew Bogott: Increase nf_conntrack_max and nf_conntrack_buckets on labnet1002. [puppet] - 10https://gerrit.wikimedia.org/r/232727 (owner: 10Andrew Bogott) [14:34:45] (03PS2) 10Filippo Giunchedi: Kill role::swift::labs [puppet] - 10https://gerrit.wikimedia.org/r/231238 (owner: 10Faidon Liambotis) [14:34:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Kill role::swift::labs [puppet] - 10https://gerrit.wikimedia.org/r/231238 (owner: 10Faidon Liambotis) [14:36:04] (03PS1) 10Yurik: Maps: Add geo-index to the water_polygons table [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) [14:36:10] akosiaris, ^ first [14:37:47] akosiaris, i'm not sure about architectural improvements esp the validation - i suspect that shp2pgsql does data validation, but obviously as with any binary data formats, you have to either rely on the validator built into the tool, or find another validator which is different from that tool, that you also trust [14:38:29] yurik: I was actually referring to a GPG or at least MD5/SHAsomething signature [14:39:35] ah, missed that point [14:39:41] also, that DROP TABLE, ALTER TABLE name process is architecturally wrong [14:40:17] Invalid signing cert error on mediawiki.org, just me or... [14:40:25] akosiaris, i looked at the tool - it only has "append" command [14:40:30] http://www.bostongis.com/pgsql2shp_shp2pgsql_quickguide.bqg [14:40:46] Oh, good, every site [14:41:12] yurik: does append mean also update in that case ? [14:41:34] Works fine in Chrome, Firefox not a fan... [14:41:34] (03PS2) 10Alex Monk: Enable RandomRootPage on remaining sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206480 (https://phabricator.wikimedia.org/T18655) (owner: 10Nemo bis) [14:41:36] akosiaris, no idea - i am not sure there are IDs inside the file [14:42:00] (03CR) 10Alex Monk: [C: 031] "I intend to do this within the next hour or so." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206480 (https://phabricator.wikimedia.org/T18655) (owner: 10Nemo bis) [14:42:45] akosiaris, i suspect that the drop and recreate is actually an ok path in this case, because this way the tool does not need to do all the import in a transaction [14:42:54] * Nemo_bis begins to suspect Christmas was moved to midsummer [14:43:04] otherwise we basically have to stop the whole system during the upgrade [14:43:13] Anyone interested in no Firefox users being able to access the cluster? [14:43:30] who cares about FF now that we have Edge? [14:43:32] marktraceur, someone else just reported this in -releng [14:43:37] And in #mediawiki [14:43:40] marktraceur: works fine for me [14:43:45] wfm [14:43:50] Weird. [14:43:58] wfm [14:44:02] DANGER DANGER [14:44:09] Luke081515: Firefox? [14:44:15] yes [14:44:19] a friend complained that he gets https errors [14:44:21] yurik: we can test it, but dropping the table sounds like a bad idea [14:44:25] and now I also get https errors [14:44:29] when I try to read he.wikipedia [14:44:31] aharoni: Yes, Firefox users are seeing them [14:44:34] All across the cluster [14:44:44] de.wikipedia works, but I can't see pictures [14:44:48] I guess not Nemo_bis or yurik though. Weird. [14:44:57] firefox versions ? [14:44:59] I just confirmed it with firefox [14:45:04] 39 here [14:45:06] FF40 [14:45:08] and phab is text only, no style elements or so on [14:45:11] FF40 [14:45:13] too [14:45:18] An error occurred during a connection to upload.wikimedia.org. Invalid OCSP signing certificate in OCSP response. (Error code: sec_error_ocsp_invalid_signing_cert) [14:45:24] FF 40 linux [14:45:27] Phabricator, thankfully, is still working for me [14:45:37] tried a number of different langs [14:45:37] just got the error for https://upload.wikimedia.org/wikipedia/commons/archive/8/83/20070823204156!David_face.png [14:45:40] Yes, I got this error to, Nemo_bis [14:45:44] another report of it in -tech [14:45:47] yep, i see the error [14:45:56] for that link [14:46:13] marktraceur, Luke081515 Nemo_bis: OS? [14:46:17] Nemo_bis: I can't see the because I get an error :/ [14:46:18] GNU/Linux [14:46:22] could it be related to bblack's recent changes to caching [14:46:29] 4.0.6-200.fc21.x86_64 [14:46:39] Currently only happening for upload.wikimedia.org to me [14:46:45] ok, I got this too [14:47:06] Fedora 21, Nemo_bis? [14:47:10] yes [14:47:15] phabricator an upload.wikimedia.org affected at my browser [14:47:15] I've got it on Ubuntu [14:47:15] he.wikipedia.org. Invalid OCSP signing certificate in OCSP response * sec_error_ocsp_invalid_signing_cert [14:47:24] report in -tech is from os x [14:47:36] ok, chrome on ubuntu works, FF40 fails [14:47:36] yeah it looks like something has changed or broken upstream at globalsign [14:47:38] Fedora 21, Firefox 32 [14:47:39] hewiki works for me [14:47:43] 6operations: sysctl::parameters don't take effect until next reboot (on Trusty at least) - https://phabricator.wikimedia.org/T109711#1557138 (10Andrew) 3NEW [14:47:52] yurik: and no, it's not a recent change in caching :P [14:47:53] Luke081515, try the image https://upload.wikimedia.org/wikipedia/commons/archive/8/83/20070823204156!David_face.png [14:47:57] sorry, Firefox 39 [14:48:01] but works on Chome [14:48:06] bblack, just trying to be helpful :-P [14:48:09] sec_error_ocsp_invalid_signing_cert [14:48:29] At my side, only phab and upload.wikimedia is affected [14:48:42] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1557148 (10GWicke) > What kind of configuration parameters would be useful? The very first thing would be a way to configure which metri... [14:48:44] phab works now [14:49:00] en.wikipedia.org does not work for me [14:49:22] Krenair: enwiki works for me [14:49:59] Maybe it has to do with location or caching or something, anyway, bblack seems to be on it [14:50:38] enwiki works for me, BUT not showing images [14:50:38] they all use the same basic stuff for OCSP. it has nothing to do with caching or specific URLs [14:50:48] hm, upload.wikimedia seems to work not for all affected people [14:50:53] it's just random which servers have gotten bad OCSP updates [14:53:27] I just got the same, anything I can help with? [14:53:31] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:54:11] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:54:48] taht was me ^ fixed [14:55:19] "An error occurred during a connection to en.wikipedia.org. Invalid OCSP signing certificate in OCSP response. (Error code: sec_error_ocsp_invalid_signing_cert) " [14:55:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:55:39] AndyRussG: folks are working on that in another channel, thanks for reporting [14:55:44] andy, yeah, I got it too [14:55:50] you mean this other channel? [14:55:52] great [14:56:11] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [14:56:13] well, ok, multiple channels [14:56:56] andrewbogott: thx! I'm here and pingable if I can be of any help [14:58:22] Which other channel? Is it a seeeeecret channel? [14:58:43] That was my assumption. [14:59:13] fwiw, chrome has some special sauce for OCSP so it behaves differently [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T1500). [15:00:05] James_F kart_ Krenair: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:13] Heya. [15:00:14] Yeah... Let's not do that yet. :) [15:00:16] jouncebot: Now is not a good time [15:00:27] !log rebooting labvirt1008 [15:00:40] (03PS1) 10BBlack: temporarily disable OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/232734 [15:00:55] (03CR) 10BBlack: [C: 032 V: 032] temporarily disable OCSP stapling [puppet] - 10https://gerrit.wikimedia.org/r/232734 (owner: 10BBlack) [15:01:05] ah yeah I'm getting it on Iceweasl not on Chrome. Yesterday I got it on Iceweasel on my phone but I assumed it was an issue with my unupdated phone setup [15:01:05] * yurik is sure its some global conspiracy govt hack that's messing with the servers [15:01:08] sec_error_ocsp_invalid_signing_cert on upload.wikimedia [15:01:10] whats up? [15:01:12] known [15:01:13] sorry, I mean Firefox on my phone [15:01:21] don't swat for a sec, pls! [15:01:39] bblack: i was about to say, just got OCSP error on phab [15:02:23] !log performing online schema change on wikidata [15:02:55] jynus, ^^^ [15:03:30] Krenair: SWATing? [15:03:36] not now [15:03:37] Not right this second. [15:03:55] When things calm down, maybe. [15:04:02] :) [15:04:16] OK. What's with The Thing? :) [15:04:28] yurik, what? [15:04:37] kart_, lots of FF users can't access the site [15:04:47] Okay. [15:04:48] OCSP issues [15:04:55] we have some OCSP Stapling related issue, and it looks a little complicated, so I'm disabling it in hopes that fixes it [15:05:10] which is using salt, so I don't want swat-related things slowing down salt :) [15:06:31] bblack: ack. Thanks for updates! [15:06:38] Guys, we've had a few OTRS tickets sent in about this [15:06:54] any particular response you guys want me to send out, or just a standard "try again later" one? [15:06:56] (03PS1) 10Alexandros Kosiaris: etherpad: Move apache configuration around [puppet] - 10https://gerrit.wikimedia.org/r/232735 [15:06:58] (03PS1) 10Alexandros Kosiaris: etherpad: set proxy-initial-not-pooled [puppet] - 10https://gerrit.wikimedia.org/r/232736 [15:07:59] Cookies52: a "we are aware about the issue and working on it" should suffice I say [15:08:05] akosiaris, ok, will do [15:08:36] !log restarting restbase1001 to apply temporary heap size of 12G [15:09:40] this is where i expect to be called "master" [15:10:01] urandom: That fixed it? [15:10:14] (03CR) 10GWicke: "Giuseppe and Alex, several (most, afaik) of these statistics will be be page-related, so naturally fit into the /page/ hierarchy in the RE" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [15:10:20] WFM now. [15:10:39] marktraceur: what does, the bot? [15:10:50] Oh [15:11:07] I thought you were fixing the OCSP issue [15:11:15] no :) [15:12:07] marktraceur: if it's fixed, i must have fixed it :) [15:12:12] Obviously! [15:12:13] whatever it is [15:12:39] I'm guessing bblack fixed it with his patch above, but I'd like confirmation that it's not just randomly working for me now [15:12:48] wfm as well [15:12:54] wfm [15:13:44] I'm still investigating [15:13:54] Did icinga alert when nutcracker went nuts on mw1142? I'm not spotting it in my backscroll [15:13:58] I don't think my patch is all of it, I think GlobalSign had issues and they just fixed those as well [15:14:07] OK [15:14:19] bd808, no [15:14:30] it said 1 process running [15:14:32] as usual [15:14:34] boo. so my fancy check doesn't really work :( [15:14:49] (03PS6) 10Andrew Bogott: Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 [15:15:10] what's at the root of all of this is that globalsign was sending OCSP responses with an expired signer on them, whether it was us querying them (for stapling) or the browser directly. [15:15:10] bd808, sorry, I didn't leave it as is, it seemed natural to me to retarted as it was dead [15:15:27] I will do it next time [15:15:31] they seem to no longer be doing that, but it's good that our stapling is disabled because it would take a while for it to stop caching the bad responses, too [15:16:05] jynus: no worries. It will happen again and we'll try to figure out why [15:16:27] in my current tests, I see good responses from globalsign directly now, and I think our update scripts will fix the caches and we can turn stapling back on [15:16:36] but I need to verify all of those things and chase down some related corner issues as I go [15:16:59] Should I tentatively announce that we're back up? [15:17:42] (03CR) 10Chad: "I'm not sure why this is in the scap module, has nothing to do with that." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232668 (https://phabricator.wikimedia.org/T71489) (owner: 10Ori.livneh) [15:18:19] (03CR) 10Andrew Bogott: [C: 032] Exclude yet more kernels that break nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/232524 (owner: 10Andrew Bogott) [15:18:20] wfm now too! \o/ congrats folks :) [15:19:57] <_jovi_> greg-g: Krenair are you guys going to do SWAT now? [15:20:14] Is bblack done with salt? [15:20:15] it's up to bblack [15:20:21] (03CR) 10BryanDavis: "The nutcracker process on mw1142 went nuts around 2015-08-20T03:02 with the error rate shooting up from ~150/m to ~20K/m. It stayed elevat" [puppet] - 10https://gerrit.wikimedia.org/r/231704 (https://phabricator.wikimedia.org/T69817) (owner: 10BryanDavis) [15:20:21] <_jovi_> kk [15:20:24] bblack: SWAT OK or wait still? [15:20:26] I'm sure he'll need slat for most of his life [15:20:53] bblack: thanks [15:20:58] greg-g: I think so, if it's nothing horribly risky! [15:21:13] Krenair: ^ :) [15:21:20] okay [15:21:50] (03PS2) 10Filippo Giunchedi: Kill swift.pp [puppet] - 10https://gerrit.wikimedia.org/r/231239 (owner: 10Faidon Liambotis) [15:22:20] kart_, have sent your first couple of changes to jenkins [15:23:09] James_F, good to go with VE for 50% of new accounts? [15:23:12] Krenair: cool. Thanks. [15:23:15] bd808: if there is a consolation, the database API host stayed strong, but couldn't handle peak enwiki thoughput [15:23:15] Krenair: Yup. [15:23:40] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor for 50% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231464 (owner: 10Jforrester) [15:24:09] (03Merged) 10jenkins-bot: Enable VisualEditor for 50% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/231464 (owner: 10Jforrester) [15:24:16] other than the fact that we're not stapling responses, I think we're in the clear on FF throwing broken OCSP errors [15:24:54] in the extremely-short term, I think all I have to do now is make sure all the caches get good updates from GlobalSign and then I can turn stapling back on [15:25:11] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/231464/ (duration: 00m 13s) [15:25:15] but in the slightly-less-short term, I think there must have been a related bug in our updater that it didn't prevent the bad updates, too [15:25:15] James_F, ^ [15:25:25] Thanks Krenair. [15:26:06] oh, hang on a sec [15:26:22] ? [15:26:25] errors from one server [15:26:34] mw2187 [15:26:38] Yay production. [15:26:46] So… codfw? [15:26:53] yeah [15:26:54] rsync: failed to set times on "/srv/mediawiki/wmf-config": Read-only file system (30) [15:26:56] yeah, doesn't serve traffic [15:27:00] Meh. [15:27:20] nothing for you to worry about anyway James_F [15:27:22] !log on mw2187: rsync: failed to set times on "/srv/mediawiki/wmf-config": Read-only file system (30) [15:27:30] oh right, no logbot [15:27:47] but bd808's SAL works :) https://tools.wmflabs.org/sal/production [15:28:18] which is a great advertisement for today's tech talk on ELK :) [15:28:19] defense in depth [15:28:22] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:30] PROBLEM - Hadoop DataNode on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Kill swift.pp [puppet] - 10https://gerrit.wikimedia.org/r/231239 (owner: 10Faidon Liambotis) [15:28:45] (03PS1) 10Dzahn: dsh: remove mw2187 from dsh - ro filesystem [puppet] - 10https://gerrit.wikimedia.org/r/232745 [15:28:45] I need to throw in a slide or two about my tool toys [15:28:50] PROBLEM - salt-minion processes on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:27] greg-g: :-) [15:29:57] (03PS2) 10Dzahn: dsh: remove mw2187 from dsh - ro filesystem [puppet] - 10https://gerrit.wikimedia.org/r/232745 [15:30:10] PROBLEM - dhclient process on analytics1056 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:12] (03CR) 10Dzahn: [C: 032] "< greg-g> !log on mw2187: rsync: failed to set times on "/srv/mediawiki/wmf-config": Read-only file system (30)" [puppet] - 10https://gerrit.wikimedia.org/r/232745 (owner: 10Dzahn) [15:30:13] kart_, ready? [15:31:04] (03CR) 10Dzahn: [V: 032] dsh: remove mw2187 from dsh - ro filesystem [puppet] - 10https://gerrit.wikimedia.org/r/232745 (owner: 10Dzahn) [15:31:10] Yes [15:31:29] !log krenair@tin Synchronized php-1.26wmf19/extensions/ContentTranslation/api/ApiContentTranslationPublish.php: https://gerrit.wikimedia.org/r/#/c/232688/ (duration: 00m 11s) [15:32:05] !log krenair@tin Synchronized php-1.26wmf18/extensions/ContentTranslation/api/ApiContentTranslationPublish.php: https://gerrit.wikimedia.org/r/#/c/232687/ (duration: 00m 13s) [15:32:15] kart_, please test [15:32:19] Okay [15:32:24] (03PS2) 10Alexandros Kosiaris: etherpad: Move apache configuration around [puppet] - 10https://gerrit.wikimedia.org/r/232735 [15:32:30] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad: Move apache configuration around [puppet] - 10https://gerrit.wikimedia.org/r/232735 (owner: 10Alexandros Kosiaris) [15:32:46] (03PS2) 10Alexandros Kosiaris: etherpad: set proxy-initial-not-pooled [puppet] - 10https://gerrit.wikimedia.org/r/232736 [15:32:48] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T109717" [puppet] - 10https://gerrit.wikimedia.org/r/232745 (owner: 10Dzahn) [15:32:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad: set proxy-initial-not-pooled [puppet] - 10https://gerrit.wikimedia.org/r/232736 (owner: 10Alexandros Kosiaris) [15:32:58] greg-g: removed from dsh, ticket created [15:36:01] Krenair: looks good. Thanks! [15:36:51] (03PS1) 10Filippo Giunchedi: swift: replace swift_check_http_host variable [puppet] - 10https://gerrit.wikimedia.org/r/232746 [15:37:04] (03CR) 10Alex Monk: [C: 032] Enable RandomRootPage on remaining sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206480 (https://phabricator.wikimedia.org/T18655) (owner: 10Nemo bis) [15:37:12] (03Merged) 10jenkins-bot: Enable RandomRootPage on remaining sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206480 (https://phabricator.wikimedia.org/T18655) (owner: 10Nemo bis) [15:38:19] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/206480/ (duration: 00m 13s) [15:38:21] (03CR) 10Mobrovac: "The problem here seems to be the usage of a second RESTBase cluster. While we could make do with the principal one, the problem here is th" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [15:38:26] did you run puppet on tin, mutante? [15:39:38] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/206480/ (duration: 00m 13s) [15:41:11] PROBLEM - puppet last run on rhodium is CRITICAL puppet fail [15:41:35] Krenair: i did not. i can now [15:42:54] !log krenair@tin Synchronized php-1.26wmf19/extensions/ContentTranslation/modules/tools/ext.cx.tools.reference.js: https://gerrit.wikimedia.org/r/#/c/232730/ (duration: 00m 13s) [15:43:11] RECOVERY - puppet last run on rhodium is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:43:18] kart_, ^ [15:43:26] !log krenair@tin Synchronized php-1.26wmf18/extensions/ContentTranslation/modules/tools/ext.cx.tools.reference.js: https://gerrit.wikimedia.org/r/#/c/232729/ (duration: 00m 12s) [15:43:43] mutante: ty [15:43:45] Krenair: done [15:43:53] thanks mutante [15:44:06] !log uploaded to apt.wikimedia.org jessie-wikimedia: etherpad-lite_1.5.7-1 [15:44:23] Krenair: testing. [15:44:33] now here's the weird part, 2187 does not look read-only to me now :p [15:45:09] and that puppet fail on rhodium was one of those glitches that are just gone on next run [15:46:31] yeah [15:46:46] I can edit things there [15:49:13] mutante, https://phabricator.wikimedia.org/P1906 was the full output [15:50:01] RECOVERY - Router interfaces on cr1-codfw is OK host 208.80.153.192, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [15:51:12] (03PS2) 10Filippo Giunchedi: swift: replace swift_check_http_host variable [puppet] - 10https://gerrit.wikimedia.org/r/232746 [15:52:50] RECOVERY - Router interfaces on cr2-codfw is OK host 208.80.153.193, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [15:53:47] Krenair: added that to the ticket [15:55:09] bblack, https://phabricator.wikimedia.org/T109712 [15:55:33] are ops waiting for ocsp stapling to be okay to turn back on before closing that? [15:55:44] or can it just be marked resolved now? [15:58:45] I don't know. I'm writing up an incident response report now [15:58:50] ok [15:59:09] I'd be hestitant to just resolve that. we don't know that related issues won't recur in the short term just yet, depending on how things go at GlobalSign [15:59:59] heh [16:00:01] jouncebot: is dead? [16:00:04] YuviPanda _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T1600). [16:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:07] hah! [16:00:17] haha! [16:00:18] nice [16:00:28] Krenair: I'll do yours first [16:00:30] nope, you're just too quick :p [16:01:08] (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/830/" [puppet] - 10https://gerrit.wikimedia.org/r/232746 (owner: 10Filippo Giunchedi) [16:01:31] (03PS6) 10Yuvipanda: sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) (owner: 10Alex Monk) [16:01:49] (03CR) 10Yuvipanda: [C: 032 V: 032] sql command: use slave server unless '--write' provided as an option before DB [puppet] - 10https://gerrit.wikimedia.org/r/223365 (https://phabricator.wikimedia.org/T105046) (owner: 10Alex Monk) [16:02:15] Krenair: which host should I run puppet on for you to test this? [16:02:16] tin? [16:02:40] one of tin, terbium, silver... [16:02:50] actually not silver in this case [16:03:08] Krenair: right. running puppet on tin now [16:03:36] <_joe_> YuviPanda: use the compiler :) [16:03:50] _joe_: not for that change no :P it was to a bash script in one host... [16:04:02] <_joe_> ok, fair enough [16:04:27] (03PS3) 10Yuvipanda: Include /api/ rewrite on wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/231976 (owner: 10Alex Monk) [16:04:30] not just one host, but yeah [16:04:34] <_joe_> bd808: I looked at all your patches, I noticed they are missing the rubber duck stamp!!1! [16:04:40] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1557475 (10fgiunchedi) >>! In T95253#1536539, @Eevans wrote: > The discovery approach seems much cleaner to me. We would need to puppetize an add... [16:04:51] Krenair: try now [16:05:24] _joe_: all 3 are live in beta via cherry-pick [16:05:26] 6operations, 10Security-Reviews, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#1557487 (10csteipp) [16:05:40] (03PS2) 10Yuvipanda: Copy rest of tin's domain_search to terbium [puppet] - 10https://gerrit.wikimedia.org/r/232087 (owner: 10Alex Monk) [16:05:56] (03CR) 10Yuvipanda: [C: 032 V: 032] "Matches tin" [puppet] - 10https://gerrit.wikimedia.org/r/232087 (owner: 10Alex Monk) [16:06:12] <_joe_> bd808: I'm a bureaucrat now, I don't care if things /work/ [16:06:19] <_joe_> they just need the rubber duck stamp!!! [16:06:28] heh [16:06:48] looks like I'll do Krenair's and _joe_ is gonna do bd808's right after he's done petting the cat :D [16:06:49] YuviPanda, looks good [16:07:18] (03PS4) 10Yuvipanda: Include /api/ rewrite on wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/231976 (owner: 10Alex Monk) [16:07:28] (03CR) 10Yuvipanda: [C: 032 V: 032] Include /api/ rewrite on wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/231976 (owner: 10Alex Monk) [16:07:45] which one is the canary app server? [16:07:52] 🐥 [16:08:00] the test server? [16:08:03] mw1017 [16:08:04] yeah [16:08:39] serves test.wikipedia.org and most other mw sites if you set the special header [16:08:51] right [16:09:27] It serves pretty much everything if you use https://addons.mozilla.org/en-us/firefox/addon/wikimedia-debug-header/ [16:10:07] bd808: Does DNS support real Unicode? Can we make 🐥.eqiad.wmnet alias? ;-) [16:11:01] James_F: yes, it would be xn--to8h.eqiad.wmnet [16:11:03] James_F: yes, [[w:Punycode]] we have domain names like [16:11:06] xn--80adaxaliyuf0k.xn--p1ai [16:11:11] xn--to8h.eqiad.wmnet [16:12:08] (03PS2) 10Yuvipanda: Remove link to old static.wikipedia.org from dumps download page [puppet] - 10https://gerrit.wikimedia.org/r/232674 (owner: 10Alex Monk) [16:12:12] James_F: not real unicode, but kinda-unicode. IDN is awful :/ [16:12:13] Clearly we must come up with pun Unicode glyph sequences for all our servers. [16:12:20] (03CR) 10Milimetric: "I think there's some confusion here over the overloaded word "domain". There's no proposal to set up an analytics.wikimedia.org domain, t" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [16:12:27] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove link to old static.wikipedia.org from dumps download page [puppet] - 10https://gerrit.wikimedia.org/r/232674 (owner: 10Alex Monk) [16:12:36] https://en.wikipedia.org/wiki/Internationalized_domain_name [16:12:53] Krenair: ok, verified the wikimediafoundation one [16:12:59] Krenair: I think you're all done? [16:13:11] 💩 can be dumps.wikimedia.org; ♖ can be bastion. [16:13:29] (03PS3) 10Filippo Giunchedi: swift: replace swift_check_http_host variable [puppet] - 10https://gerrit.wikimedia.org/r/232746 [16:13:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: replace swift_check_http_host variable [puppet] - 10https://gerrit.wikimedia.org/r/232746 (owner: 10Filippo Giunchedi) [16:13:45] I'm not even sure if IDN allows for characters like those. maybe? [16:14:00] YuviPanda, well, the foundationwiki /api/ change still needs to be rolled out to the other servers [16:14:04] I know they have a lot of rules in place to limit look-alike characters and such to help with phishing [16:14:12] Krenair: sure, that'll happen over the next 20mins [16:14:42] https://en.wikipedia.org/wiki/IDN_homograph_attack [16:14:46] <_joe_> bd808: up for your patches? [16:14:46] YuviPanda, okay. did you run puppet on terbium? [16:14:57] (03PS3) 10Giuseppe Lavagetto: Followup Ibd58670e: Also set auto_create_index for beta logstash [puppet] - 10https://gerrit.wikimedia.org/r/231049 (owner: 10Chad) [16:15:00] Krenair: ah, nope. [16:15:03] _joe_: sure. [16:15:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Followup Ibd58670e: Also set auto_create_index for beta logstash [puppet] - 10https://gerrit.wikimedia.org/r/231049 (owner: 10Chad) [16:15:34] and on the hosts behind dumps.wikimedia.org (snapshot* I guess?) [16:15:37] (03PS3) 10Giuseppe Lavagetto: beta: Disable authentication for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/231179 (https://phabricator.wikimedia.org/T76784) (owner: 10BryanDavis) [16:15:58] Krenair: hmm, I guess we'll wait the 20mins to verify those :) [16:16:03] the dumps one looked simple enough of course [16:16:55] <_joe_> bd808: the apache change has been tested? [16:17:05] <_joe_> I see it changed quite a lot of things there [16:17:16] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1557554 (10GWicke) Regarding metrics collection: Do we actually still need a separate collector? The regular collector worked fine in staging, and... [16:17:19] _joe_: the bits one? yes. it's live now [16:17:30] <_joe_> ok I'll just merge it then :) [16:17:33] it was copy pasta drift [16:17:35] (03PS1) 10Jcrespo: Add extra grants for labswiki from tin [puppet] - 10https://gerrit.wikimedia.org/r/232750 (https://phabricator.wikimedia.org/T98682) [16:17:42] (03PS2) 10Giuseppe Lavagetto: beta: copy prod bits apache config [puppet] - 10https://gerrit.wikimedia.org/r/231583 (owner: 10BryanDavis) [16:17:50] (03CR) 10Giuseppe Lavagetto: [C: 032] beta: copy prod bits apache config [puppet] - 10https://gerrit.wikimedia.org/r/231583 (owner: 10BryanDavis) [16:18:01] ostriches is starting to work on making that drift less likely in the future [16:18:26] https://gerrit.wikimedia.org/r/#/c/197655/ [16:19:23] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1557558 (10jcrespo) There are several databases on silver. I hope I am granting to the right one... [16:19:24] <_joe_> bd808: {{done}} [16:20:01] _joe_: thanks! beta cluster is down to 3 cherry-picks now :) [16:20:37] James_F: on the punycode host names thing -- https://╯‵д′╯彡┻━┻.wmflabs.org/wiki/Main_Page [16:20:41] That patch of mine is actually not all that complicated (mostly moving things about in prep of cleanup) and already works in beta. [16:20:56] bd808: haha nice :) [16:20:57] i wonder if Unicode 7 would be supported with the new emojis. it has stuff like "Reversed Hand With Middle Finger Extended" :) [16:21:03] ostriches: put it in the puppet swat? :) [16:21:05] bd808: I take it some browsers show that as Unicode? [16:21:17] firefox does [16:21:22] (In the address bar.) [16:21:22] oh [16:21:24] I think chrome would too [16:21:27] Kk. [16:21:28] Nope. [16:21:48] xn--d1a644lha820cjib27ad0264k :p [16:21:50] Do you see https://xn--d1a644lha820cjib27ad0264k.wmflabs.org [16:22:04] bd808: http://cl.ly/image/0J0r112d1e1t [16:22:05] PROBLEM - puppet last run on analytics1056 is CRITICAL puppet fail [16:22:07] set network.IDN_show_punycode=false in firefox [16:22:11] yep, that works on IRC and in browser [16:22:26] true actually [16:22:36] bd808: Works in Firefox though. [16:22:41] bd808: it does work for me, though it's an abomination :) [16:23:07] that's a bug [16:23:10] same host is also https://ಠ-ಠ.wmflabs.org/wiki/Main_Page [16:23:12] (03PS2) 10Jcrespo: Add extra grants for labswiki from tin [puppet] - 10https://gerrit.wikimedia.org/r/232750 (https://phabricator.wikimedia.org/T98682) [16:23:18] can anyone confirm whether the OCSP error is back for them again or not? [16:23:24] (in Firefox) [16:23:38] I have a test browser open, and it's doing it again out of the blue :/ [16:23:54] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1557559 (10Krenair) Yep, "labswiki" is the exact name of the correct database. Nothing else is necessary. [16:23:58] i don't get it again so far [16:24:04] bblack: https://en.wikipedia.org/wiki/Main_Page is working for me in FF 40.0 [16:24:11] I don't see it now in FF [16:24:21] ah, i just did literally now on en.wikipedia [16:24:25] yurik: Added [16:24:28] but not on phab [16:24:29] Bleh, YuviPanda [16:24:36] it's probably intermittent now.... [16:24:38] indeed it's kind of random [16:25:40] _joe_: https://gerrit.wikimedia.org/r/#/q/197655,n,z [16:25:45] I think I can fix it for good with some manual hacks now, since I have a live correct OCSP we can staple that's good for 8h [16:26:13] <_joe_> ostriches: taking a look [16:26:21] (03CR) 10Andrew Bogott: [C: 032] Move nova-api to labnet1002. [puppet] - 10https://gerrit.wikimedia.org/r/232652 (https://phabricator.wikimedia.org/T109653) (owner: 10Andrew Bogott) [16:26:25] (03PS2) 10Andrew Bogott: Move nova-api to labnet1002. [puppet] - 10https://gerrit.wikimedia.org/r/232652 (https://phabricator.wikimedia.org/T109653) [16:28:06] (03CR) 10Andrew Bogott: [C: 032] Move nova-api to labnet1002. [puppet] - 10https://gerrit.wikimedia.org/r/232652 (https://phabricator.wikimedia.org/T109653) (owner: 10Andrew Bogott) [16:28:13] (03PS1) 10BBlack: disable ocsp updater cron for now [puppet] - 10https://gerrit.wikimedia.org/r/232752 [16:28:26] (03PS2) 10BBlack: disable ocsp updater cron for now [puppet] - 10https://gerrit.wikimedia.org/r/232752 [16:28:33] (03CR) 10BBlack: [C: 032 V: 032] disable ocsp updater cron for now [puppet] - 10https://gerrit.wikimedia.org/r/232752 (owner: 10BBlack) [16:28:49] <_joe_> ostriches: I actually don't like that patch (and I see it didn't get a +1, in fact). I'll ask you to work on it [16:31:56] (03CR) 10GWicke: "I have set up a meeting tomorrow to sync up on this." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [16:32:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please see my comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [16:33:05] (03CR) 10Mobrovac: "Filed https://phabricator.wikimedia.org/T109727 for the service-runner side of things" [puppet] - 10https://gerrit.wikimedia.org/r/232722 (owner: 10Alexandros Kosiaris) [16:34:02] (03PS1) 10BBlack: Revert "temporarily disable OCSP stapling" [puppet] - 10https://gerrit.wikimedia.org/r/232754 [16:34:07] (03PS2) 10BBlack: Revert "temporarily disable OCSP stapling" [puppet] - 10https://gerrit.wikimedia.org/r/232754 [16:34:13] (03CR) 10BBlack: [C: 032 V: 032] Revert "temporarily disable OCSP stapling" [puppet] - 10https://gerrit.wikimedia.org/r/232754 (owner: 10BBlack) [16:34:48] (03CR) 10Mobrovac: "@Giuseppe, filed https://phabricator.wikimedia.org/T109727 to address the worker-restarts problem." [puppet] - 10https://gerrit.wikimedia.org/r/231790 (owner: 10Giuseppe Lavagetto) [16:35:05] <_joe_> mobrovac: I gifted you a token for writing the ticket I should've written [16:35:17] <_joe_> thanks :) [16:35:27] _joe_: Np, I'll work on it more [16:35:43] Krenair: do you want to test / verify your two other patches? [16:35:51] <_joe_> ostriches: it's mainly you removed a config I think should've been there :) [16:35:52] _joe_: :) [16:36:08] <_joe_> but consider me your patch's nanny from now on [16:38:04] YuviPanda, foundationwiki /api/ works [16:38:24] https://dumps.wikimedia.org/backup-index.html has updated to use the newer link [16:38:35] woo [16:38:36] so all good? [16:38:51] and terbium recognises codfw hosts without the codfw.wmnet [16:38:54] so I think it's all done [16:38:59] !log puppet swat done [16:39:22] thanks Krenair, _joe_, bd808, ostriches [16:39:44] _joe_: I think it's probably obsolete [16:39:54] Since prod doesn't do that there [16:40:00] thank you YuviPanda and _joe_! [16:40:07] I still have some things open, but I didn't put them up for swat because... well: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+owner:krenair,n,z [16:40:11] <_joe_> ostriches: not really, if you don't choose to run with mpm::worker [16:40:29] somebody should probably restart wm-bot I guess [16:40:36] <_joe_> bd808: I felt like overwheight homer, pressing "yes" multiple time [16:40:41] <_joe_> yes [16:41:34] one probably shouldn't happen because gitblit, another probably needs wider discussion among ops [16:41:39] _joe_: think of yourself more like George Jetson with his sore button pushing finger -- https://en.wikipedia.org/wiki/George_Jetson [16:41:50] and the last is only a few hours old with a sort-of-dependency and I want review from one specific person on it [16:43:35] Krenair: let's redirect it to github!!!1 [16:43:39] (03PS3) 10Jcrespo: Add extra grants for labswiki from tin [puppet] - 10https://gerrit.wikimedia.org/r/232750 (https://phabricator.wikimedia.org/T98682) [16:44:00] (03CR) 10Jcrespo: [C: 032] Add extra grants for labswiki from tin [puppet] - 10https://gerrit.wikimedia.org/r/232750 (https://phabricator.wikimedia.org/T98682) (owner: 10Jcrespo) [16:44:09] <_joe_> YuviPanda: let's use github!!!1! [16:44:26] Deskana|Away, moritzm, greg-g: parsoid would like to do an not-our-usual-schedule deploy today at 4pm EDT/1pm PDT/2000UTC, right after the mediawiki train [16:44:45] _joe_: :P [16:44:54] if that's alright with you all. [16:45:05] cscott: ok, should be fine. currently we're determining if the train is delayed or not, I'll let you know if it is (delayed) [16:45:07] maybe i should ping twentyafterfour as well, who is scheduled for the mediawiki train deploy. [16:45:16] yeah, ping him too :) [16:45:27] greg-g: if it's delayed, would it be later today? or not at all today? [16:45:38] 10Ops-Access-Requests, 6operations: Access to fluorine for Trey Jones (tjones) - https://phabricator.wikimedia.org/T108696#1557674 (10TJones) @Dzahn, I can log in. Thanks! [16:46:17] cscott: depends :) [16:50:36] Krenair: Rewrite it to use the diffusion repo for mw-config [16:50:51] cscott: the train deploy won't take long, once we are able to do it [16:51:33] Krenair: https://phabricator.wikimedia.org/diffusion/OMWC/ [16:52:38] (03CR) 10Dzahn: "re: the "TODO" part, that is https://gerrit.wikimedia.org/r/#/c/222522/ and https://gerrit.wikimedia.org/r/#/c/222519/ but not getting any" [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [16:53:35] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1557736 (10DStrine) [16:53:54] greg-g, twentyafterfour: ok. i'll add us to the wiki, let us know how things go. [16:54:58] word [16:55:37] anyone know offhand what the default lifetime is for a wgMemc->set() call? [16:56:10] the googles, they do nothing [16:56:20] andrewbogott: it might be forever... [16:56:22] Krenair: ^ [16:56:56] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1557763 (10GWicke) Timely [blog post on lessons learned from post-mortems](http://danluu.com/postmortem-lessons/): > Configuration bugs,... [16:57:51] andrewbogott, I think that's indefinite unless you specify the expiry [16:58:17] yike, ok. [16:58:23] and the expiry is in seconds, yes? [16:58:29] https://github.com/wikimedia/mediawiki/blob/master/includes/objectcache/MemcachedClient.php#L624-L632 [16:58:38] :) [16:58:52] * ostriches whacks Krenair for the github url [16:59:03] silly me thinking there would be docs other than the source :/ [16:59:53] ostriches, :P [17:00:05] https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/objectcache/MemcachedClient.php;HEAD$624 [17:00:21] diffusion++ [17:01:25] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1557773 (10jcrespo) 5Open>3Resolved a:3jcrespo So, I added wikiadmin to puppet from tin, and that should work and resolve the is... [17:01:30] https://doc.wikimedia.org/mediawiki-core/master/php/classMWMemcached.html#a31e5e2bd2ec5db2c9f8139dd51395aae [17:01:32] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1557776 (10Joe) >>! In T93428#1557148, @GWicke wrote: >> What kind of configuration parameters would be useful? > > The very first thing... [17:01:40] Krenair: this is half-baked but will help with an immediate issue I’m having. Mind looking? https://gerrit.wikimedia.org/r/#/c/232760 [17:01:46] (And, check the logic, it might be stupid) [17:01:59] one moment [17:06:45] YuviPanda, i think you need two puppet swats per day ;) [17:07:01] yurik: ? [17:07:03] why so? [17:07:10] YuviPanda, because there's already 8 patches :) [17:07:26] ah heh [17:07:28] they were all trivial [17:07:33] and it took only 35mins for all merging + verification [17:07:38] than bump the 8 max ;) [17:07:55] i will add one :) [17:08:01] 6operations, 10Traffic: ocsp updater: handle openssl "trylater" and similar more-gracefully - https://phabricator.wikimedia.org/T109737#1557803 (10BBlack) 3NEW a:3BBlack [17:08:21] yurik: it's done for today :D next one on tuesday [17:08:38] meh, not every day, horrible [17:08:48] (twice!) [17:09:30] 6operations, 10Traffic: ocsp updater: validate the signature expiry lifetime - https://phabricator.wikimedia.org/T109738#1557812 (10BBlack) 3NEW a:3BBlack [17:09:59] hoo: the "sql to labswiki" issue should be resolved now [17:10:09] Awesome [17:10:15] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1557821 (10ssastry) >>! In T93428#1557763, @GWicke wrote: > Timely [blog post on lessons learned from post-mortems](http://danluu.com/post... [17:10:19] well, from tin [17:12:09] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1557823 (10Dzahn) jcrespo added a user/grants on the mysql side. so connections should now work from tin. [17:15:35] 6operations, 10Traffic: ocsp updater: re-enable automatic updates - https://phabricator.wikimedia.org/T109740#1557846 (10BBlack) 3NEW a:3BBlack [17:15:47] 6operations, 10Traffic: ocsp updater: re-enable automatic updates - https://phabricator.wikimedia.org/T109740#1557854 (10BBlack) [17:15:49] 6operations, 10Traffic: ocsp updater: validate the signature expiry lifetime - https://phabricator.wikimedia.org/T109738#1557855 (10BBlack) [17:15:50] (03CR) 10Alex Monk: "If all groups have bastion access, you only need to add them to the group they actually requested and they will get bastion access, you do" [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [17:15:51] 6operations, 10Traffic: ocsp updater: handle openssl "trylater" and similar more-gracefully - https://phabricator.wikimedia.org/T109737#1557856 (10BBlack) [17:17:03] (03PS1) 10Dzahn: silver/wikitech: also allow mysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/232763 (https://phabricator.wikimedia.org/T98682) [17:17:08] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107, and 4 others: move nova api to labnet1002 - https://phabricator.wikimedia.org/T109653#1557863 (10Andrew) The api is running and working on labnet1002. Unfortunately, wikitech has the labnet1001 endpoint cached so we can't actually switch o... [17:19:43] (03PS2) 10Dzahn: silver/wikitech: also allow mysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/232763 (https://phabricator.wikimedia.org/T98682) [17:20:00] Krenair: ^ we also need terbium, right [17:20:19] to add labswiki to the normal maintenance scripts, yes [17:20:20] maintenance scripts was the issue [17:20:28] yea [17:20:39] (03CR) 10Dzahn: [C: 032] silver/wikitech: also allow mysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/232763 (https://phabricator.wikimedia.org/T98682) (owner: 10Dzahn) [17:20:42] with tin we can at least run foreachwiki somewhere and have it pick up labswiki along the way [17:22:50] until today it would just leave a db connection error in the output :/ [17:23:38] (03PS1) 10Ottomata: Change main PTR record and mgmt records to rename analytics1018 -> kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/232765 [17:24:00] Krenair, I do not think terbium has the grants [17:24:15] (03CR) 10Ottomata: [C: 032] Change main PTR record and mgmt records to rename analytics1018 -> kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/232765 (owner: 10Ottomata) [17:24:16] freakin' ferm service.. failed to stop [17:25:02] Oh, was that not on the list? [17:25:33] jynus: it's partly my bad saying "let's just do tin" when the maintenance scripts are on terbium [17:25:51] ^we got them [17:26:04] get the pithforks [17:29:43] (03PS1) 10Ottomata: Rename analytics1018 -> kafka1018 in linux-host-entries [puppet] - 10https://gerrit.wikimedia.org/r/232769 (https://phabricator.wikimedia.org/T106581) [17:29:50] (03PS1) 10Jcrespo: Adding grants for queries from terbium to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/232770 [17:30:45] (03PS2) 10Ottomata: Rename analytics1018 -> kafka1018 in linux-host-entries [puppet] - 10https://gerrit.wikimedia.org/r/232769 (https://phabricator.wikimedia.org/T106581) [17:31:05] (03CR) 10Ottomata: [C: 032 V: 032] Rename analytics1018 -> kafka1018 in linux-host-entries [puppet] - 10https://gerrit.wikimedia.org/r/232769 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [17:31:14] (03CR) 10Jcrespo: [C: 032] Adding grants for queries from terbium to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/232770 (owner: 10Jcrespo) [17:31:46] deploy, deploy, ottomata [17:32:18] ha, ehhH? [17:32:40] (03PS1) 10Dzahn: silver/wikitech: fix ferm rule syntax [puppet] - 10https://gerrit.wikimedia.org/r/232771 [17:32:41] yea:) [17:32:43] I had a small collision [17:33:23] (03CR) 10Dzahn: [C: 032] silver/wikitech: fix ferm rule syntax [puppet] - 10https://gerrit.wikimedia.org/r/232771 (owner: 10Dzahn) [17:34:56] oh , ha [17:34:59] yes do [17:35:24] (03PS2) 10Jcrespo: Adding grants for queries from terbium to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/232770 (https://phabricator.wikimedia.org/T98682) [17:37:31] (03PS3) 10Jcrespo: Adding grants for queries from terbium to labswiki [puppet] - 10https://gerrit.wikimedia.org/r/232770 (https://phabricator.wikimedia.org/T98682) [17:39:13] !log stopping kafka on analytics1018 and bringing it down for reinstall as kafka1018 with Jessie [17:39:16] joal: ^ :) [17:39:37] Thanks ottomata :) [17:39:47] will follow [17:42:22] is anyone else having issues with images on phab loading? [17:48:04] 6operations, 10Incident-20150820-OCSP, 10Traffic: ocsp updater: handle openssl "trylater" and similar more-gracefully - https://phabricator.wikimedia.org/T109737#1557980 (10greg) [17:48:13] 6operations, 10Incident-20150820-OCSP, 10Traffic: ocsp updater: validate the signature expiry lifetime - https://phabricator.wikimedia.org/T109738#1557983 (10greg) [17:48:21] 6operations, 10Incident-20150820-OCSP, 10Traffic: ocsp updater: re-enable automatic updates - https://phabricator.wikimedia.org/T109740#1557985 (10greg) [17:48:31] 6operations, 10Incident-20150820-OCSP, 10Traffic, 7HTTPS: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1557988 (10greg) [17:49:54] (03CR) 10DCausse: [C: 04-1] "This would break the Translate extension used by some multilingual wikis such as meta, MediaWiki, Commons and others." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [17:53:47] 6operations, 10Traffic: Invalid OCSP signing certificate in OCSP response : can't read Wikimedia websites on Firefox - https://phabricator.wikimedia.org/T109712#1558005 (10BBlack) 5Open>3Resolved a:3BBlack We should be back to 100% ok on this issue from the user POV. Mostly-ok was about 15:10 UTC, with... [17:58:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T1800). Please do the needful. [18:02:10] 6operations, 10Incident-20150820-OCSP, 10Traffic: Invalid OCSP signing certificate in OCSP response : can't read Wikimedia websites on Firefox - https://phabricator.wikimedia.org/T109712#1558042 (10greg) [18:10:34] (03PS1) 10Ottomata: Rename A record for analytics1018 -> kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/232774 (https://phabricator.wikimedia.org/T106581) [18:11:08] (03PS2) 10Ottomata: Rename A record for analytics1018 -> kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/232774 (https://phabricator.wikimedia.org/T106581) [18:11:30] (03CR) 10Ottomata: [C: 032] Rename A record for analytics1018 -> kafka1018 [dns] - 10https://gerrit.wikimedia.org/r/232774 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:12:37] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1558064 (10Krenair) #Operations team controls the exim aliases, so adding that project as well as #Mail [18:17:58] bblack: Hey, just wondering: I'm trying to access phabricator.wikimedia.org (from Europe) which gets lots of stuff from wmfusercontent.org, and that triggers: [18:17:59] An error occurred during a connection to phab.wmfusercontent.org. [18:17:59] Invalid OCSP signing certificate in OCSP response [18:18:56] bblack, Are you aware of that? / Do I "just" have to wait for that problem to vanish at some point? [18:19:14] (03PS1) 10Ottomata: Repuppetize kafka1018 as a broker [puppet] - 10https://gerrit.wikimedia.org/r/232776 (https://phabricator.wikimedia.org/T106581) [18:20:47] (03CR) 10Ottomata: [C: 032] Repuppetize kafka1018 as a broker [puppet] - 10https://gerrit.wikimedia.org/r/232776 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:22:32] has there been something related this new cateory watch stuff deployed? [18:23:19] yes. it's being reverted [18:23:50] my watchlist is currently flooded with then thousends of edits. the opt out feature seems removed [18:24:16] ok I'm ready to deploy the train [18:25:05] (diff | hist) . . Category:Uploaded with UploadWizard‎; 16:46 . . (0)‎ . . ‎Shesmax (talk | contribs | block)‎ ([[:File:%D0%92%D1%8F%D0%B7%D1%8C%D0%BC%D0%B0._%D0%98%D0%BE%D0%B0%D0%BD%D0%BD%D0%BE-%D0%9F%D1%80%D0%B5%D0%B4%D1%82%D0%B5%D1%87%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_%D0%BC%D0%BE%D0%BD%D0%B0%D1%81%D1%82%D1%8B%D1%80%D1%8C._%D0%A6%D0%B5%D1%80%D0%BA%D0%BE%D0%...) [18:25:30] actually, maybe not: Undefined index: query in /srv/mediawiki/php-1.26wmf19/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php on line 127 [18:25:30] Steinsplitter: uh, hiding categorization of pages in Special:Preferences#mw-prefsection-watchlist does not work? Meh [18:25:42] no longer :/ [18:26:01] ^ that error is still prominent in the logs, it might start flooding logs if I deploy wmf19 [18:26:04] andre__: thanks! phab.wmfusercontent.org is separate from the main unified cert that covers basically "everything else", and I didn't think to fix it when I fixed everything else [18:26:08] working on that now [18:26:20] bblack: re andre__'s comment, eg https://phabricator.wikimedia.org/T109638 isn't loading for me [18:26:22] Steinsplitter: it's being worked on [18:26:26] ah, nvm :) [18:26:33] legoktm: ok :) thx. [18:26:33] thanks bblack [18:26:43] bblack: ah cool. I also just dropped an email to ops@ ml as I wasn't sure you're around on irc [18:26:45] thx [18:30:54] greg-g / andre__ : should be fixed now? [18:31:58] bblack: yep! thanks! [18:32:11] bblack, yes, thanks a lot! [18:32:47] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Database, 5Patch-For-Review: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1558103 (10Dzahn) also opened firewall to allow connections from terbium, in addition to tin [18:36:50] RECOVERY - Disk space on labstore1002 is OK: DISK OK [18:41:02] (03PS1) 10Jdlrobson: Enable Wikidata page banners on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232782 (https://phabricator.wikimedia.org/T108839) [18:44:15] ottomata: analytics1052/1056 - are those renames to kafka? they show up as "server up" but "all services down" in icinga [18:44:20] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [18:45:38] (03CR) 10MaxSem: [C: 04-1] "Just pass -I to shp2pgsql." [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [18:48:49] no. those are new hadoop nodes, sorry, saw that when in meeting earlier, forgot to cehck on them [18:48:59] will do so in jsut a few, thanks [18:49:03] mutante: ^ [18:50:21] ottomata: alright, thx [18:52:08] (03PS3) 10Ori.livneh: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 [18:52:45] (03PS1) 10Merlijn van Deen: tools: cdnjs-packages-gen: explicitly specify file encoding [puppet] - 10https://gerrit.wikimedia.org/r/232786 [18:53:01] YuviPanda: ^ I can haz merge? Needed to fix CRITICAL: tools-web-static-02/Puppet failure [18:53:56] valhallasw`cloud: kk lookin [18:53:57] g [18:54:11] (03PS2) 10Yuvipanda: tools: cdnjs-packages-gen: explicitly specify file encoding [puppet] - 10https://gerrit.wikimedia.org/r/232786 (owner: 10Merlijn van Deen) [18:54:22] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: cdnjs-packages-gen: explicitly specify file encoding [puppet] - 10https://gerrit.wikimedia.org/r/232786 (owner: 10Merlijn van Deen) [18:54:29] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1558158 (10Papaul) eqdfw xe-0/0/0 cable ID = 11395 on Equinix patch panel ID= 17915019 eqdfw xe-1/0/0 cable ID = 11399 on Equinix patch panel ID = 20028800 eqdfw xe-1/1/0 cable ID = 11397 on Equinix patch pa... [18:54:38] <3 [18:56:03] valhallasw`cloud: <3 [18:56:59] !log labvirt1007 "only" 29G space left - but since we have 2.2T there that means 99% full [18:57:14] !log no log bot [18:57:51] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:58:07] mutante: https://tools.wmflabs.org/sal :) [18:58:19] it works \o/ [18:58:22] who has morebots access? [18:58:44] 9 people [18:59:10] valhallasw`cloud is one of them [18:59:24] I'll give it a kick [18:59:53] ah, that's cool, but just a temp. solution, or? [19:00:02] i think we want the logs on wikitech-static [19:00:16] (03CR) 10Yurik: "Max, i don't think -I is a good approach for this - -I implies that it will create an index with some internal name that will be based on " [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [19:00:34] MaxSem, ^ [19:00:46] yes, they should probably keep going to wikitech [19:01:04] it's pretty easy to adapt morebots to also send it elsewhere [19:02:09] (03CR) 10MaxSem: "Do you realise that when the table gets dropped, its indexes get dropped too? :P" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [19:02:34] I have 29 entries that I can add back to wikitech [19:04:41] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1558176 (10Dzahn) >>! In T109736#1557802, @Aklapper wrote: > Hi @eliza. What is this task about (or for who exactly)? needs a project tag for OIT things too [19:08:54] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1558203 (10Dzahn) Not sure what we are being asked to do exactly. Do you want it to be handled in Google entirely? [19:12:13] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1558214 (10eliza) Hello Dzahn, I was advised to leave the task open as I am uncertain as to whom to direct this to. I believe this task is about the exim alias names (trademarks vs. trademark) - though I'll let @jkrauska explain furth... [19:14:02] (03CR) 10Yurik: "do you realize that we generate the new temp table before we drop the old one?" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [19:14:07] MaxSem, ^ [19:15:21] https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&diff=174361&oldid=174286 [19:17:52] Krenair: :) thanks [19:19:28] there might be some missing things [19:19:33] that's just what I have logged [19:20:12] Krenair: thanks... it's still helpful :) [19:23:59] Krenair: you were going to provide a tiebreak vote to unblock https://gerrit.wikimedia.org/r/214351 for me, right? ;) [19:24:23] I was going to test it and merge it [19:24:34] got restbase working yesterday so I can test the restbase part now [19:25:02] andrewbogott: is "99% full" an issue for labvirt1007 that should have a ticket? i mean, 1% is still 29G but it does show up in Icinga [19:25:37] mutante: yeah, it’s an issue, I’m sort of half-heartedly tinkering with solutions right now. Did it only just cross from warning to critical? [19:25:50] andrewbogott: yea, looks like that [19:26:53] i assume it's "find instances nobody is using"? [19:27:41] Krenair: awesome! [19:27:59] mutante: yeah, or migrate them elsewhere. [19:28:11] mutante: also, swear a lot about nova’s stupid scheduling algorithm [19:28:55] I reviewed your OSM patch btw, andrewbogott [19:29:09] thanks [19:29:10] looks fine, PS3 just makes it comply with coding conventions :p [19:30:48] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1558240 (10Dzahn) 3NEW [19:31:05] 6operations, 6Labs, 10Labs-Infrastructure: disk space on labvirt1007 - https://phabricator.wikimedia.org/T109752#1558248 (10Dzahn) [19:32:24] ACKNOWLEDGEMENT - Disk space on labvirt1007 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 28722 MB (1% inode=99%): daniel_zahn https://phabricator.wikimedia.org/T109752 [19:36:24] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1558267 (10Dzahn) Hi Eliza, so from our point of view in operations (we handle the exim aliases) it is just this: trademark@ is an alias for the list of people above and trademarks@ is just an alias for trademark@ so that both varia... [19:37:50] Krenair: cool, thanks [19:40:32] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [19:40:42] !log moving enwiki_content_1432182861 elastic shard from 1022 to 1004 due to space (1022 is at 91%) [19:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:52] PROBLEM - puppet last run on analytics1027 is CRITICAL Puppet last ran 1 day ago [19:42:04] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1558286 (10Dzahn) connections should also work from terbium now, so it would be possible to run the maintenance scripts where all o... [19:42:44] mutante: I need to get some lunch, will work on disk space when I get back. [19:47:40] andrewbogott: sure, i didn't think it was that urgent, just that we should track it [20:00:04] cscott arlolra: Respected human, time to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T2000). Please do the needful. [20:03:21] greg-g, twentyafterfour: how's the train deploy? [20:05:23] RECOVERY - puppet last run on analytics1027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:06:55] (03PS1) 10Ori.livneh: grafana: set Cache-Control: no-cache on dashboard definitions [puppet] - 10https://gerrit.wikimedia.org/r/232834 [20:07:07] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: set Cache-Control: no-cache on dashboard definitions [puppet] - 10https://gerrit.wikimedia.org/r/232834 (owner: 10Ori.livneh) [20:07:13] Krinkle: ^ [20:09:01] 6operations, 10Traffic, 7Graphite, 7Varnish: Varnish caches Grafana dashboard configuration too strongly - https://phabricator.wikimedia.org/T105734#1558382 (10ori) 5Open>3Resolved a:3ori Resolved with T105734. [20:09:14] ori: Ah, that works :D [20:09:15] cscott: done, i believe [20:09:37] wait... [20:09:49] http://serverfault.com/questions/399814/varnish-purge-on-post-or-put [20:10:11] (03CR) 10MaxSem: "Just checked myself: https://phabricator.wikimedia.org/P1908" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [20:10:56] cscott: not done [20:11:43] greg-g: ok, we're still working on our paperwork for the Parsoid deploy, but we're just about ready when you are. [20:25:51] (03PS4) 10Ori.livneh: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 [20:30:58] cscott: the issue I referenced is now fixed, twentyafterfour will deploy soon [20:32:46] ok syncing [20:34:57] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1558531 (10Papaul) eqdfw xe-0/0/0 cable ID = 11395 on Equinix patch panel ID= 20028800 eqdfw xe-1/0/0 cable ID = 11399 on Equinix patch panel ID = 20028799 eqdfw xe-1/1/0 cable ID = 11397 on Equinix patch p... [20:38:46] !log twentyafterfour@tin Synchronized php-1.26wmf19: Silence the undefined index error in CirrusSearch (duration: 06m 24s) [20:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:46] (03PS1) 1020after4: wikipedia wikis to 1.26wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232838 [20:40:00] (03CR) 1020after4: [C: 032] wikipedia wikis to 1.26wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232838 (owner: 1020after4) [20:40:07] (03Merged) 10jenkins-bot: wikipedia wikis to 1.26wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232838 (owner: 1020after4) [20:41:06] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.26wmf19 [20:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:17] mw2180.codfw.wmnet returned [23]: rsync: mkstemp "/srv/mediawiki/.wikiversions.cdb.9T2MDq" failed: Read-only file system (30) [20:41:42] !log scap failed to sync to mw2180.codwf.wmnet [20:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:03] PROBLEM - puppet last run on analytics1049 is CRITICAL Puppet has 1 failures [20:42:28] I thought mutante pulled that host from the dsh group earlier today? [20:42:36] nope [20:42:38] different one [20:42:39] bd808: different one [20:42:46] yuck [20:42:46] mw2187 [20:43:17] I know we don't care, but wth? [20:48:38] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1558583 (10Krenair) >>! In T107547#1558286, @Dzahn wrote: > connections should also work from terbium now, so it would be possible... [20:49:45] twentyafterfour: i can't tell if you're done yet. [20:50:48] cscott: he is [20:50:49] 20:41 < logmsgbot> !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedia wikis to 1.26wmf19 [20:51:09] that means wikipedias are no on wmf19, which is the deploy for today [20:51:13] now* [20:52:19] ok, cool. we'll go ahead with the parsoid deploy shortly then. [20:55:10] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1558637 (10eliza) Thanks Daniel (mistakenly addressed you as Dzahn) We'll go ahead and wait for @jkrauska to see what to do and we'll get back to you. Eliza [21:01:06] (03PS1) 10Thcipriani: Add servicedeploy user; Modifiy keyholder service [puppet] - 10https://gerrit.wikimedia.org/r/232843 [21:06:52] 7Blocked-on-Operations, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review, 7Schema-change: Separate reference tables by wiki - https://phabricator.wikimedia.org/T107204#1558677 (10DannyH) [21:08:58] (03CR) 10EBernhardson: [C: 04-1] "looks like this is also breaking the prefer-recent: feature. Usage of addScriptScoreFunction() become disallowed." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [21:09:12] (03PS2) 10Greg Grossmeier: Add servicedeploy user; Modifiy keyholder service [puppet] - 10https://gerrit.wikimedia.org/r/232843 (owner: 10Thcipriani) [21:09:34] RECOVERY - puppet last run on analytics1049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:25:32] !log Ran FlowUpdateRevContentModelFromOccupyPages.php on all wikis [21:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:18] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1558786 (10Dzahn) I think tickets that are about updating a wiki page are hard to close in general. They are very open ended and we have a huge amount of p... [21:26:28] 6operations, 6Labs, 10wikitech.wikimedia.org, 5Patch-For-Review: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1497611 (10Krenair) (Nope.) [21:28:39] starting the parsoid deploy [21:30:22] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1558816 (10Dzahn) p:5Low>3Lowest lower priority because it already is much better than before if it's mostly up to date, we can possibly close... [21:31:06] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1558827 (10Dzahn) a:5Reedy>3Krenair [21:32:11] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#994703 (10Dzahn) so either close it or wait until the next wiki gets created and check it one last time while actually doing it [21:33:02] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1558838 (10Krenair) 5Open>3Resolved [21:35:36] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1558846 (10Aklapper) >>! In T105794#1487038, @BBlack wrote: > what is the issue with 1.7 that makes it difficult to support HTTPS for this case? @Merl: Could you elaborate? [21:36:56] !log updated Parsoid to version db6e6404f67a9f971b4fbefe9de239735426c738 [21:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:53] 6operations: package and puppetize ishmael - https://phabricator.wikimedia.org/T82225#1558885 (10Dzahn) I'm wondering if we should just reject this ticket then and make a new one to decom it. [21:46:10] 6operations: package and puppetize ishmael - https://phabricator.wikimedia.org/T82225#1558888 (10Dzahn) p:5Normal>3Low [21:46:42] done with parsoid deploy. [21:46:45] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1558891 (10Eevans) >>! In T95253#1557554, @GWicke wrote: > Regarding metrics collection: Do we actually still need a separate collector? The regul... [21:48:43] 6operations, 7Database: decom ishmael? - https://phabricator.wikimedia.org/T109777#1558908 (10Dzahn) 3NEW [21:49:21] 6operations: package and puppetize ishmael - https://phabricator.wikimedia.org/T82225#898114 (10Dzahn) [21:49:22] 6operations, 7Database: decom ishmael? - https://phabricator.wikimedia.org/T109777#1558919 (10Dzahn) [21:49:34] 6operations, 7Database: decom ishmael? - https://phabricator.wikimedia.org/T109777#1558922 (10Dzahn) p:5Triage>3Low [21:49:57] 6operations, 7Database: decom ishmael? - https://phabricator.wikimedia.org/T109777#1558908 (10Dzahn) [21:50:00] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1558929 (10bd808) [21:56:08] 6operations, 10Wikimedia-Mailing-lists: rename lists mwapi-team.disabled.T97148 and flowfunding.disabled.T97328 ? - https://phabricator.wikimedia.org/T109539#1558944 (10Dzahn) @Robh would you be willing to take this since you did the original tasks to disable them? We now have the new script to disable them. [21:57:33] 6operations: clean up admins module data file - https://phabricator.wikimedia.org/T109516#1558959 (10Dzahn) I agree that cleaning it up is nice but there is also an advantage to not sorting the groups alphabetically. It's that if we leave them in the order they have been added it's easier to see the next GID to... [22:03:53] (03PS1) 10Mattflaschen: Note that people should not add new wmgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232854 (https://phabricator.wikimedia.org/T105574) [22:13:20] 6operations, 10Wikimedia-Mailing-lists: rename lists mwapi-team.disabled.T97148 and flowfunding.disabled.T97328 ? - https://phabricator.wikimedia.org/T109539#1559053 (10RobH) I'm willing, but it'll go behind a bunch of other tasks I have that are higher priority. If you need it done in any kind of timely fas... [22:13:53] (03PS1) 10Andrew Bogott: Added a live-migrate script to wrap nova's standard block-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/232856 [22:16:34] (03PS2) 10Andrew Bogott: Added a live-migrate script to wrap nova's standard block-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/232856 [22:17:21] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1559092 (10CCogdill_WMF) I looked into Pimcore, Mautic, and OpenEMM. All documentation I can find on these tools suggests they are mea... [22:17:35] (03PS3) 10Ori.livneh: Set maximum execution time to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/231197 (https://phabricator.wikimedia.org/T97204) [22:17:42] (03CR) 10Ori.livneh: [C: 032] Set maximum execution time to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/231197 (https://phabricator.wikimedia.org/T97204) (owner: 10Ori.livneh) [22:18:12] (03CR) 10Ori.livneh: [V: 032] Set maximum execution time to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/231197 (https://phabricator.wikimedia.org/T97204) (owner: 10Ori.livneh) [22:33:33] (03CR) 10Tim Landscheidt: "Probably fixed T109355." [puppet] - 10https://gerrit.wikimedia.org/r/232786 (owner: 10Merlijn van Deen) [22:33:51] (03PS3) 10Andrew Bogott: Added a live-migrate script to wrap nova's standard block-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/232856 [22:35:56] (03PS4) 10Andrew Bogott: Added a live-migrate script to wrap nova's standard block-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/232856 [22:36:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:36:28] hmph [22:36:35] files/misc/scripts/wikimedia-periodic-update.sh should probably reference FlaggedRevs in the name [22:37:11] (03CR) 10Andrew Bogott: [C: 032] Added a live-migrate script to wrap nova's standard block-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/232856 (owner: 10Andrew Bogott) [22:38:18] PageTriage is installed on test2wiki but we don't run cron/updatePageTriageQueue.php there? [22:38:23] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:39:08] Why do we have both foreachwikiindblist and mwscriptwikiset? [22:39:26] (03PS5) 10Ori.livneh: Introduce ConfigurationObserver class [debs/pybal] - 10https://gerrit.wikimedia.org/r/230931 [22:41:14] 6operations, 10Traffic, 7Browser-Support-Internet-Explorer, 7HTTPS: Xbox 360 Internet Explorer unable to view Wikipedia - https://phabricator.wikimedia.org/T105455#1559183 (10brion) I think the target was September... [22:42:14] files/misc/scripts/update-special-pages looks like it could be replaced with a simple `foreachwiki updateSpecialPages.php` call [22:42:53] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [22:47:59] !log ori@tin Synchronized php-1.26wmf19/includes/libs/CSSMin.php: Icc1c23a2: CSSMin: remove dot segments in relative local URLs (duration: 00m 12s) [22:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:22] (03PS1) 10Alex Monk: Maintenance script maintenance for labswiki [puppet] - 10https://gerrit.wikimedia.org/r/232866 (https://phabricator.wikimedia.org/T107547) [22:53:43] 6operations, 10ops-ulsfo: connect ulsfo side of ulsfo-eqdfw connection - https://phabricator.wikimedia.org/T109788#1559196 (10RobH) 3NEW a:3RobH [22:56:33] PROBLEM - Apache HTTP on mw1041 is CRITICAL - Socket timeout after 10 seconds [22:58:23] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150820T2300). [23:00:04] jdlrobson matt_flaschen ebernharson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:15] Present [23:01:21] ok [23:05:42] jdlrobson, around? [23:05:58] yup :) [23:06:11] aroudn as well [23:06:32] (03CR) 10Alex Monk: [C: 032] Enable Wikidata page banners on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232782 (https://phabricator.wikimedia.org/T108839) (owner: 10Jdlrobson) [23:06:57] (03Merged) 10jenkins-bot: Enable Wikidata page banners on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232782 (https://phabricator.wikimedia.org/T108839) (owner: 10Jdlrobson) [23:08:11] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232782/ (duration: 00m 12s) [23:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:25] mutante, mw2187 is still unhappy [23:08:57] (03PS1) 10Tim Landscheidt: base: Don't install command-not-found-data either [puppet] - 10https://gerrit.wikimedia.org/r/232867 [23:09:07] jdlrobson, please test [23:09:17] on it [23:10:01] Krinkle.... [23:10:05] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1559250 (10JKrauska) Please add kfrancis to trademark[s] [23:10:36] Krenair: What have I done? [23:10:47] looks great Krenair ! thanks a bunch [23:10:59] Krinkle, you left https://gerrit.wikimedia.org/r/#/c/232864/ on the 1.26wmf19 branch undeployed [23:11:09] That was 2 minutes ago [23:11:12] I'm deploying it right now [23:11:40] Okay [23:13:35] !log krinkle@tin Synchronized php-1.26wmf19/includes/resourceloader/ResourceLoaderFileModule.php: T102578 (duration: 00m 13s) [23:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:00] Oh SWAT started already [23:14:05] Sorry, I'm in a different timezone. [23:14:18] I'm done :) [23:15:23] !log krenair@tin Synchronized php-1.26wmf19/extensions/LiquidThreads/classes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/232783/ (duration: 00m 12s) [23:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:31] matt_flaschen, ^ [23:16:01] (03CR) 10Tim Landscheidt: "(Note that this means that the send-echo-emails cron job on silver needs to be removed /manually/.)" [puppet] - 10https://gerrit.wikimedia.org/r/232866 (https://phabricator.wikimedia.org/T107547) (owner: 10Alex Monk) [23:16:20] thedj, is the Special:NewMessages link on the watchlist gone? [23:16:55] (03CR) 10Alex Monk: "Thanks for pointing that out Tim, I'll do that when ops merges this." [puppet] - 10https://gerrit.wikimedia.org/r/232866 (https://phabricator.wikimedia.org/T107547) (owner: 10Alex Monk) [23:20:24] (03CR) 10EBernhardson: "any opinions? or would it be more reasonable to just assign ::vagrant and ::vagrant::lxc directly to a labs instance and skip the role?" [puppet] - 10https://gerrit.wikimedia.org/r/230928 (owner: 10EBernhardson) [23:21:01] (03CR) 10Alex Monk: [C: 032] Set wgFlowMigrateReferenceWiki to true to start ref_src_wiki population [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232671 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [23:21:25] (03Merged) 10jenkins-bot: Set wgFlowMigrateReferenceWiki to true to start ref_src_wiki population [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232671 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [23:22:01] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/232671/ (duration: 00m 12s) [23:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:48] (03CR) 10Alex Monk: [C: 032] Note that people should not add new wmgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232854 (https://phabricator.wikimedia.org/T105574) (owner: 10Mattflaschen) [23:22:55] (03Merged) 10jenkins-bot: Note that people should not add new wmgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232854 (https://phabricator.wikimedia.org/T105574) (owner: 10Mattflaschen) [23:23:41] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/232854/ (duration: 00m 13s) [23:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:19] ebernhardson, want to do https://gerrit.wikimedia.org/r/#/c/232055/2/wmf-config/InitialiseSettings.php ? [23:27:49] Krenair: sure [23:27:55] (03CR) 10EBernhardson: [C: 032] Upate CirrusSearch active user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232055 (https://phabricator.wikimedia.org/T109018) (owner: 10EBernhardson) [23:28:20] (03Merged) 10jenkins-bot: Upate CirrusSearch active user test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/232055 (https://phabricator.wikimedia.org/T109018) (owner: 10EBernhardson) [23:28:40] * ebernhardson always finds it mildly amusing that log lines that are too long end up in the wrong files [23:28:53] err, a long log line can cause the next line to show up in the wrong file [23:29:40] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 13s) [23:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:57] Krenair: 1 apache had sync errors, you run into that? [23:29:59] * ebernhardson runs it agin [23:30:03] yes [23:30:04] mw2187? [23:30:06] known [23:30:13] yup, kk [23:30:14] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 12s) [23:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:29] https://phabricator.wikimedia.org/T109717 [23:31:04] no log explosions, search still works. but i just saw another logline i should fix... [23:31:27] go for it, nothing else to deploy in this window [23:39:49] (03PS1) 10Alex Monk: General maintenance script cleanup [puppet] - 10https://gerrit.wikimedia.org/r/232871 [23:40:47] !log ebernhardson@tin Synchronized php-1.26wmf19/extensions/CirrusSearch/: Fix some cirrussearch logspam (duration: 00m 13s) [23:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 9.09% of data above the critical threshold [500.0] [23:43:50] # All language subdomains of content projects, unless they use comma count [23:43:51] for set in wikinews wikipedia wikiquote wikisource wikiversity wikivoyage wiktionary; do [23:43:57] * Krenair sighs [23:50:10] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1559383 (10Dzahn) done. ``` trademarks: trademark -trademark: kwadhwa, mpaulson, slaporte, ywelinder, hwalls, kmaher, rstallman, mbrar, jrogers +trademark: kwadhwa, mpaulson, slaporte, ywelinder, hwalls, kmaher, rstallman, mbrar,... [23:50:23] 6operations, 7Mail: trademarks@ - https://phabricator.wikimedia.org/T109736#1559384 (10Dzahn) 5Open>3Resolved a:3Dzahn [23:50:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:50:35] 6operations, 7Mail: add kfrancis to trademarks@ alias - https://phabricator.wikimedia.org/T109736#1559386 (10Dzahn)