[00:04:10] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 9.192 second response time [00:07:20] PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:19] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 4.963 second response time [01:04:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [01:23:30] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:11:09] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:11:17] Logged the message, Master [02:11:40] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11303 MB (3% inode=99%): [02:12:17] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-09 02:11:13+00:00 [02:12:20] Logged the message, Master [02:12:41] !log l10nupdate Synchronized php-1.25wmf16/cache/l10n: (no message) (duration: 00m 02s) [02:12:46] Logged the message, Master [02:13:48] !log LocalisationUpdate completed (1.25wmf16) at 2015-02-09 02:12:45+00:00 [02:13:51] Logged the message, Master [02:47:11] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11091 MB (3% inode=99%): [02:49:19] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11114 MB (3% inode=99%): [02:59:40] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11187 MB (3% inode=99%): [03:08:10] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11258 MB (3% inode=99%): [03:22:10] PROBLEM - Disk space on praseodymium is CRITICAL: DISK CRITICAL - free space: /mnt/data 11267 MB (3% inode=99%): [03:50:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Feb 9 03:48:57 UTC 2015 (duration 48m 56s) [03:50:10] Logged the message, Master [04:14:20] 3Wikimedia-Labs-Other, operations: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1024066 (10yuvipanda) [04:14:21] 3operations, Tool-Labs: Replag on labsdb - https://phabricator.wikimedia.org/T88183#1024064 (10yuvipanda) 5Open>3Resolved I think this is fixed at least now. I'll follow up monitoring for this when I'm back from vacation [04:22:59] RECOVERY - Disk space on xenon is OK: DISK OK [04:23:20] RECOVERY - Disk space on praseodymium is OK: DISK OK [04:31:20] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 11325 MB (3% inode=99%): [04:52:28] (03PS2) 10Ori.livneh: vbench: various improvements [puppet] - 10https://gerrit.wikimedia.org/r/189305 [05:00:50] RECOVERY - Disk space on cerium is OK: DISK OK [05:05:29] 3operations, RESTBase-Cassandra: Upgrade cassandra test cluster to cassandra 2.1 - https://phabricator.wikimedia.org/T88956#1024125 (10GWicke) [05:06:19] PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 7 data above and 46 below the confidence bounds [05:07:09] 3operations, RESTBase-Cassandra: Upgrade cassandra test cluster to cassandra 2.1 - https://phabricator.wikimedia.org/T88956#1024119 (10GWicke) [05:10:30] PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 7 data above and 45 below the confidence bounds [05:15:59] PROBLEM - DPKG on xenon is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:19:10] RECOVERY - DPKG on xenon is OK: All packages OK [05:23:29] PROBLEM - DPKG on praseodymium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:24:31] (03CR) 10Springle: [C: 032] vbench: various improvements [puppet] - 10https://gerrit.wikimedia.org/r/189305 (owner: 10Ori.livneh) [05:25:30] RECOVERY - DPKG on praseodymium is OK: All packages OK [05:27:29] PROBLEM - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend [05:29:08] 3operations, RESTBase-Cassandra: Upgrade cassandra test cluster to cassandra 2.1 - https://phabricator.wikimedia.org/T88956#1024148 (10GWicke) [05:29:34] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1024149 (10GWicke) [05:31:40] !log manually updated cassandra on cerium, praseodymium & xenon to 2.1.2 (see https://phabricator.wikimedia.org/T88956) [05:31:47] Logged the message, Master [05:32:34] !log stopped puppet on cerium, praseodymium & xenon [05:32:38] Logged the message, Master [05:34:40] RECOVERY - Kafka Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected [05:39:31] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Puppet has 1 failures [05:51:29] RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself [05:57:10] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:07:11] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:08:09] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59718 bytes in 0.082 second response time [06:27:50] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:10] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:40] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:59] <_joe_> mmm no wikibugs? [06:47:37] 3operations: Scribunto_LuaInterpreterNotFoundError in production - https://phabricator.wikimedia.org/T88942#1024176 (10Joe) [06:47:41] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:10] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:06:10] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:06:10] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:06:30] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:06:11] 3WMF-Legal, operations, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1024189 (10Qgil) a:5Qgil>3MBrar.WMF @LuisV_WMF, the task was assigned to you (by me, without asking you explicitly) during January. It's ok, let's finish it. I'm assign i... [08:40:48] (03PS1) 10Giuseppe Lavagetto: mediawiki: send .phtml files to HHVM as well [puppet] - 10https://gerrit.wikimedia.org/r/189440 [08:40:58] 3operations: Scribunto_LuaInterpreterNotFoundError in production - https://phabricator.wikimedia.org/T88942#1024203 (10Joe) [08:41:51] 3operations: Scribunto_LuaInterpreterNotFoundError in production - https://phabricator.wikimedia.org/T88942#1023785 (10Joe) https://gerrit.wikimedia.org/r/189440 should solve this, given no other .phtml file is present in our repository and FilesMatch catchalls are processed AFTER RewriteRules [08:48:31] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1024208 (10Joe) Yes, all the memcached servers will be jessie at this point. [08:49:20] (03PS2) 10Giuseppe Lavagetto: mediawiki: do not escape urls in the catchall redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/188762 [08:49:55] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1024209 (10Joe) p:5Unbreak!>3High [08:50:24] 3operations: Redirects to https need to set NE (no escape) in apache - https://phabricator.wikimedia.org/T88359#1010014 (10Joe) Lowered priority since it seems no one is in a hurry to review this :) [09:03:15] (03PS1) 10GWicke: Update for Cassandra 2.1.2 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/189444 (https://phabricator.wikimedia.org/T88956) [09:05:26] (03PS2) 1001tonythomas: Un-subscribe frequently failing recipients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189316 (https://phabricator.wikimedia.org/T48640) [09:06:01] (03CR) 1001tonythomas: "Changed the limit to 5 to match mailman configuration. Should be good to go!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189316 (https://phabricator.wikimedia.org/T48640) (owner: 1001tonythomas) [09:06:11] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11241 MB (3% inode=99%): [09:07:24] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1024223 (10Joe) p:5Normal>3Unbreak! a:3Joe [09:07:57] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1018974 (10Joe) [09:08:00] (03PS3) 10Nemo bis: Un-subscribe frequently failing recipients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189316 (https://phabricator.wikimedia.org/T48640) (owner: 1001tonythomas) [09:08:32] (03CR) 10Nemo bis: [C: 031] "Sounds like a conservative setting for a start, can't do harm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189316 (https://phabricator.wikimedia.org/T48640) (owner: 1001tonythomas) [09:09:59] PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:21] RECOVERY - Disk space on xenon is OK: DISK OK [09:11:41] dataset again? [09:11:56] !log cassandra load testing on xenon, praseodymium and cerium; disk space is tight, might run out on one of those boxes but they are purely test boxes right now, so np [09:12:02] Logged the message, Master [09:21:48] (03CR) 10Phuedx: "This has been unblocked by Id927864733d58ac280f7f228bd6cac37d08e872c." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [09:23:51] apergos: here? [09:23:58] yes [09:24:03] ah ha [09:24:07] checking [09:24:10] I'm checking as well [09:24:33] I found at least one unrelated issue [09:24:52] mutante added IPv6 to the box (and DNS) but nginx doesn't listen to any IPv6 addresses [09:25:26] error.log shows a bunch of error()s for not finding files that the msnbot requests [09:25:44] but I don't see why that would make nginx unresponsive [09:25:49] the machine has no load [09:25:56] hmm, they're in a D state [09:26:12] <_joe_> filesystem failures? [09:26:24] dm-0 0.00 0.00 1064.20 67.40 133330.40 3570.40 241.96 9.04 7.99 8.44 0.78 0.88 100.00 [09:26:28] yeah, it's I/O starved [09:26:41] <_joe_> shit. [09:26:57] urgh [09:27:00] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 9.401 second response time [09:27:13] <_joe_> that's the dumps nfs. [09:27:16] no [09:27:20] that's the local FS [09:27:25] yes [09:27:47] <_joe_> he, s/./?/ [09:29:33] well it's serving 130-150MB/s [09:29:47] it's at least reading that from disk, not sure if it's actually serving it to the web [09:30:10] PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:10] yup, it is [09:30:34] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=dataset1001.wikimedia.org&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [09:31:01] 160MB/s right now [09:31:11] over 550 HTTP connections [09:31:15] we could reinstitute connection and/or bw caps [09:31:43] yeah [09:32:51] <_joe_> who's the nice one downloading 150 MB/s from us? [09:33:42] there's a bunch of EC2 VMs [09:36:39] 19 out of the top 20 IPs are EC2 [09:37:19] figures [09:37:51] also Yandex fetching 2012's pagecounts [09:38:02] AWS too [09:38:23] anyway [09:38:52] can you work on req/s & bw limits? [09:38:57] and fix IPv6 while at it? :) [09:38:59] yep [09:39:02] thx [09:39:07] already doing the limits [09:39:10] awesome [09:39:14] poor box [09:41:34] it has a bunch of disks though, right? [09:41:44] 160MB/s isn't that much [09:42:00] even though it's all over the place I guess [09:44:50] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 1.778 second response time [09:45:26] (03PS1) 10ArielGlenn: reinstate bandwidth and conn caps for dumps.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/189447 [09:47:00] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: integration-puppetmaster does not respond to other instances - https://phabricator.wikimedia.org/T88960#1024349 (10hashar) Maybe ops have some idea? :-( [09:51:54] apergos: bandwidth levels are down now, but I'm wondering if it was a combination of req/s + rsync [09:52:04] I didn't see rsync running [09:52:06] lower I mean [09:52:52] I saw one running a few minutes ago [09:53:15] those are bw capped too [09:53:20] ok [10:03:43] Hi apergos [10:03:53] hey Nemo_bis [10:04:10] FYI I tried to encourage someone to send patches for MirrorBrain support https://meta.wikimedia.org/wiki/Talk:Data_dump_torrents#Alternatives_to_burnbit.com.3F [10:04:17] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, please get rid of trailing whitespace tho" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/189444 (https://phabricator.wikimedia.org/T88956) (owner: 10GWicke) [10:04:35] ok great [10:14:35] (03PS2) 10ArielGlenn: reinstate bandwidth and conn caps for dumps.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/189447 [10:17:11] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 1 failures [10:17:38] (03CR) 10ArielGlenn: [C: 032] reinstate bandwidth and conn caps for dumps.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/189447 (owner: 10ArielGlenn) [10:21:00] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1024397 (10faidon) Two questions: - I heard that we're moving off Titan, is this part of the ticket (@Smalyshev's involvement) obsolete now? - This reques... [10:21:27] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#1024398 (10faidon) p:5Normal>3High [10:21:38] apergos: ^^ was discussed during previous ops meeting [10:21:49] ok [10:21:59] PROBLEM - HTTP on ms1001 is CRITICAL: Connection refused [10:22:39] ignore please, that' s me testing [10:23:55] 3operations, MediaWiki-extensions-ConfirmEdit-(CAPTCHA-extension): bogus captchaid results in http 500, should be http 400 instead - https://phabricator.wikimedia.org/T88970#1024399 (10fgiunchedi) 3NEW [10:24:07] 3operations, ops-eqiad: rebalance memcached in eqiad - https://phabricator.wikimedia.org/T88710#1024409 (10faidon) p:5Triage>3High a:5Christopher>3Joe [10:25:24] 3operations, ops-eqiad: rebalance memcached in eqiad - https://phabricator.wikimedia.org/T88710#1018631 (10faidon) [10:25:55] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#915542 (10faidon) [10:25:58] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1024420 (10faidon) [10:26:16] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#915542 (10faidon) a:5Cmjohnson>3Joe [10:27:05] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024429 (10hashar) [10:28:14] 3operations, Incident-20150205-SiteOutage: sleeper database connection surges during outage - https://phabricator.wikimedia.org/T88770#1024436 (10faidon) [10:28:36] 3operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1024437 (10akosiaris) 3NEW [10:28:43] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1024447 (10faidon) p:5Triage>3High [10:29:11] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#999463 (10akosiaris) [10:29:13] 3operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1024451 (10akosiaris) [10:29:25] (03PS1) 10ArielGlenn: connection limits in nginx need shared memory and key defn [puppet] - 10https://gerrit.wikimedia.org/r/189451 [10:29:26] 3operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1024437 (10akosiaris) [10:29:29] 3Deployment-Systems, operations: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#651916 (10akosiaris) [10:29:59] 3operations, ops-eqiad: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#1024468 (10faidon) p:5Triage>3Normal [10:30:43] (03CR) 10ArielGlenn: [C: 032] connection limits in nginx need shared memory and key defn [puppet] - 10https://gerrit.wikimedia.org/r/189451 (owner: 10ArielGlenn) [10:31:13] 3operations, ops-eqiad: wipe holmium disks - https://phabricator.wikimedia.org/T87391#1024473 (10faidon) Note that one of holmium's disk is broken, T83734. [10:31:13] <_joe_> how can I add a security group to an instance in OSM? [10:31:25] <_joe_> does anyone have any ideas? [10:33:30] _joe_: you can't [10:33:56] and it is not an OpenStackManager limitation but rather Openstack limitation [10:34:10] at least back then, not sure now [10:34:38] <_joe_> oh my [10:34:40] <_joe_> ok [10:35:10] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:35:39] RECOVERY - HTTP on ms1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 0.003 second response time [10:35:45] <_joe_> now I don't get why two instances in the same project can't speak to each other but on specific ports [10:35:56] gwicke: can you add me and akosiaris to the services labs project when you read this? thanks! [10:36:48] (03PS2) 10Filippo Giunchedi: public entry point for restbase [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) [10:36:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] public entry point for restbase [dns] - 10https://gerrit.wikimedia.org/r/188537 (https://phabricator.wikimedia.org/T78194) (owner: 10Filippo Giunchedi) [10:37:25] (03PS1) 10ArielGlenn: dumps nginx, fix up limit_conn_zone directive [puppet] - 10https://gerrit.wikimedia.org/r/189453 [10:38:14] (03CR) 10ArielGlenn: [C: 032] dumps nginx, fix up limit_conn_zone directive [puppet] - 10https://gerrit.wikimedia.org/r/189453 (owner: 10ArielGlenn) [10:44:45] godog: gwicke: I did. Both as members as well as projectadmins [10:45:14] akosiaris: oh! didn't realize you were wiki admin, thanks :)) [10:46:16] <_joe_> how many times will I write sysctl instead of systemctl? [10:46:40] heh... I am wondering too [10:47:23] !log Manually removed wikidatawiki.wb_changes_dispatch entries for test wikis (test2wiki, testwiki, testwikidata). [10:47:30] Logged the message, Master [10:47:44] 3operations, Project-Creators, Phabricator, Triagers: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#1024495 (10Qgil) @awight, if you need to create projects, just follow https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects and req... [10:47:54] <_joe_> ohhh nice one debian [10:48:16] <_joe_> they moved to systemd unit files for memcached, but hardcoded the values in the service file [10:48:25] <_joe_> the config values I mean [10:49:26] <_joe_> and they still ship /etc/memcached.conf, which is beautifully ignored by systemd [10:52:17] 3Labs, operations: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1024517 (10faidon) p:5Triage>3Unbreak! [10:52:37] (03CR) 10Qgil: [C: 031] Change 'Export to Excel' to 'Export (disabled)' [puppet] - 10https://gerrit.wikimedia.org/r/189327 (https://phabricator.wikimedia.org/T152) (owner: 10Merlijn van Deen) [10:59:21] 3operations, Wikidata, wikidata-query-service: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1024545 (10faidon) 5Open>3stalled Based on what we heard regarding Titan/WQS lately, I think we can safely put this on hold and mark it as stalled, correct? [11:01:24] 3operations, Wikidata, wikidata-query-service: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1024564 (10Joe) Yes, I was waiting for this evening's WQS meeting before reassessing priority/status, but marking it stalled is fair. [11:01:54] (03CR) 10Aklapper: [C: 031] "Fine with me" [puppet] - 10https://gerrit.wikimedia.org/r/189327 (https://phabricator.wikimedia.org/T152) (owner: 10Merlijn van Deen) [11:05:40] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [11:07:09] apergos: you might also want to help RobH with https://phabricator.wikimedia.org/T88497 [11:07:23] ok, will do [11:07:44] seems easy enough :) [11:10:18] please ignore any whines about ms1001 again, testing [11:12:01] PROBLEM - HTTP on ms1001 is CRITICAL: Connection refused [11:13:10] RECOVERY - HTTP on ms1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 382 bytes in 0.014 second response time [11:19:04] PROBLEM - MySQL Slave Delay on db1045 is CRITICAL: CRIT replication delay 326 seconds [11:19:07] PROBLEM - MySQL Replication Heartbeat on db1045 is CRITICAL: CRIT replication delay 328 seconds [11:19:10] PROBLEM - MySQL Replication Heartbeat on db1026 is CRITICAL: CRIT replication delay 340 seconds [11:19:29] PROBLEM - MySQL Slave Delay on db1026 is CRITICAL: CRIT replication delay 355 seconds [11:19:32] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 355 seconds [11:19:41] PROBLEM - MySQL Slave Delay on db1021 is CRITICAL: CRIT replication delay 368 seconds [11:19:59] can't be good [11:20:02] springle: ^ ? [11:20:05] hmm [11:20:14] DELETE /* Wikibase\PruneChanges::pruneChanges [11:20:19] taking forever [11:20:20] RECOVERY - MySQL Slave Delay on db1045 is OK: OK replication delay 0 seconds [11:20:23] RECOVERY - MySQL Replication Heartbeat on db1045 is OK: OK replication delay -0 seconds [11:20:29] (03PS5) 10KartikMistry: cxserver: Enable English to Russian MT [puppet] - 10https://gerrit.wikimedia.org/r/188517 [11:20:31] Well, Wikidata is locked up atm [11:20:34] <_joe_> nice deletes are nice [11:21:06] <_joe_> sjoerddebruin: what do you mean? [11:21:28] "The wiki is currently in read-only mode" [11:21:49] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay -1 seconds [11:21:58] But seems to be done now. [11:21:59] RECOVERY - MySQL Slave Delay on db1021 is OK: OK replication delay 0 seconds [11:22:12] deleting ~4 million rows unbatched [11:22:42] <_joe_> sjoerddebruin: I was trying to reproduce it, and wasn't able in fact [11:23:04] <_joe_> springle, mmmh. [11:23:05] (03PS1) 10ArielGlenn: dumps nginx: enable ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/189459 [11:23:34] and that is innodb vagueness. some slaves think 5M or more [11:23:40] for explain [11:23:53] apergos: that won't work [11:24:03] I think [11:24:09] it's what I tried on ms1001 [11:24:15] after anotehr attempt failed [11:24:26] hm, maybe it will, it really depends on bind_v6only [11:24:31] git grep ipv6only [11:24:37] or google search it [11:24:37] hoo: killed the dispatch that was probably the result of those icinga warnings [11:24:49] s/:// [11:25:17] I don't know what exactly happened, but it's recovering [11:25:45] <_joe_> jzerebecki, hoo see #operations [11:25:55] <_joe_> sorry, see above [11:26:06] <_joe_> < springle> deleting ~4 million rows unbatched [11:26:08] hoo: DELETE /* Wikibase\PruneChanges::pruneChanges [11:26:17] springle: Ah ok [11:26:18] unusually large [11:26:25] You're only as far as I am [11:26:33] see #wikidata [11:27:28] logstash is empty :( [11:27:55] I gtg [11:28:11] jzerebecki: I noted it down and will take care of it later on... unless you want to beat me to it ;) [11:28:25] <_joe_> jzerebecki: there is a good reason for logstash being empty [11:28:30] two of five s5 slave still lagged. the others are ok [11:28:30] <_joe_> it's disabled [11:28:46] <_joe_> and will stay like that until we're able to make it not take down the whole cluster [11:29:16] doh! totally forgot the recent outage for a second. [11:31:46] 3operations, Wikimedia-Git-or-Gerrit: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1024635 (10akosiaris) 5Open>3Resolved I removed the watch via gerrit's web interface a couple of minutes ago, after getting the pass from the pri... [11:35:11] 3operations, Wikimedia-Git-or-Gerrit: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1024651 (10akosiaris) [11:35:24] (03CR) 10ArielGlenn: [C: 032] "seems like with the current version and bindv6only setting this should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/189459 (owner: 10ArielGlenn) [11:37:02] 3operations, MediaWiki-extensions-ConfirmEdit-(CAPTCHA-extension): bogus captchaid results in http 500, should be http 400 instead - https://phabricator.wikimedia.org/T88970#1024656 (10Aklapper) p:5Triage>3Normal [11:38:46] (03CR) 10Alexandros Kosiaris: [C: 031] "I am fine with that approach." [puppet] - 10https://gerrit.wikimedia.org/r/188715 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [11:39:04] Are the lagged slaves being used for Wikidata watchlists? [11:40:06] (03CR) 10Alexandros Kosiaris: [C: 032] redisdb: add ferm::service for redis-server [puppet] - 10https://gerrit.wikimedia.org/r/188719 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [11:40:39] (03PS2) 10Alexandros Kosiaris: redisdb: add ferm::service for redis-server [puppet] - 10https://gerrit.wikimedia.org/r/188719 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [11:41:19] (03CR) 10Alexandros Kosiaris: [C: 032] redisdb: add ferm::service for redis-server [puppet] - 10https://gerrit.wikimedia.org/r/188719 (https://phabricator.wikimedia.org/T86898) (owner: 10Dzahn) [11:42:20] RECOVERY - MySQL Slave Delay on db1026 is OK: OK replication delay 90 seconds [11:43:11] RECOVERY - MySQL Replication Heartbeat on db1026 is OK: OK replication delay -0 seconds [11:58:11] !log bounce mwprof-profiler-to-carbon on tungsten [11:58:19] Logged the message, Master [12:00:13] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1024698 (10Joe) So, for rebalancing I just tested a valid workaround; now we define: servers: - 10.0.0.2:11211:1 - 10.0.0.3:11211:1... [12:00:42] <_joe_> I'm sure you guys will love this ^^ [12:09:59] haha [12:10:04] evil [12:18:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] add network variables for dumps rsync clients (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/189196 (owner: 10John F. Lewis) [12:19:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] add ferm service for rsyncd to dumps role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn) [12:21:20] (03CR) 10Faidon Liambotis: [C: 04-1] "I don't like keeping something that's clearly dataset-related in network.pp & in the base firewall. This belongs into the dataset manifest" [puppet] - 10https://gerrit.wikimedia.org/r/189196 (owner: 10John F. Lewis) [12:42:13] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1024712 (10Joe) Changing the label of 1 out of 4 servers in a cluster (this would be equivalent to changing one IP in our current configuration) mak... [12:48:30] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [12:50:10] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [12:50:40] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [13:03:19] (03PS1) 10JanZerebecki: Fix regexp for wikidata icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/189474 (https://phabricator.wikimedia.org/T88980) [13:06:20] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:11:13] (03CR) 10Alexandros Kosiaris: [C: 032] Fix regexp for wikidata icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/189474 (https://phabricator.wikimedia.org/T88980) (owner: 10JanZerebecki) [13:16:25] (03PS1) 10Alexandros Kosiaris: icinga: use DNS, not IP in wikidata check [puppet] - 10https://gerrit.wikimedia.org/r/189475 [13:59:57] (03PS1) 10Mobrovac: vbench: Fix minor bug in std() [puppet] - 10https://gerrit.wikimedia.org/r/189477 [14:22:41] akosiaris: your thoughts on, https://gerrit.wikimedia.org/r/#/c/188796/ please! :) [14:29:52] (03PS1) 10Alexandros Kosiaris: Grant access to nuria to tin for deployment [puppet] - 10https://gerrit.wikimedia.org/r/189481 (https://phabricator.wikimedia.org/T88760) [14:31:53] (03CR) 10Hashar: [C: 04-1] "What is the use case? I don't see a need to ssh between CI labs instances." [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle) [14:37:52] (03PS1) 10Alexandros Kosiaris: Grant access to milimetric to tin for deployment [puppet] - 10https://gerrit.wikimedia.org/r/189483 (https://phabricator.wikimedia.org/T88769) [14:51:28] (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [14:52:10] 3Deployment-Systems, operations, Scrum-of-Scrums: Update wikitech wiki with deployment train - https://phabricator.wikimedia.org/T70751#1024835 (10akosiaris) Let's move this forward again. Getting a server to host wikitech is easy. Actually hosting wikitech on a different server than virt1000... I have no idea i... [14:57:44] (03CR) 10Hashar: [C: 04-1] "I have added an experimental job that one can trigger by commenting 'check experimental'. It runs:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [15:00:28] !log cp1070 down for h/w troubleshooting. Already depooled by bblack [15:00:33] Logged the message, Master [15:03:14] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Restore asw2-a5-eqiad redundant power - https://phabricator.wikimedia.org/T88792#1024839 (10faidon) This sounds a bit dangerous, so let's reshuffle 2/3 of the servers first and then do this. [15:05:28] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1024845 (10Joe) Testing again, this time for connection failures. I tried causing network failures on a cluster I populated first with 1000 keys, by... [15:14:51] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [15:17:19] PROBLEM - Varnish HTTP bits on cp1070 is CRITICAL: Connection timed out [15:17:19] PROBLEM - puppet last run on cp1070 is CRITICAL: Timeout while attempting connection [15:18:40] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:58] 3operations, ops-eqiad: cp1070 hardware failure - https://phabricator.wikimedia.org/T88889#1024889 (10Cmjohnson) Swapped cpu1 and cpu2 to check if error follows cpu [15:24:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM aside from really minor pedantic whitespace error" (036 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/189444 (https://phabricator.wikimedia.org/T88956) (owner: 10GWicke) [15:27:09] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1024893 (10akosiaris) Two comments on that change (one by Filippo, one by me), all LGTM aside from minor errors. I do have one question. Ticket says "install openjdk-8", change does not. What gives ? [15:27:34] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1024894 (10akosiaris) a:3akosiaris [15:32:23] (03PS1) 10ArielGlenn: dumps nginx: fixup ca cert name [puppet] - 10https://gerrit.wikimedia.org/r/189493 [15:32:59] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [15:33:46] (03CR) 10ArielGlenn: [C: 032] dumps nginx: fixup ca cert name [puppet] - 10https://gerrit.wikimedia.org/r/189493 (owner: 10ArielGlenn) [15:34:18] 3operations: revisit what percentiles are calculated by txstatsd - https://phabricator.wikimedia.org/T88662#1024900 (10akosiaris) p:5Triage>3Normal [15:35:00] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 44, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 131, initializing_shards: 0, number_of_data_nodes: 3 [15:36:47] (03CR) 10Alexandros Kosiaris: "Pinging again" [puppet] - 10https://gerrit.wikimedia.org/r/145997 (https://bugzilla.wikimedia.org/67957) (owner: 10Ori.livneh) [15:37:38] any clue whether https://logstash.wikimedia.org being empty is normal ? [15:37:58] I think i remembered about some discussion related to logstash/hhvm badly interacting together last week [15:38:12] hashar: Yeah, it's supposed to be empty right now [15:38:22] hoo: thx [15:42:28] (03PS2) 10Glaisher: Add throttle rules for two workshops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187912 (https://phabricator.wikimedia.org/T88203) [15:46:59] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1024921 (10ArielGlenn) This changeset (https://gerrit.wikimedia.org/r/#/c/189493/) will, after the old /etc/ssl/localcerts/dumps.wikimedia.org.chained.crt is moved out of the way, force the generation o... [15:48:07] Damn it Glaisher, not only did you edit conflict, you forgot to close a template!? The nerve of some people. :P [15:48:29] :) [15:48:48] anomie|sick: You want dibs on SWAT or should I? (|sick maybe means you don't want to...) [15:49:17] marktraceur: Please do, thanks [15:49:31] anomie|sick: But be ready to test yours [15:49:36] Already am [15:49:48] tonythomas, Glaisher, also ping for SWAT in ten minutes or so [15:49:59] pong :) [15:50:01] okey ! [15:50:25] 3Ops-Access-Requests: Requesting sudo for hafnium for nuria - https://phabricator.wikimedia.org/T88988#1024926 (10Nuria) 3NEW [15:52:02] 3operations, ops-eqiad: cp1070 hardware failure - https://phabricator.wikimedia.org/T88889#1024934 (10Cmjohnson) Work Order submitted, once approved I will update ticket Congratulations: Work Order WO6747225 was successfully submitted. [15:53:43] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024943 (10coren) 5Open>3Resolved a:3coren The project security group did not (was changed not to?) include allowing ss... [15:54:14] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024951 (10hashar) [15:55:40] (03PS1) 10QChris: Icinga: Drop qchris from analytics contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/189500 [15:59:56] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024977 (10hashar) 5Resolved>3Open The integration labs project was missing a security rule to allow ssh from gallium for... [16:00:04] manybubbles, anomie, ^d, marktraceur, anomie: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150209T1600). Please do the needful. [16:00:09] OK! [16:01:03] :) [16:01:52] First is anomie|sick. [16:03:03] 3operations: consider hybrid caching options for ssd+disk - https://phabricator.wikimedia.org/T88992#1024982 (10fgiunchedi) 3NEW a:3fgiunchedi [16:03:44] 3operations, Wikidata, Datasets-General-or-Unknown: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1025000 (10Lydia_Pintscher) [16:05:03] I guess I'm going to cancel my patches...sigh politics [16:05:19] 3operations: Document Debian/Ubuntu security update procedure & command - https://phabricator.wikimedia.org/T88469#1025009 (10akosiaris) I assume you are referring to: http://www.ubuntu.com/usn/usn-2489-1/ which from what I see has already made it across the cluster. The command is probably something along the... [16:06:03] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1025011 (10Joe) HHVM by default generates a connection pool to memcached. In my test from the cli, I saw two connection threads. Once I dropped con... [16:06:46] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025012 (10coren) The puppetmaster issue did appear related: adding an explcit rule to allow it fixed the immediate problem,... [16:07:22] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025014 (10coren) 5Open>3Resolved [16:07:44] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025015 (10hashar) The deployment-prep labs project also uses a local puppetmaster but it does not need any specific security... [16:09:07] 3operations: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1025024 (10BBlack) [16:10:22] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025027 (10akosiaris) @hashar root@integration-slave1002:~# telnet 10.68.16.96 8140 Trying 10.68.16.96... Connected to 10.68... [16:10:47] 3operations: reclaim graphite1002 - https://phabricator.wikimedia.org/T88994#1025029 (10fgiunchedi) 3NEW a:3fgiunchedi [16:11:19] springle: I'm done btw with your db box loaner in T88994 if you are interested :) [16:11:58] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1025041 (10coren) Yeah, things are working fine now with an explicit rule - but the necessity of //having// the explicit rule... [16:12:03] 3operations: Document Debian/Ubuntu security update procedure & command - https://phabricator.wikimedia.org/T88469#1025042 (10akosiaris) p:5High>3Low [16:12:32] !log marktraceur Synchronized php-1.25wmf16/extensions/OAuth/: [SWAT] [wmf16] OAuth: Support ListDefinedTags and ChangeTagsListActive hooks (duration: 00m 11s) [16:12:33] 3operations: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1025045 (10BBlack) Currently we have reinstalled to the new jessie stack one of each type in eqiad (text -> cp1065, upload -> cp1064, bits -> cp1070, mobile -> cp1060) as well as amssq42 as text in esams for live... [16:12:39] Logged the message, Master [16:12:50] anomie|sick: Test! [16:12:51] marktraceur: Works [16:12:55] Sweet [16:13:00] Next up tonythomas. [16:13:06] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024242 (10hashar) Filled another ticket for investigation of the underlying issue: {T88995}. [16:13:15] Oh, a config patch, how nice [16:13:46] (03CR) 10MarkTraceur: [C: 032] Un-subscribe frequently failing recipients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189316 (https://phabricator.wikimedia.org/T48640) (owner: 1001tonythomas) [16:13:56] (03Merged) 10jenkins-bot: Un-subscribe frequently failing recipients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189316 (https://phabricator.wikimedia.org/T48640) (owner: 1001tonythomas) [16:14:15] marktraceur: yay ! [16:14:57] !log marktraceur Synchronized wmf-config/: [SWAT] [config] Un-subscribe frequently failing recipients (duration: 00m 05s) [16:14:57] <_joe_> Reedy: https://gerrit.wikimedia.org/r/#/c/188762/ waits for your review [16:15:00] Logged the message, Master [16:15:01] tonythomas: Test 'er [16:15:13] I figure that will take some time, but I can move on to Glaisher [16:15:23] (03Abandoned) 10Giuseppe Lavagetto: hhvm: make the puppet module more configurable [puppet] - 10https://gerrit.wikimedia.org/r/179108 (owner: 10Giuseppe Lavagetto) [16:16:18] marktraceur: we are working on that in #dev [16:16:57] (03CR) 10MarkTraceur: [C: 032] Add throttle rules for two workshops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187912 (https://phabricator.wikimedia.org/T88203) (owner: 10Glaisher) [16:17:09] Glaisher: I guess you can't really test yours, huh [16:17:22] yeah [16:17:22] (03Merged) 10jenkins-bot: Add throttle rules for two workshops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187912 (https://phabricator.wikimedia.org/T88203) (owner: 10Glaisher) [16:18:05] !log marktraceur Synchronized wmf-config/throttle.php: [SWAT] [config] Add throttle rules for two workshops (duration: 00m 07s) [16:18:11] Logged the message, Master [16:18:20] Well, nothing bad happened [16:18:26] So, that's a SWAT [16:18:31] 3operations: Puppet broken on silver.wikimedia.org - https://phabricator.wikimedia.org/T88513#1025074 (10akosiaris) Well, Ubuntu 14.04.1 LTS auto-installed on Mon Feb 2 17:49:40 UTC 2015 and puppet is not failing anymore [16:18:32] marktraceur: … yet. [16:18:32] My patches are postponed indefinitely pending discussion [16:18:33] ;-) [16:19:30] (03PS3) 10Giuseppe Lavagetto: dsh: create files based on exported resources [puppet] - 10https://gerrit.wikimedia.org/r/179121 [16:20:02] 3operations: jessie kernel vm subsystem issues for upload caches - https://phabricator.wikimedia.org/T88996#1025082 (10BBlack) 3NEW a:3BBlack [16:20:05] marktraceur: Awesome. Thanks! [16:20:09] marktraceur: tested that one. works well and good [16:20:14] Thanks ;) [16:20:40] 3operations: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1025097 (10BBlack) [16:20:41] 3operations: jessie kernel vm subsystem issues for upload caches - https://phabricator.wikimedia.org/T88996#1025098 (10BBlack) [16:20:43] 3operations: improve graphite failover - https://phabricator.wikimedia.org/T88997#1025099 (10fgiunchedi) 3NEW [16:20:48] Great, tonythomas, cheers [16:20:58] cheers ! :) [16:21:15] cant wait to take the extension to the next level [16:21:30] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1025110 (10GWicke) >>! In T85492#1024397, @faidon wrote: > Two questions: > - I heard that we're moving off Titan, is this part of the ticket (@Smalyshev'... [16:21:45] 3operations: jessie kernel vm subsystem issues for upload caches - https://phabricator.wikimedia.org/T88996#1025082 (10BBlack) [16:21:59] 3operations: jessie kernel vm subsystem issues for upload caches - https://phabricator.wikimedia.org/T88996#1025082 (10BBlack) [16:22:11] !log restarted eventlogging on hafnium for nuria via ~root/upgrade-eventlogging --no-update [16:22:15] Logged the message, Master [16:22:29] jgage: but we also need to deploy [16:22:43] jgage: the code is still from dec 16th [16:22:55] jgage: we are looking to deploy the latest from master [16:23:02] oh? ok. i'll run it without that arg then [16:23:14] it was unclear whether you'd already done the deploy from tin [16:23:35] jgage: sorry, no, do not have permits on tin yet either [16:23:47] ok, no problem [16:23:54] (03PS1) 10Filippo Giunchedi: gdash: fix graphite disk dashboard sda->md1 [puppet] - 10https://gerrit.wikimedia.org/r/189504 (https://phabricator.wikimedia.org/T85909) [16:23:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [16:25:20] (03PS2) 10Filippo Giunchedi: gdash: fix graphite disk dashboard sda->md1 [puppet] - 10https://gerrit.wikimedia.org/r/189504 (https://phabricator.wikimedia.org/T85909) [16:25:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: fix graphite disk dashboard sda->md1 [puppet] - 10https://gerrit.wikimedia.org/r/189504 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [16:25:37] (03CR) 10Faidon Liambotis: [C: 04-1] "Per T85492, let's a) remove Smalyshev for now, b) rename the group to "cassandra-test-roots" to be clear on what this is." [puppet] - 10https://gerrit.wikimedia.org/r/188605 (owner: 10Andrew Bogott) [16:25:55] akosiaris (with your duty hat on) ^^^ [16:26:28] (03PS2) 10GWicke: Update for Cassandra 2.1.2 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/189444 (https://phabricator.wikimedia.org/T88956) [16:26:58] I lied! [16:27:02] SWAT continues with my patches. [16:27:55] (03CR) 10GWicke: "@akosiaris, @fgiunchedi: Stripped the whitespace." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/189444 (https://phabricator.wikimedia.org/T88956) (owner: 10GWicke) [16:29:15] jgage: Ok, will wait for code to appear in "/srv/deployment/eventlogging/EventLogging" in halfnium [16:30:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update for Cassandra 2.1.2 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/189444 (https://phabricator.wikimedia.org/T88956) (owner: 10GWicke) [16:30:28] 3operations, Citoid: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025136 (10Mvolz) [16:31:21] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:20] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59719 bytes in 0.430 second response time [16:36:50] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:37:02] (03PS1) 10Giuseppe Lavagetto: mediawiki: armonize HHVM settings with Zend ones [puppet] - 10https://gerrit.wikimedia.org/r/189505 [16:37:43] !log marktraceur Synchronized php-1.25wmf15/extensions/UploadWizard/resources/mw.FlickrChecker.js: [SWAT] [wmf15] Re-add flickrreview template to files imported from Flickr by UploadWizard (duration: 00m 05s) [16:37:47] Logged the message, Master [16:38:11] !log marktraceur Synchronized php-1.25wmf16/extensions/UploadWizard/resources/mw.FlickrChecker.js: [SWAT] [wmf16] Re-add flickrreview template to files imported from Flickr by UploadWizard (duration: 00m 06s) [16:38:15] Logged the message, Master [16:38:33] OK now SWAT is over for realsies [16:45:59] 3operations, Wikimedia-Git-or-Gerrit: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1025174 (10akosiaris) Given the phabricator plans, gerrit's inability to listen on port 22 and the minimal relevant traffic on this ticket since 2 years ago, I am inclined to suggest we should... [16:47:00] Hm, maybe not - trying something different [16:47:33] !log restarted eventlogging on hafnium (with deploy from master on tin this time) [16:47:40] Logged the message, Master [16:48:09] !log marktraceur Synchronized php-1.25wmf15/extensions/UploadWizard/: [SWAT] [wmf15] Trying to force UploadWizard to update (duration: 00m 06s) [16:48:13] Logged the message, Master [16:48:45] !log marktraceur Synchronized php-1.25wmf16/extensions/UploadWizard/: [SWAT] [wmf16] Trying to force UploadWizard to update (duration: 00m 05s) [16:48:49] 3operations: jessie kernel vm subsystem issues for upload caches - https://phabricator.wikimedia.org/T88996#1025181 (10BBlack) Just to keep a more-detailed record of things that have been tried: At various points in this process, the frequency of the spikes can be as short as on every several minutes to as infr... [16:48:49] Logged the message, Master [16:49:38] (03PS1) 10Cenarium: Checking "autoreview" instead of "autoconfirmed" for enwiki FlaggedRevs restriction level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189513 [16:50:43] 3operations, Wikimedia-Git-or-Gerrit: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1025183 (10Dzahn) >>! In T88388#1024635, @akosiaris wrote: > @Dzanh, could you please undo the "meanwhile i have edited the list settings of the ops... [16:52:42] mutante: Can you create a private contactgroup with my phone number for me? [16:52:59] !log marktraceur Synchronized php-1.25wmf15/extensions/UploadWizard/resources/mw.FlickrChecker.js: [SWAT] [wmf15] Re-add flickrreview template to files imported from Flickr by UploadWizard (duration: 00m 05s) [16:54:46] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025195 (10Mvolz) [16:55:08] hoo: yes [16:55:25] Ok...PM? [16:55:41] ok [16:59:24] Is, um...is fluorine up? [16:59:39] Seems like it [16:59:48] it is [16:59:55] icinga would cry, if ti weren't also [17:00:05] How did I get into it last time... [17:00:17] ssh fluorine :P [17:00:41] Ah, I didn't have it set up in my ssh config [17:03:02] don't you just have *.eqiad.wmnet set up? [17:03:25] Probably not [17:04:34] 3operations, Incident-20150205-SiteOutage, MediaWiki-Core-Team, Wikimedia-Logstash: Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash - https://phabricator.wikimedia.org/T88870#1025205 (10bd808) I spent some time yesterday looking into what this will take. It turns out to... [17:08:20] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1025215 (10faidon) I think I'd prefer going with Gelf directly from MediaWiki rather than involving another component in the path. Are there an... [17:09:48] 3operations, Wikimedia-Git-or-Gerrit: Remove port 29418 from cloning process - https://phabricator.wikimedia.org/T37611#1025228 (10Dzahn) unless we still add the iptables rule on ytterbium itself (that's different from my abandoned patch above which expected we'd have to forward it between machines for a scenari... [17:14:41] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1025266 (10bd808) >>! In T88732#1025215, @faidon wrote: > I think I'd prefer going with Gelf directly from MediaWiki rather than involving anot... [17:23:30] 3operations: zirconium: more space for /srv (take from /var/log) - https://phabricator.wikimedia.org/T89004#1025288 (10Dzahn) 3NEW [17:24:05] 3operations: zirconium: more space for /srv (take from /var/log) - https://phabricator.wikimedia.org/T89004#1025295 (10Dzahn) [17:25:09] 3operations: zirconium: more space for /srv (take from /var/log) - https://phabricator.wikimedia.org/T89004#1025288 (10Dzahn) [17:25:20] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.05 [17:28:23] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1025303 (10GWicke) >>! In T88956#1024893, @akosiaris wrote: > Two comments on that change (one by Filippo, one by me), all LGTM aside from minor errors. I do have one question. Ticket says "install... [17:28:48] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1025304 (10GWicke) [17:33:07] (03CR) 10Arlolra: [C: 031] mediawiki: do not escape urls in the catchall redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/188762 (owner: 10Giuseppe Lavagetto) [17:35:28] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025322 (10akosiaris) Hello, I 've encountered the zotero not running issue during the Dev Summit and I have talked with @Catrope about it and how to solve it. The issue stems from: a) The fact that the z... [17:35:41] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025323 (10akosiaris) p:5Unbreak!>3High [17:37:19] legoktm: when is global userpage supposed to be deployed ? [17:38:13] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1025325 (10akosiaris) 5Open>3Resolved @Gwicke, OK good to know. On a side note, I 'd rather we avoid backported openjdk 8 to jessie to production for as much as we can (the openjdk 8 in the apt... [17:38:27] Needs to be scheduled still [17:39:08] (03CR) 10Ori.livneh: [C: 031] mediawiki: send .phtml files to HHVM as well [puppet] - 10https://gerrit.wikimedia.org/r/189440 (owner: 10Giuseppe Lavagetto) [17:39:17] ^ _joe_ i saw that one last night, i meant to +1 it [17:39:25] (03PS2) 10Ori.livneh: mediawiki: harmonize HHVM settings with Zend ones [puppet] - 10https://gerrit.wikimedia.org/r/189505 (owner: 10Giuseppe Lavagetto) [17:43:53] (03CR) 10Ori.livneh: "* 180s seems like an awful lot" [puppet] - 10https://gerrit.wikimedia.org/r/189505 (owner: 10Giuseppe Lavagetto) [17:44:13] what's a "AppleDouble encoded Macintosh file" good for [17:46:25] !log restarted elasticsearch on logstash1003; OOM [17:46:32] Logged the message, Master [17:50:49] (03PS5) 10Andrew Bogott: Rename cassandra-roots to cassandra-test-roots; add mobrovac and jdouglas. [puppet] - 10https://gerrit.wikimedia.org/r/188605 [17:51:39] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1025354 (10Andrew) https://gerrit.wikimedia.org/r/#/c/188605/ [17:55:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "In order:" [puppet] - 10https://gerrit.wikimedia.org/r/189505 (owner: 10Giuseppe Lavagetto) [18:00:43] matanya: tentively scheduled for the 18th [18:01:03] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#1025362 (10Andrew) Zeljko, please respond [18:01:03] a dream comming true, thanks legoktm_ [18:01:38] what is? [18:02:40] (03CR) 10Faidon Liambotis: [C: 032] Rename cassandra-roots to cassandra-test-roots; add mobrovac and jdouglas. [puppet] - 10https://gerrit.wikimedia.org/r/188605 (owner: 10Andrew Bogott) [18:04:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am not really fond of this approach. Instead I would favor an approach where most of the anyway common config is in a .erb file and flag" [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [18:06:07] (03PS2) 10Ori.livneh: vbench: Fix minor bug in std() [puppet] - 10https://gerrit.wikimedia.org/r/189477 (owner: 10Mobrovac) [18:06:15] 3operations: Can protactinium be reclamed (was emergency gadolinium replacement) - https://phabricator.wikimedia.org/T89009#1025388 (10RobH) 3NEW [18:06:30] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: Fix minor bug in std() [puppet] - 10https://gerrit.wikimedia.org/r/189477 (owner: 10Mobrovac) [18:07:06] 3operations: Can protactinium be reclaimed (was emergency gadolinium replacement) - https://phabricator.wikimedia.org/T89009#1025402 (10faidon) p:5Triage>3Normal a:3Ottomata [18:09:30] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:10:30] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.002167 secs [18:13:00] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:14:10] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.001422 secs [18:16:40] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:17:40] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:17:49] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset -0.002329 secs [18:18:50] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -1.4e-05 secs [18:19:39] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:20:50] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset -8.5e-05 secs [18:24:19] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 11278 MB (3% inode=99%): [18:27:58] paravoid: there? [18:28:10] yes but about to jump into a meeting [18:30:45] paravoid: k, talk to you later, take a look at this (chrome is droping support for speedy) [18:30:53] paravoid: http://blog.chromium.org/2015/02/hello-http2-goodbye-spdy-http-is_9.html [18:30:55] I saw [18:31:33] i like that url. "Hello, HTTP2. Goodbye, SPDY. HTTP is 9!" [18:33:07] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1025466 (10RobH) [18:34:35] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1020797 (10RobH) [18:34:37] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1025469 (10RobH) [18:40:59] PROBLEM - Disk space on xenon is CRITICAL: DISK CRITICAL - free space: /mnt/data 11261 MB (3% inode=99%): [18:41:16] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025492 (10GWicke) @akosiaris: Thanks for looking into a saner way to deploy zotero. Lets not over-complicate things though: I think it's fine for zotero & citoid to share hardware and IP. The zotero servi... [18:45:18] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:46:20] (03PS1) 10Ori.livneh: vbench: create domain proxy objects for more python calling conventions [puppet] - 10https://gerrit.wikimedia.org/r/189528 [18:47:12] (03CR) 10Ori.livneh: [C: 032] vbench: create domain proxy objects for more python calling conventions [puppet] - 10https://gerrit.wikimedia.org/r/189528 (owner: 10Ori.livneh) [18:50:18] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [18:50:18] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [18:51:38] db1008 stuff is me [18:53:36] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025536 (10GWicke) [18:54:22] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#795651 (10GWicke) [18:55:18] PROBLEM - check_mysql on db1008 is CRITICAL: Cant connect to local MySQL server through socket /tmp/mysql.sock (2) [18:55:52] 3operations, ops-eqiad: Rack Setup new diskshelf for labstore1001 - https://phabricator.wikimedia.org/T88802#1025545 (10Cmjohnson) 5Open>3declined Okay, Will come back to it when Yuvi gets back [18:56:19] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [18:56:19] (03PS1) 10GWicke: Update cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/189530 (https://phabricator.wikimedia.org/T88956) [18:57:35] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025555 (10Catrope) >>! In T76308#1025322, @akosiaris wrote: > b) The deployment method of zotero which right now is distributing a set of shared object files > > https://git.wikimedia.org/tree/mediawiki%... [18:58:50] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1025558 (10RobH) So this rebalancing is going to block the deployment of 2 of the 6 Restbase systems. Can we plan to move mc1017-mc1018 into row D? Or at minimum... [18:59:18] (03CR) 10Filippo Giunchedi: [C: 031] "this reminds me of a pitfall I noticed today: git grep doesn't recurse inside submodules :(" [puppet] - 10https://gerrit.wikimedia.org/r/189530 (https://phabricator.wikimedia.org/T88956) (owner: 10GWicke) [18:59:30] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 11280 MB (3% inode=99%): [18:59:44] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1025563 (10Mvolz) Some additional factors, which you probably have considered but just writing down here: The service, translation-server, is not the same thing as zotero-standalone. Zotero[1] is a submodu... [19:00:18] RECOVERY - check_mysql on db1008 is OK: Uptime: 241 Threads: 1 Questions: 18224 Slow queries: 0 Opens: 33 Flush tables: 2 Open tables: 42 Queries per second avg: 75.618 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:04:17] (03PS1) 10GWicke: Bump cassandra memory on restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/189531 [19:07:55] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1025595 (10GWicke) [19:09:17] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1020797 (10GWicke) Setup: - Debian Jessie - Partitioning: - small (~20G RAID-1) partition for `/` - bulk of SSDs as RAID-0 on top of LVM vg [19:09:39] 3operations, RESTBase-Cassandra: Update cassandra puppetization for 2.1 - https://phabricator.wikimedia.org/T88956#1025602 (10mobrovac) Why ver 8 when Cassandra seems to support [Java 7](http://www.datastax.com/documentation/cassandra/2.1/cassandra/install/installDeb_t.html) ? [19:13:27] (03CR) 10Mobrovac: [C: 031] Bump cassandra memory on restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/189531 (owner: 10GWicke) [19:17:54] (03PS1) 10Hoo man: Fix template type "paresercache" -> "parsercache" [puppet] - 10https://gerrit.wikimedia.org/r/189535 [19:18:13] springle: hey... around? [19:18:45] hoo: in ops meeting. desperate? [19:19:13] springle: No... just trying to understand the Icinga configuration [19:20:46] 3operations, Engineering-Community: date/budget proposal for 2015 Ops Offsite - https://phabricator.wikimedia.org/T89023#1025652 (10Rfarrand) 3NEW a:3Rfarrand [19:21:16] Doh [19:21:21] I made a typo in typo [19:21:57] (03PS2) 10Hoo man: Fix template name typo "paresercache" -> "parsercache" [puppet] - 10https://gerrit.wikimedia.org/r/189535 [19:23:56] (03PS1) 10Ori.livneh: vbench: use defer.inlineCallbacks to chain commands [puppet] - 10https://gerrit.wikimedia.org/r/189536 [19:24:05] RoanKattouw: ^ should fix things [19:24:14] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: use defer.inlineCallbacks to chain commands [puppet] - 10https://gerrit.wikimedia.org/r/189536 (owner: 10Ori.livneh) [19:24:54] (03CR) 10Kaldari: Adding original language of this work campaign for WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [19:29:13] ori: Interesting, that works but now there's an error *after* it runs [19:29:15] pastebinning [19:30:14] ori: http://pastebin.com/aw2NMqwk [19:31:09] (03PS1) 10Ori.livneh: vbench: remove debugging code [puppet] - 10https://gerrit.wikimedia.org/r/189539 [19:31:19] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: remove debugging code [puppet] - 10https://gerrit.wikimedia.org/r/189539 (owner: 10Ori.livneh) [19:31:55] RoanKattouw: ok, fixed too. [19:32:02] (and updated on osmium) [19:32:53] ori: Yup, working now, thanks [19:41:37] <^d> andrewbogott: Do you need those testelastic* nodes anymore? [19:41:54] <^d> Meh, meant for in #-labs [19:41:54] ^d: I don’t think I can be the andrew you want [19:42:19] hm, or maybe I am? [19:42:25] Anyway, in a meeting, bug me in 30? [19:42:32] <^d> mmk [19:42:54] If they’re actually mine, then I definitely don’t need them [19:43:05] Might check with manybubbles though since I probably made them for him [19:43:13] ? [19:43:39] I don't think I need them [19:54:17] (03CR) 10BryanDavis: "I think the test code should be executing `composer update` rather than `composer install`. If a composer.lock file is present the install" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [19:56:58] <^d> andrewbogott: I thought you maybe spun those up friday testing nfs. [19:57:16] Oh! ACtually I think I did [19:57:23] Um… Friday seems like so long ago [19:57:29] Feel free to delete them [19:57:33] <^d> Mmk [19:57:34] <^d> Thx [19:57:39] sorry for confusion [19:58:33] (03PS1) 10Cmjohnson: Adding mgmt dns for restbase [dns] - 10https://gerrit.wikimedia.org/r/189540 [19:58:51] akosiaris, _joe_: btw, I'm happy to answer questions about restbase any time, am just a bit tired of repeating myself right now [20:02:34] 3operations, Continuous-Integration, Wikimedia-Labs-Infrastructure: labs security rules are flappy / invalid cause network communications issues - https://phabricator.wikimedia.org/T88960#1024242 (10hashar) The doc publishing jobs are failing as well and there is no workaround for it :( T89026 [20:04:35] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for restbase [dns] - 10https://gerrit.wikimedia.org/r/189540 (owner: 10Cmjohnson) [20:06:51] (03PS1) 10Cmjohnson: Adding fwd dns entries for restbase servers [dns] - 10https://gerrit.wikimedia.org/r/189544 [20:07:05] (03CR) 10Andrew Bogott: [C: 032] Move puppet-lint options to .puppet-lint.rc [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [20:07:25] \O/ [20:07:43] (03PS2) 10Andrew Bogott: puppet-lint: ignore some var in single quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [20:08:06] (03CR) 10Cmjohnson: [C: 032] Adding fwd dns entries for restbase servers [dns] - 10https://gerrit.wikimedia.org/r/189544 (owner: 10Cmjohnson) [20:08:30] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1025755 (10RobH) It is the same private key, no change. [20:09:10] (03CR) 10Andrew Bogott: [C: 032] puppet-lint: ignore some var in single quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [20:17:18] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1025771 (10RobH) Hrmm, that change is correct, and this should work. Let me loop back to this shortly. [20:18:05] akosiaris, godog: if you have a moment, could you +2 the cassandra submodule update in https://gerrit.wikimedia.org/r/#/c/189530/ ? [20:18:24] would like to re-enable puppet on the test cluster [20:34:43] (03CR) 10Krinkle: "This was to allow using dsh/salt from e.g. integration-dev to orchestrate commands to the cluster. https://phabricator.wikimedia.org/T8781" [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle) [20:34:49] (03CR) 10Dzahn: [C: 031] Fix template name typo "paresercache" -> "parsercache" [puppet] - 10https://gerrit.wikimedia.org/r/189535 (owner: 10Hoo man) [20:35:49] (03CR) 10Dzahn: [C: 032] Icinga: Drop qchris from analytics contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/189500 (owner: 10QChris) [20:37:53] (03CR) 10Dzahn: [C: 032] Change 'Export to Excel' to 'Export (disabled)' [puppet] - 10https://gerrit.wikimedia.org/r/189327 (https://phabricator.wikimedia.org/T152) (owner: 10Merlijn van Deen) [20:38:25] greg-g, updated depl schedule [20:38:31] yurikR: with? [20:38:57] ah, I see now [20:39:02] greg-g, all by myself :( [20:39:11] good job! :P [20:39:17] * yurikR sings to himself quietly... [20:47:16] (03CR) 10BBlack: [C: 031] mediawiki: do not escape urls in the catchall redirect to https [puppet] - 10https://gerrit.wikimedia.org/r/188762 (owner: 10Giuseppe Lavagetto) [20:56:38] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet] - 10https://gerrit.wikimedia.org/r/163814 (owner: 10Hashar) [20:56:47] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/163814 (owner: 10Hashar) [20:57:29] (03CR) 10Dzahn: [C: 031] "agree, they are "blocked by" not "blocking", this confused me as well" [puppet] - 10https://gerrit.wikimedia.org/r/189329 (https://phabricator.wikimedia.org/T33) (owner: 10Merlijn van Deen) [20:58:11] Hey ^d [20:58:22] andrewbogott: thanks for the puppet-lint patches :] [20:58:36] <^d> Krenair: yes? [20:58:39] (03CR) 10Dzahn: [C: 031] Add documentation link to 'create bug by email' text. [puppet] - 10https://gerrit.wikimedia.org/r/189326 (https://phabricator.wikimedia.org/T865) (owner: 10Merlijn van Deen) [20:58:47] uh, actually, let's go to -releng. sorry [20:58:56] hashar: I hope they work! [21:00:05] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150209T2100). Please do the needful. [21:00:33] 3operations: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1025975 (10BBlack) NPN + SPDY is enabled in the current test software stack and was fairly trivial to turn and seems to work fine. ALPN is apparently supported by the nginx codebase we're running, but will require an upgrade to Open... [21:02:40] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1025983 (10hashar) With the operations/puppet.git patches merged above, puppet-lint v1.1.0 no more reports any error. Thus, the Jenkins job [[ https://integration.wikimedia.org/ci/job/operations-pup... [21:04:34] springle: Still i the meeting? [21:04:43] (03PS2) 10Merlijn van Deen: Add documentation link to 'create bug by email' text. [puppet] - 10https://gerrit.wikimedia.org/r/189326 (https://phabricator.wikimedia.org/T865) [21:05:00] hoo: no. whats up [21:06:00] springle: I'd find it useful to get Icinga notifications if a s5 slave is lagged behind. [21:06:12] What's the closest to that I can get right now? [21:06:47] I looked around puppet... and there's a lot of stuff which has to do with mysql/ maria and icinga... [21:07:10] (03PS2) 10Merlijn van Deen: Change Blocking Tasks to 'Blocked By' Tasks [puppet] - 10https://gerrit.wikimedia.org/r/189329 (https://phabricator.wikimedia.org/T33) [21:08:06] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1026000 (10GWicke) >>! In T76308#1025555, @Catrope wrote: > I didn't notice libssl was in there, that means I've done something much worse than I thought I was doing. Thanks for cleaning this up. While sca... [21:08:31] hoo: right now we'd need to tweak the mariadb::monitor_replication check in role:mariadb::core if shard=s5 [21:08:53] pass it a combination of contact groups i guess [21:09:20] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#1026002 (10greg) a:5greg>3None >>! In T86602#1019902, @greg wrote: > This was already done, right? Or do we still need a time? > > If it wasn't done, pick a time that wo... [21:10:04] springle: Is the "usual" monitoring you have noisy? (Will I go crazy if I just add myself to that?) [21:10:53] RECOVERY - Disk space on cerium is OK: DISK OK [21:10:58] I mean it's probably what I'd get anyway times three or so (given that s5/s1 are potentially more noisy, I guess) [21:11:38] i think the replag alert last night was the first in months on the production shards [21:12:08] Yeah ok... so I'll just subscribe to that [21:12:13] RECOVERY - Disk space on xenon is OK: DISK OK [21:12:15] I wont die on some noise [21:12:16] things always lag a bit, obviously, but the threshold is fairly high [21:12:27] 60s or something [21:12:45] yeah... a few seconds can happen every once in a while [21:12:58] (03CR) 10Legoktm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) (owner: 10Legoktm) [21:13:04] btw, I'm not tweaking that script to not delete more than 25k rows at once [21:13:12] which is still much, but we wont hit that usually anyway [21:13:24] "now" tweaking? [21:13:35] yeah, now :P [21:13:42] double negatives make my head hurt :) [21:13:45] cool [21:15:23] I got a bunch of 503 Service Unavailable from different https://bits.wikimedia.org/www.mediawiki.org/load.php?.. urls just now. causing pages to be unstyled and with js errors [21:16:56] 3Ops-Access-Requests: Sudo for Roan on osmium - https://phabricator.wikimedia.org/T89038#1026024 (10ori) 3NEW [21:17:14] (03PS1) 10Hoo man: Add hoo to the "dba" and "wikidata" contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/189585 [21:18:28] (03CR) 10Dzahn: [C: 031] "created that contact in private repo, can be used" [puppet] - 10https://gerrit.wikimedia.org/r/189585 (owner: 10Hoo man) [21:18:57] hoo: dba might annoy you as it has other stuff. it wouldn't be hard to send s5 replag alerts to wikidata (if that is a group) [21:19:17] others might hate it [21:20:04] Mh... can we do that for now and if it annoys me, we can do something more fancy? [21:20:13] heh [21:20:14] ok [21:20:44] dbstore and analytics might make you cry [21:21:21] I can imagine people to do nasty stuff there [21:21:30] nasty as in slow, not bad :P [21:22:00] 3operations: Support SPDY - https://phabricator.wikimedia.org/T35890#1026043 (10GWicke) The new jessie nginx test install already supports SPDY, and I believe is serving a fraction of the prod traffic: https://spdycheck.org/#cp1008.wikimedia.org So it looks like we'll gradually get wider SPDY support as the Jes... [21:23:17] (03PS1) 10Nuria: Correcting docs and thresholds for eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/189588 [21:25:40] (03PS3) 10Springle: Fix template name typo "paresercache" -> "parsercache" [puppet] - 10https://gerrit.wikimedia.org/r/189535 (owner: 10Hoo man) [21:30:58] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1026067 (10scfc) Nicely done, merci! If Jenkins would vote -1, would that reject merges? I. e., if in an emergency #operations needs to merge a change even if it doesn't lint, would they be able t... [21:32:02] (03CR) 10Springle: [C: 032] Fix template name typo "paresercache" -> "parsercache" [puppet] - 10https://gerrit.wikimedia.org/r/189535 (owner: 10Hoo man) [21:34:36] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1026073 (10hashar) >>! In T87132#1026067, @scfc wrote: > Nicely done, merci! > > If Jenkins would vote -1, would that reject merges? I. e., if in an emergency #operations needs to merge a change e... [21:40:50] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1026091 (10Matanya) https://gerrit.wikimedia.org/r/#/c/189589/ [21:55:25] (03PS2) 10Springle: Add hoo to the "dba" and "wikidata" contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/189585 (owner: 10Hoo man) [21:56:29] (03CR) 10Springle: [C: 032] Add hoo to the "dba" and "wikidata" contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/189585 (owner: 10Hoo man) [22:00:04] yurik: Dear anthropoid, the time has come. Please deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150209T2200). [22:00:41] (03PS1) 10Calak: Change templateeditor user group rights on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189594 (https://phabricator.wikimedia.org/T89040) [22:02:40] (03CR) 10Ebrahim: [C: 031] Change templateeditor user group rights on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189594 (https://phabricator.wikimedia.org/T89040) (owner: 10Calak) [22:08:22] (03CR) 10Cenarium: "Actually, this should stay for the user interface (protection summary). I'll remove "autoreview" instead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189513 (owner: 10Cenarium) [22:08:51] Krinkle, is jenkins dead again ((( https://gerrit.wikimedia.org/r/#/c/189592/ [22:09:58] 3operations, MediaWiki-Core-Team: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1026201 (10ori) >>! In T88393#1016697, @ori wrote: > if we have a cron job pick up any files in archive/ that logrotate failed to compress for whatever reason, we could close this task and feel good abou... [22:11:35] yurikR1: Nope, but your job was not going anywhere it seems: https://integration.wikimedia.org/ci/job/mediawiki-extensions-zend/2600/console [22:11:38] I've aborted it [22:12:21] Krinkle, thx, overriding it ( [22:12:58] ori: Can you restart Chrome again? (sigh) [22:18:49] RoanKattouw: done [22:19:24] ori: Thanks [22:27:59] Krinkle, is jenkins even alive today? https://integration.wikimedia.org/zuul/ :((( [22:28:14] trying to merge 2 core patches to to branches, and it just hangs :((( [22:28:15] it is [22:28:19] yurikR1: it's working fine for 100s of jobs in the past hour. [22:28:27] 3operations, Release-Engineering, WMF-Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1026252 (10bd808) @greg is going to find a #release-engineering helper for this project [22:28:27] i'm special :( [22:30:05] yurikR1: It seems 1 in N mediawiki-extensions-zend is having phpunit go stuck somewhere half-way [22:30:30] should i force merge it? [22:30:38] sems like it passed all the tests [22:30:44] https://gerrit.wikimedia.org/r/#/c/189596/ [22:30:58] Krinkle, ^ [22:33:19] yurikR1: I'd wait a minute for the timeout to kick in (I just aborted it) and try again once [22:33:27] to verify it isn't deterministic with that branch [22:34:04] Krinkle, but all the tests have passed right before that for the same patch [22:34:07] yurikR1: Please note you can do abort yourself in the future. Just go to the Jenkins build via the Zuul dashboard and abort the build if the build log is not doing anything for that more than 10 minutes. [22:34:18] Anyone with wmf ldap can do that. Should not involve contint admins. [22:34:29] Gate pipeline is different [22:36:44] Krinkle, tried logging in ... got 503: Request: GET http://integration.wikimedia.org/ci/loginError, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 1035342331 [22:36:45] Forwarded for: 66.108.170.120, 10.64.0.172 [22:36:45] Error: 503, Service Unavailable at Mon, 09 Feb 2015 22:36:10 GMT [22:37:13] *shrugs* [22:37:13] File a bug :) [22:37:21] works for me. [22:37:46] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [22:40:36] (03PS1) 10Ori.livneh: vbench: make it easier to log to a file [puppet] - 10https://gerrit.wikimedia.org/r/189603 [22:40:38] (03PS1) 10Ori.livneh: role::ve: allow wikidev to (re-)start hhvm, xvfb and chromium [puppet] - 10https://gerrit.wikimedia.org/r/189604 [22:41:28] (03PS2) 10Ori.livneh: vbench: make it easier to log to a file [puppet] - 10https://gerrit.wikimedia.org/r/189603 [22:43:18] (03CR) 10Ori.livneh: [C: 032] vbench: make it easier to log to a file [puppet] - 10https://gerrit.wikimedia.org/r/189603 (owner: 10Ori.livneh) [22:45:06] robh: i would like to emails about tasks with the tag operations in phab what needs to happen for that ? [22:46:04] i dont use the email functionality at all [22:46:10] Request: GET http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=user.groups&only=styles&skin=minerva&target=mobile&user=MaxSem&version=20150204T144534Z&*, from 10.128.0.103 via cp4003 cp4003 ([10.128.0.103]:80), Varnish XID 2174280559
Forwarded for: [trim]
Error: 503, Service Unavailable at Mon, 09 Feb 2015 22:42:57 GMT [22:46:18] msot of the time it seems tractional data not content [22:46:41] matanya: but, i think you are looking for subscribe right? [22:46:48] You can subscribe to a given project and get updates [22:47:06] https://phabricator.wikimedia.org/project/view/29/ [22:47:09] or is it watch? [22:47:15] i dunno, what does it show for you as options there? [22:47:29] You wouldn't join, thats something else entirely. [22:47:41] robh: i meant subscribe [22:47:51] i see only edit or flag [22:47:56] huh [22:48:15] before you did the merge i would get an email about any ops-reuqest ot ops-core [22:48:19] but no longer [22:48:20] no watch? [22:48:27] nope [22:48:34] missed a lot of fun :) [22:48:51] so thats an odd one, since its a closed group we cannot add non opsen to it [22:49:12] but im not sure how someone gets updates on a project when they arent allowed to join [22:49:16] chasemp: ^ any ideas? [22:49:31] You can't watch a project that you can't join. [22:50:03] Even though that project can be related to all sorts of publicly visible objects. [22:50:26] yea, we may need to modify how we use operations then... [22:50:45] matanya: I'd recommend filing an operations phab task detailing what you used to have and why, if only so it sits there [22:50:50] (03PS1) 10Dzahn: admin: add group for benchmarking with chromium [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) [22:51:09] i honestly dunno what the answer is, but i think it will involve changing how we administer membership to the ops project [22:51:24] which has implications to tickets that may be set to operations view only and should instead migrate to nda [22:51:38] (just need to do some ticket searches, but im not willing to do that since i just got back from the store and im gonna eat food ;) [22:51:47] (03CR) 10jenkins-bot: [V: 04-1] admin: add group for benchmarking with chromium [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) (owner: 10Dzahn) [22:52:03] thanks robh i'll fill a ticket, sorry for causing trouble always :) [22:52:13] nah its a good question and i didnt think about it [22:52:30] better to tackle it now with the workflows in flux from the migration [22:52:40] than try to fix it in a month ocne everyone is used to a certain way [22:53:39] (03PS2) 10Dzahn: admin: add group for benchmarking with chromium [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) [22:54:27] (03CR) 10jenkins-bot: [V: 04-1] admin: add group for benchmarking with chromium [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) (owner: 10Dzahn) [22:54:31] (03CR) 10Andrew Bogott: [C: 031] admin: add group for benchmarking with chromium [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) (owner: 10Dzahn) [22:55:03] greg-g, thx to jenkins, the depl is taking a bit longer (( [22:55:22] andrewbogott: thanks, but jenkins still doesnt like.. hmm [22:55:24] as in i have been sitting waiting for it for the past 1.5 hrs ) [22:55:40] expected , but found '' [22:56:08] yurikR1: https://phabricator.wikimedia.org/T89050#1026297 [22:56:46] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:56:56] 3operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1026396 (10Matanya) 3NEW [22:57:03] yurikR1: what's the change? [22:57:09] andrewbogott: ah, it's a single space. yaml is strict [22:57:15] (03CR) 10Ori.livneh: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) (owner: 10Dzahn) [22:58:17] (03PS3) 10Dzahn: admin: add group for benchmarking with chromium [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) [23:00:10] (03CR) 10Dzahn: [C: 032] "adding the group now, access request comes next :)" [puppet] - 10https://gerrit.wikimedia.org/r/189606 (https://phabricator.wikimedia.org/T89038) (owner: 10Dzahn) [23:00:45] !log yurik Synchronized php-1.25wmf15/extensions/ZeroBanner: cherry-picking 189553 (duration: 00m 06s) [23:00:52] Logged the message, Master [23:01:07] !log yurik Synchronized php-1.25wmf16/extensions/ZeroBanner: cherry-picking 189553 (duration: 00m 06s) [23:01:10] Logged the message, Master [23:01:47] 3Ops-Access-Requests: Sudo for Roan on osmium - https://phabricator.wikimedia.org/T89038#1026448 (10Dzahn) i made a new admin group for this first https://gerrit.wikimedia.org/r/#/c/189606/ [23:07:43] (03PS1) 10Dzahn: add chromium-admins to visual editor role [puppet] - 10https://gerrit.wikimedia.org/r/189611 (https://phabricator.wikimedia.org/T89038) [23:08:18] (03PS2) 10Dzahn: add chromium-admins to visual editor role [puppet] - 10https://gerrit.wikimedia.org/r/189611 (https://phabricator.wikimedia.org/T89038) [23:08:52] ori: ^ and that would be the actual access request. it's the first time i do this since it's hiera though [23:08:58] https://gerrit.wikimedia.org/r/#/c/189611/2/hieradata/role/common/ve.yaml [23:09:11] i _think_ that's how you do it now [23:09:13] 3operations, Citoid, Services: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1026471 (10Mvolz) 3NEW [23:10:17] mutante: ve is not a cluster, though [23:11:16] 3Ops-Access-Requests: Sudo for Roan on osmium - https://phabricator.wikimedia.org/T89038#1026483 (10Dzahn) and the change in hieradata above is the actual access request using the new group please double check me though, it's the first one i do since we moved this to hieradata instead of including the admin gro... [23:11:43] ori: is cluster mandatory? [23:12:36] copied this from ocg, which used cluster: pdf [23:13:03] 3operations, Citoid, Services: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1026499 (10Mvolz) [23:14:59] 3Ops-Access-Requests, operations, Citoid, Services: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1026516 (10Krenair) [23:15:08] 3operations, Citoid, Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1026518 (10Mvolz) [23:15:17] mutante: you can (and should) take out cluster [23:15:21] mutante: it lgtm otherwise [23:15:30] ori: 'k, thanks [23:17:36] (03PS3) 10Dzahn: add chromium-admins to visual editor role [puppet] - 10https://gerrit.wikimedia.org/r/189611 (https://phabricator.wikimedia.org/T89038) [23:17:54] (03CR) 10Ori.livneh: [C: 031] add chromium-admins to visual editor role [puppet] - 10https://gerrit.wikimedia.org/r/189611 (https://phabricator.wikimedia.org/T89038) (owner: 10Dzahn) [23:19:23] <_joe_> mutante: ve is in misc [23:19:32] <_joe_> cluster relates to nagios groups and ganglia [23:19:57] _joe_: his latest patch doesn't have cluster, which is correct imo [23:20:15] <_joe_> ori: yes it is [23:20:31] (03PS1) 10BBlack: reload nginx once a day on protoproxies [puppet] - 10https://gerrit.wikimedia.org/r/189613 [23:20:53] ah,it makes the groups for icinga.. yes, should have known from before, when it wasnt in hiera [23:21:12] _joe_: it doesn't need a three-day waiting period, right? it's not an access request, really [23:22:08] <_joe_> ori: uhm no idea, it seems like a change of permissions but I'm pretty involved in something else and I had no time to read the policy [23:22:09] i separated it on purpose into making the group and then using it [23:22:23] so that at least it's smaller :p [23:23:24] andrewbogott: do you know? [23:24:29] there's no way that restarting chromium on a private testing machine is an issue, imo [23:24:46] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [23:25:01] It’s definitely not an issue, although in theory it is a change of access so probably subject to the policy. [23:25:05] lemme find the policy :) [23:26:19] yeah, policy mostly describes getting shell access, not slightly-modifying-existing-shell-access. [23:26:33] So I’d say it’s fine, but Robh is the Giver of Laws in this case [23:26:52] wha? [23:27:01] say no, quickly! [23:27:30] So im not the giver of laws, this shit was decided in the ops meetings [23:27:32] i just documented it [23:27:48] told you :) [23:27:50] its really well documented. [23:27:51] ok, then, the transcriber of laws [23:28:00] https://wikitech.wikimedia.org/wiki/Operations_requests [23:28:06] https://wikitech.wikimedia.org/wiki/Requesting_shell_access [23:28:15] 'Escalating Existing Shell Access' section [23:28:37] i dunno even know what your question is for the record ;D [23:28:50] robh: and yet you answered it [23:29:01] because my docs are that fucking good! [23:29:12] ori: so, pretty clear that the three-day policy still applies, sorry [23:30:11] now, there isnt a section called 'unwritten policy' but if there was it would be 'if you want to break any of these, you have to have the director of operations sign off' [23:30:32] but since mark is on vacation, i guess you could get faidon to sign off as acting director of ops during mark's vacation [23:31:03] the policy is intentionally strict since everyone always thinks they should get an exception (opsen included!) [23:31:13] (we aren't innocent of that shit at all) [23:31:52] so a lot of requests then required the ops clinic person to track down and annoy mark about items that really werent worth him getting involved in, and due to the time zone changes and what not usually only shaved a single day off the request ;D [23:33:24] this particular conversation i dont mind having since these things were decided in an actual ops meeting, so i dont have to take personal heat for them [23:33:26] greg-g, is there enough time for scap? [23:34:04] andrewbogott: there is the note that if its a correction of a misapplied permission that it doesn't need the 3 day wait [23:34:18] like user X asks for access to analytics, and then someone forgets to include bastion [23:34:34] yeah, this isn’t exactly that [23:34:49] yurikR1: should be [23:35:06] greg-g, ok, will commit https://gerrit.wikimedia.org/r/#/c/189617 and scap [23:35:14] (sync) [23:35:25] also, in cases where a user feels the 3 day is counter-productive, i urge them to note such on the task. the policy is just something we have created to try to work, doesn't mean it won't need correction in future [23:35:42] its been tweaked a lot in the past, i expect it to continue to change. [23:36:06] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [23:36:07] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [23:36:14] ugh [23:36:56] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 0, number_of_data_nodes: 3 [23:38:06] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 133, initializing_shards: 1, number_of_data_nodes: 3 [23:38:34] robh, btw, re server naming: I'm fine with dsc* or ssd* [23:38:51] both describe the hardware, not the role [23:40:00] generally don't care too much though, as long as it's not actively misleading [23:42:15] (03PS1) 10Ori.livneh: vbench: use 24-hr clock in timestamps [puppet] - 10https://gerrit.wikimedia.org/r/189622 [23:42:30] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: use 24-hr clock in timestamps [puppet] - 10https://gerrit.wikimedia.org/r/189622 (owner: 10Ori.livneh) [23:42:34] csteipp: re private wikis in *.wikimedia.org : wouldn't make more sense to move those to a sub domain, e.g *.private.wikimedia.org and exclude that ? [23:43:26] 3operations, Release-Engineering, WMF-Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1026635 (10Technical13) >>! In T76560#847830, @Nirzar wrote: > Here's [[ http://nirzar.github.io/prototypes/error-pages/error-template.html | final template ]] for the error page. > > - It's mobile o... [23:43:52] Wasn't that in -mobile, matanya? [23:44:12] matanya: It would make my life easier, but there are lots of them already setup, so I guess I have little hope we could change them all any time soon. [23:44:27] we'd also need *.private.wikipedia.org I think [23:44:43] Krenair: i never know in which channel i am in such hours. [23:44:52] (03PS2) 10Dzahn: add Cyrillic project domain names [dns] - 10https://gerrit.wikimedia.org/r/189102 (https://phabricator.wikimedia.org/T88722) [23:45:05] csteipp: when there is a hope there is a way [23:45:37] i hope we would not use wikipedia domain for private wikis [23:45:43] only wikimedia [23:45:47] We already do. [23:45:52] for wg_enwiki, arbcom_*wiki, etc. [23:45:53] we should move it :p [23:45:54] arbcoms are privaet [23:45:57] and they are wikipedia [23:46:01] mutante, why? [23:46:01] nah makes sense on wikipedia for that [23:46:03] atleast create new ones there, and start migrating slowly [23:46:14] arbcom is project specific [23:46:18] and language specific [23:46:30] its actually the perfect example of a private wiki that belongs in wikipedia.org [23:46:36] They'd become $lang-wikipedia-arbcom.private.wikimedia.org? [23:46:36] that's horrible [23:46:43] more trouble created by me, i should really go to sleep [23:46:53] matanya: well played [23:47:09] not even intentional this time [23:47:18] is convinced easily. ok. [23:47:21] just a bug report escelation :) [23:47:33] mutante: i think i get where you are coming from though [23:47:43] in that a closed wiki is not in the general spirit of wikipedia like projects [23:48:10] and if there were wikimedia specific things squatting there, it would stink. [23:48:25] csteipp: is there a list of such wikis ? [23:48:34] private wikis? [23:48:37] this is very similar conversation to the chapters formatting fqdn issue [23:48:43] private+wikipedia wikis? [23:49:09] keep in mind SNI star cert issues with creating subdomains [23:49:16] (they only cover one level of star) [23:49:22] robh: yea, i still think sub-chapters within the US should just use wikimedia.us :) [23:49:24] English Wikinews actually has an arbcom, I'm sure they could get a wiki if they asked nicely [23:49:49] mutante: and then have a landing page for like countrychapter.wikimedia.org to all the us ones? [23:49:56] matanya: private.dblist? [23:50:03] thanks [23:50:04] i like that idea, good luck getting it implemented without pissing off ppl ;D [23:50:15] let's move everything under wikimedia.org [23:50:17] since most chapters have countrycode.wikimedia.org right? [23:50:23] m.arbcom.en.wikipedia.wikimedia.org [23:50:32] :D [23:50:46] ^ and then ask for https cert error [23:51:14] robh, I'd say chapters prefer having their own wikimedia.countrycode :P [23:51:15] we'll just get a few new subdomains *.*.*.wikimedia.org [23:51:18] what.dafuq.is.this.internet.org [23:51:32] Platonides: yea, and the country code for the US is .us :) [23:51:44] greg-g, et al: SCAP time! [23:51:58] *.*.*.*.wikimedia.org [23:52:09] how much can it be :) [23:52:20] now go configure a star cert for that [23:52:31] mutante, for cc != 'us', USA is special with its "subchapters" [23:52:44] it might be easier to setup our own CA and get a Wikipedia root it into browsers /me hides [23:52:53] must say it looks like you are using bad words in *.*.*.*.wikimedia.org [23:52:55] so who's going to swat today? [23:53:26] Krenair: want to do it? :) [23:53:39] Platonides: yea, my point was once the subchapters should be city.wikimedia.us [23:53:40] if nobody else wants to, I can [23:53:55] I bet RoanKattouw or ebernhardson can help [23:54:00] mutante: not all chapters are cities [23:54:06] !log yurik Synchronized php-1.25wmf16/extensions/ZeroBanner: cherry-picking 189617 (duration: 00m 07s) [23:54:11] Logged the message, Master [23:54:12] matanya: argg, i know, New England :p [23:54:19] !log yurik Synchronized php-1.25wmf15/extensions/ZeroBanner: cherry-picking 189617 (duration: 00m 05s) [23:54:21] Logged the message, Master [23:54:27] * RoanKattouw waves [23:54:36] matanya: need ISO names :) [23:54:37] I have stuff in the SWAT, so I'd be happy to do it [23:54:51] Or help someone do it [23:55:02] RoanKattouw, core submodule update for https://gerrit.wikimedia.org/r/#/c/189144/ ? :P [23:55:10] !log yurik Started scap: syncing ZeroBanner i18n [23:55:14] Krenair: Will build one [23:55:15] Logged the message, Master [23:55:30] mutante: i.e m.arbcom.EN_us.wikimedia.org ? [23:55:36] that is even worse [23:56:26] stupid m. domains, why can't they just be the same? [23:56:47] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1026656 (10Andrew) James and Marco's access is now merged. Best to open a new ticket for Stas once the use case is determined. [23:57:03] (03CR) 10Dzahn: "fine, let's use lower case" [dns] - 10https://gerrit.wikimedia.org/r/189102 (https://phabricator.wikimedia.org/T88722) (owner: 10Dzahn) [23:57:08] greg-g, i just realized that there is a swat coming up in 5 min ((( If i ctrl+C scap, will the cluster survive? :) [23:57:16] yes, but don't [23:57:28] matanya: lol, yea, EN_us sounds good [23:57:45] yurikR1: I thought you were ready to run scap when you asked :/ [23:58:16] greg-g: yes, WHY, the only exception is wikitech , and i abandoned the attempt to generate them [23:58:20] i was - but than was waiting for jenknis again because i didn't want to +2 core changes unless ready to depl [23:58:31] * greg-g nods [23:58:32] 3RESTBase, Ops-Access-Requests, Services: Access to the Cassandra / RESTBase test cluster for Stas, Marko and James - https://phabricator.wikimedia.org/T85492#1026665 (10Andrew) 5Open>3Resolved [23:58:41] my bad, I should have had you wait [23:58:52] (to get the other changes in there pre-scap) [23:58:55] swat changes, that is [23:59:09] greg-g, i can Ctrl+C, just don't know the implications [23:59:35] where is it right now? what step? [23:59:46] Updating LocalisationCache for 1.25wmf15 using 4 thread(s) [23:59:52] tgr, ebernhardson: around for swat?