[00:07:12] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:17:35] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1483807 (10RobH) [00:19:49] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1483813 (10RobH) [00:36:27] (03PS1) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [00:38:45] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1483822 (10Philippe-WMF) Just noting at https://meta.wikimedia.org/wiki/Wikimedia_Pennsylvania has the pa.wikimedia address listed...I'l... [00:42:26] (03PS2) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [00:43:24] !log Depooled Precise image scalers (mw1159 and mw1160); watching for errors. [00:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:41] (03PS3) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [01:04:55] I see some errors. Not re-pooling just yet; doing some debugging first. [01:14:34] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1483825 (10Springle) +1 to the provisioning. Also +1 to Tim's plan, once we get the hardware. [01:14:55] (03CR) 10Alex Monk: "This commit exposes both of those bugs, but they are actually separate." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) (owner: 10Alex Monk) [01:18:42] !log Re-pooling mw1159 and mw1160; ran out of time for debugging. [01:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:26:54] (03PS10) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [01:27:10] (03PS4) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [01:28:06] (03PS5) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [02:03:05] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-27 02:03:04+00:00 [02:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 27 02:07:15 UTC 2015 (duration 7m 14s) [02:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:08:55] (03PS6) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [02:11:53] (03PS7) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [02:22:55] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 07m 20s) [02:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:01] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-27 02:27:00+00:00 [02:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:49] (03PS8) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [03:39:21] !log upgrade & restart dbstore1002 [03:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:43:53] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (101028s 100000s) [04:21:13] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:03] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [04:27:43] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (8439 100000s) [04:43:14] (03PS9) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [04:44:01] (03CR) 10jenkins-bot: [V: 04-1] labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [04:45:14] (03PS10) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [04:54:02] (03PS11) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [04:59:13] PROBLEM - Cassandra database on restbase1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [05:00:03] PROBLEM - Cassanda CQL query interface on restbase1003 is CRITICAL: Connection refused [05:00:15] !log restarted cassandra on restbase1003 [05:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:01:22] RECOVERY - Cassandra database on restbase1003 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [05:02:02] RECOVERY - Cassanda CQL query interface on restbase1003 is OK: TCP OK - 0.010 second response time on port 9042 [05:07:05] (03PS12) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [05:15:42] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [05:16:13] PROBLEM - Cassanda CQL query interface on restbase1005 is CRITICAL: Connection refused [05:17:03] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from Parsoid returned the unexpected status 500 (expecting: 200): /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/r [05:17:42] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [05:18:13] RECOVERY - Cassanda CQL query interface on restbase1005 is OK: TCP OK - 0.009 second response time on port 9042 [05:19:12] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [05:19:53] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [05:21:53] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.003 second response time on port 9042 [05:47:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 27 05:47:31 UTC 2015 (duration 47m 30s) [05:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:33] PROBLEM - puppet last run on lvs1003 is CRITICAL Puppet has 1 failures [06:31:30] <_joe_> uh rb trouble [06:31:34] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 2 failures [06:32:33] PROBLEM - puppet last run on db2044 is CRITICAL Puppet has 2 failures [06:32:33] PROBLEM - puppet last run on db2060 is CRITICAL Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 3 failures [06:32:53] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:32:53] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 2 failures [06:32:53] PROBLEM - puppet last run on mw2052 is CRITICAL Puppet has 1 failures [06:34:04] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:56:32] RECOVERY - puppet last run on db2044 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on db2060 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on lvs1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:58:23] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:28] <_joe_> !log upgrading hhvm to the latest package across the cluster [06:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:12:44] (03PS3) 10Giuseppe Lavagetto: confctl: fix warning on regexes [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 [07:14:32] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1484033 (10MoritzMuehlenhoff) We can surely add them for now so that you can experiment/test with it, I can do that this week. [07:14:33] (03CR) 10Giuseppe Lavagetto: [C: 032] confctl: fix warning on regexes [software/conftool] - 10https://gerrit.wikimedia.org/r/226682 (owner: 10Giuseppe Lavagetto) [07:16:58] (03PS2) 10Giuseppe Lavagetto: confctl: don't create inexistent entities [software/conftool] - 10https://gerrit.wikimedia.org/r/226683 (https://phabricator.wikimedia.org/T104574) [07:20:18] 6operations: Conftool and etcd should represent boolean values as booleans, not 'yes' / 'no' - https://phabricator.wikimedia.org/T106738#1484041 (10Joe) The reason why we used three values there ("yes", "no", "inactive") is to reproduce the state in which servers can be now in pybal: 1) Present and enabled 2) P... [07:22:14] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1484047 (10Joe) Making the proper package work on trusty is a real pain, I suggest building a lame binary package for that if we want to. For now clos... [07:22:24] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978#1484050 (10Joe) [07:22:27] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Package a modern version of etcd for jessie, trusty - https://phabricator.wikimedia.org/T97970#1484048 (10Joe) 5Open>3Resolved [07:23:28] (03PS2) 10Giuseppe Lavagetto: debian: fix email in changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/226909 (owner: 10Hashar) [07:23:53] (03CR) 10Giuseppe Lavagetto: [C: 032] "meh, thanks again hashar!" [software/conftool] - 10https://gerrit.wikimedia.org/r/226909 (owner: 10Hashar) [07:24:15] (03CR) 10Giuseppe Lavagetto: [C: 032] confctl: don't create inexistent entities [software/conftool] - 10https://gerrit.wikimedia.org/r/226683 (https://phabricator.wikimedia.org/T104574) (owner: 10Giuseppe Lavagetto) [07:29:58] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1484075 (10MoritzMuehlenhoff) There's an upstream pull request, but it's unmerged for about a year, apparently it's slated for a future 2.0 release, so I wouldn't hold my breat... [07:34:31] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1484085 (10Joe) Instead of downgrading our security and/or do shady hacks via hiera, let's assume paramiko is broken (it is, in many ways!) and let's move on. [07:35:04] 7Puppet, 6operations, 6Discovery, 10Wikidata, and 2 others: Make a puppet role that sets up a query service and loads it - https://phabricator.wikimedia.org/T95679#1484087 (10Joe) [07:35:06] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 3 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1484086 (10Joe) 5Open>3Resolved [07:38:51] 6operations, 7HHVM, 5Patch-For-Review: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1484088 (10Joe) All appservers upgraded. I'm resolving the ticket now, you can still reopen it later [07:39:42] 6operations, 5Patch-For-Review, 7discovery-system: Ensure alerts and notifications on confd failure modes - https://phabricator.wikimedia.org/T103360#1484091 (10Joe) 5Open>3Resolved [07:54:47] !log installed java security updates on xenon, cerium, praseodymium, maps-test* [07:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:03:21] Gutten Tag [08:04:38] _joe_: had a HHVM related question for you. You pushed new packages for hhvm hhvm-dev , I am wondering whether our hhvm extensions are back compatible (like hhvm-fss , hhvm-luasandbox ) etc [08:04:42] https://phabricator.wikimedia.org/T106699 [08:06:18] <_joe_> hashar: the point is that our HHVM package has [08:06:24] <_joe_> Provides: hhvm-api-20150212 [08:06:31] which is the dependency? [08:07:42] <_joe_> yes [08:07:52] <_joe_> this is the abi version, sort-of [08:08:04] <_joe_> we use it to indicate compatibility with extensions [08:08:36] <_joe_> extensions then depend on the version of that they've been built against [08:09:58] that is smart ! [08:15:55] hashar: PHP uses the same method in Debian; php addons depends on phpapi-DATE instead [08:17:47] I was wondering whether the extensions needed to be recompiled [08:18:45] <_joe_> moritzm: yeah I copied it from the php behaviour [08:24:23] 6operations, 7HHVM, 5Patch-For-Review: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1484133 (10hashar) [08:24:25] hashar_: extensions only need to be recompiled if the HHVM extension API has changed (as displayed in hhvm --version) [08:24:33] I have upgraded hhmv on the Jenkins slaves :) [08:25:13] I really need to have a course / training about system development and C [08:28:26] _joe_: and you can forget etcd for Trusty since the Jenkins job now run on Jessie :-} ( was https://phabricator.wikimedia.org/T97970 ) [08:29:01] <_joe_> hashar: you were not the only one requesting it :) [08:29:57] * hashar points folk at Jessie [08:50:54] 6operations: High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1484155 (10fgiunchedi) the increase seem to jump at 20150723-18:00 according to logstash: {F272666} also a jump in redis clients: {F272670} that seems to coincide with 1.25wmf16 ? ``` 23:02 logmsgbot:... [08:54:09] paravoid _joe_ ori ^ seems related to a deployment to me [08:58:59] (03PS1) 10Filippo Giunchedi: memcached: logrotate only csv files [puppet] - 10https://gerrit.wikimedia.org/r/227191 [09:01:16] _joe_: Jenkins is not allowed to submit change on operations/software/conftool so a CR+2 / successful tests do not get the change merged in (ex: https://gerrit.wikimedia.org/r/#/c/226909/ ) [09:01:29] _joe_: should I grant authorization for submit right ? [09:01:48] the default for operations/* repos is that Jenkins can't submit [09:06:29] <_joe_> hashar: yes [09:06:39] <_joe_> please auth that [09:07:49] (03CR) 10Hashar: [C: 032] debian: fix email in changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/226909 (owner: 10Hashar) [09:08:32] (03Merged) 10jenkins-bot: debian: fix email in changelog [software/conftool] - 10https://gerrit.wikimedia.org/r/226909 (owner: 10Hashar) [09:08:41] fixed! [09:09:00] !log Allowed JenkinsBot to submit changes on operations/software/conftool for CI purposes. [09:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:24:05] (03PS1) 10Muehlenhoff: Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) [09:24:47] !log reimage restbase1007, new disks installed [09:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:09] https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#Corrupted_or_infected_files [09:26:27] this may not be true, but FYI [09:28:03] RECOVERY - SSH on restbase1007 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [09:29:27] yannf: you probably want to fill it at phabricator.wikimedia.org [09:29:33] yannf: note the image is 28 128 × 23 334 pixels ! [09:29:44] yes, I saw that [09:30:32] it's probably phoned [09:31:50] i.e. bogus [09:32:50] I mean phony ;) [09:39:38] 6operations: High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1484299 (10jcrespo) I am going to depool db1035, it happens around the same time and it is giving me connections errors too. Jobs (redis) on that host are failing me. [09:42:35] jynus: not sure I got the last part about the jobs on that host failing? [09:43:32] well, jobs execute on mws [09:43:51] but the ones using db1035 seem to be failing more frequently [09:44:08] I want to discard those, and anyway, it behabing badly [09:44:49] I do not see a direct connection between redit and db [09:45:15] but I see an indirect connection, maybe because number of connections on the same host? [09:45:29] let me try this, and I will have more info [09:46:54] sure [09:47:34] this will clarify more: https://phabricator.wikimedia.org/P1073 [09:48:18] <_joe_> the sessions redises and the jobqwueue redises are separated [09:49:12] ok, in any case, independently of the new ticket, it is something I have to do [09:49:37] I supposed higher levels of errors for a while [09:50:00] but its been 2 days and they are lower but still happening [09:50:48] I think there is something fishy with connection handling [09:51:31] so there is still a posibility of too many connections on mw being the cause (even if they are physically separated) [09:53:23] RECOVERY - dhclient process on restbase1007 is OK: PROCS OK: 0 processes with command name dhclient [09:53:59] another option is that they are the cause and not the consequence, but only db1035 is consistenly failing, so that's strange [09:54:03] RECOVERY - configured eth on restbase1007 is OK - interfaces up [09:54:18] !log reimage restbase1009, new disks [09:54:23] RECOVERY - DPKG on restbase1007 is OK: All packages OK [09:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:33] RECOVERY - Disk space on restbase1007 is OK: DISK OK [09:54:53] RECOVERY - RAID on restbase1007 is OK Active: 6, Working: 6, Failed: 0, Spare: 0 [09:54:58] (03PS1) 10Jcrespo: Depool db1035 as we believe is the cause of disconnections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227193 [09:56:23] RECOVERY - puppet last run on restbase1007 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:56:30] ^feel free to disagree [09:57:22] RECOVERY - Host restbase1009 is UPING OK - Packet loss = 0%, RTA = 3.40 ms [10:02:03] (03CR) 10Jcrespo: [C: 032] Depool db1035 as we believe is the cause of disconnections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227193 (owner: 10Jcrespo) [10:04:36] !log jynus Synchronized wmf-config/db-eqiad.php: Depool db1035 (duration: 00m 12s) [10:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:06:44] so, there are 3 possible outcomes: the errors now go to db1044 (independently if it is related or not), the errors disappear but from the db but not redis (not related) or the errors disapear from both (related) [10:11:22] (03PS1) 10Muehlenhoff: Add ferm rules for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/227197 (https://phabricator.wikimedia.org/T104939) [10:12:19] 7~/win 22 [10:12:21] er :) [10:15:32] PROBLEM - Host restbase1009 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:13] RECOVERY - Host restbase1009 is UPING OK - Packet loss = 0%, RTA = 0.29 ms [10:25:43] RECOVERY - salt-minion processes on restbase1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:31:02] PROBLEM - RAID on db1059 is CRITICAL 1 failed LD(s) (Degraded) [10:31:23] arg [10:31:32] at least it is >1050 [10:35:04] so, the redis/mc thing. It is too early to say, but I think it fixes the db job problem but not redis session problem [10:38:47] 6operations: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1484465 (10fgiunchedi) I agree that's confusing, though I'm not sure if `mwscript` (part of scap) is of any real use on `tin` other than convenience? (cc @bd808 @ori @greg) [10:39:19] jynus: ack, thanks [10:39:38] will update the ticket when I have more data, ok [10:43:04] PROBLEM - Host restbase1009 is DOWN: PING CRITICAL - Packet loss = 100% [10:43:53] 6operations, 10ops-eqiad: db1059 raid degraded - https://phabricator.wikimedia.org/T107024#1484468 (10jcrespo) 3NEW [10:45:37] ACKNOWLEDGEMENT - RAID on db1059 is CRITICAL 1 failed LD(s) (Degraded) Jcrespo T107024 [10:47:31] (03PS1) 10Faidon Liambotis: Remove custom fact ec2id (2nd try), unused [puppet] - 10https://gerrit.wikimedia.org/r/227201 [10:53:03] RECOVERY - Host restbase1009 is UPING OK - Packet loss = 0%, RTA = 6.28 ms [10:58:42] RECOVERY - RAID on restbase1009 is OK Active: 6, Working: 6, Failed: 0, Spare: 0 [11:03:02] (03PS2) 10Giuseppe Lavagetto: mediawiki: catch thumb_handler.php to HHVM as well [puppet] - 10https://gerrit.wikimedia.org/r/227000 [11:03:10] <_joe_> godog, paravoid ^^ [11:03:30] <_joe_> if you want to take a look since you were involved in the revert this weekend [11:03:32] (03PS2) 10Filippo Giunchedi: Also add srijan to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/226941 (https://phabricator.wikimedia.org/T106407) (owner: 10Alex Monk) [11:03:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Also add srijan to bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/226941 (https://phabricator.wikimedia.org/T106407) (owner: 10Alex Monk) [11:05:03] I don't like making a special exception like that [11:05:20] especially doing it specifically on the imagescalers [11:05:30] what happens if someone decided to rename thumb_handler to thumbhandler for example? [11:05:38] or move it around or whatever [11:05:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1484542 (10fgiunchedi) >>! In T106407#1481526, @srijan wrote: > Hi! > I am not able to login to stat1003. Here is what I am getting: > $ ssh srijan@stat1003.eqiad.w... [11:07:28] <_joe_> paravoid: well, if you have better suggestions. I don't like having multiple, varying entrypoints for our application either [11:07:42] what about what ori proposed on that changeset? [11:08:00] <_joe_> where? [11:08:14] there's a review by ori there [11:08:16] https://gerrit.wikimedia.org/r/227000 [11:08:18] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: NRPE: Command check_endpoints_restbase not defined [11:08:19] <_joe_> no I think that's a terrible idea [11:08:26] <_joe_> like, the worst thing we could do [11:08:37] PROBLEM - Restbase root url on restbase1009 is CRITICAL: Connection refused [11:09:02] <_joe_> if we ever have any static asset that includes either ".php" or ".hh" somewhere in the name or the path [11:09:18] <_joe_> it would end up being interpreted by hhvm [11:09:34] <_joe_> I don't even know if we have such corner-cases either [11:09:51] <_joe_> we probably have [11:10:06] .php[$/] ? :) [11:10:50] <_joe_> I would prefer to make a specific exception for the only documented case where we have the php file embedded in the url [11:11:00] that we know of :) [11:11:09] <_joe_> no I'm pretty confident this is it [11:11:21] <_joe_> on the normal appserver we use mpm_worker [11:11:31] <_joe_> so we don't have zend's mod_php anymore [11:12:14] <_joe_> I didn't switch the scalers as given how they work, prefork is easier to "control" [11:13:39] <_joe_> my point is that my patchset has no potential side effects, but I guess we can hope .php[$/] could work too [11:14:36] <_joe_> the correct way of doing this, btw, would be to do something a bit different, but I don't have the time for thorough testing of a "really correct solution" right now [11:15:44] <_joe_> for the record, that would be doing something like and [11:16:03] <_joe_> instead of all the proxypasses we set there [11:16:10] <_joe_> sorry, bbiab [11:18:17] RECOVERY - puppet last run on restbase1009 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:20:35] I'm leaving running some caching/opening table process on db1035 while it is depooled [11:37:47] (03CR) 10Giuseppe Lavagetto: "Simply removing the dollar sign from the catchall would expose us to all kind of potential unintended consequences. I'd avoid that if poss" [puppet] - 10https://gerrit.wikimedia.org/r/227000 (owner: 10Giuseppe Lavagetto) [11:39:15] Where are the Labs ops? [11:39:32] asleep? [11:39:34] what's up? [11:40:49] We changed some MediaWiki configuration file to be less Wikipedia-centric and it broke a meta-table that tools rely on: https://phabricator.wikimedia.org/T106897 [11:42:27] I was hoping to catch Yuvi or Marc, but maybe someone else can review and deploy https://gerrit.wikimedia.org/r/226939 ? [11:42:54] not sure what they could do about that [11:43:48] I thought they maintained maintain-replicas. [11:44:09] jynus ^^ labsdb related thing(s) above [11:45:29] i'll ask them to look into it when they show up [11:45:45] All right, thanks. [11:50:58] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1484607 (10fgiunchedi) I've reimaged 1007 and 1009, currently running `stressdisk` on `/var/tmp` [11:55:33] (03CR) 10MZMcBride: mediawiki: catch thumb_handler.php to HHVM as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227000 (owner: 10Giuseppe Lavagetto) [12:07:45] !log deployed https://gerrit.wikimedia.org/r/#/c/227205/ and restarted apache2 on iridium [12:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:20] why did apache2 need a restart? [12:17:48] JohnFLewis, I can sync the table, but if it is the script that broke it (I do not know, but that is what the ticket suggests) it will break again [12:18:18] I was waiting for confirmation [12:18:40] (03PS2) 10Faidon Liambotis: Enable ferm on lead [puppet] - 10https://gerrit.wikimedia.org/r/226707 (https://phabricator.wikimedia.org/T104979) (owner: 10Muehlenhoff) [12:18:53] (03CR) 10Faidon Liambotis: [C: 032] Enable ferm on lead [puppet] - 10https://gerrit.wikimedia.org/r/226707 (https://phabricator.wikimedia.org/T104979) (owner: 10Muehlenhoff) [12:23:30] jynus: I think it's fallout from a change but that script did cause it is my understanding [12:24:27] do you mean that if it is run again, it will not break it, but it has to be fixed manually? [12:24:54] (03PS2) 10Faidon Liambotis: Enable ferm for polonium [puppet] - 10https://gerrit.wikimedia.org/r/226708 (https://phabricator.wikimedia.org/T104979) (owner: 10Muehlenhoff) [12:25:08] or you do not know (as I do) [12:26:30] (03CR) 10Faidon Liambotis: [C: 032] Enable ferm for polonium [puppet] - 10https://gerrit.wikimedia.org/r/226708 (https://phabricator.wikimedia.org/T104979) (owner: 10Muehlenhoff) [12:27:18] PROBLEM - DPKG on lead is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:28:49] as mark said, I want input from the script owners before doing something that could make things worse [12:29:36] (03PS1) 10Jcrespo: Repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227206 [12:30:08] jynus: the script itself from what I can tell us the issue but there is a patch. Katie can tell you more [12:30:58] And yeah, best wait but thought I'll ping you since its databases :) [12:32:30] RECOVERY - DPKG on lead is OK: All packages OK [12:36:07] (03CR) 10Jcrespo: [C: 032] Repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227206 (owner: 10Jcrespo) [12:36:25] it is on my alley, but I need to coordinate with application level almost always or problems will happen again [12:38:31] PROBLEM - DPKG on polonium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:30] RECOVERY - DPKG on polonium is OK: All packages OK [12:47:20] (03PS2) 10BBlack: Remove wap and mobile subdomains [dns] - 10https://gerrit.wikimedia.org/r/223972 (https://phabricator.wikimedia.org/T104942) [12:47:48] (03PS1) 10Hashar: Revert "Remove w/COPYING and w/CREDITS dead symlinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227210 (https://phabricator.wikimedia.org/T107007) [12:48:05] (03PS2) 10Hashar: Revert "Remove w/COPYING and w/CREDITS dead symlinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227210 (https://phabricator.wikimedia.org/T107007) [12:48:10] (03PS3) 10Hashar: Revert "Remove w/COPYING and w/CREDITS dead symlinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227210 (https://phabricator.wikimedia.org/T107007) [12:48:34] 6operations: High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1484639 (10jcrespo) I depooled db1035 at 10:07 and it solved the database issues, but I think it is unrelated to this issue, my apologies: ``` $ grep -c '2015-07-27 07:' redis.log 0 $ grep -c '2015-07-27... [12:48:59] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1484640 (10BBlack) wikitech-l thread seems in agreement as well, so moving forward on this. Thanks for looking into it everyone :) [12:49:34] (03CR) 10BBlack: [C: 032] Remove wap and mobile subdomains [dns] - 10https://gerrit.wikimedia.org/r/223972 (https://phabricator.wikimedia.org/T104942) (owner: 10BBlack) [12:51:05] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1484644 (10BBlack) [12:51:06] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1484645 (10BBlack) [12:51:08] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1484642 (10BBlack) 5Open>3Resolved a:3BBlack [12:51:09] !log jynus Synchronized wmf-config/db-eqiad.php: Repool db1035 after maintenance (duration: 00m 12s) [12:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:31] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1484650 (10BBlack) The pa.us issue is in wiki**M**edia.org, which has a separate ticket here: T102826 This ticket is just tracking wiki**... [13:03:53] (03PS1) 10BBlack: Remove multi-level subdomains from wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/227214 (https://phabricator.wikimedia.org/T102814) [13:05:04] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1484669 (10BBlack) ^ will merge the above tomorrow (Tues). [13:06:08] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1484671 (10BBlack) [13:06:22] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1374642 (10BBlack) [13:06:24] 6operations, 10Traffic: Fix/decom multiple-subdomain wikis in wikimedia.org - https://phabricator.wikimedia.org/T102826#1484672 (10BBlack) [13:06:59] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1484675 (10BBlack) [13:07:00] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1374553 (10BBlack) [13:07:12] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1484676 (10BBlack) [13:07:14] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1432996 (10BBlack) [13:13:45] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1484687 (10BBlack) [13:13:46] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1484686 (10BBlack) [13:13:57] 6operations: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1484692 (10Legoktm) `mscript` needs to be installed on tin for scripts like `updateinterwikicache` to work. [13:14:53] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1484694 (10Andrew) As per https://phabricator.wikimedia.org/T105723, we now have hot spares for all vital labs services /except/ internal DNS. The associated task https://pha... [13:15:09] (03CR) 10Hashar: "Maybe we can add some tests to prevent deletion of files under /w/ :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227210 (https://phabricator.wikimedia.org/T107007) (owner: 10Hashar) [13:16:32] (03PS1) 10Muehlenhoff: WIP/RfC: Allow multiple/dynamic range of ports for ferm services [puppet] - 10https://gerrit.wikimedia.org/r/227216 (https://phabricator.wikimedia.org/T104981) [13:19:12] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1484699 (10mark) It's a bit unclear to me what lives where now, and what the plan for this is. Also serving our documentation, could you make a simple map of what essential ma... [13:20:22] !log powering down logstash1003 to relocate to rack d3 [13:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:48] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1484705 (10mark) Is the missing link light on the card the only indication the card isn't working? [13:29:15] (03PS1) 10Muehlenhoff: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227218 (https://phabricator.wikimedia.org/T104981) [13:33:11] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1484712 (10faidon) I turned the port up on labnet1002 (the interface is named "rename6", probably due to some Ubuntu udev bug) and I managed to ping over it and confirmed via tcpdump: traffic seemed to pass... [13:35:20] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1484713 (10Cmjohnson) The link light and every attempt to pxe resulted in media failure [13:37:35] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1484721 (10BBlack) Thanks everyone for pitching in and cleaning up the list considerably! Took another sample today: ``` 197 Peachy MediaWiki Bot API Version 2.0 (alpha 8) 154 Mo... [13:44:45] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1484730 (10BBlack) ^ Added Merl (I'm guessing is the maintainer of MerlBot). The plog4u ones seem to the same as the gwtwiki ones, and the most-current iteration of that codebase seems to b... [13:48:38] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1484740 (10fgiunchedi) another consideration is disk utilization, we're roughly at 50% in eqiad ATM (each machine has raid1 2x500... [13:51:48] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1484744 (10BBlack) 3NEW [13:52:04] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1484752 (10BBlack) [13:52:06] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1484751 (10BBlack) [13:55:54] !log uploaded linux 3.19.3-7 (based on 3.19.8-ckt4 plus the recent NMI security fixes) to carbon [13:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:14] (03Abandoned) 10Chmarkine: Remove old double-subdomain aliases [dns] - 10https://gerrit.wikimedia.org/r/224309 (https://phabricator.wikimedia.org/T102814) (owner: 10Chmarkine) [14:14:03] !log logstash1001 going down to relocate to row A [14:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:24] PROBLEM - Host logstash1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:44] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1484831 (10mark) I think if we provide you with access to the log files & service user accounts you listed, you should be able to do most o... [14:24:18] (03CR) 10Cmjohnson: [C: 032] "Merging this now that the servers have moved" [dns] - 10https://gerrit.wikimedia.org/r/226722 (owner: 10Cmjohnson) [14:30:37] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1484853 (10coren) @joe: The alternative is shelling out to SSH in this case, which was my original idea but @Faidon had a strong preference to using paramiko instead. @MoritzM... [14:33:10] (03PS4) 10BBlack: move majority of privates/files usage to secret() [puppet] - 10https://gerrit.wikimedia.org/r/224213 [14:33:10] _joe_, why was https://gerrit.wikimedia.org/r/#/c/226941/ necessary? [14:41:36] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1484884 (10Andrew) I've updated https://wikitech.wikimedia.org/wiki/Labs_infrastructure#dns and added diagrams. [14:43:20] <_joe_> Krenair: you mean godog, right? [14:43:22] <_joe_> :) [14:44:47] well, anyone from ops that might know would be good.. [14:45:24] mostly legacy from previous admins.pp shenanigans where bastion access was this weird quasi-priv [14:45:34] in theory it's sane to say all groups should exist on bastion hosts [14:45:34] but [14:45:37] (03CR) 10BBlack: [C: 032] "I manually merged up the few files/ updates in the private repo that happened while this was lingering. Yuvi solve the labs issues over a" [puppet] - 10https://gerrit.wikimedia.org/r/224213 (owner: 10BBlack) [14:45:52] there are a number of folks who only connect to bastion hosts and run actual mysql stuff? and other things from there [14:46:11] so the weird abstraction continued as there wasn't enough good enough on how to break it up and make it more sane [14:46:26] this is from fuzzy memory a year or so ago tho [14:46:40] (03Abandoned) 10Paladox: Adding task support instead of using Bug: which was for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [14:48:08] wasn't enough good info ^ [14:49:22] so some people need to remain in bastiononly because even if we added all real groups to the bastion hosts, if they left one of their groups they might still want bastion access? [14:49:27] is that right? [14:50:46] sort of, some people actually use prod only from bastion hosts and thus a bastin only group(ing) was created long ago [14:51:03] urandom mobrovac gwicke, FYI there's the java 7 package upgrade and java8 downgrade pending, moritzm has updated packages T104887 [14:51:06] it I guess was a mix of ppl using it as a proxy host to other prod hosts and as a primary landing host [14:51:30] there isn't a hard and fast reason not to just say....ops is on bastion [14:51:57] godog: yeah, we were waiting for things to settle before doing the downgrade [14:52:04] the reasoning at transition time from admins.pp was essentially, this is a mess let's port the mess as is and straighten this out [14:52:08] and it wasn't straightend out [14:52:13] gwicke: what things to settle? [14:52:54] godog: primarily load reaching our target limits [14:54:41] gwicke: namely what limits? [14:55:30] Krenair: the other reason historically afaik is that a bastiononly group allows denial to prod from removal of one group [14:55:34] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [14:56:00] and in a world where account cleanup was suspect and ppl had accounts in weird places from weird things that was some minimal level of sanity for access denial [14:56:30] the weird accounts in weird places is all cleaned up now right? [14:56:55] afaik yes [14:57:25] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.099 second response time [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150727T1500). Please do the needful. [15:00:04] James_F: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:16] * James_F waves. [15:00:31] Krenair: if you are thinking let's add groups to bastion hosts as well as bastiononly and keep both abstractions. that is the next step I think, and then paring down the bastiononly group to see who really needs it and why. [15:00:42] I can SWAT today, unless anyone else feels a deep need. [15:00:45] godog: sorry, was distracted; right now we are shooting for ideally <600G storage load per node [15:00:55] 3T raw data [15:01:02] chasemp, okay... which bastions, by the way? [15:01:08] just bast1001? [15:01:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226337 (owner: 10Jforrester) [15:01:51] Whee. [15:02:12] (03Merged) 10jenkins-bot: Enable VisualEditor for auto-created accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226337 (owner: 10Jforrester) [15:02:43] Krenair: not sure which hosts get the group, but bast1001 almost certainly [15:02:50] gwicke: how would that affect jdk7 vs jdk8? allegedly 600G per instance is related to bootstrapping [15:02:52] at the moment just bast1001 [15:02:58] but deployers/restricted also get access to hooft [15:02:59] the weird history of bastiononly I have is primarily from mutante [15:03:24] anyway that looks like another 3/4 days [15:03:30] which parsoid-admin, ocg-render-admins and bastiononly do not get [15:03:44] gwicke: is there one restbase* host which is within the target area? (We'd like to update one system in advance of the others) [15:03:59] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for auto-created accounts on enwiki [[gerrit:226337]] (duration: 00m 13s) [15:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:05] ^ James_F check please [15:04:06] Whee. [15:04:09] godog, bootstrapping is just one of the things it's connected to: https://wikitech.wikimedia.org/wiki/Cassandra/Hardware#Instance_sizing [15:04:45] thcipriani: Yup, working. [15:04:52] James_F: cool, thanks! [15:05:02] moritzm: http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad?panelId=12&fullscreen [15:05:33] Glaisher: ping for SWAT [15:05:34] 1003 is the closest, but still at ~700G [15:05:35] (03PS1) 10BBlack: switch zerofetcher auth to secret() [puppet] - 10https://gerrit.wikimedia.org/r/227226 [15:05:37] (03PS1) 10BBlack: switch nagios private contacts to secret() [puppet] - 10https://gerrit.wikimedia.org/r/227227 [15:06:04] thcipriani: pong [15:06:36] (03CR) 10BBlack: [C: 032] switch zerofetcher auth to secret() [puppet] - 10https://gerrit.wikimedia.org/r/227226 (owner: 10BBlack) [15:06:39] gwicke: all of that was observed when the pagination bug was also on though, right? [15:07:16] !log logstash1001 and logstash1003 offline for physical move and reimaging to jessie. kibana data will be degraded until they are back [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:23] (03CR) 10BBlack: [C: 032] switch nagios private contacts to secret() [puppet] - 10https://gerrit.wikimedia.org/r/227227 (owner: 10BBlack) [15:07:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226483 (https://phabricator.wikimedia.org/T106337) (owner: 10Glaisher) [15:07:57] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-default' on cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226483 (https://phabricator.wikimedia.org/T106337) (owner: 10Glaisher) [15:08:36] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1484968 (10coren) @yuvipanda: No, the switchover test never took place and other concerns overrode this, and now labstore1001 is disconnected from the... [15:09:26] godog: at least part of those issues like timeouts were observed (to a lesser degree) as soon as we hit 600G; for some of the others like bootstrapping and metric reporter failures it's still open whether it will be resolved <600G [15:09:48] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Set wgCategoryCollation to uca-default on cswiktionary [[gerrit:226483]] (duration: 00m 12s) [15:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:53] ^ Glaisher check please [15:10:11] thcipriani: updateCollation.php also needs to be run :) [15:10:23] I mentioned that in the commit message [15:10:36] Glaisher: ah, must've missed, looking [15:10:38] godog: the limit issue was something that was around for several months before we ran into trouble; it wasn't an issue as long as the cluster kept up, as the number of renders was low [15:11:14] however, once delete requests started to fail at high load, it turned into a major issue [15:11:45] as failed deletes means more matched items next time, means more attempted deletes [15:13:31] gwicke: though the added load could have been a factor to skew the numbers [15:13:42] Glaisher: mwscript updateCollation.php --wiki=cswiktionary --previous-collation=uppercase <-- look right? [15:14:12] godog: we actually have metrics for the write request rate, so can look it up [15:14:30] godog: http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad?panelId=25&fullscreen [15:14:41] looking [15:14:44] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1485005 (10coren) @yuvipanda: No, the hardware is known to have issues - though up to date it's always been fully working once it gets working at all (all the issues... [15:15:00] are you doing extra db writes? [15:15:09] because we are having issues right now [15:15:18] with io on s3 [15:15:26] 6operations: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1485010 (10bd808) >>! In T103198#1484465, @fgiunchedi wrote: > I agree that's confusing, though I'm not sure if `mwscript` (part of scap) is of any real use on `tin` other than convenienc... [15:15:40] jynus: do you mean me? [15:15:53] jynus: we are talking about C* [15:15:59] ok, then [15:16:05] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1485018 (10Andrew) We can live without pxe since there's already an OS installed. Last time I worked on this box I disabled the 1g nic (as with labnet1001) and the server dropped off the network entirely.... [15:16:30] thcipriani: yep, looks correct [15:16:53] gwicke: not sure what to make of that [15:17:00] (03PS6) 10BBlack: No need for wgSecureLogin on our wikis, HTTPS is forced everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) [15:17:32] godog: there were definitely periods where write request rates were similar to how they are now [15:17:37] Glaisher: kk, running [15:17:39] (03PS5) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [15:18:00] godog: including mid-June to end of June [15:19:00] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1485030 (10MoritzMuehlenhoff) I had a brief look and there's quite a bit of code churn so that we cannot easily build a 1.15 package with the patch applied on top; we would nee... [15:19:04] gwicke: what about the last week? [15:20:03] godog: looks moderate to me; what about it? [15:20:32] (there was one spike last night from the thin-out script) [15:20:54] heh we've been running way past 600G for the last week, is there any obvious indication of problems? [15:21:13] Glaisher: 192653 rows processed [15:21:16] we are seeing the usual trickle of timeouts that shouldn't happen [15:21:35] should be finished now, check please :) [15:21:36] (03CR) 10BryanDavis: Ferm rules for Logstash log ingestion (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [15:21:56] godog: it's not catastrophic by any means, but we shouldn't see those at base load [15:22:42] thcipriani: Looks like it's working. The category which I was looking at has changed. [15:22:53] gwicke: heh I was looking for the timeouts dashboard, which one is it? [15:22:53] Glaisher: kk, thanks [15:22:59] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 3 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1485053 (10coren) This should be done, or very near completion. As far as I can tell, there is no unpuppetized configuration but I'm no... [15:23:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225021 (https://phabricator.wikimedia.org/T103263) (owner: 10Glaisher) [15:23:35] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1485056 (10coren) [15:23:50] godog: one is http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad?panelId=22&fullscreen [15:23:56] (03Merged) 10jenkins-bot: Enable Quiz extension at French Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225021 (https://phabricator.wikimedia.org/T103263) (owner: 10Glaisher) [15:25:03] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:26:01] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Quiz extension at French Wikibooks [[gerrit:225021]] (duration: 00m 12s) [15:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:07] ^ Glaisher check please [15:26:21] chasemp, is it possible to ssh to lead.wikimedia.org, polonium.wikimedia.org or rhenium.wikimedia.org without going via a bastion? [15:26:35] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:43] bd808: you can access logstash1001 [15:27:27] (03PS2) 10Muehlenhoff: Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) [15:27:36] thcipriani: works [15:27:42] Glaisher: thanks! [15:27:53] :) [15:28:11] (03CR) 10jenkins-bot: [V: 04-1] Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [15:28:34] looks like rhenium allows it but not the other two [15:28:39] (03CR) 10BBlack: "I'd like to go ahead and merge this so we can move forward with bits-cluster decom and misc-cluster plans, etc. There's still about 160 r" [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [15:29:35] Glaisher: what db changes does education program need? [15:29:39] cmjohnson1: thanks. I think I got all the good stuff copied off [15:29:42] sec [15:30:04] gwicke: that graph doesn't seem related to space used by sstables, https://graphite.wikimedia.org/render/?width=754&height=450&_salt=1438010966.366&from=-12weeks&target=cassandra.restbase*.org.apache.cassandra.metrics.Connection.TotalTimeouts.1MinuteRate&target=secondYAxis(cassandra.restbase1001.org.apache.cassandra.metrics.ColumnFamily.all.LiveDiskSpaceUsed.value) [15:30:29] (03PS3) 10Muehlenhoff: Ferm rules for Logstash log ingestion [puppet] - 10https://gerrit.wikimedia.org/r/227192 (https://phabricator.wikimedia.org/T104964) [15:30:30] Krenair: I think the same [15:30:35] which is interesting [15:30:36] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1485076 (10BBlack) Are we basically done with all of the bits.wm.o traffic removals we can accomplish quickly and easily? I'd like to merge the cluster over into text-l... [15:30:52] thcipriani: https://github.com/wikimedia/mediawiki-extensions-EducationProgram/blob/master/sql/EducationProgram.sql [15:32:16] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1485079 (10BBlack) @Chmarkine - Sorry! I thought I looked here before uploading that new patch, or else I would've just used yours. I fo... [15:33:57] Coren: andrewbogott Hi? [15:34:21] Could you take a look at https://gerrit.wikimedia.org/r/#/c/226939/ ? [15:34:28] That bug is breaking several tools. [15:34:34] godog: keep in mind that this graph looks at total disk used, which includes heap dumps [15:34:55] oh, wait, no [15:35:05] you used ColumnFamily.all.LiveDiskSpaceUsed.value [15:35:08] hm… jynus, is https://gerrit.wikimedia.org/r/#/c/226939/2/maintain-replicas/maintain-replicas.pl your domain? [15:35:51] you should ask Coren or someone else [15:35:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225019 (https://phabricator.wikimedia.org/T105853) (owner: 10Glaisher) [15:36:21] (03Merged) 10jenkins-bot: Enable EducationProgram extension at French Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225019 (https://phabricator.wikimedia.org/T105853) (owner: 10Glaisher) [15:36:28] godog: the graph is dominated by the big timeout bursts, which we probably agree are more likely with high load and memory pressure [15:36:29] Glaisher: going to run mwscript sql.php --wiki=frwikisource php-1.26wmf15/extensions/EducationProgram/sql/EducationProgram.sql pre-file-sync, anything else need to happen? [15:36:56] (03PS3) 10BBlack: misc-web varnish: retab [puppet] - 10https://gerrit.wikimedia.org/r/224997 (owner: 10Dzahn) [15:37:03] don't think so. looks fine [15:37:16] kk, going [15:37:16] (03CR) 10BBlack: [C: 032] misc-web varnish: retab [puppet] - 10https://gerrit.wikimedia.org/r/224997 (owner: 10Dzahn) [15:37:23] (03CR) 10BBlack: [V: 032] misc-web varnish: retab [puppet] - 10https://gerrit.wikimedia.org/r/224997 (owner: 10Dzahn) [15:38:01] jynus: really? Isn’t db replicas your think? I have no idea who has worked on that in the past. [15:38:11] gwicke: do you see a relation between the two? I don't [15:38:26] (03PS2) 10BBlack: Update links on dumps.wm.org to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/224750 (owner: 10Chmarkine) [15:38:36] andrewbogott, I mantain the databases [15:38:46] i know nothing about random script other people use [15:38:49] I've seen coren doing that when echowikis.dblist was removed [15:39:06] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable EducationProgram extension at French Wikisource [[gerrit:225019]] (duration: 00m 12s) [15:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:12] ^ Glaisher check please [15:39:17] jynus: ok, fair enough. Best to find Coren then [15:39:20] looking [15:39:25] (03CR) 10BBlack: [C: 032] Update links on dumps.wm.org to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/224750 (owner: 10Chmarkine) [15:39:30] godog: you are saying that timouts caused by OOM are unrelated to memory pressure? [15:39:43] and/or load? [15:40:19] (03PS2) 10BBlack: Rename 'cookie_munging' VCL subroutine to 'stash_cookie' [puppet] - 10https://gerrit.wikimedia.org/r/225281 (owner: 10Ori.livneh) [15:40:44] and specially, i know nothing when I try to suggest changes and people respond with "I know better than you, your suggestions are ignored" [15:41:04] (03CR) 10BBlack: [C: 032 V: 032] Rename 'cookie_munging' VCL subroutine to 'stash_cookie' [puppet] - 10https://gerrit.wikimedia.org/r/225281 (owner: 10Ori.livneh) [15:41:08] gwicke: no, that's you twisting my words [15:41:10] godog: my understanding is that the bulk of the memory pressure is caused by a) writes, and b) compactions [15:41:37] b) is influenced by the total storage load [15:41:43] thcipriani: Looks good. [15:41:48] Thanks a lot! [15:41:55] Glaisher: awesome. Thank you! [15:42:06] With that, SWAT deploy is complete. [15:42:24] there's also c), which is metadata structures for large storage load taking up more heap [15:42:26] gwicke: I was specifically talking about timeouts and their allegedly relation with 600G instances, which ATM don't seem related [15:42:37] \o/ [15:43:03] <_joe_> !log upgrading the jobrunners to the latest HHVM packlage [15:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:13] <_joe_> ebernhardson: ^^ should be all done for the mobile team [15:43:29] _joe_: awsome thanks! [15:45:41] godog: lets try it out ;) [15:46:40] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1485112 (10EBernhardson) 5Open>3Resolved Joe has deployed this to th... [15:47:07] gwicke: the point I'm trying to make is that there is no reason to wait to downgrade [15:47:26] well, we had three nodes OOM last night [15:47:52] !log Added bgerstile and coreyfloyd to github "owners" team [15:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:05] the primary trigger was a write burst, but if you look at the heap usage it was already very high before the burst [15:48:20] 10Ops-Access-Reviews, 6operations: Review access to stat1003, eventlogging for legoktm - https://phabricator.wikimedia.org/T106315#1485121 (10Joe) [15:48:49] godog: http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-gc?panelId=34&fullscreen [15:50:19] gwicke: ouch, what nodes OOM? [15:50:27] (03CR) 10BBlack: [C: 04-1] "Looks sane in general, but in reviewing this it caused me to question the (shared with this patch) logic of hit-for-pass in fetches for X-" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [15:51:34] godog: 1003, 1005 and then 1004 [15:51:58] (03Abandoned) 10BBlack: Drop AES256 from mid/compat lists [puppet] - 10https://gerrit.wikimedia.org/r/224445 (https://phabricator.wikimedia.org/T105716) (owner: 10BBlack) [15:52:13] I assume it was the replicas belonging to that one article with lots of renders that the thin-out script finally dropped [15:52:14] PROBLEM - puppet last run on mw1011 is CRITICAL Puppet has 1 failures [15:55:23] 6operations: Migrate access-requests@ from RT to Phabricator - https://phabricator.wikimedia.org/T84861#1485145 (10Krenair) Sometimes it's useful to be able to refer back to old tickets easily. [15:55:55] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 66927 MB (3% inode=99%) [15:56:11] (03PS4) 10EBernhardson: Add statsd reporting plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/223202 [15:56:47] godog: my preference would be to wait with the downgrade for a couple more days, but if you really want to downgrade as quickly as possible, then go for it [15:57:13] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Drop AES-256 mid/compat lists. - https://phabricator.wikimedia.org/T105716#1485156 (10BBlack) 5Open>3declined On further reflection, now's not really the time to disable AES256 like this. If anything we could start with just the ones that are non-PFS,... [15:58:42] gwicke, godog: let's do it Thursday or Friday, then? [15:59:46] moritzm: thurs perhaps, I'd avoid friday [16:00:00] yeah, +1 for Thursday [16:02:39] (03PS4) 10Tim Landscheidt: Ignore warnings about URLs without modules for private repository [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) [16:03:59] gwicke, godog: Thursday, then :-) [16:04:21] kk, sounds good [16:09:22] (03CR) 10Tim Landscheidt: "Thanks to @BBlack's work in I0db6fdb1c75355b58095e0ec29d6028bbc614649 & Co., there are only three (3) occurrences of the warning for the p" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [16:13:34] 10Ops-Access-Requests, 6operations, 6Reading-Admin, 5Patch-For-Review: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1485192 (10kevinator) [16:14:23] (03CR) 10BBlack: "Yeah, those last few cases all seem to be related to the deployment stuff on tin/mira. They looked complex so I left them alone for now i" [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [16:15:48] PROBLEM - DPKG on logstash1001 is CRITICAL: Connection refused by host [16:15:49] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL - elasticsearch http://10.64.0.122:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.122, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [16:16:08] PROBLEM - Disk space on logstash1001 is CRITICAL: Connection refused by host [16:16:10] PROBLEM - configured eth on logstash1001 is CRITICAL: Connection refused by host [16:16:28] PROBLEM - RAID on logstash1001 is CRITICAL: Connection refused by host [16:16:29] PROBLEM - dhclient process on logstash1001 is CRITICAL: Connection refused by host [16:16:39] PROBLEM - puppet last run on logstash1001 is CRITICAL: Connection refused by host [16:16:49] PROBLEM - salt-minion processes on logstash1001 is CRITICAL: Connection refused by host [16:17:59] RECOVERY - puppet last run on mw1011 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:18:10] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1485204 (10yuvipanda) Paramiko is already fairly heavily used in labs hosts(openstack + designate). Let's park any move away from paramiko until we can get the rest of the labs... [16:21:49] (03PS13) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [16:21:54] can anyone help with gitblit? https://integration.wikimedia.org/ci/job/parsoidsvc-php-parsertests/5117/console [16:29:04] !log restarted gitblit on antimony (AGAIN...) [16:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:16] arlolra: should recover in a few mins, normally [16:31:18] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61481 bytes in 0.193 second response time [16:33:00] 6operations, 10Analytics-Cluster: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1485264 (10kevinator) [16:33:11] (03PS1) 10Cmjohnson: adding netboot.cfg for logstash1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/227235 [16:33:35] godog: check plz ^ [16:34:10] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1485267 (10fgiunchedi) reporting from irc, @moritzmuehlenhoff has updated openjdk 7 packages, downgrading on Thurs [16:34:24] !log switched operations/dns to ff-only like operations/puppet in gerrit config [16:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:07] (03PS2) 10Filippo Giunchedi: install_server: adding netboot.cfg for logstash1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/227235 (owner: 10Cmjohnson) [16:35:37] (03PS3) 10Filippo Giunchedi: install_server: adding netboot.cfg for logstash1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/227235 (owner: 10Cmjohnson) [16:35:58] (03CR) 10Filippo Giunchedi: [C: 031] "I've tweaked the commit message, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/227235 (owner: 10Cmjohnson) [16:36:43] bblack: thanks [16:36:58] (03PS4) 10Cmjohnson: install_server: adding netboot.cfg for logstash1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/227235 [16:38:45] (03CR) 10Cmjohnson: [C: 032] install_server: adding netboot.cfg for logstash1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/227235 (owner: 10Cmjohnson) [16:49:18] PROBLEM - NTP on logstash1001 is CRITICAL: NTP CRITICAL: No response from NTP server [16:51:40] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:09] RECOVERY - Host logstash1001 is UPING OK - Packet loss = 0%, RTA = 0.94 ms [16:58:08] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:19] RECOVERY - Host logstash1001 is UPING OK - Packet loss = 0%, RTA = 1.61 ms [17:05:42] 10Ops-Access-Requests, 6operations, 6Reading-Admin, 5Patch-For-Review: Requesting access to stat1002 (Hadoop / HDFS / Hue) for tbayer - https://phabricator.wikimedia.org/T105748#1485362 (10dr0ptp4kt) I received approval on this request from Terry as well. [17:05:48] RECOVERY - dhclient process on logstash1001 is OK: PROCS OK: 0 processes with command name dhclient [17:05:56] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1485364 (10GWicke) We discussed this on IRC, and decided to switch RESTBase directly to api.srv.eqiad.wmnet for now. This means that we'll need to explicitly override the host: header... [17:06:04] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1485365 (10GWicke) p:5Triage>3Normal [17:07:08] RECOVERY - DPKG on logstash1001 is OK: All packages OK [17:07:19] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 5, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 123, initializing_shards: 0, number_of_data_nodes: 3 [17:07:19] RECOVERY - Disk space on logstash1001 is OK: DISK OK [17:07:30] RECOVERY - configured eth on logstash1001 is OK - interfaces up [17:07:34] bd808 1001 is ready for you...1003 shortly [17:07:39] RECOVERY - RAID on logstash1001 is OK no disks configured for RAID [17:07:59] PROBLEM - puppet last run on logstash1001 is CRITICAL Puppet has 1 failures [17:08:23] !log updated mc200[34] to linux 3.19.3-7 for some testing on hardware [17:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:29] anyone around for a quick chat about mailman? [17:13:35] cmjohnson1: logstash1001 seems to be trusty rather than jessie [17:14:05] grr..fixing [17:14:17] thanks! [17:23:21] (03PS1) 10BBlack: geolanglist: always use text-lb for primary [dns] - 10https://gerrit.wikimedia.org/r/227243 [17:23:23] (03PS1) 10BBlack: disable geolang zerodot for all but wp.org [dns] - 10https://gerrit.wikimedia.org/r/227244 [17:23:40] cajoel: hi (probably best you'll get at the minute :) ) [17:23:52] (03PS1) 10Cmjohnson: Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545 [puppet] - 10https://gerrit.wikimedia.org/r/227245 [17:24:09] hi cajoel [17:24:10] JohnFLewis: I was hoping for mutante [17:24:15] any hoo. [17:24:27] Q that I couldn't figure out in 30s. [17:24:28] hmm, you need hoo? [17:24:41] cajoel: mutante is out I think? [17:24:54] does our exim config try sending to a mailman list if the end address doesn't work? [17:25:24] (03PS2) 10Cmjohnson: Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545 [puppet] - 10https://gerrit.wikimedia.org/r/227245 [17:25:24] meaning, if I try to email wmfall@wikimedia.org, will it attempt delivery to wmfall@lists.wikimedia.org as a last resort default? [17:25:48] and if it doesn't, what would you think of me adding those as aliases to direct? [17:26:03] I don't believe so. I'd imagine it'll get bounced or sent to roots [17:26:34] they can be added but they'll be auto-moderated if they're not added to the list config /me finds the exact name [17:26:39] (03CR) 10Cmjohnson: [C: 032] Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545 [puppet] - 10https://gerrit.wikimedia.org/r/227245 (owner: 10Cmjohnson) [17:26:40] cajoel: no it doesn't [17:27:00] would an exim alias resolve this? [17:27:24] I don't know what the problem is? people emailing the wrong address? [17:28:30] cajoel: an exim alias would though what chasemp said. anyhow, it'll need to be added to https://lists.wikimedia.org/mailman/admin//?VARHELP=privacy/recipient/acceptable_aliases for it to be accepted by mailman otherwise it'll bounce or be held in mod [17:28:48] I could do that. [17:29:24] chasemp: it's a little complicated, but yes, I'd like to catch accidental deliveries to the address without the lists. [17:30:16] paravoid: yt? [17:33:39] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1485426 (10Andrew) [17:37:00] (03CR) 10Matthias Mullie: "Do all of those wikis have Parsoid, Echo, VE, ... and whatever other dependencies Flow needs?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [17:37:55] 6operations, 6Labs, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1485449 (10Andrew) [17:38:49] PROBLEM - dhclient process on logstash1001 is CRITICAL: Connection refused by host [17:40:09] PROBLEM - DPKG on logstash1001 is CRITICAL: Connection refused by host [17:40:28] PROBLEM - Disk space on logstash1001 is CRITICAL: Connection refused by host [17:40:29] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL - elasticsearch http://10.64.0.122:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.122, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [17:40:39] PROBLEM - configured eth on logstash1001 is CRITICAL: Connection refused by host [17:40:49] PROBLEM - RAID on logstash1001 is CRITICAL: Connection refused by host [17:41:26] cajoel: IDK man, obfuscating mailing list interaction doesn't seem good? I don't understand the use case, seems social and good luck :) [17:42:46] (03CR) 10Alex Monk: "No. VisualEditor is not on any wikisource or wiktionary, and Echo is not on legalteamwiki or zerowiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [17:43:23] cajoel: chasemp it'd also confuse the distinction between google hosted "lists" and our mailman lists [17:43:28] I'd vote no [17:44:43] (03CR) 10Alex Monk: "Except legalteamwiki and zerowiki are private, so that's not the issue here. Still, no VE on wikisource or wiktionary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [17:45:02] cajoel: aliasing sounds a bit dangerous, but a 'hey, this user is unknown, but maybe you meant this mailing list: xxxxxx@lists.wikimedia.org' kind of reply could work? [17:51:01] (03PS2) 10BBlack: Fix ICO MIME regexps [puppet] - 10https://gerrit.wikimedia.org/r/225852 (https://phabricator.wikimedia.org/T63443) (owner: 10Gilles) [17:52:15] (03CR) 10BBlack: [C: 032] Fix ICO MIME regexps [puppet] - 10https://gerrit.wikimedia.org/r/225852 (https://phabricator.wikimedia.org/T63443) (owner: 10Gilles) [17:53:28] (03PS6) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [17:53:45] Krenair: can we just revert the wikipedia urls cleanup patch? unless you know how to fix SiteMatrix? [17:54:42] valhallasw`cloud: I would be interested in making this change only for 2 lists which are WMF specific. [17:54:45] wmfall and wmfsf [17:55:12] 6operations: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1485551 (10CCogdill_WMF) 3NEW [17:55:19] 6operations: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1485561 (10CCogdill_WMF) @EWilfong_WMF is the relevant contact at Trilogy. Eric, if you have more specific questions, can you add them in a comment to this task? [17:55:24] legoktm, I was hoping to get maintain-replicas fixed first [17:55:59] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1485562 (10fgiunchedi) a:5EBernhardson>3RobH @robh we should refresh the quote we for had elasticsearch hw in codfw in RT #85... [17:56:17] legoktm, also I'd prefer to only re-add defaults to wgcanonicalurl and wgsitename, rather than a full revert [17:56:23] sure [17:56:51] I don't understand why ops haven't dealt with it yet [17:56:59] RECOVERY - configured eth on logstash1001 is OK - interfaces up [17:57:09] RECOVERY - RAID on logstash1001 is OK no disks configured for RAID [17:57:11] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1485565 (10RobH) s3500 max out at 800 gb, larger than that moves up to the s3700 series [17:57:13] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1485566 (10fgiunchedi) [17:57:19] RECOVERY - dhclient process on logstash1001 is OK: PROCS OK: 0 processes with command name dhclient [17:57:21] 6operations, 6Labs, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1485567 (10Andrew) Let's schedule this for one of the live labvirts next week. [17:58:29] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 7967.73774525 [17:58:38] RECOVERY - DPKG on logstash1001 is OK: All packages OK [17:58:43] cajoel: I see no technical issue with it though there seems to be a social issue. mailman handles aliasing already and it's capable of it. assuming that change is done, everything will work fine :) [17:58:50] RECOVERY - Disk space on logstash1001 is OK: DISK OK [17:58:54] [my comment] [17:59:24] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1485587 (10greg) Still on the plate for this week, right? [17:59:52] bd808: both are finished [17:59:54] Distributor ID: Debian [17:59:54] Description: Debian GNU/Linux 8.1 (jessie) [17:59:56] Release: 8.1 [17:59:57] lfaraone, was it you who complained about the favicon of special wikipedias at wikimania? [17:59:58] Codename: jessie [18:00:39] cmjohnson1: awesome. I'll hop on and finish out the parts that are missing from apt still [18:01:11] (03CR) 10Dduvall: "That's a fair point I think, and now I'm even wondering if bypassing the cache is desirable for security scanning. If not, we can easily r" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [18:01:44] RECOVERY - Host logstash1003 is UPING OK - Packet loss = 0%, RTA = 1.46 ms [18:01:45] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL - elasticsearch http://10.64.48.113:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.113, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [18:02:03] PROBLEM - NTP on logstash1003 is CRITICAL: NTP CRITICAL: Offset unknown [18:02:03] RECOVERY - RAID on logstash1003 is OK no disks configured for RAID [18:02:03] PROBLEM - puppet last run on logstash1003 is CRITICAL Puppet has 5 failures [18:10:47] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1485616 (10Mattflaschen) [18:12:14] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 5, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 123, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards [18:12:23] PROBLEM - DPKG on logstash1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:13:17] YuviPanda, what needs to be done to get that commit merged and the script run? [18:13:29] Krenair: Coren is doing that right after the ops meeting. [18:13:54] Krenair: longer term, we'll need to split meta_p populating script out of the other one, convert it to python3 and so we can run it without problems [18:14:38] (03CR) 10Merlijn van Deen: [C: 04-1] labstore: Make script use exceptions instead of return value checking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [18:15:07] (03CR) 10Merlijn van Deen: [C: 031] Labs: Fix puppetmaster::certcleaner for self-hosted puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/226455 (https://phabricator.wikimedia.org/T106627) (owner: 10Tim Landscheidt) [18:15:15] (03CR) 10Yuvipanda: labstore: Make script use exceptions instead of return value checking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [18:15:34] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1485642 (10chasemp) I would like to help with this so I understand where things are at, even if it's just gettin the coffee :) 2 creams, 1 sugar? [18:16:17] (03CR) 10Legoktm: "Lets just enable on Wikipedias for now then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [18:18:14] RECOVERY - NTP on logstash1003 is OK: NTP OK: Offset -0.005491375923 secs [18:18:24] RECOVERY - DPKG on logstash1001 is OK: All packages OK [18:19:24] RECOVERY - puppet last run on logstash1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:20:11] !log logstash1001 back up and running [18:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:16] 6operations, 6Multimedia, 10Wikimedia-Media-storage, 7Monitoring: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1485653 (10chasemp) @mark is this worthy of a catchpoint alert? It seems like it may be a good external sanity check. [18:22:15] 6operations, 6Multimedia, 10Wikimedia-Media-storage, 7Monitoring: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1485655 (10chasemp) p:5Triage>3High [18:23:24] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1485658 (10Krenair) > donate.wikimedia.org has a wildcard SSL cert, so we could conceivably use that for the events site But that's hosted by WMF... Shouldn't third-party hosted sites only get ce... [18:24:35] (03CR) 10coren: [C: 04-1] "Looks good, except for the rm -rf which must take place even if the umount fails." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [18:25:03] !log No mediawiki, hhvm or apache2 logs going to logstash1001:10514 [18:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:13] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1485661 (10Krenair) [18:25:14] cmjohnson1: ^ any idea about that? [18:25:42] I'm seeing input on :12201 (the gelf port) [18:25:58] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1485664 (10kevinator) Approved! [18:26:29] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1485666 (10Krenair) > donate.wikimedia.org has a wildcard SSL cert, so we could conceivably use that for the events site But that's hosted by WMF... Shouldn't third-parties only get certs which m... [18:27:05] oh. did the ip address happen to change? [18:27:32] Coren, are you really doing https://phabricator.wikimedia.org/T106963 ? [18:27:56] I thought you'd only do the (related) labs task [18:28:12] 6operations: Update wikimedia apt repo to include debs for shiny-server - https://phabricator.wikimedia.org/T106435#1485669 (10EBernhardson) The package is not currently in use in production, its proposed for use in mediawiki-vagrant (https://gerrit.wikimedia.org/r/#/c/221827/). The server is currently manually... [18:28:14] Krenair: ... no. I grabbed the wrong task. [18:29:00] (03CR) 10Andrew Bogott: [C: 032] Labs: Fix puppetmaster::certcleaner for self-hosted puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/226455 (https://phabricator.wikimedia.org/T106627) (owner: 10Tim Landscheidt) [18:29:06] (03PS2) 10Andrew Bogott: Labs: Fix puppetmaster::certcleaner for self-hosted puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/226455 (https://phabricator.wikimedia.org/T106627) (owner: 10Tim Landscheidt) [18:29:08] Krenair: Sorry for the confusion [18:29:09] (03CR) 10Alex Monk: "Or we could just exclude those same two projects as VisualEditor does. Or hide the include of Flow behind a UseVisualEditor check." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [18:33:14] !log logstash1003 salt key not accepted by master [18:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:19] cmjohnson1: ^ [18:34:15] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1485685 (10MoritzMuehlenhoff) Since it was mentioned in the Ops meeting, let's add it to the task: xulrunner is only present in Wheezy, starting with 31, Firefox/Iceweasel no longer use/bu... [18:35:09] bd808 salt-key resolved [18:35:10] (03PS1) 10BryanDavis: logstash: change ip address for logstash1001 and logstash1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227258 (https://phabricator.wikimedia.org/T97545) [18:36:09] cmjohnson1: thanks. puppet run is looking better there now [18:36:53] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 123, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards [18:37:05] RECOVERY - puppet last run on logstash1003 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:37:41] (03CR) 10BryanDavis: [C: 032] logstash: change ip address for logstash1001 and logstash1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227258 (https://phabricator.wikimedia.org/T97545) (owner: 10BryanDavis) [18:37:47] (03Merged) 10jenkins-bot: logstash: change ip address for logstash1001 and logstash1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227258 (https://phabricator.wikimedia.org/T97545) (owner: 10BryanDavis) [18:38:53] !log bd808 Synchronized wmf-config/InitialiseSettings.php: logstash: change ip address for logstash1001 and logstash1003 (duration: 00m 12s) [18:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:28] !log fatalmonitor full of errors from mw1011 [18:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:48] !log manually ran sync-common on mw1011 [18:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:21] that didn't fix it [18:42:36] what's up with "Fatal error: Call to undefined function headers_sent() in /srv/mediawiki/hhvm-fatal-error.php on line 81" [18:43:50] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1485715 (10mobrovac) >>! In T96017#1485685, @MoritzMuehlenhoff wrote: > xulrunner is only present in Wheezy, starting with 31, Firefox/Iceweasel no longer use/build a separate library of X... [18:44:03] bd808, is that coming from mw1011? [18:44:08] yeah [18:44:16] in a constant stream [18:44:40] very weird [18:44:59] I ran php -a on mw1011 and headers_sent() returned false to me [18:45:09] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1485718 (10GWicke) [18:45:40] headers_sent() is on linke 815 of that file too, not 81 [18:46:01] oh that's just a bad paste by me [18:46:08] log message is line 815 [18:46:19] "Call to undefined function defined()" [18:46:35] something is way messed up there [18:47:17] can a root depool mw1011 until we can figure out what is wrong with it? [18:47:33] bd808: on it [18:48:34] RECOVERY - Host labnet1002 is UPING OK - Packet loss = 0%, RTA = 1.55 ms [18:49:24] !log stop jobrunner/jobchron/hhvm on mw1011 [18:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:59] (03CR) 10Jforrester: "Scheduled for tomorrow morning's SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226338 (owner: 10Jforrester) [18:52:25] godog: thanks. It it at least not shouting at the hhvm.log file any more [18:53:30] !log restarted populateContentModel.php --wiki=enwiki on terbium with modification to occassionally clear the link cache so it doesn't OOM. [18:53:34] PROBLEM - puppet last run on labnet1002 is CRITICAL puppet fail [18:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:30] cmjohnson1: so ... I think we have to restart rsyslog on all of the MW servers because of the ip address change of logstash1001. [18:55:14] PROBLEM - check_puppetrun on beryllium is CRITICAL Puppet has 1 failures [18:55:18] they are using the hostname and not a hardcoded ip, but nothing seems to have picked up the ip change. [18:55:43] RECOVERY - puppet last run on labnet1002 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:56:12] !log rsyslog forwarded hhvm and apache2 logs still not hitting logstash1001; rsyslog restarts may be needed [18:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:14] RECOVERY - check_puppetrun on beryllium is OK Puppet is currently enabled, last run 65 seconds ago with 0 failures [19:01:00] 6operations, 6Labs, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1485787 (10yuvipanda) use labvirt1009, has only 3 tools instances and they all can be failed over or sustain downtime. [19:01:27] (03CR) 10coren: [C: 032] "Yep." [software] - 10https://gerrit.wikimedia.org/r/226939 (https://phabricator.wikimedia.org/T106897) (owner: 10Alex Monk) [19:01:37] (03CR) 10coren: [V: 032] "Yep." [software] - 10https://gerrit.wikimedia.org/r/226939 (https://phabricator.wikimedia.org/T106897) (owner: 10Alex Monk) [19:03:01] ^Coren: what will be the status after deployment/run? [19:03:21] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1485790 (10yuvipanda) Heh, it did thankfully work when it was rebooted last time accidentally. [19:03:31] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1485791 (10yuvipanda) (thankfully - let's not do that again, etc) [19:03:47] jynus: That just changes the family parsed from the dblist files in the meta table. There still is need to check/fix what happened with wgCanonicalServer [19:04:06] software? [19:04:09] or data? [19:04:36] ... what? Sorry, I must have misunderstood your question? [19:04:40] basically, will you need a DBA? [19:04:42] We know exactly what happened with wgCanonicalServer and this change should fix it. [19:04:48] jynus: For that? No. [19:05:07] ok, thanks [19:05:09] Krenair: I was about to check that exactly. :-) [19:05:45] Coren, sorry, but like 3 or 4 people told me to fix this and I did not know what they were talking about :-) [19:06:16] jynus: That's okay - it's dbish so people naturally look at the DBAs. :-) But it's just data - not DB. [19:06:35] if at any time Krenair, Coren you need help, #wikimedia-databases I will be happy to help [19:06:40] and sean [19:06:46] ty, jynus [19:09:40] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1485805 (10Jdlrobson) Thanks @ebernhardson for letting us know! [19:11:21] bd808: do we need to restart rsyslog? [19:11:36] godog: I think so :/ [19:11:56] godog: this dashboard is still empty -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm [19:12:24] ugh that 403s on me [19:12:42] and this one too https://logstash.wikimedia.org/#/dashboard/elasticsearch/apache2log [19:12:50] Krenair: Rebuilding the data now, c1 is done and looks okay. [19:12:58] godog: 403? hmmm [19:14:19] hrmm, is etherpad down for folks? [19:14:28] cuz it shows error page for me. [19:14:29] bd808: yeah I'm trying to understand if it is me [19:14:36] Coren, remind me what c[1-3] are? [19:14:51] just each different labs db server? [19:15:02] godog: it could be misc-varnish related too. /me looks to see how that is setup [19:15:06] Krenair: The underlying replicas; they're three servers but they all have all the projects. [19:15:17] robh: yeah [19:15:21] robh: yeah same here [19:16:10] well, other items on misc web are working [19:16:15] so ill poke at the host system [19:16:47] Krinkle: afaict, now, the only row with a NULL url is centralauth (which is expected) [19:16:53] Well. [19:16:57] That's arguable. [19:16:59] (03CR) 10Faidon Liambotis: [C: 031] "Sounds sane." [dns] - 10https://gerrit.wikimedia.org/r/227243 (owner: 10BBlack) [19:17:17] You've probably seen my patch in Gerrit to just completely remove that centralauth row. [19:17:23] kicked apache, works [19:17:26] (03CR) 10Faidon Liambotis: [C: 031] "Sounds sane. There is no non-Wikipedia Zero as far as I know too." [dns] - 10https://gerrit.wikimedia.org/r/227244 (owner: 10BBlack) [19:17:36] !log etherpad was giving errors, apache restart fixed [19:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:14] Krenair: I like the table to reflect every available replicated database. [19:19:18] godog: misc-varnish is configured to use hostnames too for the logstash backend. I wonder if it needs a hup to pick up new ips for the hosts as well? [19:19:30] Coren, then you probably shouldn't've called it wiki [19:20:08] logstash.wikimedia.org points to misc-varnish which then round robins between logstash100[1-3] [19:20:32] bd808: the 403 is from apache so something is replying [19:20:33] logstash1001 and 1003 got new ips today when cmjohnson1 moved them to new racks [19:20:40] 6operations, 10Beta-Cluster, 6Labs, 7Monitoring: Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production - https://phabricator.wikimedia.org/T97865#1485870 (10hashar) Will be done with Jenkins, see {T106421}. [19:20:59] bd808: ah it isn't from apache, nevermind [19:23:12] bd808: anyways no it DTRT cp1044.eqiad.wmne:60622 logstash1001.eqiad:http ESTABLISHED [19:23:41] godog: you may be right about the 403 being an apache problem. testing from logstash1001 locally now [19:24:08] (03PS1) 10Andrew Bogott: Set up labnet1002 as a spare for labnet1001. [puppet] - 10https://gerrit.wikimedia.org/r/227270 [19:24:25] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1485888 (10jcrespo) 3NEW a:3jcrespo [19:24:33] logstash1003 gives expected response but 1001 does not [19:25:02] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1485902 (10jcrespo) [19:25:53] bd808: I think we can try bouncing apache, lots of files in /etc/apache2 show up modified right when apache2 started [19:26:39] godog: I just bounced it and am not seeing a difference [19:27:17] "AH01797: client denied by server configuration: /srv/deployment" [19:27:20] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1485918 (10jcrespo) All errors seem caused by `/rpc/RunJobs.php`, but with no a specific parameter in par... [19:27:44] (03PS1) 10Alex Monk: Re-add default=wikipedia lines to wgCanonicalServer and wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227273 (https://phabricator.wikimedia.org/T106963) [19:28:59] (03PS2) 10Alex Monk: Re-add default=wikipedia lines to wgCanonicalServer and wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227273 (https://phabricator.wikimedia.org/T106963) [19:29:08] (03CR) 10Alex Monk: [C: 032] Re-add default=wikipedia lines to wgCanonicalServer and wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227273 (https://phabricator.wikimedia.org/T106963) (owner: 10Alex Monk) [19:29:15] (03Merged) 10jenkins-bot: Re-add default=wikipedia lines to wgCanonicalServer and wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227273 (https://phabricator.wikimedia.org/T106963) (owner: 10Alex Monk) [19:29:19] bd808: heh 1003 and 1001 are running apache2.4, 1002 is running 2.2 [19:29:24] yeah [19:29:40] 1003 gives the expected 401 for `curl -v -H'X-Forwarded-Proto: https' localhost` [19:29:48] but 1001 gives a 403 instead [19:29:59] 1002 hasn't been reiamged yet [19:30:14] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 795.180312529 [19:30:38] bd808: /srv/deployment isn't on 1001 [19:30:50] (03CR) 10Yuvipanda: dynamicproxy/tools: set up outage error system (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) (owner: 10Merlijn van Deen) [19:30:55] well there you go [19:30:58] trebuchet fail [19:31:32] I saw puppet restarting the salt minion too. [19:31:34] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227273/ (duration: 00m 13s) [19:31:37] which is probably realted [19:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:42] Krinkle_, legoktm: There's still other visible breakage in sitematrix, but there you go [19:32:27] actually let's see if I can make that a bit better while I'm at it [19:32:48] godog: "The Salt Master has rejected this minion's public key!" [19:32:58] bd808: le sigh, fixing [19:33:23] why call it a "public key" when there are unused salt-related nouns to use that can make the purpose and function obscure [19:33:44] RECOVERY - salt-minion processes on logstash1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:33:52] meh... not much better. [19:34:34] call it "Brine" or something [19:34:37] ori: should be something like "The Salt Master has rejected this minion's isotope" [19:34:45] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine - https://phabricator.wikimedia.org/T107043#1485959 (10Ironholds) This is for Discovery's new analyst. [19:34:57] nacl not found [19:34:59] or crystal lattice [19:35:07] bd808: yep should be good, are you deploying again or I do? [19:35:17] 'rejected' remains mostly unambiguous though [19:35:23] perhaps it should be 'flavored' [19:35:24] godog: I'll force a puppet run [19:35:33] "The Salt Master has flavored this minion's brine" [19:35:39] Much better. [19:36:17] wait wait no, I have it [19:36:51] godog: \o/ getting a 401 now [19:37:01] which is totally expected [19:37:12] bd808: \o/ now for the real problem... [19:37:36] !log godog fixed salt key for logstash1001 which fixed trebuchet install of kibana [19:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:53] PROBLEM - Brine salinity ratio anomaly detection on logstash1001 is CRITICAL: CRITICAL: Anomaly detected: dissolved salts 11 PPI above and 0 below the confidence bounds [0.035] [19:38:21] hahaha [19:38:45] !ack pass the salt? [19:39:05] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1485980 (10yuvipanda) [19:40:12] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1485269 (10yuvipanda) Can you put your public key on your Office Wiki user page to verify? Thanks [19:41:10] (03PS1) 10coren: Labs: Ignore users with no home when adding DB users [puppet] - 10https://gerrit.wikimedia.org/r/227324 [19:41:16] YuviPanda: ^^ that one [19:41:29] robh: does https://phabricator.wikimedia.org/T107043 need a 3 day wait? [19:41:31] cmjohnson1: are you going to reimage logstash1002 today too or save that for later? [19:41:45] i am going to save that for tomorrow [19:41:51] unless you want it done later [19:42:03] YuviPanda: yes [19:42:09] robh: ok! [19:42:11] all access requests are at minimum 3 day [19:42:17] cmjohnson1: tomorrow works for me [19:42:29] also, he doesnt have to put it on officewiki, i'll claim and update [19:42:31] cool [19:42:35] bd808: logstash1001 seems to be receiving syslog traffic from mw [19:42:37] robh: oh, I see [19:42:37] ok [19:42:45] YuviPanda: actually, i lied [19:42:52] if they add their key via web its fine [19:42:56] i have no idea what 'old owrkld' is [19:42:59] import? [19:43:27] godog: from mw itself, yes. Not from hhvm & apache2; https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm https://logstash.wikimedia.org/#/dashboard/elasticsearch/apache2log [19:43:50] godog: I fixed mw logging with https://gerrit.wikimedia.org/r/#/c/227258/ [19:45:12] 6operations, 10Beta-Cluster, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1486006 (10thcipriani) p:5Normal>3High Sounds like Varnish packages won't be getting built for Trusty any longer, upping priority. [19:46:32] godog: see /etc/rsyslog.d/40-mediawiki.conf on any mw host to see the bits that aren't working yet since logstash1001 got a new ip [19:47:07] indeed :( [19:47:14] !log bounce rsyslog on mw1235 [19:47:14] (03CR) 1020after4: [C: 032] Revert "Remove w/COPYING and w/CREDITS dead symlinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227210 (https://phabricator.wikimedia.org/T107007) (owner: 10Hashar) [19:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:27] my guess is that rsyslog caches the ip and doesn't check for updates [19:47:41] cajoel, do you still need to be able to log in to those MXes? [19:47:42] (03Merged) 10jenkins-bot: Revert "Remove w/COPYING and w/CREDITS dead symlinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227210 (https://phabricator.wikimedia.org/T107007) (owner: 10Hashar) [19:47:56] Krenair: yes, it's incredibly useful for mail delivery debug [19:48:42] Krenair: I can sit tight for the moment and not have access for a short term, but it would be very helpful to find a work around. [19:50:28] it got broken at some point recently when ops added firewalls to those hosts stopping you from SSHing into them externally. They would be able to fix it by either adding you to the bastiononly group or adding the oit group to a bastion host [19:50:50] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1486022 (10RobH) a:3mpopov So there are a number of things that have to be met for us to process this access request. All of these items are outlined on: https://wiki... [19:51:25] (03PS1) 1020after4: Fix symlinks for w/COPYING and w/CREDITS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227326 [19:51:34] Krenair: yep, moritzm brought this up about 2 hours ago.. [19:51:57] thanks for asking me about it [19:52:26] (03PS2) 1020after4: Fix symlinks for w/COPYING and w/CREDITS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227326 (https://phabricator.wikimedia.org/T107007) [19:52:33] I noticed it when going through that bastiononly group earlier, am uploading a commit that changes this all around though [19:53:21] (03CR) 1020after4: [C: 032] Fix symlinks for w/COPYING and w/CREDITS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227326 (https://phabricator.wikimedia.org/T107007) (owner: 1020after4) [19:53:25] Krenair: thanks [19:53:26] (03Merged) 10jenkins-bot: Fix symlinks for w/COPYING and w/CREDITS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227326 (https://phabricator.wikimedia.org/T107007) (owner: 1020after4) [19:54:47] !log twentyafterfour Synchronized w/: deploy https://gerrit.wikimedia.org/r/#/c/227326/ (duration: 00m 12s) [19:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:07] bd808: yup, will need to bounce rsyslog [19:56:18] godog: k. pretty obviously we should have something better for this (like pybal for logstash cluster?) [19:57:03] (03PS1) 10Alex Monk: Add all groups to bast1001, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [19:57:28] chasemp, ^ [19:58:46] !log bounce rsyslog on mw in codfw in batches [19:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:05] gwicke cscott arlolra subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150727T2000). Please do the needful. [20:01:32] 6operations: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1486053 (10JKrauska) 3NEW [20:01:52] bd808: for example, yeah [20:02:01] (03CR) 10Yuvipanda: [C: 032] "...ok!" [puppet] - 10https://gerrit.wikimedia.org/r/227324 (owner: 10coren) [20:03:19] (03CR) 10Alex Monk: "hooft is an even weirder case because it allows only a subset of the groups which bast1001 allows. I'm not sure what purpose it's supposed" [puppet] - 10https://gerrit.wikimedia.org/r/227327 (owner: 10Alex Monk) [20:03:53] (03CR) 10CSteipp: "Anomie, does votewiki use ULS? I'm not familiar enough with that setup to know if we're relying on it there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [20:06:29] (03CR) 10Anomie: "> Anomie, does votewiki use ULS?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [20:07:23] 6operations, 6Multimedia, 10Wikimedia-Media-storage, 7Monitoring: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#1486072 (10Bawolff) >In general, if that happens, it indicates a pretty serious problem with upload or new thumbnails. There c... [20:07:35] !log bounce rsyslog on mw in eqiad in batches [20:07:37] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1486073 (10Ironholds) The public key is in the initial comment. [20:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:16] (03PS2) 10Alex Monk: Add all groups to bast1001, empty bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227327 [20:08:29] (03PS8) 10Yuvipanda: dynamicproxy/tools: set up outage error system [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) (owner: 10Merlijn van Deen) [20:14:07] bd808: there's also an exception on logstash1001's logstash log [20:15:02] godog: yuck. [20:16:56] !log Depooled Precise scalers (mw1159 and mw1160) again, for testing. [20:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:16] !log (A rise in 503s/minute expected. I'll keep it brief.) [20:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:20] (03PS1) 10Jforrester: Enable VisualEditor for 10% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227329 [20:18:22] (03PS1) 10Jforrester: Enable VisualEditor for 20% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227330 [20:18:42] is anyone investigating https://phabricator.wikimedia.org/T106986 ? [20:18:46] (03CR) 10Anomie: [C: 04-1] Disable a bunch of extensions on loginwiki/votewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [20:20:39] (03CR) 10Jforrester: [C: 04-1] "Not for a while." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227329 (owner: 10Jforrester) [20:20:47] (03CR) 10Jforrester: [C: 04-1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227330 (owner: 10Jforrester) [20:21:01] deploying new parsoid code now. [20:24:20] (03PS4) 10Alex Monk: Disable a bunch of extensions on loginwiki/votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) [20:25:53] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1486107 (10Krenair) [20:27:49] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1486053 (10Krenair) I tried to find the relevant lines from #wikimedia-operations earlier: ```Jul 27 18:24:54 does our exim config try sending to a mailman li... [20:33:26] !log deployed parsoid version 92f1cd6d [20:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:05] (03PS1) 10Alex Monk: Add azbwiki.labsdb to tools hosts [puppet] - 10https://gerrit.wikimedia.org/r/227333 (https://phabricator.wikimedia.org/T107081) [20:34:21] (03CR) 10Yuvipanda: [C: 032] "Thanks for the patch :)" [puppet] - 10https://gerrit.wikimedia.org/r/222753 (https://phabricator.wikimedia.org/T102971) (owner: 10Merlijn van Deen) [20:36:15] Krenair: ok to merge? [20:36:19] (the hosts patch) [20:36:54] YuviPanda, I think it's fine, yeah [20:37:00] your call though.. [20:37:25] (03PS2) 10Yuvipanda: tools: add azbwiki.labsdb to tools hosts [puppet] - 10https://gerrit.wikimedia.org/r/227333 (https://phabricator.wikimedia.org/T107081) (owner: 10Alex Monk) [20:37:31] (03PS3) 10Yuvipanda: tools: add azbwiki.labsdb to tools hosts [puppet] - 10https://gerrit.wikimedia.org/r/227333 (https://phabricator.wikimedia.org/T107081) (owner: 10Alex Monk) [20:37:43] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: add azbwiki.labsdb to tools hosts [puppet] - 10https://gerrit.wikimedia.org/r/227333 (https://phabricator.wikimedia.org/T107081) (owner: 10Alex Monk) [20:38:02] Krenair: done [20:38:24] ty [20:38:45] Krenair: yw. thanksf or the patch [20:38:47] *for [20:40:03] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1486145 (10JKrauska) @Krenair @JohnFLewis @greg-g The ask seems to directly mimic the exiting groups alias for ops@. I feel like it's a low bar to ask for the same tr... [20:40:40] cajoel: fyi I'm JohnLewis on phab and greg-g is greg :) [20:41:11] @subscribers [20:41:14] :) [20:42:15] (03PS1) 10GWicke: Lower the InitiatingHeapOccupancyPercent from 40% to 35% [puppet] - 10https://gerrit.wikimedia.org/r/227335 [20:43:12] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1486155 (10JKrauska) The 'prior art' got turned in to markup in the description. ``` ops: ops@lists.wikimedia.org ``` [20:43:36] cajoel, thanks for explaining [20:43:38] sounds good to me [20:44:19] Krenair: amazingly Lisa just happened to showcase why it's a needed thing.. :) [20:44:29] (03PS1) 10Yuvipanda: dynamicproxy: Make sure /var/www exists too [puppet] - 10https://gerrit.wikimedia.org/r/227336 [20:44:32] greg-g: want to make sure you see it too [20:44:41] ^^^ [20:44:58] (03PS2) 10Yuvipanda: dynamicproxy: Make sure /var/www exists too [puppet] - 10https://gerrit.wikimedia.org/r/227336 [20:45:13] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Make sure /var/www exists too [puppet] - 10https://gerrit.wikimedia.org/r/227336 (owner: 10Yuvipanda) [20:45:28] (03PS2) 10GWicke: Lower the InitiatingHeapOccupancyPercent from 40% to 35% [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) [20:47:06] godog: I wonder if we need to install libbcprov-java to fix that class not found for org.bouncycastle.jcajce.provider.digest.MD5$Digest on the logstash1001 [20:49:45] Coren: I'm trying to understand https://gerrit.wikimedia.org/r/#/c/226937/2/modules/labstore/files/storage-replicate. What happens if the umount fails, leaves it mounted and we run an rm -rf? [20:50:22] YuviPanda: It can't. It's a forced lazy unmount - the only way 'umount' can fail then is if there was nothing mounted at that point in the first place. [20:50:48] Coren: sure, but then we shouldn't need an unconfitional rm -rf there, right? [20:51:09] I'm just vary of an rm -rf that could also potentially start rm-rfing random things :) [20:51:11] YuviPanda: Yes, because otherwise the mountpoint /directory/ and the lockdir will still be there. [20:51:23] Coren: yes, but if the umount fails don't we want them to be there? [20:51:25] bd808: sure worth it a try, I don't think logstash1001 is back judging from the load before/after reimage [20:51:32] Coren: if it fails it'll crash and we'll have to manually see what's up [20:52:18] godog: I'll add that package manually and restart the service to see if that makes things better [20:52:28] Coren: isn't that what we want? [20:52:29] and of course make a ticket to track it [20:52:38] YuviPanda: No, if the unmount fails it means the filesystem was never mounted - basically that can happen at any point between the acquiring of the lockdir and the sucessful mount. [20:52:59] YuviPanda: I don't *mind* having to have a manual step - but that doesn't seem to be a very useful one. [20:53:18] YuviPanda: Although it should be very rare. The most likely scenario is the lvcreate failing [20:53:49] Coren: right and i would rather have it crash and notify us than go on with it. [20:54:02] Allright. It's a reasonable position. [20:54:02] Coren: I don't think that umount failing is recoverable automatically [20:54:06] So it should crash [20:54:09] 6operations, 10RESTBase, 10RESTBase-Cassandra: test Cassandra 2.1.7 - https://phabricator.wikimedia.org/T101745#1486198 (10GWicke) 5Open>3Resolved a:3GWicke We have been running 2.1.7 in production for a while now. It's been mostly working fine, but also didn't resolve bootstrap issues we are seeing wi... [20:54:10] !log installed libbcprov-java and restarted logstash on logstash1001 [20:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:23] With this I feel like we should let it crash and notify... [20:54:51] You're still not getting it though - the umount can't fail unless there was nothing mounted, which can happen anywhere between ll. 223-240 [20:55:00] Most of those are harmless (usually out of space, etc) [20:55:12] (03PS1) 10Ori.livneh: $wgEventLoggingSchemaApiUri: http -> https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227345 [20:55:17] But they should also be rare enough that I don't mind it not recovering automatically. [20:55:19] Coren: sure, so if it can't fail and it does fail we want to be notified right? [20:55:33] Right. [20:55:38] (03CR) 10Ori.livneh: [C: 032] $wgEventLoggingSchemaApiUri: http -> https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227345 (owner: 10Ori.livneh) [20:55:49] I don't think we should auto recover there [20:56:03] (03Merged) 10jenkins-bot: $wgEventLoggingSchemaApiUri: http -> https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227345 (owner: 10Ori.livneh) [20:56:06] (03CR) 10CSteipp: "Done: I2af26d23b9343e90db2f01f099c1292914bd7ac3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [20:56:14] Heh. It's not "can't fail" so much as "will have an error return iff there was nothing to unmount" which means an exception but is not anomalous. [20:56:43] Coren: I guess I'm being overly cautious but I feel that's OK in this context :) [20:56:46] I would have thought that the actual failiure in the log was sufficient - but like I said it should be rare enough. [20:56:53] So I don't mind. [20:57:02] BTW I had another Q [20:57:15] !log ori Synchronized wmf-config/CommonSettings.php: I1ca47ebc4: $wgEventLoggingSchemaApiUri: http -> https (duration: 00m 12s) [20:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:25] Coren: sure [20:58:20] subprocess.check_output() doesn't say what happens to stderr of the subprocess - afaict it just discards it? [20:58:32] Coren: that info is in the exception [20:58:48] And hence echoed into the log [20:59:17] Is it? My understanding is that you can only grab stderr by merging it into stdout? [21:00:39] (Loosing stderr would be an issue if it's lvs or lvcreate that fails - the actual error message is very important then) [21:00:48] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1486237 (10RobH) [21:01:03] (03PS2) 10Andrew Bogott: Set up labnet1002 as a spare for labnet1001. [puppet] - 10https://gerrit.wikimedia.org/r/227270 [21:01:43] YuviPanda: Yeah, "subprocess.CalledProcessError: Command '['/bin/ls', '/foo']' returned non-zero exit status 2" [21:01:56] stderr is just dumped to stderr by default [21:01:59] YuviPanda: Having the return code is not useful. [21:02:03] is there any problem with xml dumps? [21:03:17] (03CR) 10Andrew Bogott: [C: 032] Set up labnet1002 as a spare for labnet1001. [puppet] - 10https://gerrit.wikimedia.org/r/227270 (owner: 10Andrew Bogott) [21:03:20] valhallasw`cloud: Hm. If the proces stderr ends up in the log then we don't lose the info - that's better than nothing, though just a bit less clear than it could be. Not a big issue then. [21:03:52] legoktm, wondering if we should group the special-wikipedia sites together somehow [21:04:07] either by their own dblist included in most of the same places as wikipedia.dblist [21:04:15] or just directly into wikipedia.dblist [21:04:21] https://www.irccloud.com/pastebin/gpL16fTF/ [21:04:24] Coren: ^ alternatively [21:04:33] * Coren still very very much dislikes tracebacks in logs as opposed to cleanly caught errors with a clear message that creates a single log entry with adequate severity. Ohwells. [21:04:55] I'm not sure why the error object doesn't just have an .stderr attribute [21:05:08] valhallasw`cloud: Yeah, but merging stderr into stdout is very very bad. [21:05:41] valhallasw`cloud: I know - that's why I did the popen and used subprocess.communicate() in my (admitedly much more verbose) version. [21:07:06] Coren: brb meeting [21:07:22] godog: I think I got it working again (logstash1001) [21:07:41] load is back up to ~1 and I'm finally seeing some hhvm and apache2 logs [21:13:10] bd808: cool! [21:16:05] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1486306 (10Ottomata) 1042-1045 are installed and part of the Hadoop cluster. [21:17:04] PROBLEM - puppet last run on strontium is CRITICAL puppet fail [21:24:36] !log rebooting labnet1002, just to see if I can [21:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:38] (03PS1) 10Alex Monk: Add special wikipedias to wikipedia.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227351 [21:29:32] (03CR) 10Hashar: [C: 04-1] "I love this new scap feature and I am eager to see being used on beta cluster to deploy all the backend services whenever a merge occur on" (0317 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [21:33:30] PROBLEM - NTP on labnet1002 is CRITICAL: NTP CRITICAL: Offset unknown [21:34:40] RECOVERY - puppet last run on strontium is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:34:53] !log ori Synchronized php-1.26wmf15/extensions/AbuseFilter: I13d29ea6: Revert "Conversion to using getMainStashInstance()" (duration: 00m 12s) [21:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:35:02] (03PS2) 10Ori.livneh: memcached: logrotate only csv files [puppet] - 10https://gerrit.wikimedia.org/r/227191 (owner: 10Filippo Giunchedi) [21:35:12] (03CR) 10Ori.livneh: [C: 032 V: 032] memcached: logrotate only csv files [puppet] - 10https://gerrit.wikimedia.org/r/227191 (owner: 10Filippo Giunchedi) [21:35:14] (03PS1) 10Eevans: enabled GC logging [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) [21:35:30] RECOVERY - NTP on labnet1002 is OK: NTP OK: Offset -5.125999451e-05 secs [21:37:04] 10Ops-Access-Requests, 6operations, 10Graphoid: Allow mobrovac to restart Graphoid - https://phabricator.wikimedia.org/T106814#1486375 (10RobH) This was discussed in the operations weekly meeting, and the ability to restart graphoid as a restricted admin type account (non sudo) was approved. This will still... [21:37:21] Coren: valhallasw`cloud re: stderr, well, it gets out into syslog just before the exception stacktrace so I think that's ok [21:37:56] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1486379 (10RobH) a:3RobH [21:40:05] (03CR) 10BryanDavis: [C: 031] "Works well for me on 2 different trusty deploy servers I built in labs." [puppet] - 10https://gerrit.wikimedia.org/r/225554 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [21:41:15] bd808: want me to merge? I can do so if you're going to be around to tell me if it fucked up :) [21:41:21] tgr: ^ [21:41:35] (03PS1) 10RobH: adding user madhuvishy to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/227357 [21:41:59] YuviPanda: sure. we will know if it's bad if trebuchet stops working on tin [21:42:21] (03PS2) 10Yuvipanda: trebuchet: Update deployment_server to support Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225554 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [21:42:25] I can push a test repo deploy after you merge and force puppet on tin to test [21:42:28] (03PS3) 10Yuvipanda: trebuchet: Update deployment_server to support Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225554 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [21:42:33] (03CR) 10Yuvipanda: [C: 032 V: 032] trebuchet: Update deployment_server to support Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225554 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [21:43:00] bd808: there. wnat me to force ar un on tin? [21:43:03] err [21:43:14] ye si do [21:44:44] (03PS8) 10Alex Monk: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [21:45:23] bd808: done [21:45:56] (03PS1) 10Krinkle: Add temporary rl-test.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227358 (https://phabricator.wikimedia.org/T105255) [21:46:10] YuviPanda: testing.... [21:46:30] (03PS2) 10Ori.livneh: Enable ESI for testwiki [puppet] - 10https://gerrit.wikimedia.org/r/225243 [21:46:38] (03CR) 10Ori.livneh: "bump! :)" [puppet] - 10https://gerrit.wikimedia.org/r/225243 (owner: 10Ori.livneh) [21:47:30] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1486406 (10RobH) Analytics-admins grants the user rights to sudo as hdfs, as well as oozie and hive. I think this is indeed the correct group to... [21:48:14] (03CR) 10RobH: [C: 031] "I think this is right, but I'd like someone with more in depth analytics cluster experience to review as well. (I see no better way to gr" [puppet] - 10https://gerrit.wikimedia.org/r/227357 (owner: 10RobH) [21:48:31] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1486407 (10RobH) a:5RobH>3Ottomata [21:49:10] (03CR) 10Alex Monk: [C: 031] Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [21:49:53] YuviPanda: tin isn't broken by it :) [21:50:28] !log updated scap to dc8eda5 (Don't exclude PHP files from being synced) [21:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:55] 10Ops-Access-Requests, 6operations: Add Matanya to "restricted" to perform server side uploads - https://phabricator.wikimedia.org/T106447#1486414 (10RobH) a:3mark This was discussed in the operations meeting. The end result is we (ops) dislike that a restricted shell access group is required to accomplish... [21:51:39] bd808: wheeeeee [21:52:23] YuviPanda: https://gerrit.wikimedia.org/r/#/c/225552/ is related and could use some review and ops input [21:53:51] bd808: hmm, probably someone with more apache knowledge than me, I guess :( [21:53:59] or maybe that's just me punting. not sure... [21:54:06] heh [21:54:29] the best way to get apache knowledge is to break all apaches right? [21:55:09] YuviPanda: do you think it's a sane approach (vs. doing a separate patch per module)? [21:56:15] it's half-finished and I'm not sure if I should continue or split it up or just drop it and leave it to people who actually encounter problems with apache modules on trusty to fix them [21:59:09] (03CR) 10CSteipp: "We don't necessarily need to bypass the cache for scanning, although caching 500 different urls for one page because we're fuzzing one par" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [21:59:33] tgr: if I were you I'd do the last bit (leave it to people who actually encounter them) but I'm not a very nice person anymore :) [21:59:45] tgr: but as for reviewing / merging, I think one per module would be easiest? [22:00:22] (03CR) 10Alex Monk: [C: 04-1] Move sourceswiki special.dblist->wikisource.dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [22:01:34] (03PS9) 10Alex Monk: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [22:01:41] (03CR) 10Alex Monk: [C: 031] Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) (owner: 10coren) [22:03:57] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1486448 (10dcausse) We scheduled the upgrade on Thu, Jul 30, 5PM UTC. Should be 7PM CET (for me) and 10AM PST (for Erik) @chasemp is this ok for you? [22:09:22] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1486451 (10chasemp) >>! In T106090#1486448, @dcausse wrote: > We scheduled the upgrade on Thu, Jul 30, 5PM UTC. > Should be 7PM CET (for me) > and 10AM PST (for Erik) > > @cha... [22:09:32] (03CR) 10Yuvipanda: labstore: Make script use exceptions instead of return value checking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [22:09:37] Coren: ^ can you remove your -1? [22:10:43] (03CR) 10coren: [C: 031] "Given how rare the divergent case is, having to intervene manually is not a hardship. Since the code is cleaner for it, switching to +1 a" [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [22:11:22] (03PS3) 10Yuvipanda: labstore: Make script use exceptions instead of return value checking [puppet] - 10https://gerrit.wikimedia.org/r/226937 [22:11:58] (03CR) 10Eevans: "> Lower the InitiatingHeapOccupancyPercent from 40% to 35%" [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) (owner: 10GWicke) [22:13:11] valhallasw`cloud: do you anna remove your -1 too? :) [22:19:17] !log rebooting labvirt1005 [22:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:40] PROBLEM - puppet last run on labvirt1005 is CRITICAL: Timeout while attempting connection [22:20:29] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [22:21:54] (03CR) 10Yuvipanda: [C: 032] labstore: Make script use exceptions instead of return value checking [puppet] - 10https://gerrit.wikimedia.org/r/226937 (owner: 10Yuvipanda) [22:25:40] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 0.44 ms [22:26:45] Yes, Krenair. [22:27:03] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1486473 (10mpopov) @RobH Yep, I am an analyst for the discovery team. Just signed the doc and, as @Ironholds said, the public key is in the task description. //"We need... [22:29:57] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1486485 (10RobH) a:5mpopov>3Tfinc [22:30:43] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1485269 (10RobH) @mpopov: I could look it up, but please list your wikitech username (so I'm not guessing ;) We base your UID off that. [22:31:19] (03PS3) 10MaxSem: role::maps::master: Import waterlines on init and then weekly [puppet] - 10https://gerrit.wikimedia.org/r/225702 [22:32:41] (03CR) 10MaxSem: role::maps::master: Import waterlines on init and then weekly (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225702 (owner: 10MaxSem) [22:37:36] (03CR) 10BBlack: [C: 032] geolanglist: always use text-lb for primary [dns] - 10https://gerrit.wikimedia.org/r/227243 (owner: 10BBlack) [22:37:50] (03CR) 10BBlack: [C: 032] disable geolang zerodot for all but wp.org [dns] - 10https://gerrit.wikimedia.org/r/227244 (owner: 10BBlack) [22:38:30] 6operations, 6Services, 10Traffic: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1486521 (10BBlack) [22:38:56] 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1486527 (10RobH) 3NEW a:3RobH [22:39:25] 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1486536 (10RobH) Please note the watchmouse checks should remain in place. [22:41:47] (03CR) 10GWicke: "@eevans, on the linked Jira ticket others report that lowering the initiating threshold helps to avoid old gen pauses:" [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) (owner: 10GWicke) [22:42:04] (03PS3) 10GWicke: Lower the InitiatingHeapOccupancyPercent from 45% to 35% [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) [22:42:57] (03PS4) 10GWicke: Lower the InitiatingHeapOccupancyPercent from 40% to 35% [puppet] - 10https://gerrit.wikimedia.org/r/227335 [22:46:09] (03PS4) 10MaxSem: role::maps::master: Import waterlines on init and then weekly [puppet] - 10https://gerrit.wikimedia.org/r/225702 [22:48:03] bd808: thoughts on my comment at https://gerrit.wikimedia.org/r/#/c/221827/ [22:48:10] (if you have time, etc) [22:48:53] thoughts about forking the dev environment per project? How about "yuck" and "gross" :) [22:49:21] bd808: that's what I'm reccomending to things that are not dependent on MW [22:49:46] ok. have fun supporting those things [22:50:23] 6operations, 6Services, 10Traffic: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1486589 (10GWicke) @bblack, that sounds like a good idea to me. Do you think the backend Varnish layer would be the easiest / least invasive place for now? [22:50:24] I would actually like to make MW optional in mw-vagrant (weird I know) [22:50:29] bd808: that'll work too [22:50:39] the quarry one was a 9line bash script, and so was the ORES one. [22:51:02] I just don't think having to figure out the entire mw clone / composer dance to get something that doesn't depend on it at all working. [22:51:19] bd808: the other problem is that I'm reccomending people base stuff off jessie as well, which doesn't fit into MWV. [22:51:27] I know, I know, patches welcome... [22:51:54] mostly because techops doesn't support jessie for MW ;) [22:52:00] indeed [22:52:15] which makes jessie a bit of a weird thing right? [22:52:34] it is, but all non-MW new things are jessie now. we're in weird state here, I agree. [22:52:37] techops loves it but not enough to test and then reimage 400+ servers [22:52:50] unfortunately, indeed. [22:53:47] * bd808 still needs to learn how systemd actually works [22:54:13] * SPF|Cloud wonders why Debian chose systemd [22:54:38] bd808: it's been pretty cool. [22:54:48] SPF|Cloud: https://wiki.debian.org/Debate/initsystem/systemd [22:54:58] SPF|Cloud: look up 'devuvan' as well [22:55:05] I know [22:55:10] http://without-systemd.org [22:56:09] hah [22:56:11] mediawiki [22:56:25] And Debian 7 without systemd [22:56:29] hahaha [22:56:32] 1.19.20+dfsg-0+deb7u3 [22:56:59] legoktm: ^ they need you to finish your new package :) [22:57:14] :< [22:57:19] sigh. [22:57:27] I'm making progress! [22:57:43] I'm not sure if jessie introduced "systemctl" because of systemd but if yes then I'd like to have sysvinit back [22:58:22] I thought jessie was cool, fun and the better but actually I sometimes just want wheezy back :( [22:58:32] you kids have just been spoiled by unification of init systems for the last 15 years or so. [22:58:39] heh [22:58:43] * YuviPanda likes systemd [22:58:47] back in my day every un*x was different and we liked it [22:58:54] heh [22:58:57] * SPF|Cloud likes wheezy [22:59:28] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1486657 (10RobH) I've requested updated quotes on https://rt.wikimedia.org/Ticket/Display.html?id=8524 and... [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150727T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:24] and we wrote un*x on things to avoid AT&T's ® [23:00:55] after quite a while https://gerrit.wikimedia.org/r/#/c/187654/ is ready for code review. However, I'm a little bit afraid that this change might create trouble [23:01:18] * bd808 has a business card that says "UN*X-based Computer Technician" [23:01:38] * SPF|Cloud has 13 months of experience with Linux so he says nothing [23:01:58] I'll do SWAT [23:02:13] physikerwelt: "This change will purge caches for all pages that include math." sounds a bit scary [23:02:44] * ebernhardson shows up [23:02:46] SPF|Cloud: And you are afraid of a bit of change? [23:02:46] (03CR) 10Alex Monk: "I kind of meant this commit. So are we good to go here now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [23:03:28] bd808: That's why I want that someone merges it without carfully considering the consequences [23:03:47] MERGE ALL TEH THINGS [23:03:56] THEN DON'T BE AROUND FOR TEH DEPLY [23:04:36] back under your bridge troll Reedy. It's not feeding time yet. ;) [23:05:08] I wanted to visit the bugzilla quips page so I can add that comment, but then I saw Wikimedia migrated to phabricator [23:05:38] Hey RoanKattouw, how do you feel about trying https://gerrit.wikimedia.org/r/#/c/226971/ in this swat? [23:05:42] Reedy: :-) [23:05:46] someday quips will return to https://tools.wmflabs.org/bash/. Someday. [23:05:53] SPF|Cloud: Important task is important [23:05:53] https://phabricator.wikimedia.org/T73245 [23:06:15] nothing is more important than T73245 [23:06:30] (03CR) 10Mobrovac: [C: 031] "This is not a miracle-worker patch, but it is true that this prompts G1GC to start the mark phase earlier, thus potentially giving us some" [puppet] - 10https://gerrit.wikimedia.org/r/227335 (owner: 10GWicke) [23:06:30] bd808 ... I mean That's why I do NOT want that someone merges it, without carfully considering the consequences. [23:06:31] I actually want to find some time to write a phab extension for quips [23:07:46] Krenair: Ahm, maybe not [23:07:48] I'm doing 3 things at the same time [23:07:50] Reedy: do it! we put the bugzilla database dump somewhere on dumps, have a scavenger hunt too [23:07:51] (03CR) 10CSteipp: "I'd support deploying this as soon as I2af26d23b9343e90db2f01f099c1292914bd7ac3 is in production, so we can measure the impact." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [23:07:54] RoanKattouw, heh, okay [23:08:01] there's no shortage of nicer things [23:08:03] physikerwelt: what part of it will purge everything? [23:08:08] JohnFLewis: Yeah, there's a few copies of it around [23:08:19] I can't imagine it'd be too hard [23:08:28] The hardest part would be familiarising with the phab codebase [23:08:45] quips.wikimedia.org [23:08:56] (03CR) 10Eevans: "> @eevans, on the linked Jira ticket others report that lowering the" [puppet] - 10https://gerrit.wikimedia.org/r/227335 (owner: 10GWicke) [23:08:57] I should hope after spending at least two months talking with people to get reviews, approvals and so on [23:09:04] bd808 https://gerrit.wikimedia.org/r/#/c/187654/29/Math.hooks.php l99 [23:09:09] Krenair: Some other day maybe [23:09:14] :) [23:09:26] (03CR) 10Mobrovac: [C: 04-1] enabled GC logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [23:09:49] My latest evil plan is to store quips in an elasticsearch index, allow new ones to be entered with '!bash' irc commands and make a pretty page for viewing on tools [23:09:54] * Krenair added some to the calendar [23:10:21] Reedy: it'll always exist anyway, no one could be bothered to remove the database from whichever share db2030 was temp allocated to ;) [23:10:30] *shard [23:10:37] Is https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/specials/SpecialGlobalRenameUser.php#L36 this way of creating forms documented somewhere? [23:10:38] bd808: see https://gerrit.wikimedia.org/r/#/c/187654/29/tests/MathUtilsTest.php for how the integer constants where renamed [23:10:57] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1486712 (10Merl) >>! In T105794#1484730, @BBlack wrote: > ^ Added Merl (I'm guessing is the maintainer of MerlBot). Thx. But my bot send its first request using http and then is redirected... [23:11:06] bd808: see https://gerrit.wikimedia.org/r/#/c/187654/29/tests/MathUtilsTest.php for how the integer constants were renamed [23:11:10] physikerwelt: any idea how many pages will be effected? [23:11:14] SPF|Cloud: You must be new around here [23:11:21] kinda :p [23:11:40] 30k [23:11:53] in NS0 [23:12:06] I want to blame the one who invented that thing because I wasted an hour of my life because I develop my new extension with Xml::XXX Reedy [23:12:19] o_0 [23:12:37] And HTMLForm doesn't work for me because of a imo stupid HHVM bug [23:12:46] Last time I did it, I did it by example [23:12:55] SPF|Cloud: https://www.mediawiki.org/wiki/HTMLForm/tutorial [23:12:55] for the contactpage extension rewrite [23:13:11] SPF|Cloud: looks at includes/htmlform/HTMLForm.php, the class docblock explains it [23:13:12] bd808: I know how that works :) [23:13:15] s/looks/look/ [23:13:37] But https://phabricator.wikimedia.org/T59463 makes it impossible for me to work with [23:13:40] oh, might not explain why it doesn't work :P but thats the doc that exists [23:14:20] Or someone just gotta tell me that's not because of HTMLForm.... then I'll leave this MediaWiki world and blame myself all day [23:14:53] JohnFLewis would be (or wouldn't be) happy I guess :D [23:14:58] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1486747 (10Tfinc) Approved [23:15:24] SPF|Cloud: finish the extension then you can leave [23:15:48] * Reedy labels JohnFLewis as a tarp [23:16:23] * JohnFLewis cautiously says thanks then backs closer to the door away from Reedy [23:17:01] One more word about that John and you may write CreateWiki yourself [23:17:36] (03PS2) 10Eevans: enabled GC logging [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) [23:17:42] SPF|Cloud: you know you're better at that stuff than me :) [23:18:16] Never [23:18:41] (03CR) 10Eevans: enabled GC logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [23:18:55] SPF|Cloud: I'm not saying it's not, but why would htmlform cause DatabaseMysqli->conn to be an invalid db handle? [23:19:08] Krenair: Oh crap only just noticed your additions [23:19:16] The core queue is congested anyway so I'll just do those now then [23:19:26] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1486781 (10Mattflaschen) Can the current External Store be backed up before being decommissioned? [23:19:30] RoanKattouw, they're not high priority [23:19:33] (03CR) 10Catrope: [C: 032] Add HTTPS variants for RSS feed whitelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222691 (https://phabricator.wikimedia.org/T104727) (owner: 10Jeremyb) [23:19:45] (03CR) 10Catrope: [C: 032] NewUserMessageOnAutoCreate = true for gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226966 (https://phabricator.wikimedia.org/T106169) (owner: 10Alex Monk) [23:19:55] (03CR) 10Catrope: [C: 032] Enable GuidedTour on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226967 (https://phabricator.wikimedia.org/T103659) (owner: 10Alex Monk) [23:19:57] (03Merged) 10jenkins-bot: Add HTTPS variants for RSS feed whitelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222691 (https://phabricator.wikimedia.org/T104727) (owner: 10Jeremyb) [23:20:11] (03CR) 10Catrope: [C: 032] Fix site name and meta namespace of zh_min_nanwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226968 (https://phabricator.wikimedia.org/T106639) (owner: 10Alex Monk) [23:20:27] (03Merged) 10jenkins-bot: NewUserMessageOnAutoCreate = true for gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226966 (https://phabricator.wikimedia.org/T106169) (owner: 10Alex Monk) [23:20:30] bd808: it would be great if you could help with https://gerrit.wikimedia.org/r/#/c/187654/29 the goal is to find someone doing a real code review... -1 or -2 is not a problem but if it gets merged by chance or is ignored forever that would not be optimal [23:20:37] bd808: wfMessage, something else, idk why [23:20:49] (03Merged) 10jenkins-bot: Enable GuidedTour on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226967 (https://phabricator.wikimedia.org/T103659) (owner: 10Alex Monk) [23:20:51] (03CR) 10Catrope: [C: 032] Localise suwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226970 (https://phabricator.wikimedia.org/T106784) (owner: 10Alex Monk) [23:21:12] (03Merged) 10jenkins-bot: Fix site name and meta namespace of zh_min_nanwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226968 (https://phabricator.wikimedia.org/T106639) (owner: 10Alex Monk) [23:21:19] But when re-writing the complete extension using Xml:: things it apparently works under HHVM [23:21:32] (03Merged) 10jenkins-bot: Localise suwikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226970 (https://phabricator.wikimedia.org/T106784) (owner: 10Alex Monk) [23:23:21] !log catrope Synchronized w/static/images/project-logos/suwikiquote.png: Localized logo for suwikiquote (duration: 00m 12s) [23:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:58] SPF|Cloud: $this->mConn looks like it should only be false or a resource. getBindingHandle() throws an exception if it is false. There must be a failure mode that messes up mConn but it's not jumping out at me. I don't think it has anything to do with htmlform though [23:24:27] !log catrope Synchronized wmf-config/InitialiseSettings.php: SWAT (duration: 00m 12s) [23:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:35] Krenair: Those are all 5 of yours [23:24:42] thanks [23:25:33] JohnFLewis ^^ interesting information for your (non-)commercial MediaWiki business :p [23:25:36] SPF|Cloud: If you can recreate easily, you should try to catch the exception in mysqlRealEscapeString() and var_dump $conn [23:25:51] Anyway (serious), idk why it throws the exception [23:26:19] Well I probably can do that [23:27:50] I will tomorrow see if I can var_dump that and if I can I will post the results in the related phab task [23:28:05] night for now [23:28:24] DatabaseMysqli::mysqlConnect() could have some better logging in the failure case [23:28:50] I guess so [23:28:50] s/better/any/ [23:29:43] (03PS1) 10Alex Monk: Follow-up I6e77eb39: Actually configure new logo for suwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227371 (https://phabricator.wikimedia.org/T106784) [23:30:24] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1486868 (10mpopov) @RobH Sure. My wikitech username is Bearloga. [23:32:05] * AaronSchulz wonders how the trigger_error in https://github.com/facebook/hhvm/blob/master/hphp/system/php/redis/Redis.php does not end up in logstash [23:33:17] AaronSchulz: does it end up in hhvm.log on fluorine? [23:34:17] I see it there, yes [23:34:33] #012Warning: Failed connecting to redis server at 10.64.0.184: Connection timed out [23:35:09] there are lots of those in logstash [23:35:14] (03CR) 10Mobrovac: [C: 031] enabled GC logging [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [23:35:58] bd808: searching gives empty results [23:36:39] go to https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm, expand time to lat day and search for "Failed connecting to redis server" [23:36:47] *last day [23:37:30] there is a big hole in logstash data today from the jessie upgrade we did [23:38:19] !log catrope Started scap: SWAT [23:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:27] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1486888 (10aaron) Looks like the hhvm.log errors at "Connection timed out" (from https://github.com/facebook/hhvm/blob/master/hphp/system/php/redi... [23:40:22] (03CR) 10GWicke: "@eevans, https://issues.apache.org/jira/browse/CASSANDRA-7486 (linked in the config comment)." [puppet] - 10https://gerrit.wikimedia.org/r/227335 (owner: 10GWicke) [23:43:38] AaronSchulz: what's our connection timeout there in prod? [23:45:32] looks like the RedisConnectionPool default, which is 1sec [23:46:06] we use 250ms for memcached [23:48:19] (03PS1) 10BryanDavis: Initialise repo [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/227375 [23:49:00] (03PS2) 10BryanDavis: Initialise repo [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/227375 (https://phabricator.wikimedia.org/T99735) [23:49:47] (03CR) 10BryanDavis: [C: 032] Initialise repo [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/227375 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [23:52:42] (03CR) 10BryanDavis: [V: 032] Initialise repo [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/227375 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [23:53:38] !log Re-pooling mw1159 and mw1160 [23:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:09] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1486947 (10Cyberpower678) >>! In T105794#1484730, @BBlack wrote: > ^ Added Merl (I'm guessing is the maintainer of MerlBot). > > The plog4u ones seem to the same as the gwtwiki ones, and th... [23:59:44] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1486957 (10Cyberpower678) >>! In T105794#1486947, @Cyberpower678 wrote: >>>! In T105794#1484730, @BBlack wrote: >> ^ Added Merl (I'm guessing is the maintainer of MerlBot). >> >> The plog4u... [23:59:51] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1486958 (10RobH) Part of this project has involved evaluation our current rotation. As such, I put in everyone's waking hours (assuming daytime hours of 8AM-11PM as acceptable paging.) This result set is stored...