[00:00:04] Deploy window US Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160215T0000) [00:01:39] jouncebot, next [00:01:39] In 33 hour(s) and 58 minute(s): CXserver to Jessie/SCB (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1000) [00:01:39] T1000: Update Beta Cluster status documentation (re Q3 intradepartamental priority) - https://phabricator.wikimedia.org/T1000 [00:02:08] heh. false positive there [00:04:31] missing a \b in the regex? [00:05:42] yup [00:33:40] (03PS1) 10Alex Monk: New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) [00:34:30] (03CR) 10jenkins-bot: [V: 04-1] New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [00:35:46] suppose it would help if it was valid syntax.. [00:36:20] missing comma [00:36:39] (03PS2) 10Alex Monk: New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) [00:45:56] (03CR) 10Alex Monk: "thoughts on the name would be welcome" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [01:20:08] any roots around who wouldn't mind running a command in a labs instance via salt for me? [01:45:56] (03CR) 10TTO: [C: 04-1] "Ideally the right would have a different name from the group, to minimise confusion, but it seems that ship sailed long ago. Given the pre" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [01:48:54] (03PS3) 10Alex Monk: New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) [01:50:34] (03CR) 10Alex Monk: "good to hear the name is not completely crazy :) thanks for catching that error" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [01:58:25] basically I'd like someone to apply https://wikitech.wikimedia.org/w/index.php?title=Hiera:Phabricator&diff=305870&oldid=289411 manually [01:58:34] on phab-01.phabricator.eqiad.wmflabs and phab-02.phabricator.eqiad.wmflabs [01:58:38] if possible [01:58:53] it worked by simply running puppet on phab-03, but login is broken on those two [02:08:58] 6operations, 10MediaWiki-Authentication-and-authorization, 5MW-1.27-release-notes, 5Patch-For-Review: ~3000% increase in session redis memory usage, causing evictions and session loss - https://phabricator.wikimedia.org/T125267#2027695 (10Anomie) It doesn't look so bad if you look at all the servers instea... [02:10:55] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: Puppet has 1 failures [02:27:46] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 71581 bytes in 1.079 second response time [02:28:05] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 498 bytes in 0.077 second response time [02:30:15] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [02:37:05] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:29:14] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [03:58:55] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:18:53] (03PS1) 10KartikMistry: CX: add 'ady' and 'azb' wikis [puppet] - 10https://gerrit.wikimedia.org/r/270668 [04:26:52] (03PS2) 10KartikMistry: CX: add 'ady' and 'azb' wikis [puppet] - 10https://gerrit.wikimedia.org/r/270668 [05:41:35] (03PS1) 10KartikMistry: Add initial Debian package for giella-core [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) [06:00:09] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 526 MB (5% inode=62%) [06:21:12] (03PS1) 10Dereckson: Namespace configuration on ja.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270679 (https://phabricator.wikimedia.org/T126914) [06:26:15] RECOVERY - Disk space on cp3040 is OK: DISK OK [06:26:52] (03CR) 10Dereckson: [C: 031] New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [06:27:59] RECOVERY - MariaDB disk space on silver is OK: DISK OK [06:30:44] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:14] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:46] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:15] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:58:35] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:14] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:28] twentyafterfour: thanks for the acl setup [07:02:55] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 1 failures [07:20:56] (03CR) 10Tulsi Bhagat: [C: 031] New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [07:23:22] (03CR) 10Tulsi Bhagat: [C: 031] Adding WP and WT as namespace aliases for tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269970 (https://phabricator.wikimedia.org/T126604) (owner: 10MarcoAurelio) [07:28:55] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:40:21] 6operations, 6Phabricator, 6Project-Admins: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#2027920 (10Danny_B) [07:49:13] (03PS1) 10Muehlenhoff: Blacklist aufs kernel module [puppet] - 10https://gerrit.wikimedia.org/r/270690 [08:02:58] <_joe_> moritzm: it's not used with docker anymore right? [08:03:03] <_joe_> (aufs [08:05:39] not that I know of, it's on out of tree patch and only present in trusty, precise and jessie (but got dropped from Debian after that, so our 3.19 kernel already doesn't have it any longer), but will double-check to be sure [08:06:06] <_joe_> I'm pretty sure it doesn't [08:06:09] <_joe_> but it used to [08:22:50] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt15 [debs/linux] - 10https://gerrit.wikimedia.org/r/270693 [08:30:45] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [08:34:16] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:34:25] 6Operations, 6Phabricator, 6Project-Admins: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#2027962 (10Aklapper) [08:38:41] 7Puppet, 10Beta-Cluster-Infrastructure, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2027982 (10hashar) @Danny_b please bring the discussion on some list. #tracking vs Goal has nothing to do with the task at hand `Remove all ::beta roles in pup... [08:42:35] 6Operations, 10Incident-20150205-SiteOutage, 10MediaWiki-Debug-Logger, 6Reading-Infrastructure-Team, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#2027999 (10hashar) [08:47:55] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:48:14] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:49:26] PROBLEM - RAID on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:24] PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:25] PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:34] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:35] PROBLEM - puppet last run on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:36] <_joe_> and it's a memleak on the API cluster, how nice [08:51:04] PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:23] <_joe_> hashar: so, have you seen mobrovac's suggestion on integrating the puppet compiler with zuul? [08:53:35] <_joe_> his idea was that we add a line to the commit like [08:53:46] <_joe_> Affects: node1.eqiad.wmnet,node2.codfw.wmnet [08:53:56] <_joe_> and that triggers a run of the puppet compiler [08:54:06] <_joe_> do you think that would be hard to implement? [08:54:08] _joe_: yup I have had that idea for a while, wasn't sure whether it would be of any use to anyone [08:54:17] <_joe_> oh it would be [08:54:39] <_joe_> I think it would be even more useful once I implement regex matching for the compiler :) [08:55:34] PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:34] PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:35] PROBLEM - nutcracker port on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:44] PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:15] PROBLEM - Disk space on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:30] <_joe_> I'm going to powercycle mw1131 [08:57:41] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028014 (10elukey) Better from last week, but we have still not fully recovered: p75 {F3360499} p50 {F3360501} [08:57:54] PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:40] _joe_: heading to meeting with zeljko . Lets talk about it in an hour from now [08:58:48] <_joe_> hashar: ok [08:59:09] (03PS14) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [09:05:54] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [09:08:36] RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full [09:08:45] RECOVERY - RAID on mw1131 is OK: OK: no RAID installed [09:09:18] <_joe_> !log powercycled mw1131, OOM'd [09:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:35] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [09:09:36] RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm [09:09:36] RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:09:44] RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212 [09:09:44] RECOVERY - configured eth on mw1131 is OK: OK - interfaces up [09:09:45] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:09:55] RECOVERY - DPKG on mw1131 is OK: All packages OK [09:09:55] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 50 minutes ago with 0 failures [09:10:14] RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [09:10:16] RECOVERY - Disk space on mw1131 is OK: DISK OK [09:10:45] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 499 bytes in 0.384 second response time [09:11:06] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 71591 bytes in 1.695 second response time [09:13:22] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028053 (10elukey) Kernel versions: 10 3.19.0-2-amd64 2 3.2.0-31-generic 1 3.2.0-34-generic 1 3.2.0-38-generic 2 3.2.0-45-generic... [09:14:27] (03PS1) 10ArielGlenn: dumps cron: snapshot host that dumps en wp can dump regular wikis afterwards [puppet] - 10https://gerrit.wikimedia.org/r/270696 [09:17:40] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028057 (10elukey) Memcached current items stored, if I read it correctly, should show a good overview of the status of the caches: {F3360557} [09:17:51] (03CR) 10ArielGlenn: [C: 032] dumps cron: snapshot host that dumps en wp can dump regular wikis afterwards [puppet] - 10https://gerrit.wikimedia.org/r/270696 (owner: 10ArielGlenn) [09:19:54] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [09:26:54] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [09:37:03] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028108 (10elukey) At this point I think it would be really useful to understand what is the relationship between edit actions and memcached before proceeding, to es... [09:38:23] !log swift codfw-prod: ms-be2020 / ms-be2021 weight to 3500 [09:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:45] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028136 (10Joe) if we look at all the latency metrics, they are nowhere near being back to the mean and honestly the latency we still see cannot be due to cache cold... [09:51:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt15 [debs/linux] - 10https://gerrit.wikimedia.org/r/270693 (owner: 10Muehlenhoff) [09:54:33] (03PS3) 10Muehlenhoff: Add ferm rule for graphite/labs web service [puppet] - 10https://gerrit.wikimedia.org/r/270306 [09:54:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rule for graphite/labs web service [puppet] - 10https://gerrit.wikimedia.org/r/270306 (owner: 10Muehlenhoff) [09:55:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 672 [09:58:35] (03PS1) 10Hoo man: Replace the sidebar link to commons with the commons category [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270701 (https://phabricator.wikimedia.org/T126960) [10:00:06] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 633 [10:05:59] (03CR) 10Ricordisamoa: Replace the sidebar link to commons with the commons category (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270701 (https://phabricator.wikimedia.org/T126960) (owner: 10Hoo man) [10:15:29] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028204 (10elukey) Memcached configs: mc1008.eqiad.wmnet: (DEBIAN) nobody 792 46.0 92.4 92497732 91531152 ? Ssl Feb11 2592:06 /usr/bin/memcached -p 1121... [10:25:06] RECOVERY - check_mysql on db1008 is OK: Uptime: 2314015 Threads: 2 Questions: 15811396 Slow queries: 15522 Opens: 5067 Flush tables: 2 Open tables: 404 Queries per second avg: 6.832 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:51:25] (03PS1) 10Hoo man: Throttle exception for de:WP:Wikimedia Deutschland/WPFF Berlinale2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270706 [10:52:31] (03CR) 10Hoo man: [C: 032] "Trivial" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270706 (owner: 10Hoo man) [10:52:57] (03Merged) 10jenkins-bot: Throttle exception for de:WP:Wikimedia Deutschland/WPFF Berlinale2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270706 (owner: 10Hoo man) [10:55:11] crap, deploying is broken [10:56:03] it's not [10:56:32] <_joe_> hoo|busy: it's still not, it will be in a few... [10:57:15] hm… it was just complaining to me about .git/objects not being writable because of permission foo [10:57:20] but ok now [10:58:00] _joe_: I guess you're going to send an email once you do the mira -> tin switch? [10:58:19] !log hoo@mira Synchronized wmf-config/throttle.php: Throttle exception for [[de:WP:Wikimedia Deutschland/WPFF Berlinale2016]] (duration: 01m 28s) [10:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:20] <_joe_> hoo|busy: yes [10:59:23] seems I just added an interwiki link to the SAL page... Wikitext is hard [11:04:34] (03PS1) 10Filippo Giunchedi: eqiad: add restbase1007-{b,c} instances [dns] - 10https://gerrit.wikimedia.org/r/270707 [11:08:47] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1007-{b,c} to seeds [puppet] - 10https://gerrit.wikimedia.org/r/270708 [11:10:12] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1007-b instance [puppet] - 10https://gerrit.wikimedia.org/r/270709 [11:10:43] (03PS1) 10Giuseppe Lavagetto: deployment: switch back to tin [puppet] - 10https://gerrit.wikimedia.org/r/270710 (https://phabricator.wikimedia.org/T124024) [11:11:12] hoo|busy: :o Interwikis should be done with WikiData! [11:11:38] <_joe_> hoo|busy: I'm going to switch now unless you need me not to [11:11:41] p858snake|L2_: morebots is the new Wikidata [11:11:46] _joe_: Think we're good [11:11:51] <_joe_> ok [11:12:33] jynus: 'wikidata' of course is two wikis, wikidatawiki and testwikidatawiki :) And this is not an extension but a core feature, which is at present only used by an extension [11:13:05] 6Operations, 6Phabricator, 6Project-Admins: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#2028350 (10Krenair) [11:13:45] ah! [11:13:49] better, then [11:15:46] still, if it is core, it should be applied to all wikis [11:16:05] What are you talking about? [11:16:07] sites? [11:16:46] page languages [11:17:07] Do you mean splitting the parser cache by lang? [11:17:22] Or do you have ticket at hand? [11:17:29] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] eqiad: add restbase1007-{b,c} instances [dns] - 10https://gerrit.wikimedia.org/r/270707 (owner: 10Filippo Giunchedi) [11:17:31] hoo|busy, I have no idea, ask here https://phabricator.wikimedia.org/T69223 [11:18:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1007-{b,c} to seeds [puppet] - 10https://gerrit.wikimedia.org/r/270708 (owner: 10Filippo Giunchedi) [11:18:38] Oh, that's actually a thing? [11:18:44] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028356 (10ArielGlenn) Hey folks, so I was looking into doing this build but it seems you're wanting a build against master (yes?) which includes a bunch of commit... [11:18:44] ? [11:19:18] I wonder how this aligns with the parser target language RfC [11:19:36] can you add any feedback or questions there? [11:19:48] before I make an irreversible change to all wikis [11:21:12] db1063 crashed? [11:21:45] jynus: https://phabricator.wikimedia.org/T114640#2028360 I can try to stomp people on that [11:22:15] yes, not you personally [11:22:25] but I want assureance there is agreement on the change [11:23:55] Makes sense [11:24:20] PROBLEM - MariaDB Slave Lag: s2 on db1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:24:38] and here we go :S [11:26:01] RECOVERY - MariaDB Slave Lag: s2 on db1063 is OK: OK slave_sql_lag Seconds_Behind_Master: 40 [11:27:15] potential disk issue [11:27:44] I am going to put offline one disk [11:28:27] jynus: ack [11:28:54] !log offlining disk 32:5 on db1063 [11:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:57] _joe_: is the wmfreimage stuff complete, do you know? or is there still testing / updating to be done? [11:30:04] Adapter: 0: EnclId-32 SlotId-5 state changed to OffLine. [11:30:11] (03PS1) 10Giuseppe Lavagetto: deployment: switch back to tin [dns] - 10https://gerrit.wikimedia.org/r/270713 (https://phabricator.wikimedia.org/T124024) [11:30:18] <_joe_> apergos: tsting definitely [11:30:47] ok [11:31:07] <_joe_> just the script, the salt runner works [11:31:12] (03PS2) 10Giuseppe Lavagetto: deployment: switch back to tin [dns] - 10https://gerrit.wikimedia.org/r/270713 (https://phabricator.wikimedia.org/T124024) [11:31:16] 63 is a new server, that is strange [11:31:28] (03PS2) 10Giuseppe Lavagetto: deployment: switch back to tin [puppet] - 10https://gerrit.wikimedia.org/r/270710 (https://phabricator.wikimedia.org/T124024) [11:32:22] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment: switch back to tin [dns] - 10https://gerrit.wikimedia.org/r/270713 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [11:32:30] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment: switch back to tin [puppet] - 10https://gerrit.wikimedia.org/r/270710 (https://phabricator.wikimedia.org/T124024) (owner: 10Giuseppe Lavagetto) [11:32:45] PROBLEM - RAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [11:33:41] Oh, really? [11:33:46] <_joe_> lol [11:34:01] I do not know way we pay you, mister icinga-wm [11:34:28] if I detect RAID issues before you [11:35:27] I need some time to confirm it really solves the issues [11:44:26] BTW, please take note on this, as this is a very common issue, documented here https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Caused_by_hardware and all ops should know how to react to it [11:46:16] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: Puppet has 77 failures [11:48:27] <_joe_> !log restarted hhvm on mw1130, memory exhausted [11:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:57] 6Operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2028409 (10ArielGlenn) This will happen after the curren dump run completes. ETA 5-6 days from now. [11:56:30] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2028413 (10jcrespo) 3NEW [11:57:13] <_joe_> !log switching the deployment host back to tin [11:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:07] 6Operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2028425 (10hoo) Please let me know if there are disruptions to be expected Monday or early Tuesday. [12:00:43] <_joe_> this can cause a few salt minions to fail [12:04:48] (03CR) 10MarcoAurelio: "Scheduled for SWAT deployement ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [12:04:51] (03CR) 10ArielGlenn: "So do we want to push to some subdirectory under /srv/dumps or under /srv/statistics? (And called what?)" [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [12:07:20] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028445 (10mmodell) @arielglenn: I've updated the changelog and tagged it as [[ https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/;3.0.1 | 3.0.1 ]]. W... [12:07:41] 6Operations, 6Phabricator, 6Project-Admins, 6Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2028446 (10Krenair) [12:11:40] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2028448 (10elukey) Ganglia is still not my best friend, mc1004 had a spike the first time that I tried to remove it from the pool and the graph wasn't showing all th... [12:12:27] (03PS4) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [12:13:15] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:14:30] (03CR) 10jenkins-bot: [V: 04-1] send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [12:16:21] (03PS1) 10Giuseppe Lavagetto: deployment::redis: explicit binding to all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/270719 [12:17:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] deployment::redis: explicit binding to all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/270719 (owner: 10Giuseppe Lavagetto) [12:20:20] (03PS5) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [12:20:54] (03PS1) 10Giuseppe Lavagetto: deployment::redis: add bind to all servers. [puppet] - 10https://gerrit.wikimedia.org/r/270721 [12:21:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] deployment::redis: add bind to all servers. [puppet] - 10https://gerrit.wikimedia.org/r/270721 (owner: 10Giuseppe Lavagetto) [12:23:05] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028474 (10fgiunchedi) @mmodell that indeed begs the question of what to do with debian revision (`-1`) vs upstream version (`3.0.1`). I've tagged `debian/3.0-1` w... [12:30:25] <_joe_> !log deployment master is now tin [12:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:46] https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&diff=306838&oldid=280537 [12:43:26] I took the opportunity to finally remove reference to either particular server from the docs [12:43:49] <_joe_> Krenair: whoa thanks you were fast :) [12:44:05] I happened to be around at the right moment :) [12:45:27] To be honest, it sounds little bit strange that deployment.codfw.wmnet points to an eqiad server :) [12:45:40] deployment.eqiad.wmnet used to point to mira.codfw.wmnet [12:45:51] <_joe_> SPF|Cloud: why? [12:45:51] and when it changes you get a lovely warning about someone potentially doing something nasty [12:46:12] <_joe_> Krenair: heh, I know [12:46:42] _joe_: well when you have "codfw" somewhere in the dns name you would expect a codfw server instead of an eqiad server [12:46:56] I don't think it would cause big issues, but it just seems a bit strange [12:47:01] <_joe_> not really, given it's a CNAME [12:47:21] <_joe_> and it definitely doesn't cause any issue [12:55:32] 6Operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2028543 (10Gilles) It's hard to say if latency is a good enough indicator for overall connection quality, but like this experiment, we can try something like @Bbla... [12:55:45] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [13:03:05] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [13:03:52] (03PS1) 10Phedenskog: Collect Navigation Timing metrics with higher sane values [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) [13:04:07] (03CR) 10Joal: [C: 031] "Good for me :)" [puppet] - 10https://gerrit.wikimedia.org/r/269759 (owner: 10Ottomata) [13:05:00] (03CR) 10jenkins-bot: [V: 04-1] Collect Navigation Timing metrics with higher sane values [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [13:05:37] do we have a dashboard of puppet runs in failure / not reporting ? [13:06:27] (03PS2) 10Phedenskog: Collect Navigation Timing metrics with higher sane values [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) [13:06:35] I discovered by accident this weekend that deployment-elastic05.deployment-prep.eqiad.wmflabs was in puppet failure for probably quite some time. [13:07:07] <_joe_> gehel: you won't see a lot of surprised faces here :P [13:07:16] gehel: no, but you should receive an e-mail (that is, as long as puppet has run at least once somewhere in the last month or so) [13:08:06] valhallasw`cloud: I did not receive an email, so I guess I'm missing something ... [13:08:09] <_joe_> gehel: meaning that we're usually not good enough to always remember beta when we change things [13:08:43] _joe_: I have no problem with things being broken on beta, I'd just like to know when they are ... [13:09:09] gehel: and there's http://shinken.wmflabs.org/problems?start=0&search=deployment&end=30 , but that might not be fully up to date [13:09:22] (login with guest/guest) [13:11:06] <_joe_> ...and now it's logged!!!1! how can we protect that secret anymore? [13:12:06] _joe_: but I only typed *****! [13:13:56] that was not a very good secret to start with :-) [13:15:37] so if I want to receive mails, I should add myself to nagios_common/.../contactgroups... ? [13:15:59] gehel: for shinken, as yuvi [13:16:03] (03PS2) 10Giuseppe Lavagetto: admin: add entry for Riccardo [puppet] - 10https://gerrit.wikimedia.org/r/269690 (https://phabricator.wikimedia.org/T126434) (owner: 10Volans) [13:16:33] gehel: for emails about puppet failing specifically, that /should/ be automatic (but it was added maybe two weeks ago?). Check with andrewbogott about that [13:17:15] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: add entry for Riccardo [puppet] - 10https://gerrit.wikimedia.org/r/269690 (https://phabricator.wikimedia.org/T126434) (owner: 10Volans) [13:20:11] (03PS2) 10Giuseppe Lavagetto: admin: Added Riccardo to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/269691 (https://phabricator.wikimedia.org/T126434) (owner: 10Volans) [13:20:42] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: Added Riccardo to the Ops group [puppet] - 10https://gerrit.wikimedia.org/r/269691 (https://phabricator.wikimedia.org/T126434) (owner: 10Volans) [13:27:47] 6Operations, 10Ops-Access-Requests, 5Patch-For-Review: root shell for Riccardo - https://phabricator.wikimedia.org/T126434#2028586 (10Joe) 5Open>3Resolved a:3Joe [13:27:49] 6Operations, 10Ops-Access-Requests: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2028589 (10Joe) [13:28:00] 6Operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Be able to switch programmatically between deployment servers in codfw and eqiad - https://phabricator.wikimedia.org/T124024#2028590 (10Joe) 5Open>3Resolved [13:28:02] 6Operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#2028591 (10Joe) [13:42:30] (03CR) 10Bmansurov: [C: 04-1] "missing URL parameter name" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [13:43:32] 7Puppet, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#2028613 (10zeljkofilipin) a:5zeljkofilipin>3None [13:43:46] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/ParallelAssignment offense [puppet] - 10https://gerrit.wikimedia.org/r/259726 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:43:53] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/NumericLiterals offense [puppet] - 10https://gerrit.wikimedia.org/r/259725 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:00] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/Not offense [puppet] - 10https://gerrit.wikimedia.org/r/259724 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:03] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/NegatedIf offense [puppet] - 10https://gerrit.wikimedia.org/r/259722 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:11] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/MultilineIfThen offense [puppet] - 10https://gerrit.wikimedia.org/r/259719 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:13] (03CR) 10Bmansurov: Enable survey at reduced sample rate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [13:44:17] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/MethodCallParentheses offense [puppet] - 10https://gerrit.wikimedia.org/r/259718 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:23] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/LeadingCommentSpace offense [puppet] - 10https://gerrit.wikimedia.org/r/259717 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:30] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/IfUnlessModifier offense [puppet] - 10https://gerrit.wikimedia.org/r/259716 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:36] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/DotPosition offense [puppet] - 10https://gerrit.wikimedia.org/r/259712 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:41] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/EmptyLiteral offense [puppet] - 10https://gerrit.wikimedia.org/r/259710 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:49] (03Abandoned) 10Zfilipin: RuboCop: Fixed Style/DefWithParentheses offence [puppet] - 10https://gerrit.wikimedia.org/r/259708 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:52] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/CommandLiteral offense [puppet] - 10https://gerrit.wikimedia.org/r/259706 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:57] (03Abandoned) 10Zfilipin: RuboCop: fixed Style/ColonMethodCall offence [puppet] - 10https://gerrit.wikimedia.org/r/259702 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:47:41] (03PS1) 10Muehlenhoff: Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 [13:48:57] (03CR) 10jenkins-bot: [V: 04-1] Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 (owner: 10Muehlenhoff) [13:57:05] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [13:57:33] (03PS1) 10Gehel: Adding gehel to some shinken notifications for labs [puppet] - 10https://gerrit.wikimedia.org/r/270729 [13:59:28] ACKNOWLEDGEMENT - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] Filippo Giunchedi expected, swiftrepl running [14:00:45] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [14:02:16] 6Operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2028648 (10JanZerebecki) > assuming you mean latency This problem is likely not root alone in latency. There may be cases where because of buffer bloat and conge... [14:03:02] (03PS2) 10Muehlenhoff: Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 [14:14:47] 6Operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2028677 (10BBlack) @JanZerebecki - no idea on your questions yet, but really we should look at those questions with the HTTP/2 code rather than the SPDY code, as t... [14:15:30] 6Operations, 10Traffic, 6Zero, 3Mobile-Content-Service, and 2 others: Send X-Carrier + X-Carrier-Meta headers on all responses - https://phabricator.wikimedia.org/T126053#2028678 (10BBlack) 5Open>3Resolved a:3BBlack [14:18:07] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2028685 (10BBlack) @Danny_B - we can't purge caches on response body content matching, that's why I'm asking about at least a Date range if we have no other metadata to go on.... [14:19:11] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2028689 (10BBlack) (To be clear in reference to earlier comments - last edit date doesn't matter, just the date range in which the MediaWiki servers were emitting bad content, d... [14:21:08] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1007-b instance [puppet] - 10https://gerrit.wikimedia.org/r/270709 [14:21:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1007-b instance [puppet] - 10https://gerrit.wikimedia.org/r/270709 (owner: 10Filippo Giunchedi) [14:21:37] (03PS1) 10BBlack: cache_text: re-enable SPDY [puppet] - 10https://gerrit.wikimedia.org/r/270736 (https://phabricator.wikimedia.org/T125979) [14:22:06] (03PS2) 10BBlack: cache_text: re-enable SPDY [puppet] - 10https://gerrit.wikimedia.org/r/270736 (https://phabricator.wikimedia.org/T125979) [14:22:14] (03CR) 10BBlack: [C: 032 V: 032] cache_text: re-enable SPDY [puppet] - 10https://gerrit.wikimedia.org/r/270736 (https://phabricator.wikimedia.org/T125979) (owner: 10BBlack) [14:24:07] 6Operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Upgrade ElasticSearch to 1.7.4 - https://phabricator.wikimedia.org/T122697#2028696 (10Gehel) To do (as discussed with @dcausse): 1) upgrade Vagrant so that we can test in dev, with no impact 2) upgrade Cindy (browser test bot) to validate 3) u... [14:24:29] !log start restbase1007-b cassandra instance, bootstrapping T119935 [14:24:32] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2028699 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLlUX7o-0X0Il_jxrUT} [2016-02-15T14:24:29Z] ... [14:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:52] 6Operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2028701 (10dcausse) [14:28:06] 6Operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2028704 (10mark) p:5Normal>3High [14:29:35] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2028708 (10Cmjohnson) The disk is covered under warranty and requested a new one. @jcrespo did the controller show the disk failed before you manually took it offline? If so, please copy/paste that output in the future.... [14:31:00] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2028710 (10Cmjohnson) labsdb1002 is a CISCO server. I can pull a disk out of one of the decommissioned servers and replace but we should really start working on replacing this server. [14:37:03] does somone know how to enable https://phabricator.wikimedia.org/T126901 this? [14:37:46] PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: Connection refused [14:38:41] _joe_: ma rk suggested we could just comment out phabricator::main from iridium in the site manifests for now, and re-enable puppet on the box (it's one week plus now). I looked at the cron jobs and they look like we could leave them going [14:39:06] (community metrics, dump of phab,etc) [14:39:11] what do you think about that? [14:39:27] <_joe_> +1 [14:39:40] ok I'm gonna do that right now [14:42:29] (03PS3) 10Muehlenhoff: Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 [14:43:19] (03PS1) 10ArielGlenn: comment out phab role on iridium so puppet can run safely [puppet] - 10https://gerrit.wikimedia.org/r/270741 [14:44:16] (03CR) 10jenkins-bot: [V: 04-1] comment out phab role on iridium so puppet can run safely [puppet] - 10https://gerrit.wikimedia.org/r/270741 (owner: 10ArielGlenn) [14:44:19] there's an issue with the admin data.yaml [14:44:41] ahahahaha [14:44:53] how short is that change and I still fscked it up [14:45:21] (03PS2) 10ArielGlenn: comment out phab role on iridium so puppet can run safely [puppet] - 10https://gerrit.wikimedia.org/r/270741 [14:46:14] (03CR) 10Alex Monk: Adding akumar, mnoushad to bastion only and perf-roots group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270337 (owner: 10Cmjohnson) [14:46:25] (03CR) 10DCausse: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/269759 (owner: 10Ottomata) [14:46:48] (03CR) 10ArielGlenn: [C: 032] comment out phab role on iridium so puppet can run safely [puppet] - 10https://gerrit.wikimedia.org/r/270741 (owner: 10ArielGlenn) [14:47:54] 6Operations, 6Performance-Team, 10Traffic, 5Patch-For-Review: Disable SPDY on cache_text for a week - https://phabricator.wikimedia.org/T125979#2028741 (10Gilles) @JanZerebecki SPDY doesn't have re-prioritization, only HTTP/2 does. SPDY can only set the priority of an asset at the beginning. Indeed it see... [14:49:15] (03PS1) 10Andrew Bogott: Move the wikitech apache config template into the openstack module [puppet] - 10https://gerrit.wikimedia.org/r/270743 [14:49:19] (03PS1) 10Alex Monk: admin: properly indent mnoushad's UID [puppet] - 10https://gerrit.wikimedia.org/r/270744 [14:51:54] !log phab role commented out on iridium and puppet re-enabled, no phab changes will be applied there [14:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:29] (03CR) 10Andrew Bogott: [C: 032] "puppet compiler approved" [puppet] - 10https://gerrit.wikimedia.org/r/270743 (owner: 10Andrew Bogott) [14:53:32] (03CR) 10Alex Monk: Adding akumar, mnoushad to bastion only and perf-roots group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270337 (owner: 10Cmjohnson) [14:54:15] PROBLEM - SSH on iridium is CRITICAL: Connection refused [14:56:38] <_joe_> apergos: ^^ [14:56:48] grrrrr [14:56:50] <_joe_> were you doing something on iridium? [14:57:28] this while I"m actually on the host [14:57:35] just that one thing [14:57:38] nothin else [15:00:24] as in I ran puppet agent --test and that's all [15:05:09] <_joe_> apergos: puppet agent -t runs puppet [15:05:14] yes [15:05:17] that was the point [15:05:18] <_joe_> apergos: is the server really unreachable? [15:05:23] well I'm on it [15:05:26] <_joe_> usually I'd try --noop first [15:05:29] give me a second, I'm checking something [15:05:45] (03PS1) 10Andrew Bogott: Move the wikitech apache config, again [puppet] - 10https://gerrit.wikimedia.org/r/270747 [15:09:01] 6Operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T126627#2028795 (10fgiunchedi) thanks @cmjohnson ! looks like the disk was `sdd` again but I've powercycled the machine to make sure [15:09:20] (03CR) 10Andrew Bogott: [C: 032] Move the wikitech apache config, again [puppet] - 10https://gerrit.wikimedia.org/r/270747 (owner: 10Andrew Bogott) [15:09:30] (03CR) 10BBlack: [C: 04-1] "This should probably be split into three changes to avoid failure on race conditions during the update process, right? First copy them to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) (owner: 10Krinkle) [15:10:38] (03PS2) 10Ema: Rename vcl_recv_purge into recv_purge [puppet] - 10https://gerrit.wikimedia.org/r/270392 (https://phabricator.wikimedia.org/T124279) [15:11:09] Steinsplitter, for those i18n extensions I'd ask Nikerabbit [15:11:26] or someone from twn or wmf's i18n teamm [15:11:26] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2028798 (10jcrespo) @Cmjohnson, the controlled did not depool (yet) the disk, but icinga showed full system contention (completelly unrestposive for seconds) until I did that. Replication went to the roof in an IO-bound... [15:11:27] (03CR) 10Ema: [C: 032 V: 032] Rename vcl_recv_purge into recv_purge [puppet] - 10https://gerrit.wikimedia.org/r/270392 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [15:11:29] --test != -t, confusingly [15:11:40] (03PS2) 10Ema: Omit thread_pool_add_delay on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269686 (https://phabricator.wikimedia.org/T126206) [15:11:50] (03CR) 10Ema: [C: 032 V: 032] Omit thread_pool_add_delay on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269686 (https://phabricator.wikimedia.org/T126206) (owner: 10Ema) [15:11:57] err wait, it does [15:12:12] it really is called "test", even though it should stand for "manual terminal invocation" or something [15:12:49] who pinged? :o [15:13:47] it was less than 10 lines ago Nikerabbit, read up :p [15:13:47] ah [15:14:11] Krenair: yes but what were you replying to, but found the bug [15:14:14] btw, any op should be able to review https://gerrit.wikimedia.org/r/270744 [15:16:09] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2028803 (10BBlack) [15:16:12] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2028802 (10BBlack) [15:16:18] Krenair: thanks, I'll merge that [15:16:25] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1954739 (10BBlack) [15:16:28] 6Operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#1998513 (10BBlack) [15:16:36] (03PS2) 10Filippo Giunchedi: admin: properly indent mnoushad's UID [puppet] - 10https://gerrit.wikimedia.org/r/270744 (owner: 10Alex Monk) [15:16:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: properly indent mnoushad's UID [puppet] - 10https://gerrit.wikimedia.org/r/270744 (owner: 10Alex Monk) [15:18:03] ty [15:18:24] !log Changed email for global account "Frau pomerenke" [15:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:57] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2028810 (10BBlack) Note, I've reversed the... [15:19:21] morebots: poke [15:19:22] I am a logbot running on tools-exec-1210. [15:19:22] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:19:22] To log a message, type !log . [15:20:21] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [15:20:26] 6Operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T126627#2028812 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi disk rebuilding [15:20:43] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028815 (10mmodell) @fgiunchedi: tagged and pushed 3.0.1-1, does everything look ok now? [15:21:22] RECOVERY - SSH on iridium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [15:22:01] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028816 (10mmodell) btw I don't see a tag literally named `debian/3.0-1`, am I missing something? [15:22:09] !log disabled puppet again on iridium, had to restore sshd config from filebucket after puppet run [15:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:31] (03PS10) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [15:28:34] (03PS1) 10Volans: Depool of db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270752 (https://phabricator.wikimedia.org/T120122) [15:30:11] ^ yay [15:32:38] <_joe_> !log restarting the salt minions on all deployment targets [15:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:08] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028852 (10fgiunchedi) @mmodell doh, I'm trying to `git push --tags` to phabricator without success so far, so yeah I had a `debian/3.0-1` local tag not pushed yet... [15:36:48] (03PS1) 10Andrew Bogott: Move horizon apache config into a vhost [puppet] - 10https://gerrit.wikimedia.org/r/270753 [15:38:23] (03CR) 10Jcrespo: [C: 031] "Looks ok." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270752 (https://phabricator.wikimedia.org/T120122) (owner: 10Volans) [15:38:28] (03CR) 10Andrew Bogott: "I have really no idea if horizon.wikimedia.org.erb is valid." [puppet] - 10https://gerrit.wikimedia.org/r/270753 (owner: 10Andrew Bogott) [15:39:30] I could use a hand from someone who knows how to write apache site configs. https://gerrit.wikimedia.org/r/#/c/270753/ [15:40:10] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028867 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi and uploaded to `carbon` now ``` root@carbon:~# reprepro list trusty-wikimedia scap trusty-wikimedia|mai... [15:40:50] (03CR) 10Ottomata: "I think we can store these on stat1002 in /a/log/webrequest/archive/dumps.wikimedia.org." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [15:45:44] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [15:45:46] (03CR) 10Krinkle: "It leaves symlinks, so the old-first race condition is fine. The new-first race-condition is valid, but in mw deploys the convention is no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) (owner: 10Krinkle) [15:46:17] ACKNOWLEDGEMENT - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] Filippo Giunchedi new cassandra instance online [15:48:28] (03PS2) 10Ottomata: Add $check_jar and $camus_jar parameterize to camus::job [puppet] - 10https://gerrit.wikimedia.org/r/269759 [15:49:00] (03CR) 10BBlack: [C: 031] "OK" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) (owner: 10Krinkle) [15:49:59] 6Operations, 10RESTBase-Cassandra: cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#2028886 (10fgiunchedi) during `restbase1007-b` bootstrap data is streaming from `restbase1007-a` (i.e. localhost) though the observed speeds are the same, in the order of 4.5MB/s [15:51:01] <_joe_> wow, my wifi network is definitely faster godog [15:53:00] hehehe [15:53:00] 6Operations, 7Availability, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2028898 (10fgiunchedi) after 56M thumbnail requests from `ms-fe1001` the size distribution looks like this ```... [15:53:10] (03PS4) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [15:57:56] (03PS2) 10Volans: Depool of db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270752 (https://phabricator.wikimedia.org/T120122) [16:01:54] (03CR) 10Ottomata: [C: 032] Add $check_jar and $camus_jar parameterize to camus::job [puppet] - 10https://gerrit.wikimedia.org/r/269759 (owner: 10Ottomata) [16:02:43] (03CR) 10Volans: [C: 032] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270752 (https://phabricator.wikimedia.org/T120122) (owner: 10Volans) [16:03:09] (03Merged) 10jenkins-bot: Depool of db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270752 (https://phabricator.wikimedia.org/T120122) (owner: 10Volans) [16:16:50] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028948 (10mmodell) @fgiunchedi: I added #acl_operations-team to the push policy for {rMSCA} [16:17:20] Krenair: Cool. Thanks! :-) [16:18:07] (03PS3) 10Alexandros Kosiaris: CX: add 'ady' and 'azb' wikis [puppet] - 10https://gerrit.wikimedia.org/r/270668 (owner: 10KartikMistry) [16:18:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: add 'ady' and 'azb' wikis [puppet] - 10https://gerrit.wikimedia.org/r/270668 (owner: 10KartikMistry) [16:20:04] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028962 (10fgiunchedi) >>! In T126660#2028948, @mmodell wrote: > @fgiunchedi: I added #acl_operations-team to the push policy for {rMSCA} thanks! works for me! [16:20:41] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:21:29] ^that is normal, it is taking a bit longer than expected [16:26:34] 7Blocked-on-Operations, 3Scap3: rebuild scap debian package (we forgot to include refreshCdbJsonFiles) - https://phabricator.wikimedia.org/T126660#2028973 (10fgiunchedi) also I've moved `3.0.1-1` tag to `debian/3.0.1-1` for consistency, can be revised too of course [16:27:39] !log installing postgres security updates on maps* [16:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:55] 6Operations, 6Analytics-Kanban, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2028985 (10elukey) [16:28:28] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 855m 18s) [16:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:05] !log Killed stuck rsync by mw1119 that was keeping l10nudpate running [16:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:14] (03PS1) 10Ottomata: Fix path to camus jars [puppet] - 10https://gerrit.wikimedia.org/r/270759 [16:32:28] (03CR) 10Ottomata: [C: 032 V: 032] Fix path to camus jars [puppet] - 10https://gerrit.wikimedia.org/r/270759 (owner: 10Ottomata) [16:32:45] akosiaris: mergin? [16:32:51] cxserver.yaml [16:32:51] ? [16:33:43] 6Operations, 10procurement, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2029006 (10faidon) 3NEW [16:34:22] !log Ran sync-common on mw1119 after stuck scap update was killed there [16:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:31] jynus: ^ I think you should be clear [16:34:39] kart_: ok to merge that [16:34:40] ? [16:34:48] akosiaris: gerrit merged but didn't yet merge on puppet master [16:34:48] thanks, bd808 [16:34:51] ? [16:34:57] https://gerrit.wikimedia.org/r/#/c/270668/ [16:35:20] 6Operations, 10procurement, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2029020 (10faidon) 3NEW [16:35:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:36:22] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:36:35] oook, i'm merging... [16:36:50] ottomata: damn, sorry [16:36:53] thanks [16:37:01] soook [16:37:10] just had a fix i needed to push [16:37:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:38:12] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:38:17] (03PS2) 10Andrew Bogott: Use keystone v3 api for horizon [puppet] - 10https://gerrit.wikimedia.org/r/270593 (https://phabricator.wikimedia.org/T123310) [16:38:21] !log volans@tin Synchronized wmf-config/db-eqiad.php: depool db1022 (duration: 00m 58s) [16:38:24] (03PS3) 10Andrew Bogott: Use keystone v3 api for horizon [puppet] - 10https://gerrit.wikimedia.org/r/270593 (https://phabricator.wikimedia.org/T123310) [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:53] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [16:40:14] (03CR) 10Andrew Bogott: [C: 032] Use keystone v3 api for horizon [puppet] - 10https://gerrit.wikimedia.org/r/270593 (https://phabricator.wikimedia.org/T123310) (owner: 10Andrew Bogott) [16:40:41] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [16:42:21] bd808: is there a phab project for... "mediawiki logging"? [16:44:13] #MediaWiki-Logging, I guess? [16:44:31] 6Operations, 10MediaWiki-Logging: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2029043 (10faidon) 3NEW [16:44:37] bd808: ^^ [16:45:03] paravoid: you probably want https://phabricator.wikimedia.org/tag/mediawiki-debug-logger/ [16:45:24] "logging" is generally about audit tables in the databases [16:45:51] paravoid: we also have https://phabricator.wikimedia.org/tag/wikimedia-logstash/ [16:46:05] 6Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2029043 (10faidon) [16:46:23] !log Deployed patch for T126897 [16:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:39] (03PS2) 10Andrew Bogott: keystone policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/270597 (https://phabricator.wikimedia.org/T123310) [16:48:32] (03CR) 10Andrew Bogott: [C: 032] keystone policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/270597 (https://phabricator.wikimedia.org/T123310) (owner: 10Andrew Bogott) [16:50:42] (03PS1) 10Andrew Bogott: keystone policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/270763 [16:52:20] (03CR) 10Andrew Bogott: [C: 032] keystone policy changes: [puppet] - 10https://gerrit.wikimedia.org/r/270763 (owner: 10Andrew Bogott) [16:53:32] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet last ran 20 hours ago [16:55:04] 6Operations, 10Beta-Cluster-Infrastructure, 6Services, 5Patch-For-Review: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#2029087 (10hashar) [16:55:22] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:57:08] 6Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#2029090 (10bd808) Logging from MediaWiki to Logstash is done via syslog formatted UDP datagrams. Logging from MediaWiki to Fluorine is a UDP datagram flow using the custom u... [17:00:31] 6Operations, 10Parsoid, 6Services: Switch Parsoid to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T125017#2029105 (10hashar) [17:03:18] (03PS6) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [17:03:21] 6Operations, 10RESTBase-Cassandra, 6Services: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2029119 (10Eevans) >>! In T125906#2024014, @GWicke wrote: > Update on the test in staging: > > - Using 8m chunks, compression approached 5% on at least one of the nodes when co... [17:05:00] (03CR) 10ArielGlenn: "Ottomata: not too worried about /srv vs /a tbh, about time for /a to be gone." [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [17:08:23] 6Operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2029135 (10jcrespo) This is fixed, but shorty after, it crashed with corrupted InnoDB pages. Related? [17:10:14] 6Operations, 10RESTBase-Cassandra, 6Services: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2029142 (10GWicke) [17:10:31] 6Operations, 10RESTBase-Cassandra: cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#2029144 (10Eevans) p:5Triage>3High We have a number of cluster changes coming down the pipe (T119935, T125842, and T95253) that would benefit from higher throughput; Bumping priorit... [17:11:21] Trying to understand how to upgrade elasticsearch to 1.7.5. I expected to find a reprepro config to update from upstream repo, but it seems configured only for elasticsearch 1.6 (we are on 1.7.1). [17:12:31] 6Operations, 10RESTBase-Cassandra, 6Services: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2029171 (10GWicke) A full dump run at concurrency 50 has finished without issues. As a further stress test, I have now started another dump run, this time with concurrency 100. [17:12:45] Also, `reprepro checkupdate` let's me know that the key of the jenkins upstream repo is unkown. Which leads me to think that we do not actually use updates from upstream [17:12:58] Can anyone confirm ? [17:13:31] gehel: we generally do use updates from upstream, it looks like elasticsearch was updated to 1.7.1 manually I'm assuming, in september [17:13:48] judging from the emails I got from reprepro that is [17:14:49] godog: I know next to nothing about reprepro, but it seems to me that "reprepro checkupdate" fails on the first unkown GPG key. In this case, the jenkins key. [17:15:27] Which makes me think it is kind of broken at the moment, which made me think we do not use it. I'm probably missing something ... [17:15:48] we use reprepo for the jenkins .deb package. It is copy pasted from upstream [17:16:04] update doc for Jenkins is at https://wikitech.wikimedia.org/wiki/Jenkins#Updating [17:16:08] maybe their key has changed [17:16:38] do you know where reprepro gets the pub keys from ? Download them on the fly from a key server ? [17:16:46] absolutely Zero clue [17:16:47] gehel: yeah that looks like another problem currently, to answer your question re: upgrade the approach is generally to manually upgrade the package on one/more canary machines and then make the new version available on the repo [17:17:14] Jenkins has instructions at http://pkg.jenkins-ci.org/debian-stable/ [17:17:26] not sure how one would verify the key at http://pkg.jenkins-ci.org/debian-stable/jenkins-ci.org.key though [17:17:26] godog: when you say manually, means downloading it locally and dpkg -i ? [17:17:40] (03PS1) 10Andrew Bogott: Horizon: update nova policies [puppet] - 10https://gerrit.wikimedia.org/r/270773 [17:17:50] gehel: correct, there's a task for improving that https://phabricator.wikimedia.org/T115758 [17:18:08] note: I'm not trying to upgrade jenkins, but elasticsearch, it's just that reprepro fails on jenkins check [17:19:17] (03CR) 10Andrew Bogott: [C: 032] Horizon: update nova policies [puppet] - 10https://gerrit.wikimedia.org/r/270773 (owner: 10Andrew Bogott) [17:22:23] ok looks like we have to fix the jenkins key anyway, can't remember offhand which gpg keyring is used by reprepro [17:23:44] gehel: I'm not getting the jenkins key error btw when running reprepro checkupdate on carbon [17:24:04] there might be a new Jenkins available, feel free to bump it in apt.wikimedia.org and I will get jenkins taken care of [17:25:19] 6Operations, 10Ops-Access-Requests: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2029242 (10Volans) [17:25:21] 6Operations: Add Riccardo to ops email aliases - https://phabricator.wikimedia.org/T126433#2029239 (10Volans) 5Open>3Resolved a:3Volans Added myself to the ops alias. [17:25:44] (03PS1) 10Andrew Bogott: Temporarily cripple all Horizon actions [puppet] - 10https://gerrit.wikimedia.org/r/270774 [17:27:11] (03CR) 10Andrew Bogott: [C: 032] Temporarily cripple all Horizon actions [puppet] - 10https://gerrit.wikimedia.org/r/270774 (owner: 10Andrew Bogott) [17:27:22] hashar: Could you review https://gerrit.wikimedia.org/r/#/c/270712/ please. [17:27:29] hashar: Could you also review https://gerrit.wikimedia.org/r/#/c/270734/ please. [17:28:41] paladox: yeah looks that will do :-} [17:28:51] paladox: can't baby sit them though. Busy migrating npm jobs [17:29:09] hashar: Ok. [17:35:22] godog: I have probably missed something. `sudo reprepro checkupdate`still gives me the error. Missing env variable ? [17:37:22] su - maybe [17:37:28] to pick up the root environ? [17:37:32] sudo -E, my mistake [17:38:07] hm missing really? maybe (but I have never used checkupdate there) [17:41:57] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2029305 (10debt) Hi @hashar - @ksmith is correct. We want to be able to deploy the code to beta and test... [17:42:47] (03PS1) 10Andrew Bogott: nova policy.json updates [puppet] - 10https://gerrit.wikimedia.org/r/270781 (https://phabricator.wikimedia.org/T126765) [17:43:33] gehel: yeah it isn't sudo-proof :( [17:44:35] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2029316 (10Eevans) [17:47:48] (03PS1) 10Andrew Bogott: Add a customized glance policy file. [puppet] - 10https://gerrit.wikimedia.org/r/270783 (https://phabricator.wikimedia.org/T126765) [17:48:22] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2029329 (10Eevans) p:5Triage>3High We have a number of cluster changes coming down the pipe (T119935, T125842, and T95253), and the compaction related implications could prove problematic... [17:49:45] (03CR) 10Andrew Bogott: [C: 032] nova policy.json updates [puppet] - 10https://gerrit.wikimedia.org/r/270781 (https://phabricator.wikimedia.org/T126765) (owner: 10Andrew Bogott) [17:50:48] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2029345 (10GWicke) > the compaction related implications could prove problematic This is rather vague. Are you aiming to compare DTCS (which uses STCS) to plain STCS? [17:50:55] (03CR) 10Andrew Bogott: [C: 032] Add a customized glance policy file. [puppet] - 10https://gerrit.wikimedia.org/r/270783 (https://phabricator.wikimedia.org/T126765) (owner: 10Andrew Bogott) [17:55:48] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2029378 (10Eevans) >>! In T126221#2029345, @GWicke wrote: >> the compaction related implications could prove problematic > > This is rather vague. Are you aiming to compare DTCS (which uses... [18:13:52] (03PS1) 10Andrew Bogott: Glance policy.json: Allow everyone to read and list. [puppet] - 10https://gerrit.wikimedia.org/r/270785 [18:16:14] (03PS1) 10Krinkle: mediawiki: Resolve docroot symlink used by bits /w/extension [puppet] - 10https://gerrit.wikimedia.org/r/270786 [18:16:16] (03PS1) 10Krinkle: mediawiki: Remove outdated bits config for /static/current fonts [puppet] - 10https://gerrit.wikimedia.org/r/270787 [18:18:48] PROBLEM - cassandra-a service on xenon is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:19:20] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: Connection refused [18:19:27] (03PS1) 10Krinkle: mediawiki: Remove dead-end redirect at /stats for chapter wikis [puppet] - 10https://gerrit.wikimedia.org/r/270788 [18:23:43] !log Upgrading db1022 (MariaDB and kernel) already depooled and put in scheduled downtime [18:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:36] (03PS5) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [18:38:41] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2029503 (10GWicke) > That out-of-order writes are causing newer/smaller sstables to prematurely become candidates for compaction with older/larger sstables. We know that read repairs cause s... [18:47:30] (03PS6) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [18:49:44] (03PS7) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [19:03:15] (03PS8) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [19:20:22] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2029581 (10Eevans) >>! In T126221#2029503, @GWicke wrote: >> That out-of-order writes are causing newer/smaller sstables to prematurely become candidates for compaction with older/larger ssta... [19:20:34] (03PS2) 10Bmansurov: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [19:20:53] (03CR) 10Bmansurov: [C: 031] Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [19:21:05] (03CR) 10Nikerabbit: "Where is the expiry of font files defined nowadays?" [puppet] - 10https://gerrit.wikimedia.org/r/270787 (owner: 10Krinkle) [19:26:36] (03PS9) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [19:27:03] (03CR) 10Krinkle: "Currently this is set by the /static handler which is Apache, which sets max-age = 1 year in https://github.com/wikimedia/operations-puppe" [puppet] - 10https://gerrit.wikimedia.org/r/270787 (owner: 10Krinkle) [19:27:43] (03PS1) 10Bmansurov: Run the survey at normal rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270792 (https://phabricator.wikimedia.org/T125946) [19:28:09] (03PS1) 10Ori.livneh: Don't serve HiDPI thumbs on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) [19:29:19] (03PS10) 10Ottomata: Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) [19:32:22] 6Operations, 10RESTBase-Cassandra, 6Services: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2029599 (10mobrovac) I'm -1 on going to prod with Brotli before we finish the cluster expansion and complete the multi-instance set-up. I think we should be looking at Brotli as... [19:32:37] (03CR) 10Ottomata: [C: 032] Create refinery classes in analytics_cluster role, apply them to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/270103 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [19:37:59] RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active [19:40:19] RECOVERY - cassandra-a CQL 10.64.0.202:9042 on xenon is OK: TCP OK - 0.003 second response time on port 9042 [19:43:56] (03PS1) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [19:44:56] (03CR) 10jenkins-bot: [V: 04-1] Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [19:53:08] (03PS2) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator Maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160215T2000). [20:01:04] hey twentyafterfour [20:01:08] we doin this? [20:01:15] apergos: yep [20:01:26] I'm ready if you are [20:01:27] ok lemme revert my change (not apply) [20:01:33] then yours and not apply [20:01:42] :) [20:01:43] I mean merge yours and not apply [20:02:15] (03PS3) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:03:17] (03PS1) 10ArielGlenn: Revert "comment out phab role on iridium so puppet can run safely" [puppet] - 10https://gerrit.wikimedia.org/r/270799 [20:03:27] (03PS2) 10ArielGlenn: Revert "comment out phab role on iridium so puppet can run safely" [puppet] - 10https://gerrit.wikimedia.org/r/270799 [20:04:07] (03CR) 1020after4: [C: 031] Revert "comment out phab role on iridium so puppet can run safely" [puppet] - 10https://gerrit.wikimedia.org/r/270799 (owner: 10ArielGlenn) [20:04:49] (03PS4) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:05:45] (03CR) 10ArielGlenn: [C: 032] Revert "comment out phab role on iridium so puppet can run safely" [puppet] - 10https://gerrit.wikimedia.org/r/270799 (owner: 10ArielGlenn) [20:07:31] twentyafterfour: how can I verify that these tags are correct, in the https://gerrit.wikimedia.org/r/#/c/268351/2/manifests/role/phabricator.pp changeset? [20:08:31] apergos: checking [20:08:48] I need you to tell me how to check, too, when you are done. [20:09:05] actually extension_tag looks wrong somehow ... [20:09:15] if you got to /srv/phab on iridium [20:09:19] and run git submodule foreach ' [20:09:29] git submodule foreach 'git describe' [20:09:35] it should show each tag [20:10:02] looking [20:10:28] honestly every one of them should be release/2015-02-04/1 [20:10:58] https://phabricator.wikimedia.org/P2618 [20:11:01] results [20:11:31] and I can tell you nothing changed on today's puppet run so it's been like that since puppet was disabled a week plus ago [20:12:53] http://pastebin.com/JxEaYKiW this was your pastebin from whenever it hit the fan [20:13:45] (03CR) 10Krinkle: Collect Navigation Timing metrics with higher sane values (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/270725 (https://phabricator.wikimedia.org/T125381) (owner: 10Phedenskog) [20:14:06] git submodule foreach git submodule foreach 'if [ `git rev-parse release/2016-02-04/1` != `git rev-parse HEAD` ]; then echo "no match"; fi' [20:14:30] (03PS5) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:14:40] apergos: the thing is git describe shows whichever tag it wants [20:14:53] I'm sure it does [20:14:54] I am quite certain that I created the tag release/2016-02-04/1 on every repo [20:15:09] so setting all of them to release/2016-02-04/1 is the safest bet [20:15:20] but I can fix them all quickly if it changes them to the wrong thing [20:15:41] (and as we discussed, the code is cached by apache so it won't bring down phab) [20:15:56] I'm git status in each dir to see how that is [20:16:54] (03CR) 1020after4: phabricator: forward the old tag system to current release/2015-11-18/1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268351 (owner: 10Rush) [20:17:50] arcanist, extensions have HEAD detached at release/2016-02-04/1 [20:18:26] (03PS2) 10Andrew Bogott: Glance policy.json: Allow everyone to read and list. [puppet] - 10https://gerrit.wikimedia.org/r/270785 [20:18:28] libphutil, phabricator have HEAD detached at release/2015-11-18/1 [20:18:32] (03PS6) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:19:41] (03CR) 10Andrew Bogott: [C: 032] Glance policy.json: Allow everyone to read and list. [puppet] - 10https://gerrit.wikimedia.org/r/270785 (owner: 10Andrew Bogott) [20:19:47] (03PS3) 1020after4: phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 (owner: 10Rush) [20:20:06] apergos: updated the changeset [20:20:34] I see it [20:20:47] if these turn out to be wrong, how do we know it? [20:20:49] does release/2015-11-18/1 point to the same commit as 02-04/1 ? [20:20:54] checking [20:21:04] if so then we're good [20:21:25] (03PS7) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:23:22] not knowing the right way to check that, I checked the contents of phabricator/.git/HEAD and extensions/.git/HEAD [20:23:24] they differ [20:23:45] git submodule foreach git submodule foreach 'if [ `git rev-parse release/2016-02-04/1` != `git rev-parse HEAD` ]; then echo "no match"; fi' [20:24:12] empty [20:24:13] lol @ "git submodule foreach git submodule foreach" [20:24:20] ahah [20:24:31] copypaste error [20:24:33] yep [20:24:49] apergos: if that didn't output any "no match" lines then it's all good to go [20:24:59] assuming you strip the extra foreach [20:25:11] (03PS8) 10Ottomata: Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) [20:25:18] (03CR) 10Ottomata: [C: 032 V: 032] Apply new analytics_cluster role to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/270795 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:25:29] it did not print any 'no match' lines [20:25:43] that simply checks that all the heads are pointing to the right tag which will match the tag I submitted to gerrit in cs3 [20:26:05] ok great [20:26:17] going to merge, and apply both at once [20:26:24] then I'll tell you so you can ssh in [20:26:25] * twentyafterfour is pretty confident [20:26:29] thanks apergos [20:26:58] (03PS4) 10ArielGlenn: phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 (owner: 10Rush) [20:27:17] (03CR) 1020after4: [C: 031] phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 (owner: 10Rush) [20:27:23] (03CR) 10Alex Monk: "might have once appeared under www2.knams.wikimedia.org based on https://meta.wikimedia.org/wiki/WQ/Draft - https://meta.wikimedia.org/wik" [puppet] - 10https://gerrit.wikimedia.org/r/270788 (owner: 10Krinkle) [20:28:56] (03CR) 10ArielGlenn: [C: 032] phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 (owner: 10Rush) [20:30:07] just puppet merged someone else's [20:30:09] no biggie [20:31:03] (03PS2) 10Ori.livneh: Don't serve HiDPI thumbs on mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) [20:32:02] twentyafterfour: time for you to hop on and check things [20:32:09] puppet is currently enabled and I have done one run [20:33:08] ok trying [20:35:03] (03PS1) 10Ottomata: Remove some unsed role::analytics::* classes, more to come [puppet] - 10https://gerrit.wikimedia.org/r/270808 (https://phabricator.wikimedia.org/T109859) [20:36:24] (03PS1) 10Andrew Bogott: Update designate policy.conf [puppet] - 10https://gerrit.wikimedia.org/r/270809 (https://phabricator.wikimedia.org/T126765) [20:37:05] ah I should say that nothing was updated during the one run [20:37:10] of course there are the "lock files" [20:37:13] apergos: looks good [20:37:29] (03CR) 10Mholloway: [C: 031] Don't serve HiDPI thumbs on mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) (owner: 10Ori.livneh) [20:37:46] now it won't screw up even if the lock files go away [20:37:56] theonly way to be sure we're ok is to remove lock files, do a puppet run, see what it does [20:38:14] (03CR) 10Ottomata: [C: 032 V: 032] Remove some unsed role::analytics::* classes, more to come [puppet] - 10https://gerrit.wikimedia.org/r/270808 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:38:26] apergos: ok... one sec [20:38:28] (03PS2) 10Andrew Bogott: Update designate policy.conf [puppet] - 10https://gerrit.wikimedia.org/r/270809 (https://phabricator.wikimedia.org/T126765) [20:38:46] please do the needful so that if puppet does stupid things we can put things back together quickly [20:40:46] I'm trying to copy the entire source tree so it can be put back in place quickly if need be. the biggest challenge to that is /srv/phab/repos is inside /srv/phab which sucks (because it's huge) [20:40:52] 47g right [20:40:58] iirc from last time [20:41:06] will the backup job care if /srv/phab/repos is a symlink? [20:41:14] (03CR) 10Andrew Bogott: [C: 032] Update designate policy.conf [puppet] - 10https://gerrit.wikimedia.org/r/270809 (https://phabricator.wikimedia.org/T126765) (owner: 10Andrew Bogott) [20:41:15] well we're outside the deployment window now [20:41:20] oh :-/ [20:41:26] so make your copy [20:41:30] ok [20:41:36] we'll do the test with puppet another time [20:41:40] (tomorrow?) [20:41:40] I'll copy everything except repos/ [20:41:45] ok sounds good [20:42:14] you want to log what's been done and what we plan to do, on whichever one of the tickets? [20:42:34] I'm going to disable puppet again because we didn't get to run that test [20:42:47] tomorrow if test goes well we can enable it for good [20:44:06] I don't know if the symlink will be a problem or not tbh [20:44:21] it might back up a symlink and not the files [20:46:54] ok logging [20:47:18] apergos: I won't mess with the symlink until I can confirm that and schedule another maintenance [20:47:29] great [20:47:55] !log brought iridium up to date with a current puppet run and checked that all repositories are at the correct tag [20:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:40] !log backing up /srv/phab to /srv/phab.bak in case it needs to be restored in a hurry. Note: the backup excludes /srv/phab/repos which is backed up separately [20:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:07] oh, that's good too. I was thinking it would go on a phab task (whichone?) but logging it here is smart [20:49:09] as well [20:49:22] (03PS1) 10Ottomata: Remove unused analytics role classes [puppet] - 10https://gerrit.wikimedia.org/r/270851 (https://phabricator.wikimedia.org/T109859) [20:50:10] yeah I'll update tasks but people don't always know what tasks to check when things come up unexpectedly [20:50:15] me neither :-D [20:50:30] oh repos is only 17G [20:50:34] (03PS2) 10Ottomata: Remove unused analytics role classes [puppet] - 10https://gerrit.wikimedia.org/r/270851 (https://phabricator.wikimedia.org/T109859) [20:50:50] everything is 47G combined [20:50:55] yeah total is 47 [20:54:10] (03CR) 10Ottomata: [C: 032] Remove unused analytics role classes [puppet] - 10https://gerrit.wikimedia.org/r/270851 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:55:37] !log moved contents of /srv/phab/dumps into /srv/dumps [20:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:53] is that going to break any rsyncs, twentyafterfour? [20:56:58] PROBLEM - Disk space on ms-be1008 is CRITICAL: DISK CRITICAL - free space: / 1619 MB (3% inode=79%) [20:57:13] I think we get copies of those over on the dataset host so they can be downloaded [20:57:19] (03PS1) 10Krinkle: wmfstatic: Match cache behaviour of "/static" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270854 [20:59:15] apergos: I don't think so? /srv/phab/dumps is obsolete, /srv/dumps is the new location [20:59:22] (03PS1) 10Ottomata: Apply analytics_cluster::database::meta role on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/270855 [20:59:46] apergos: shouldn't it just copy the phabricator_public.dump? [21:00:24] yes [21:00:31] just checked the cron job to be sure [21:00:32] all good [21:00:42] I was tempted to just delete those other obsolete dumps but I preserved them just in case [21:00:46] sure [21:01:13] ok /srv/phab.bak has everything now [21:02:02] sweet [21:02:15] so get yerself a slot for tomorrow, let me know when it is, if tz works I'll be here, etc [21:02:25] same bat time, same bat channel [21:02:47] (03CR) 10Krinkle: [C: 032] wmfstatic: Match cache behaviour of "/static" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270854 (owner: 10Krinkle) [21:03:13] Testing the above in beta and will roll out afterward if all ok [21:03:14] (03Merged) 10jenkins-bot: wmfstatic: Match cache behaviour of "/static" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270854 (owner: 10Krinkle) [21:03:26] shouldn't affect prod (path is still unused) [21:04:01] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: puppet fail [21:04:30] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: puppet fail [21:04:34] (03PS2) 10Ottomata: Apply analytics_cluster::database::meta role on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/270855 [21:04:58] hmm looking into puppet fail [21:08:12] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: puppet fail [21:08:57] (03PS3) 10Ottomata: Apply analytics_cluster::database::meta role on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/270855 [21:10:01] oh and twentyafterfour: send mail about the slot. [21:10:04] (03PS1) 10Ottomata: Remove unused anayltics role from analytics kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/270857 (https://phabricator.wikimedia.org/T109859) [21:10:04] today even :-P :-D [21:10:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 644 [21:10:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 645 [21:10:35] apergos: about the next maintenance? [21:10:38] yep [21:10:50] well about the puppet test window [21:11:01] (03CR) 10Ottomata: [C: 032] Apply analytics_cluster::database::meta role on analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/270855 (owner: 10Ottomata) [21:11:03] because if it goes back there might be a short outage. I know, small possibilitiy etc but. [21:11:07] *goes bad [21:11:09] still not sure when is best to do it but I'll try to find a good time tomorrow (difficult because of busy deployment day) [21:11:14] yep I know [21:11:22] if tomorrow doesn't work then Wed I guess [21:11:31] just lemme know [21:11:43] wednesday I have a slot already scheduled ... [21:11:43] (03PS2) 10Ottomata: Remove unused anayltics role from analytics kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/270857 (https://phabricator.wikimedia.org/T109859) [21:11:49] (03CR) 10Ottomata: [C: 032 V: 032] Remove unused anayltics role from analytics kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/270857 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:11:53] ah right for phab [21:12:03] backup::set { 'srv-phab-repos': } <- does it use the srv-phab-repos to figure out the backup path? [21:12:14] let's see [21:12:20] so I can update that to srv-repos .... I think [21:13:26] bacula::director::fileset { 'srv-phab-repos': [21:13:26] includes => [ '/srv/phab/repos' ], [21:13:26] } [21:13:29] so no [21:13:40] but hm then.... [21:13:49] then hrm ew [21:14:00] manifests/role/backup.pp [21:14:06] it's there too so if you were to update things [21:14:13] (03PS15) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [21:14:15] you'd need to change the name ot be nice and update the path there, it seems [21:14:38] ah I see [21:14:44] yes, that's currently in use [21:15:10] RECOVERY - check_mysql on lutetium is OK: Uptime: 3040106 Threads: 1 Questions: 23594841 Slow queries: 56010 Opens: 126643 Flush tables: 3 Open tables: 64 Queries per second avg: 7.761 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [21:15:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 662 [21:15:13] that seems like an odd indirection of the config - why not just pass the path as a parameter to backup::set? [21:16:30] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:22] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:19:50] (03PS1) 10Ottomata: Move burrow role to kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/270859 [21:20:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 2353314 Threads: 1 Questions: 16287484 Slow queries: 15682 Opens: 5259 Flush tables: 2 Open tables: 399 Queries per second avg: 6.921 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [21:20:12] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:20:48] (03CR) 10jenkins-bot: [V: 04-1] Move burrow role to kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/270859 (owner: 10Ottomata) [21:20:52] (03PS2) 10Ottomata: Move burrow role to kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/270859 (https://phabricator.wikimedia.org/T109859) [21:22:13] (03CR) 10jenkins-bot: [V: 04-1] Move burrow role to kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/270859 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:24:33] (03PS3) 10Ottomata: Move burrow role to kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/270859 (https://phabricator.wikimedia.org/T109859) [21:25:49] twentyafterfour: ask a kosiaris about it sometime I guess [21:26:31] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:26:47] !log krinkle@tin Synchronized w/static.php: (no message) (duration: 00m 58s) [21:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:20] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:28:06] (03CR) 10Ottomata: [C: 032] Move burrow role to kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/270859 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [21:29:12] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:30:39] (03PS1) 10Ottomata: Mv burrow hiera configs to proper place [puppet] - 10https://gerrit.wikimedia.org/r/270861 [21:31:30] (03CR) 10Ottomata: [C: 032 V: 032] Mv burrow hiera configs to proper place [puppet] - 10https://gerrit.wikimedia.org/r/270861 (owner: 10Ottomata) [21:35:02] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:37:21] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: puppet fail [21:38:59] (03PS1) 10Andrew Bogott: Wikitech: Turn down nutracker logging [puppet] - 10https://gerrit.wikimedia.org/r/270866 [21:40:22] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [21:47:31] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:01:29] (03CR) 10Ori.livneh: [C: 031] Wikitech: Turn down nutracker logging [puppet] - 10https://gerrit.wikimedia.org/r/270866 (owner: 10Andrew Bogott) [22:02:06] (03CR) 10Andrew Bogott: [C: 032] Wikitech: Turn down nutracker logging [puppet] - 10https://gerrit.wikimedia.org/r/270866 (owner: 10Andrew Bogott) [22:05:51] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:16:12] bblack, i just added you to the caching task - this way we will make better http://en.wikipedia.beta.wmflabs.org/wiki/Sparql [22:21:42] andrewbogott, around? [22:21:54] Krenair: what’s up? [22:22:14] andrewbogott, I'd like to look into the issues with phab-01.phabricator.eqiad.wmflabs and phab-02 [22:22:28] sure, ok [22:22:30] however login and puppet are broken, so my key hasn't been added to root there like phab-03 etc. [22:22:40] I'm about to head off to sleep but can someone ping me with the results? [22:22:49] because that impacts my getting salt running on them [22:23:02] oh, salt isn't working there? [22:23:05] maybe my key isn't on them [22:23:08] nope it's not [22:23:08] Is there any historical context for how they broke in the first place? [22:23:18] I don't have any [22:23:18] without ssh or salt... [22:23:27] the next set of possible solutions are pretty ugly [22:24:05] sorry, is there a ticket for this? Or some background? [22:24:19] nope [22:24:23] we should file one at some point [22:24:27] and I guess I should dig out the irc logs [22:24:32] https://phabricator.wikimedia.org/T126323 here's where I listed known broken instances for me [22:24:55] included ate phab02 and phab01 [22:25:40] *are [22:26:36] if someone makes a specific task for phab01 and 02 maybe they can link it to that one [22:26:55] andrewbogott, what are the next possible solutions when this happens to instances? [22:27:14] user ssh, root ssh, salt... [22:27:23] reboot, try all of the above again... [22:27:31] oh, I thought there'd be something else [22:27:41] then we get into weird territory like ‘shut down instance, mount drive on the virt host, go digging in the file system' [22:27:43] doesn't nova provide an interactive console? [22:28:08] some nova implementations do, ours does not at the moment. [22:28:19] yeah it's on my wish list [22:28:25] "equiv to mgmt console" [22:28:26] It’s fine for me to reboot those instances, right? [22:28:41] all right, I'm really out, good luck fols [22:28:43] folks [22:28:52] Well I think they're still running web services [22:28:58] but it's all testing/dev stuff [22:29:07] ok [22:29:11] I would just reboot [22:31:11] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [22:35:28] ok, I have salt access at least [22:35:33] andrewbogott, no luck with ssh after they rebooted, anything with salt? [22:35:37] ok [22:35:57] puppet is disabled on 01 [22:36:02] so that explains… roughly everything [22:40:25] hashar: The php53 tests are failing. Please see https://phabricator.wikimedia.org/T125965 [22:50:59] Krenair: I stashed the mess that was in /var/lib/git/operations/puppet on the phabricator master... [22:51:09] moved to a different branch, fetched and updated [22:51:21] enabled puppet on phab-01, ran it, signed the key, ran it again... [22:51:31] and now you can ssh in [22:52:39] I think twentyafterfour was working on that [22:55:40] paladox: no idea [22:55:50] hashar: Oh/ [22:55:54] paladox: the task got hijacked so I reverted the recent changes [22:56:08] paladox: it pointed to https://gerrit.wikimedia.org/r/#/c/269310/ which does not show much failure [22:56:10] but seems some git clone takes age [22:56:17] anyway midnight here [22:56:29] and it is time for a sleep :-} I am not too worried a recheck will get it properly [22:56:33] hashar: Oh ok/ [22:57:14] Krenair: why do you care about these instances? [22:57:20] paladox: https://gerrit.wikimedia.org/r/#/c/269310/ got merged already :-}  no need to 'recheck' it! [22:57:37] hashar: Ok. [22:57:38] Sorry. [22:57:40] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:58:00] been attempting to do a bit of phabricator development, and apparently these were up but nobody could get in [22:58:21] figured someone with salt access might be able to sort it out [23:01:00] I don't have a huge attachment to these particular instances but I hate to leave broken stuff lying around when I can try to get people to fix it [23:01:58] Krenair, 02 doesn’t have the default security group but it does have a security group which is described as 'Alternate ssh port for regular shell access when git is on port 22' [23:02:03] to which I say: I am done working on this one [23:02:41] both instances are throwing a bunch of puppet errors, but should at least be installing keys [23:09:48] andrewbogott, salt is working there right? [23:10:26] (updating ariel's ticket) [23:10:45] Krenair: salt is working, probably ssh is too, maybe, somewhere other than port 22 [23:11:28] aha [23:11:31] ssh phab-02 -p 222 [23:11:31] works [23:12:27] (03CR) 10Cenarium: "I forgot to mention it, but also: bots and sysops shouldn't be autopromoted to the new usergroup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [23:14:16] (03CR) 10Alex Monk: "Yes, it took me a while to understand that properly. Instead of those users (sysops/bots) being autopromoted to the group, their groups ha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [23:15:09] (03CR) 10Alex Monk: "Oh, you mean having either of those groups should prevent you from being autopromoted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [23:17:20] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [23:22:31] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [23:30:16] (03PS2) 10ArielGlenn: dumps mirroring tool, don't assume dest is local filesystem [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/270001 [23:31:51] ok now gone for realz [23:38:11] PROBLEM - Disk space on cp3040 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=87%) [23:46:12] (03CR) 10Tim Starling: "Well, you have to choose one or the other. If .git is group-writable and you git pull as root, then that would give escalation from wikide" [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry)