[00:03:33] (03PS2) 10Dzahn: Minor tweaks to my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/238191 (owner: 10Chad) [00:03:51] !log tstarling@tin Synchronized php-1.26wmf22/extensions/ParsoidBatchAPI: for I56d28e9a for RT testing, not live yet (duration: 00m 13s) [00:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:53] (03CR) 10Dzahn: [C: 032] Minor tweaks to my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/238191 (owner: 10Chad) [00:08:31] (03PS3) 10Dzahn: contint: for Jessie s/ruby1.9.3/ruby2.1/ [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [00:08:59] (03Abandoned) 10Dzahn: contint: add nagios contact group in hiera common [puppet] - 10https://gerrit.wikimedia.org/r/237301 (owner: 10Dzahn) [00:09:37] (03PS2) 10Dzahn: WIP: just testing something [puppet] - 10https://gerrit.wikimedia.org/r/237412 (owner: 10Alexandros Kosiaris) [00:11:11] (03CR) 10Tim Starling: [C: 032] Update personal .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/238363 (owner: 10Tim Starling) [00:11:41] (03PS4) 10Dzahn: contint: for Jessie s/ruby1.9.3/ruby2.1/ [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [00:13:04] (03CR) 10Dzahn: [C: 032] "really, no merge since June? https://phabricator.wikimedia.org/T103600 is already resolved" [puppet] - 10https://gerrit.wikimedia.org/r/220308 (https://phabricator.wikimedia.org/T103600) (owner: 10Hashar) [00:16:59] (03CR) 10Dzahn: "@gwicke is this correct and still current? both exist, test.wikiPedia and test.wikiMedia, or would we add both?" [puppet] - 10https://gerrit.wikimedia.org/r/236687 (owner: 10Alex Monk) [00:18:45] (03PS2) 10Dzahn: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [00:22:20] (03PS2) 10Dzahn: hiera: Remove phab-02 data [puppet] - 10https://gerrit.wikimedia.org/r/234808 (owner: 10Negative24) [00:22:42] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640012 (10JMinor) I tried to create an account at the link you provided, however I get this error: Account creation error The user name "JMino... [00:23:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640016 (10JMinor) a:5JMinor>3Dzahn [00:23:46] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640018 (10Dzahn) @JMinor please use just "jminor" here and don't worry about the WMF thing. This will also be your shell user. [00:24:20] (03PS3) 10Dzahn: hiera: Remove phab-02 data [puppet] - 10https://gerrit.wikimedia.org/r/234808 (owner: 10Negative24) [00:24:22] (03CR) 10GWicke: [C: 031] "Judging by each wiki's identical content and recent changes, they seem to map to the same wiki. If so, using one of the two names should b" [puppet] - 10https://gerrit.wikimedia.org/r/236687 (owner: 10Alex Monk) [00:25:16] (03CR) 10Dzahn: [C: 032] hiera: Remove phab-02 data [puppet] - 10https://gerrit.wikimedia.org/r/234808 (owner: 10Negative24) [00:26:14] (03PS2) 10Dzahn: Fix restbase on test.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/236687 (owner: 10Alex Monk) [00:27:37] (03CR) 10Dzahn: [C: 032] Fix restbase on test.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/236687 (owner: 10Alex Monk) [00:28:30] (03PS2) 10Dzahn: udp2log: Use the DNS name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/234978 (owner: 10Muehlenhoff) [00:29:18] (03CR) 10Dzahn: [C: 031] udp2log: Use the DNS name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/234978 (owner: 10Muehlenhoff) [00:29:23] (03PS3) 10Dzahn: udp2log: Use the DNS name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/234978 (owner: 10Muehlenhoff) [00:31:50] (03CR) 10Dzahn: [C: 031] Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [00:31:54] (03PS2) 10Dzahn: Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [00:33:07] (03PS2) 10Dzahn: Kill pa.us.wikimedia.org from dns [dns] - 10https://gerrit.wikimedia.org/r/227173 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [00:33:13] (03CR) 10Dzahn: [C: 031] Kill pa.us.wikimedia.org from dns [dns] - 10https://gerrit.wikimedia.org/r/227173 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [00:33:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640032 (10Krenair) The important thing is that the shell username (I think it's called 'instance shell account name' on the signup form) matches... [00:35:24] (03CR) 10Dzahn: "@ArielGlenn it's not dumps but also not misc-web, but text-lb meanwhile. abandon?" [dns] - 10https://gerrit.wikimedia.org/r/120999 (owner: 10ArielGlenn) [00:35:55] (03CR) 10Dzahn: [C: 04-2] download.wikimedia.org moved to misc-web [dns] - 10https://gerrit.wikimedia.org/r/120999 (owner: 10ArielGlenn) [00:36:01] (03PS1) 10EBernhardson: alt language search configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238366 [00:38:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [00:41:16] (03CR) 10EBernhardson: [C: 032] alt language search configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238366 (owner: 10EBernhardson) [00:41:23] (03Merged) 10jenkins-bot: alt language search configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238366 (owner: 10EBernhardson) [00:41:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640052 (10Dzahn) a:5Dzahn>3JMinor Yes, you can use anything you like if you want a different shell user name, it doesn't have to be identica... [00:43:16] RECOVERY - DPKG on lvs3001 is OK: All packages OK [00:43:19] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-labs.php: noop sync of labs config change (duration: 00m 11s) [00:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:43] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:47:23] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1640063 (10Dzahn) p:5Normal>3Low [00:48:27] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:27] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:57] !log reinstalling lvs300[34] to jessie [00:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:05] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [00:50:05] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [00:51:56] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:05] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [01:00:05] (03Abandoned) 10BBlack: Kill pa.us.wikimedia.org from dns [dns] - 10https://gerrit.wikimedia.org/r/227173 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [01:01:06] (03PS2) 10BBlack: remove www.nl and noboard.chapters [dns] - 10https://gerrit.wikimedia.org/r/237190 (https://phabricator.wikimedia.org/T102826) [01:01:30] (03CR) 10BBlack: [C: 032] remove www.nl and noboard.chapters [dns] - 10https://gerrit.wikimedia.org/r/237190 (https://phabricator.wikimedia.org/T102826) (owner: 10BBlack) [01:02:13] (03PS2) 10BBlack: Remove redirects for www.nl, noboard.chapters [puppet] - 10https://gerrit.wikimedia.org/r/237192 (https://phabricator.wikimedia.org/T102826) [01:02:43] (03CR) 10BBlack: [C: 032 V: 032] Remove redirects for www.nl, noboard.chapters [puppet] - 10https://gerrit.wikimedia.org/r/237192 (https://phabricator.wikimedia.org/T102826) (owner: 10BBlack) [01:03:15] 6operations, 10Traffic, 7HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#1640095 (10BBlack) [01:03:15] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1640096 (10BBlack) [01:03:17] 6operations, 6Community-Advocacy, 10Traffic, 5Patch-For-Review: Fix/decom multiple-subdomain wikis in wikimedia.org - https://phabricator.wikimedia.org/T102826#1640093 (10BBlack) 5Open>3Resolved a:3BBlack [01:03:58] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1640098 (10BBlack) [01:04:00] 6operations, 10Traffic, 7HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#1411365 (10BBlack) [01:04:13] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1374642 (10BBlack) [01:04:15] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1374553 (10BBlack) [01:07:33] (03CR) 10GWicke: [C: 031] cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [01:07:40] (03PS3) 10GWicke: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [01:27:57] legoktm: deploying I presume? I'm pushing a RL core fix [01:28:07] (in a minute) [01:28:21] Krinkle: in a bit, you should go first [01:28:26] k [01:30:38] !log krinkle@tin Synchronized php-1.26wmf22/resources/src: T112287 (duration: 00m 11s) [01:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:33:20] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1640126 (10BBlack) Since we've had a second cert expire on us unexpectedly now in a span of a few days, I went ahead and auditing the expiries on all of the cert files stored in puppet's files/ssl/ dir... [01:40:30] Krinkle: are you finished? [01:40:37] Woops, let me chekc [01:40:59] Yes [01:41:02] legoktm: Go ahead :) [01:43:58] PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100% [01:44:19] RECOVERY - Host lvs3004 is UP: PING OK - Packet loss = 0%, RTA = 88.19 ms [01:44:49] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1640146 (10BBlack) [01:45:10] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1215460 (10BBlack) Backups LVSes in ulsfo and esams upgraded. Should fail over to them for testing before upgrading the primaries. [01:51:52] uhoh [01:51:55] that totally didn't work [01:51:58] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2021_v6 [01:52:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640158 (10JMinor) a:5JMinor>3Dzahn Okay, confirmed account for jminor is there. Thanks for the help guys. [01:53:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640160 (10Dzahn) Confirmed too :) user is there and the UID is: 12948 [01:53:27] (03CR) 10Thcipriani: [C: 032] "Innocuous change, adds to what already exists" [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [01:53:29] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [01:53:45] (03Merged) 10jenkins-bot: Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [01:55:12] (03PS2) 10Dzahn: admin: create shell account for Joshua Minor [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) [01:55:27] (03CR) 10Thcipriani: [C: 032] Add pattern-matching arg to limit deploy hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/238208 (owner: 10Thcipriani) [01:55:41] (03Merged) 10jenkins-bot: Add pattern-matching arg to limit deploy hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/238208 (owner: 10Thcipriani) [01:56:27] 6operations, 10Wikimedia-Mailing-lists, 7user-notice: announce scheduled downtime - https://phabricator.wikimedia.org/T110133#1640166 (10Dzahn) scheduled next one for: Friday, September 18, 2015 at 2:00:00 PM UTC (Friday, September 18, 2015 at 7:00:00 AM PDT) [01:59:01] (03CR) 10Tim Landscheidt: [C: 031] toollabs: remove redis Sysctl[vm.overcommit_memory] [puppet] - 10https://gerrit.wikimedia.org/r/237895 (owner: 10Merlijn van Deen) [02:01:55] (03PS1) 10Dzahn: admin: add jminor to research,stats,analytics-priv [puppet] - 10https://gerrit.wikimedia.org/r/238376 (https://phabricator.wikimedia.org/T111872) [02:02:06] (03PS2) 10Dzahn: toollabs: remove redis Sysctl[vm.overcommit_memory] [puppet] - 10https://gerrit.wikimedia.org/r/237895 (owner: 10Merlijn van Deen) [02:02:32] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1640174 (10scfc) Wasn't there an Icinga check that tested that certificates were good for another x days? [02:02:33] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1640172 (10Dzahn) There are 2 patches in code review now. I will finish this tomorrow. [02:03:37] (03CR) 10Dzahn: [C: 032] "yes, same thing, duplicate, and belongs in the redis module" [puppet] - 10https://gerrit.wikimedia.org/r/237895 (owner: 10Merlijn van Deen) [02:06:59] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4015_v6 [02:08:29] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1640176 (10BBlack) We only have that icinga check on the primary unified cert, which covers the production endpoints for: - wikipedia.org - mediawiki.org - wikibooks.org - wikidata.org - wikimediafoun... [02:08:39] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:11:07] (03PS1) 10Dzahn: mailman: don't exclude "last_mailman_version" [puppet] - 10https://gerrit.wikimedia.org/r/238379 [02:18:22] !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo: Revert Echo to 1.26wmf21 state (duration: 00m 12s) [02:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:18:43] !log legoktm@tin Synchronized php-1.26wmf22/extensions/MobileFrontend: Revert Echo to 1.26wmf21 state (duration: 00m 11s) [02:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:14] (03PS2) 10Dzahn: mailman: don't exclude "last_mailman_version" [puppet] - 10https://gerrit.wikimedia.org/r/238379 [02:21:37] (03CR) 10Dzahn: [C: 032] mailman: don't exclude "last_mailman_version" [puppet] - 10https://gerrit.wikimedia.org/r/238379 (owner: 10Dzahn) [02:31:46] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1640205 (10Dzahn) 5Resolved>3Open reopening because we keep getting warning emails that it's about to expire [02:32:29] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1640211 (10Dzahn) Robh has mailed Stephen 4 days ago and again today. [02:33:28] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1640215 (10Dzahn) a:5RobH>3Slaporte @slaporte please advise **If this Domain Name is not renewed by 21 Sep 2015, the domain name will be DEACTIVATED** [02:38:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:40:04] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 10m 53s) [02:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:06] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1640250 (10Jdforrester-WMF) [02:46:50] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-15 02:46:50+00:00 [02:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:21] legoktm: you deploying? [03:01:37] operations/mediawiki-config? [03:02:38] he was sending out an emergency echo revert I think [03:02:47] ah [03:04:29] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:06:18] mafk: I finished deploying [03:23:32] Hm.. why is it that http://wikimedia.com./ and http://wikimedia.xyz./ don't redirect and show "Domain unconfigured", but the non-redirect domains like http://wikipedia.org./ or http://wikimedia.org./ do [03:23:55] there' something about the trailing dot that only turns into a redirect on the main domains, not the redirect domains [03:27:39] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:31:38] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3016_v6 [03:33:18] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [03:46:48] mutante: thanks! [04:04:09] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:21:11] (03PS1) 10Yuvipanda: quarry: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/238387 [04:21:46] (03PS2) 10Yuvipanda: quarry: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/238387 [04:21:52] (03CR) 10Yuvipanda: [C: 032 V: 032] quarry: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/238387 (owner: 10Yuvipanda) [04:22:09] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [100000000.0] [04:22:50] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:43:50] (03CR) 10Dzahn: "hashar: should i restore ?:)" [puppet] - 10https://gerrit.wikimedia.org/r/237301 (owner: 10Dzahn) [04:44:07] kart_: welcome [04:46:33] apergos: (for later) this is probably 'abandon', but is it? https://gerrit.wikimedia.org/r/#/c/120999/ #oldgerrit [04:51:50] (03CR) 10Dzahn: "anyone want to review creation of new shell user? check if the key is here is the same as on the ticket and if the user who put it on the " [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn) [05:02:09] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [05:50:19] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:10:52] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 15 06:10:52 UTC 2015 (duration 10m 51s) [06:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:20:19] 6operations, 3Discovery-Maps-Sprint: Puppets have not run on maps-test{2,3,4} in many weeks - https://phabricator.wikimedia.org/T112613#1640445 (10Yurik) 3NEW a:3akosiaris [06:29:40] 6operations, 3Discovery-Maps-Sprint: Puppet has not run on maps-test{2,3,4} in many weeks - https://phabricator.wikimedia.org/T112613#1640464 (10yuvipanda) [06:30:09] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: puppet fail [06:30:10] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [06:30:48] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [06:31:09] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:00] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:09] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:20] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:20] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:29] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:34] <_joe_> uhm this is getting worse [06:36:03] heh, you were on IRC much before it happened today [06:36:10] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:35] <_joe_> akosiaris: I guess no progress on upgrading the puppetmasters, right? [06:49:47] (03PS1) 10Giuseppe Lavagetto: Avoid including webscalesqlclient completely [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238390 [06:56:49] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:50] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:50] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:20] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:58:29] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:30] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:59] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:10:20] (03PS4) 10Muehlenhoff: udp2log: Use the DNS name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/234978 [07:14:37] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1640548 (10jcrespo) [07:18:20] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=429.70 Read Requests/Sec=854.89 Write Requests/Sec=236.43 KBytes Read/Sec=3979.01 KBytes_Written/Sec=945.71 [07:26:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] udp2log: Use the DNS name in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/234978 (owner: 10Muehlenhoff) [07:27:41] 6operations: bios defaults on new hardware orders - https://phabricator.wikimedia.org/T112627#1640623 (10fgiunchedi) 3NEW a:3RobH [07:28:15] 6operations, 10ops-codfw, 5Patch-For-Review: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1640637 (10fgiunchedi) 5Open>3Resolved resolving, I've opened {T112627} to track the takeaways above [07:31:59] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=181.80 Read Requests/Sec=282.22 Write Requests/Sec=126.27 KBytes Read/Sec=13345.05 KBytes_Written/Sec=984.22 [07:36:27] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1640645 (10jcrespo) My original complain, which is the service loss due to the spikes is solved at mysql configuration, and other changes done at application side... [07:36:54] (03PS2) 10Giuseppe Lavagetto: poolcounter: Add configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238099 [07:42:54] (03PS2) 10Giuseppe Lavagetto: Avoid including webscalesqlclient completely [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238390 [07:47:00] (03CR) 10Filippo Giunchedi: [C: 031] Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [07:55:32] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1640690 (10hashar) Can you add this task to the next week meeting agenda please? No worries there... [07:58:26] 6operations, 10Beta-Cluster, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1640693 (10hashar) Following on @Dzahn comment, should probably use Jessie instead of Trusty. If so: * rephrase the task summary * remove blocker {T65899} * maybe cre... [07:59:42] (03CR) 10Filippo Giunchedi: [C: 031] poolcounter: Add configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238099 (owner: 10Giuseppe Lavagetto) [08:00:49] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=187.92 Read Requests/Sec=1603.01 Write Requests/Sec=82.23 KBytes Read/Sec=8408.03 KBytes_Written/Sec=2447.39 [08:03:59] (03CR) 10Hashar: [C: 031] "I have cherry picked it on beta cluster and integration puppet masters:" [puppet] - 10https://gerrit.wikimedia.org/r/238221 (https://phabricator.wikimedia.org/T112537) (owner: 10Hashar) [08:04:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.70 Read Requests/Sec=0.40 Write Requests/Sec=0.30 KBytes Read/Sec=1.60 KBytes_Written/Sec=1.20 [08:06:48] (03PS1) 1020after4: Beginnings of some scap3 documentation [tools/scap] - 10https://gerrit.wikimedia.org/r/238391 (https://phabricator.wikimedia.org/T109515) [08:07:09] (03CR) 10jenkins-bot: [V: 04-1] Beginnings of some scap3 documentation [tools/scap] - 10https://gerrit.wikimedia.org/r/238391 (https://phabricator.wikimedia.org/T109515) (owner: 1020after4) [08:09:10] <_joe_> twentyafterfour: around? how do I test a mediawiki-config change in deployment-prep? I just merge it? [08:09:43] _joe_: yeah I think so [08:10:15] I mean, I think deployment-prep automatically syncs every commit [08:10:29] <_joe_> ok [08:14:26] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall" (034 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) (owner: 10Giuseppe Lavagetto) [08:17:01] (03PS2) 1020after4: Beginnings of some scap3 documentation [tools/scap] - 10https://gerrit.wikimedia.org/r/238391 (https://phabricator.wikimedia.org/T109515) [08:17:18] (03PS9) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [08:18:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [08:18:53] (03PS3) 1020after4: Beginnings of some scap3 documentation [tools/scap] - 10https://gerrit.wikimedia.org/r/238391 (https://phabricator.wikimedia.org/T112554) [08:19:23] (03PS3) 10Giuseppe Lavagetto: poolcounter: Add configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238099 [08:20:33] (03CR) 10Giuseppe Lavagetto: [C: 032] poolcounter: Add configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238099 (owner: 10Giuseppe Lavagetto) [08:22:07] !log bumped default size of iptables connection tracking table to 256k [08:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:36] !log bounce ms-be2006, xfs [08:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:30] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 988 MB (2% inode=58%) [08:29:29] _joe_: ^ [08:30:43] I'll also free up some diskspace, there are some Linux kernel builds which can be yanked [08:31:52] (03PS1) 10Alexandros Kosiaris: maps: Fix group membership for postgres log [puppet] - 10https://gerrit.wikimedia.org/r/238392 (https://phabricator.wikimedia.org/T112613) [08:32:46] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Fix group membership for postgres log [puppet] - 10https://gerrit.wikimedia.org/r/238392 (https://phabricator.wikimedia.org/T112613) (owner: 10Alexandros Kosiaris) [08:32:57] mhh also part of the problem is the unused 400g vg [08:36:59] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [08:38:52] indeed :-) [08:40:20] 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: Puppet has not run on maps-test{2,3,4} in many weeks - https://phabricator.wikimedia.org/T112613#1640783 (10akosiaris) 5Open>3Resolved [08:40:29] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - free space: / 29 MB (0% inode=61%) [08:40:38] _joe_: not nothing yet [08:40:45] s/not/no,/ [08:41:32] <_joe_> uhm copper must be me [08:42:09] RECOVERY - Disk space on copper is OK: DISK OK [08:42:27] <_joe_> akosiaris: so, copper just has a 40 GB raid root partition [08:42:32] <_joe_> where is all the rest of the disk? [08:42:47] <_joe_> should I build a partition for /tmp where package building happens? [08:43:12] IIRC that was caused by a partman bug/misconfig for early jessie installs [08:43:39] <_joe_> md1 : active (auto-read-only) raid1 sdb2[1] sda2[0] [08:44:25] <_joe_> oh that's the lvm partition [08:44:49] <_joe_> ok, I'm adding a /tmp on that space [08:46:13] did we manage to run out of space on copper ? [08:46:15] * moritzm has freed 5GB [08:46:38] akosiaris: yes, Icinga pinged us at 10:28 [08:46:52] ah yes just saw it [08:47:34] so, /var/cache/pbuilder/result is basically garbage, it can be cleaned up [08:48:14] _joe_: I 'd say add LV extents to / [08:48:31] we got like 400G free on that box [08:49:33] <_joe_> akosiaris: / is on a raid1 partition [08:49:36] <_joe_> not on LV [08:49:59] <_joe_> so I'd say we create a separate partition for /var/cache/pbuilder as well if we want to [08:50:13] <_joe_> /tmp is fundamental as it's the place where builds take place [08:52:18] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [08:53:16] <_joe_> !log created a 100 G partition on a LV on copper, for /tmp [08:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:14] sigh, indeed [08:54:15] yeah OK [08:56:44] (03PS6) 10Alexandros Kosiaris: rubocop: do not run for upstream code [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [08:56:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] rubocop: do not run for upstream code [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [08:58:20] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1640793 (10akosiaris) Done. I noticed that git submodules are being excluded right now in https://gerrit.wikimedia.org/r/#/c/235695/6/.rubocop.yml,cm.... [08:59:42] (03CR) 10Hoo man: [C: 031] "@Lokal profil: Please remember keeping https://github.com/Wikimedia-Sverige/DCAT in sync." [puppet] - 10https://gerrit.wikimedia.org/r/229136 (owner: 10Lokal Profil) [09:01:51] (03PS1) 10Bene: Whitelist m.wikidata.org for central auth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) [09:02:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:06:01] (03PS1) 10Filippo Giunchedi: cassandra: force dependency on openjdk-8-jdk [puppet] - 10https://gerrit.wikimedia.org/r/238395 [09:07:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: force dependency on openjdk-8-jdk [puppet] - 10https://gerrit.wikimedia.org/r/238395 (owner: 10Filippo Giunchedi) [09:09:21] (03PS3) 10ArielGlenn: Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/229136 (owner: 10Lokal Profil) [09:10:48] there's a report of cert expiry for extdist.wmflabs.org [09:10:48] (03CR) 10ArielGlenn: [C: 032] Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/229136 (owner: 10Lokal Profil) [09:10:53] [02:14] Hi, I just want to tell you that there is a problem when downloading an mediawiki extension. --2015-09-15 10:57:29-- https://extdist.wmflabs.org/dist/extensions/LdapAuthentication-REL1_25-d4db6f0.tar.gz Resolving extdist.wmflabs.org (extdist.wmflabs.org)... 208.80.155.156 Connecting to extdist.wmflabs.org (extdist.wmflabs.org)|208.80.155.156|:443... connected. ERROR:... [09:10:54] ...cannot verify extdist.wmflabs.org's certificate, issued by [09:11:50] <_joe_> bawolff: yes it's known, see -labs channel topic :( [09:11:59] ah, thanks [09:12:51] <_joe_> yeah, pretty sad :/ [09:13:05] (03CR) 10Hoo man: [C: 031] "After looking at the CentralAuth code briefly: This should work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [09:13:28] yeah... [09:17:23] well I guess at least most tools aren't strict-transport-security... [09:17:27] on the bright side [09:22:31] any more oddball certificates set to expire soon? ;-) [09:32:03] Hi there is a problem with the extension download [09:32:14] --2015-09-15 10:57:29-- https://extdist.wmflabs.org/dist/extensions/LdapAuthentication-REL1_25-d4db6f0.tar.gz Resolving extdist.wmflabs.org (extdist.wmflabs.org)... 208.80.155.156 Connecting to extdist.wmflabs.org (extdist.wmflabs.org)|208.80.155.156|:443... connected. ERROR: cannot verify extdist.wmflabs.org's certificate, issued by ‘/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3’: [09:32:21] Issued certificate has expired. [09:33:22] I told this before in the normal channel and they want me to post this here too. You can ignore the certificate error and then you will be able to download the file. [09:33:34] <_joe_> askdklasjklfjals: we know that [09:34:17] assuming people will read the topic ;-) [09:34:54] although I guess it might even be midnight UTC if it's midday west coast time :/ [09:35:29] (03CR) 10Hoo man: [C: 031] "There might be more, but this looks good for starters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [09:36:41] (03PS1) 10Bene: DNM Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) [09:39:06] (03CR) 10Aude: "let's put this at swat today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [09:41:53] Can't ExtensionDistributor be pointed to http in the meanwhile? [09:41:58] (03PS1) 10Yurik: Updated Kartotherian & Tilerator ports [puppet] - 10https://gerrit.wikimedia.org/r/238399 [09:42:41] <_joe_> Nemo_bis: that's a cure that would be worse than the problem, IMO [09:42:54] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1640871 (10jcrespo) [09:44:23] only for knowledgeable users [09:44:33] <_joe_> https is needed to ensure identity of the distributor, not just for encryption [09:44:35] akosiaris, ^^^ - we can upgrade earlier if possible, we are going public in a few days [09:44:48] <_joe_> especially for non-knowledgeable users :) [09:45:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [09:45:33] I don't know. I think teaching users to ignore certificate warnings is more dangerous then temporarily disabling https while its broken [09:45:55] yurik: what ? [09:46:07] akosiaris, https://gerrit.wikimedia.org/r/#/c/238399/ [09:46:41] lemme get this correctly, we want to change ports because some viruses use them as well ? [09:46:58] akosiaris, it was annoying because i couldn't even do localhost:4000 [09:47:00] try that :) [09:47:14] ERR_CONNECTION_REFUSED [09:47:18] as expected [09:47:34] thankfully, I would be very suprised if something was listening there [09:47:53] <_joe_> bawolff: I think we should not teach users to ignore cert warnings [09:48:12] yurik: maybe something to do with your laptop ? [09:48:15] <_joe_> I don't really get how you could infer that suggestion from what I said. [09:48:33] well that's what's happening now. To quote #mediawiki [09:48:34] [02:16] But I ignored the certificate error and was able to download the file via wget [09:48:50] yes! [09:48:51] akosiaris, i saw it multiple times in chrome - a different error, something like "connection not allowed" [09:48:57] due to restrictions [09:49:05] <_joe_> yurik: so it's your antivirus? [09:49:12] i don't have one in ubuntu [09:49:17] yurik: what restictions ? [09:49:23] Users need files. If the cert is broken, telling people to download it anyways, teaches them to ignore certificate errors [09:49:25] port is retstricted [09:49:34] let me see how i got it before, esc [09:49:36] sec [09:49:36] <_joe_> yurik: which port? [09:49:38] 4000 [09:49:44] <_joe_> yurik: I doubt it [09:49:46] also 6000 was causing it [09:49:57] try http://localhost:6000/ [09:50:05] it shows ERR_UNSAFE_PORT [09:50:37] so i decided to find a port that was not well known - so picked 6533 & 6534 [09:50:45] safe, unknown, good for the two new services [09:50:51] we don't have to change production [09:50:54] --explicitly-allowed-ports=xxx [09:51:03] but would be nice to keep it consistent [09:51:05] we do have to change production [09:51:17] ? [09:51:30] well, LVS, monitoring, varnish [09:51:33] multiple places [09:51:50] production has its own config file, so it can be anything, that is true. If its not too hard, can we change it? [09:51:58] plus 4000 does not emit an ERR_UNSAFE_PORT for me [09:52:16] it is hard which is why I am complaining [09:52:22] and trying to find an alternative [09:52:31] akosiaris, up to you, we are ok not to change it in prod [09:52:42] i have changed it for the default configuration [09:52:48] and all docs [09:53:31] also, that patch changes varnishes and configs, so only LVS needs to change [09:53:40] https://src.chromium.org/viewvc/chrome/trunk/src/net/base/net_util.cc?view=markup [09:53:45] for the love of god [09:53:50] or whatever [09:54:01] so, I don't see 4000 or 4100 in that list [09:54:11] so, it probably is not that the problem you are facing ? [09:54:37] akosiaris, its not a big deal for prod, if its too hard, lets keep it [09:54:58] 6operations, 5Patch-For-Review: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1640897 (10MoritzMuehlenhoff) 5Open>3Resolved nf_conntrack_max has been bumped to 256k, also the hash table size was increased to 32k. I double-checked via salt that the value has properly pr... [09:55:11] i will simply use ssh -L 5633:localhost:4000 maps2001... [09:55:37] it will be simpler to cause less confusion, but not that much of a deal [09:55:49] well, it is true that there is nothing forcing us to use the default ports for services [09:56:12] and given the birthday paradox it is obvious as some point 2 services will want to use the same port [09:56:19] especially given the love for number 8 [09:56:25] like 8888, 8080 etc [09:56:56] the good thing is that its ok to bring the service down right now because of few uses [09:57:06] so if we would want t ochange it, nows the time [09:57:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:58:21] that's true [10:06:36] (03PS1) 10Muehlenhoff: Enable ferm on mw1018 [puppet] - 10https://gerrit.wikimedia.org/r/238403 [10:08:47] (03CR) 10Giuseppe Lavagetto: [C: 031] Enable ferm on mw1018 [puppet] - 10https://gerrit.wikimedia.org/r/238403 (owner: 10Muehlenhoff) [10:09:12] 6operations, 5Patch-For-Review, 7Swift: swift eqiad capacity planning - https://phabricator.wikimedia.org/T1268#1640922 (10fgiunchedi) procurement for expansion in https://rt.wikimedia.org/Ticket/Display.html?id=9624 [10:09:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Avoid including webscalesqlclient completely [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238390 (owner: 10Giuseppe Lavagetto) [10:09:38] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1640928 (10jcrespo) [10:11:59] RECOVERY - RAID on ms-be2006 is OK: OK: optimal, 13 logical, 13 physical [10:12:58] RECOVERY - very high load average likely xfs on ms-be2006 is OK: OK - load average: 7.42, 2.35, 0.84 [10:16:46] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1640940 (10jcrespo) The original intention of this ticket is already being discussed on T104735, that is no more relevant here. We need to update the ticket to clarify that this tick... [10:16:58] (03PS2) 10Giuseppe Lavagetto: Backport of D44265: filter_var_array: do not fall back to FILTER_DEFAULT [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237861 (https://phabricator.wikimedia.org/T107677) (owner: 10BryanDavis) [10:17:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Backport of D44265: filter_var_array: do not fall back to FILTER_DEFAULT [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237861 (https://phabricator.wikimedia.org/T107677) (owner: 10BryanDavis) [10:19:31] (03PS2) 10Giuseppe Lavagetto: Backport of D37899: Fix ReflectionClass::getMethods filter [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237860 (https://phabricator.wikimedia.org/T95864) (owner: 10BryanDavis) [10:19:43] akosiaris, btw, tilerator still shows "tin" as one of the depl targets [10:21:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Backport of D37899: Fix ReflectionClass::getMethods filter [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237860 (https://phabricator.wikimedia.org/T95864) (owner: 10BryanDavis) [10:24:03] yurik: those were fixed, so it was readded somehow [10:24:16] apergos: ^ ? [10:24:45] grrrr [10:24:46] I did remove tin from the redis db and it seems to have been added somehow [10:25:18] hmm [10:25:19] no [10:25:28] keys 'deploy:tilerator/deploy:minions*' [10:25:28] 1) "deploy:tilerator/deploy:minions" [10:25:28] 2) "deploy:tilerator/deploy:minions:maps-test2004.codfw.wmnet" [10:25:28] 3) "deploy:tilerator/deploy:minions:maps-test2002.codfw.wmnet" [10:25:28] 4) "deploy:tilerator/deploy:minions:maps-test2003.codfw.wmnet" [10:25:29] 5) "deploy:tilerator/deploy:minions:maps-test2001.codfw.wmnet" [10:25:33] so it's not there [10:25:44] so ? [10:26:36] where are you seeing tin as one of the targets then? [10:27:34] yurik: ^ [10:29:46] apergos, last time i was doing git deploy sync it showed it there [10:29:51] (which was today [10:30:04] i had 4/5 count [10:30:26] akosiaris, ^ [10:32:37] indeed when I do git deploy report sync --detailed I see tin [10:33:06] probably some other key that didn't get cleared out, since all timestamps for it are None [10:33:40] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1640982 (10JanZerebecki) Sure you can add them. I would advise to enable the same Jenkins jobs on the submodule repo itself first (I didn't check if tha... [10:44:40] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [10:44:52] I think you want the values in the set "deploy:tilerator/deploy:minions" [10:45:02] that's where it gets the initial set of minions from [10:47:00] 6operations, 7HHVM: Package and deploy HHVM 3.6.5+dfsg1-1+wm3 - https://phabricator.wikimedia.org/T112640#1640999 (10Joe) 3NEW a:3Joe [10:48:35] (03PS1) 10Giuseppe Lavagetto: Version Bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238408 (https://phabricator.wikimedia.org/T112640) [10:54:56] redis 127.0.0.1:6379> srem "deploy:tilerator/deploy:minions" "tin.eqiad.wmnet" [10:54:56] (integer) 1 [10:54:56] redis 127.0.0.1:6379> smembers "deploy:tilerator/deploy:minions" [10:54:56] 1) "maps-test2003.codfw.wmnet" [10:54:56] 2) "maps-test2002.codfw.wmnet" [10:54:56] 3) "maps-test2004.codfw.wmnet" [10:54:56] 4) "maps-test2001.codfw.wmnet" [10:55:05] if it shows up again I want to hear about it [11:10:16] 6operations, 7Graphite, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1641025 (10Gilles) Hasn't this been implemented now? Yesterday I was using grafana and it only prompted me for credentials when I tried to save a dashboard. [11:13:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [11:13:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [11:20:09] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [11:23:58] !log depooled mw1018 (for enabling ferm) [11:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [11:26:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1018 [puppet] - 10https://gerrit.wikimedia.org/r/238403 (owner: 10Muehlenhoff) [11:27:40] PROBLEM - puppet last run on mw2026 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:22] (03CR) 10Filippo Giunchedi: [C: 031] reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [11:42:38] !log repool mw1018 (with ferm enabled) [11:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:43:28] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1641069 (10fgiunchedi) [11:43:31] 6operations, 10RESTBase-Cassandra, 10hardware-requests, 5Patch-For-Review: codfw 3x spares for cassandra encryption testing - https://phabricator.wikimedia.org/T111382#1641067 (10fgiunchedi) 5Open>3Resolved machines are setup [11:44:33] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1641071 (10Aklapper) >>! In T99132#1334919, @dr0ptp4kt wrote: > I have asked the contact at Google if a feature request could be put in to increase the sites limit and to also support the notion of granting a parti... [11:46:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:50:37] 6operations: bios defaults on new hardware orders - https://phabricator.wikimedia.org/T112627#1641077 (10fgiunchedi) [11:53:19] RECOVERY - puppet last run on mw2026 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:55:23] 6operations, 5codfw-appserver-setup, 5wikis-in-codfw: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1641087 (10Aklapper) What is left to do here? All blocker tasks are closed. [11:55:56] 6operations, 10CirrusSearch, 6Discovery, 7Documentation: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1641096 (10Aklapper) [12:07:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [12:13:25] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1641121 (10BBlack) >>! In T104730#1640940, @jcrespo wrote: > The original intention of this ticket is already being discussed on T104735, that is no more relevant here. We need to up... [12:17:48] (03PS1) 10BBlack: Remove wikidata CA cookie hacks [puppet] - 10https://gerrit.wikimedia.org/r/238418 (https://phabricator.wikimedia.org/T109072) [12:18:47] RECOVERY - DPKG on mc2016 is OK: All packages OK [12:29:58] (03CR) 10CSteipp: [C: 04-1] "This should be unnecessary as soon as https://gerrit.wikimedia.org/r/#/c/233091/ is merged. I think wikidata should be handled the same as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [12:35:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 8 below the confidence bounds [12:39:36] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:40:27] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1641167 (10jcrespo) @BBlack, then maybe report it upstream and close it. Monitor if it can be done in the future. [12:47:38] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:49:16] (03PS1) 10Muehlenhoff: Enable ferm on mw1114 (API server) [puppet] - 10https://gerrit.wikimedia.org/r/238425 [12:53:33] !log depooled mw1114 (for enabling ferm) [12:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:37] (03PS1) 10Hashar: Generate HTML coverage report [tools/scap] - 10https://gerrit.wikimedia.org/r/238428 [12:56:40] (03PS1) 10Hashar: Add some coverage to scap.cdlib [tools/scap] - 10https://gerrit.wikimedia.org/r/238429 [12:57:12] (03CR) 10Hashar: "That helps tracks coverage. We can even get it published under https://integration.wikimedia.org/cover/ when patches are merged." [tools/scap] - 10https://gerrit.wikimedia.org/r/238428 (owner: 10Hashar) [12:57:38] (03CR) 10Hashar: "Not that useful but at least give some usage example and slightly increase test coverage." [tools/scap] - 10https://gerrit.wikimedia.org/r/238429 (owner: 10Hashar) [12:58:26] 6operations, 10RESTBase: restbase staging cluster uses the same metric name as production cluster - https://phabricator.wikimedia.org/T112644#1641198 (10fgiunchedi) 3NEW a:3fgiunchedi [12:59:39] (03PS2) 10Andrew Bogott: Turn puppet autosign back on beta/integration [puppet] - 10https://gerrit.wikimedia.org/r/238221 (https://phabricator.wikimedia.org/T112537) (owner: 10Hashar) [12:59:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1114 (API server) [puppet] - 10https://gerrit.wikimedia.org/r/238425 (owner: 10Muehlenhoff) [13:01:44] (03PS3) 10Andrew Bogott: Turn puppet autosign back on beta/integration [puppet] - 10https://gerrit.wikimedia.org/r/238221 (https://phabricator.wikimedia.org/T112537) (owner: 10Hashar) [13:03:32] (03PS1) 10Filippo Giunchedi: restbase: make statsd metric prefix configurable [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) [13:03:34] (03PS1) 10Filippo Giunchedi: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) [13:05:31] (03CR) 10Andrew Bogott: [C: 032] Turn puppet autosign back on beta/integration [puppet] - 10https://gerrit.wikimedia.org/r/238221 (https://phabricator.wikimedia.org/T112537) (owner: 10Hashar) [13:10:09] 7Puppet, 10Continuous-Integration-Config: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1641274 (10zeljkofilipin) [13:11:55] !log failing over LVS service in ulsfo to secondariess (400[12] pybal stopped, traffic on jessie-based 400[34]) [13:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:19] !log repool mw1114 (with ferm enabled) [13:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:17] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1641285 (10fgiunchedi) 3NEW [13:21:52] (03PS3) 10Andrew Bogott: toolschecker: Added db tests [puppet] - 10https://gerrit.wikimedia.org/r/238323 (https://phabricator.wikimedia.org/T107449) [13:22:51] (03PS1) 10Hashar: contint: remove obsolete ruby related packages [puppet] - 10https://gerrit.wikimedia.org/r/238436 [13:23:18] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Added db tests [puppet] - 10https://gerrit.wikimedia.org/r/238323 (https://phabricator.wikimedia.org/T107449) (owner: 10Andrew Bogott) [13:23:41] (03CR) 10Hashar: [C: 031] "I think that was an experiment with Alexandros/Andrew/Subramanya. The packages are no more in use." [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [13:28:27] (03PS1) 10Hashar: contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 [13:29:15] (03CR) 10Hashar: "I think that was to potentially run MediaWiki jobs against a postgresql backend." [puppet] - 10https://gerrit.wikimedia.org/r/238438 (owner: 10Hashar) [13:29:57] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on http://tools-checker.wmflabs.org:80/nfs/home - 184 bytes in 0.016 second response time [13:31:02] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.035 second response time [13:34:29] phew, that made me faint [13:36:40] (03PS1) 10Hashar: contint: remove subversion::client [puppet] - 10https://gerrit.wikimedia.org/r/238442 [13:37:32] 7Puppet: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1641344 (10zeljkofilipin) 3NEW a:3zeljkofilipin [13:38:43] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1641354 (10zeljkofilipin) a:3zeljkofilipin [13:38:55] (03PS1) 10Andrew Bogott: toolschecker: Fix up db tests a bit. [puppet] - 10https://gerrit.wikimedia.org/r/238443 [13:40:33] PROBLEM - NFS read/writeable on labs instances on labstore1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on http://tools-checker.wmflabs.org:80/nfs/home - 184 bytes in 0.070 second response time [13:40:50] (03CR) 10Filippo Giunchedi: reprepro: add new distro jessie for mediawiki releases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [13:40:57] (03CR) 10Filippo Giunchedi: [C: 04-1] reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [13:41:08] <_joe_> what's this check about? [13:41:10] <_joe_> the NFS one [13:41:15] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Fix up db tests a bit. [puppet] - 10https://gerrit.wikimedia.org/r/238443 (owner: 10Andrew Bogott) [13:41:59] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1641369 (10fgiunchedi) @dzahn 8 I think will work, +1 on changing default distro too [13:43:53] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.180 second response time [13:43:58] 7Puppet, 10Continuous-Integration-Config: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1641377 (10zeljkofilipin) > 15:41 you can start by adding a bit of doc at https://www.mediawiki.org/wiki/Continuous_integration... [13:45:26] akosiaris: any objections on the name for these pageview api restbase servers? [13:45:30] https://phabricator.wikimedia.org/T111053#1638833 [13:46:07] aqs ? [13:46:13] arg.. [13:46:31] can't it be pageview or something ? [13:49:57] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1641402 (10mark) >>! In T111053#1638833, @Ottomata wrote: > The specs of those are all the same. > > We'll use > > - analytics1011 > - analytics1016 > - analytics1019 > > These will be reins... [13:50:29] 7Puppet, 10Continuous-Integration-Config: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1641403 (10zeljkofilipin) @akosiaris: A quick look at the submodules (searching for `.rb` files) says there are none. Am I missing them? [13:50:33] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1641404 (10Ottomata) uh oh! [13:50:33] akosiaris: i think pageview is too limiting [13:50:39] there will be more than pageviews in here eventually [13:50:41] i can feel it [13:51:27] perhaps keep their current names? ;) [13:51:50] i'd prefer to name them something else, buuuut, i don't care very much [13:52:05] ottomata: what do you mean more than pageviews ? [13:52:08] akosiaris: name brainstorm happened here [13:52:08] https://etherpad.wikimedia.org/p/pageview_api_nodes [13:52:09] another service ? [13:52:26] it's the new rage [13:52:28] akosiaris: this is a restful service api to public analytics aggregate data sets [13:52:39] 7Puppet, 10Continuous-Integration-Config: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1641411 (10zeljkofilipin) 5Open>3Resolved I think this task is resolved. The only subtask that is created is T110019. I plan to work on it this week. If you think there... [13:52:42] might be a new KPI, "nr of services deployed per quarter" ;) [13:52:46] haha [13:53:13] pageview is just the big one right now [13:53:52] and, i agree, aqs isn't a great name, just the best we came up with [13:53:59] so, maybe i should just leave them with their names [13:54:42] milimetric: ^^ [13:54:57] I 'd rather we did not leave them with their names [13:55:06] as I would rather did not leave them in the analytics VLAN either [13:55:17] it's a public facing service [13:55:43] well, not public as in public IPs for the boxes, but end-users will access it [13:56:00] so the convention is going for the service/role the box has [13:56:51] ottomata: what else is to be there except pageview api in the {foreseeable, not so forseeable} future ? [13:57:13] yeah, analytics query service was thought up in that spirit (role the box has) [13:58:06] akosiaris: any analytics data we currently have in other places should be served from the same place really. But that's long term [13:58:24] edit data, event logging analyses, etc. [13:59:17] akosiaris: yeah, any aggregate public datasets that are regularly generated and supported [13:59:18] (03PS1) 10DCausse: TTMServer: enable wikimedia extra plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238446 [13:59:18] not one offs [13:59:23] milimetric, do you have any page/summary/info with all storage technologies your team are already using? [13:59:35] (03PS1) 10Muehlenhoff: Exclude DNS requests from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) [14:00:03] jynus: we're not using anything other than mysql and hdfs now [14:00:05] (03CR) 10DCausse: [C: 04-1] "Requires wikimedia extra plugin (elasticsearch) v1.7.1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238446 (owner: 10DCausse) [14:00:39] here's an idea? dunno what if anything if this would go, but it is just a sample [14:00:40] http://datasets.wikimedia.org/public-datasets/ [14:00:57] milimetric, regarding your email, I got some suggestions from domas to check [14:01:10] will go back to you soon [14:01:16] ok [14:01:18] jynus: we did make a page of some candidates a while ago [14:01:18] https://wikitech.wikimedia.org/wiki/Analytics/DataStore/Evaluation [14:01:51] we need to kill mysql there [14:02:00] some presentation [14:02:01] https://docs.google.com/presentation/d/1YvswGSk7JWPbfshf3YVTLnAbG1_RQikHLZZ8sxmlX1Q/edit#slide=id.g75f4a0847_0_0 [14:02:05] maybe some answers for akosiaris in there? [14:02:10] thanks, ottomata [14:03:32] akosiaris: also, making a new wikistats (stats.wikimedia.org) is in the goals for some quarter soon, and who knows, maybe it'll be better to serve that data via an api rather than having copies of files all over the place [14:03:40] ottomata: I am this close to proposing bda => "big data API" :P [14:03:48] haha [14:03:51] nawWWW [14:04:00] let's drop the 'big' from it [14:04:40] yeah, this is not just for big data [14:05:04] and wikistats is the main goal for Q2 [14:07:17] I was joking, relax guys [14:07:31] hahahah [14:07:40] although I though big data was for everything, that's why it was called "big" [14:07:42] :P [14:08:04] the universe consists purely of information [14:08:12] information is the fundamental substance [14:08:29] information is data [14:08:32] the universe is big [14:09:09] the police officer is a trompone [14:09:31] that's my favourite example of false generalization [14:09:37] trompone is an instrument [14:09:47] <_joe_> my data are bigger than yours [14:09:51] the police officer is an instrument (of the law, but who cares) [14:09:59] so ... policer officer => trompone!!! [14:10:22] trombone [14:10:39] lol on many levels Nemo_bis [14:10:41] haha [14:10:49] maybe not literally kill mysql, he mentioned Presto on top of mysql [14:10:58] (03CR) 10Mobrovac: [C: 031] "This will be a no-op for prod, so can be safely +2-ed" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [14:12:07] presto ON mysql? [14:12:14] ottomata: anyway... the good thing about aqs is that it reminds wdqs and given that both are powered in some form by java products it might not be that bad [14:12:21] (03CR) 10Mobrovac: [C: 031] restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [14:12:24] brb, meeting [14:12:27] so... I hate to be a party pooper but aqs? [14:12:31] ha, akosiaris i think the java part is irrelevant, but whatever :) [14:12:34] akosiaris: i want to reinstall these [14:12:38] OH, you want them out of analytics vlan [14:12:41] Uh, ok new IPs! [14:12:45] hm. [14:12:52] we lost him! [14:13:14] ottomata, actually I asked for the opposite, an OLAP on top of mysql fronted [14:13:17] aw :) but he said brb [14:13:30] ohhh, like using mysql client? [14:13:35] ottomata: it is irrelevant but still funny [14:13:37] ha [14:13:38] yes, [14:13:49] jynus: might be easier to talk about in a meeting yeah [14:14:07] but he could only recommend me for SQL, PResto or hive [14:14:14] "SQL-like" [14:14:23] (03CR) 10Zfilipin: [C: 031] contint: remove obsolete ruby related packages [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [14:14:30] jynus: impala [14:14:36] we have it installed, but not fully productionized [14:14:47] it needs some hugs [14:14:47] yes, yes, it is just that this 2 weeks are wuite impossible [14:15:00] ha, um, not sure what we are talkigna bout anymore [14:15:05] to many "high"s + a presentation offsite [14:15:41] (03CR) 10Zfilipin: [C: 031] contint: remove subversion::client [puppet] - 10https://gerrit.wikimedia.org/r/238442 (owner: 10Hashar) [14:15:46] plus to be fair, I do not know much about requirements except at very high level [14:16:11] so it sounds wrong for me go to meetings unprepared [14:17:02] (03CR) 10Zfilipin: [C: 031] contint: remove postgresql [puppet] - 10https://gerrit.wikimedia.org/r/238438 (owner: 10Hashar) [14:18:33] but I would really to know more [14:18:43] *love [14:18:47] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, will merge/deploy shortly if no objections" [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [14:18:55] jynus: this meeting is just to talk about the kinds of queries we don't want to pre-aggregate data for, because that would create too much data [14:19:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [14:19:58] (03CR) 10Filippo Giunchedi: [C: 031] Allow full path to hosts file [tools/scap] - 10https://gerrit.wikimedia.org/r/238213 (owner: 10Thcipriani) [14:19:58] 6operations, 10ops-eqiad, 10netops: cr2-eqiad PEM 2 failure - https://phabricator.wikimedia.org/T112000#1641473 (10Cmjohnson) Pickup for RMA requested for 16 Sept. [14:19:58] the first example is an endpoint that gets the top pageviews for one or all projects, by access method (desktop, mobile, app), for an arbitrary time range [14:21:00] yeah, seems reasonable [14:21:09] (03CR) 10Zfilipin: [C: 031] Enable captchas on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) (owner: 10CSteipp) [14:21:11] jynus: we asked the analytics list and that's what they wanted, so we thought of Druid and then wanted to check with you [14:21:31] I am not familiar with it [14:21:39] that's all this first meeting would cover [14:21:58] so that is why I may be unuseful [14:22:28] plus too busy short term (some of it working on your tasks :-) ) [14:22:30] !log swapped disk on db1043 [14:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:43] thanks, Chris! [14:23:00] jynus: I figured you would want to be included in the decision though. You can learn more here or I could go over it in the meeting: http://druid.io [14:23:30] ok, no pressure, when you have some time [14:23:42] I also got some ideas to get you pure mysql (which I suppose it is something that you still want to have) [14:24:10] 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1641482 (10Cmjohnson) Swapped failed disk....currently in rebuild state Enclosure Device ID: 32 Slot Number: 1 Drive's position: DiskGroup: 0, Span: 0, Arm: 1 Enclosure position: N/A Device Id: 1 WWN: 5000C50028EA0BF0 S... [14:24:12] but with higher compression rate so that I can give you x1, etc. [14:24:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [14:26:11] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [14:26:29] is there someone who can review https://gerrit.wikimedia.org/r/#/c/238413/ (changing gerrit access for a new wikidata gerrit repo) ? [14:27:02] this happens everytime there is a new gerrit repo under wikidata/* [14:27:34] * aude thinks some ops people + release engineering people can [14:33:02] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:33:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [14:35:18] (03PS1) 10Merlijn van Deen: toollabs: add python-psycopg2 [puppet] - 10https://gerrit.wikimedia.org/r/238450 [14:35:57] (03PS4) 10Filippo Giunchedi: cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [14:36:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/236391 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [14:38:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [14:41:33] (03CR) 10BBlack: "As a general rule, probably most places where we define UDP:53 for DNS, we should define TCP:53 for it as well." [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) (owner: 10Muehlenhoff) [14:41:42] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:41:49] (03PS1) 10Jcrespo: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238456 (https://phabricator.wikimedia.org/T112478) [14:48:19] (03CR) 10GWicke: "We should add a similar variable prefix for logstash. In Ansible, I called this the 'cluster':" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [14:51:06] !log bounce cassandra on test cluster to deploy https://gerrit.wikimedia.org/r/236391 [14:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:01] (03CR) 10Hoo man: [C: 04-1] "Ok, so in that case we only need to CORS part of this. Thanks for letting us know, Chris." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [14:55:52] PROBLEM - Cassandra database on xenon is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [14:57:31] RECOVERY - RAID on db1043 is OK: OK: optimal, 1 logical, 2 physical [14:57:41] PROBLEM - Cassandra database on cerium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [14:59:12] PROBLEM - Cassandra database on praseodymium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [14:59:35] that's me ^ [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150915T1500). Please do the needful. [15:00:04] _joe_ aude MatmaRex kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] * aude here [15:00:15] hello jouncebot. [15:00:51] * kart_ here too [15:00:52] 6operations, 7Database: New hardware for production core mysql cluster - https://phabricator.wikimedia.org/T106847#1641597 (10Aklapper) >>! In T106847#1505661, @jcrespo wrote: > We need to spec a hardware replacement to be used for production hosts and send it to @robh Has that happened in the meantime? [15:01:13] okie doke, I can SWAT this morning. looks like the only person we're missing is _joe_ . [15:01:48] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1641598 (10Aklapper) >>! In T107749#1504078, @BBlack wrote: > patch reverted. will tune this better and re-attempt later/tomorrow. @BBlack: Did that happen or did other st... [15:01:51] aude: you saw the comments on your patch I assume? [15:02:00] thcipriani: yes [15:02:05] <_joe_> hi [15:02:06] we will do them in a follow up [15:02:14] aude: kk, going to do that one first then [15:02:20] <_joe_> thcipriani: I'm here of course [15:02:21] this patch already got bigger than the original request and thus got stalled [15:02:38] 6operations: kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756 for analytics1044 and analytics1043 - https://phabricator.wikimedia.org/T107698#1641600 (10Aklapper) >>! In T107698#1503007, @Ottomata wrote: > Bugfix has been backported into Trusty. I'm upgrading the 8 newer nodes (1042-1049) now. We'll... [15:02:41] _joe_: hi! you're up next. [15:02:50] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1641601 (10BBlack) Never happened. As it turns out, the only "easy" way to make this work right involves a parameter that's only supported in nginx's commercial Plus varia... [15:02:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [15:03:17] (03Merged) 10jenkins-bot: Exclude Flow topic boards and Draft NS from Special:UnconnectedPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229197 (https://phabricator.wikimedia.org/T107927) (owner: 10Aude) [15:05:22] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:06:17] !log thcipriani@tin Synchronized wmf-config/Wikibase.php: SWAT: Exclude Flow topic boards and Draft NS from Special:UnconnectedPages [[gerrit:229197]] (duration: 00m 11s) [15:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:24] ^ aude check please [15:06:27] looks good [15:06:32] kk [15:06:34] thanks [15:07:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238108 (https://phabricator.wikimedia.org/T105378) (owner: 10Giuseppe Lavagetto) [15:07:51] (03Merged) 10jenkins-bot: poolcounter: add connect_timeout in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238108 (https://phabricator.wikimedia.org/T105378) (owner: 10Giuseppe Lavagetto) [15:08:28] (03PS1) 10Filippo Giunchedi: Revert "cassandra: updated gc settings" [puppet] - 10https://gerrit.wikimedia.org/r/238462 (https://phabricator.wikimedia.org/T106619) [15:09:19] (03PS2) 10Filippo Giunchedi: Revert "cassandra: updated gc settings" [puppet] - 10https://gerrit.wikimedia.org/r/238462 (https://phabricator.wikimedia.org/T106619) [15:09:34] !log thcipriani@tin Synchronized wmf-config/PoolCounterSettings-codfw.php: SWAT: poolcounter: add connect_timeout in codfw [[gerrit:238108]] (duration: 00m 12s) [15:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:47] ^ _joe_ deployed, check if possible please [15:10:57] <_joe_> thcipriani: ok I'm just checking that the health checks don't blow out [15:11:14] <_joe_> thcipriani: looks good [15:11:36] _joe_: awesome, thanks. Going forward with the eqiad poolcounter patch. [15:12:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238109 (https://phabricator.wikimedia.org/T105378) (owner: 10Giuseppe Lavagetto) [15:12:34] (03Merged) 10jenkins-bot: poolcounter: enable connect_timeout for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238109 (https://phabricator.wikimedia.org/T105378) (owner: 10Giuseppe Lavagetto) [15:12:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "cassandra: updated gc settings" [puppet] - 10https://gerrit.wikimedia.org/r/238462 (https://phabricator.wikimedia.org/T106619) (owner: 10Filippo Giunchedi) [15:14:02] 7Puppet, 10Continuous-Integration-Config: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1641643 (10akosiaris) >>! In T102020#1641403, @zeljkofilipin wrote: > @akosiaris: A quick look at the submodules (searching for `.rb` files) says there are none. Am I missin... [15:14:06] (03PS2) 10Bene: Whitelist m.wikidata.org for central auth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) [15:14:22] !log thcipriani@tin Synchronized wmf-config/PoolCounterSettings-eqiad.php: SWAT: poolcounter: enable connect_timeout for testwiki [[gerrit:238109]] (duration: 00m 19s) [15:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:34] ^ _joe_ deployed, check please. [15:16:17] <_joe_> thcipriani: seems ok, thanks [15:16:29] _joe_: ok, thank you! [15:16:38] MatmaRex: you're up. [15:16:47] right here [15:16:53] 6operations, 7Database: New hardware for production core mysql cluster - https://phabricator.wikimedia.org/T106847#1641654 (10jcrespo) Aklapper, 4 months to full failure? that is too much time for operator standards! :-) I designed with Sean's help an initial spec, have to ask for initial quotes still. [15:16:54] <_joe_> thcipriani: if you hear of problems editing testwiki, this might be the cause [15:17:03] <_joe_> my tests are not that professional :P [15:17:08] _joe_: noted :) [15:19:22] RECOVERY - Cassandra database on cerium is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [15:20:52] RECOVERY - Cassandra database on praseodymium is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [15:20:52] RECOVERY - Cassandra database on xenon is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [15:22:45] * MatmaRex grumbles at Wikidata hogging CI [15:23:21] * aude grumbles about not enough CI resources :) [15:23:38] (03PS1) 10Andrew Bogott: toolschecker: fixed the labsdb1005 test. [puppet] - 10https://gerrit.wikimedia.org/r/238464 [15:23:58] (03PS2) 10Andrew Bogott: toolschecker: fixed the labsdb1005 test. [puppet] - 10https://gerrit.wikimedia.org/r/238464 [15:24:00] (03CR) 10JanZerebecki: [C: 031] "Looks good, but we said it should wait until tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/238418 (https://phabricator.wikimedia.org/T109072) (owner: 10BBlack) [15:24:58] (03CR) 10Andrew Bogott: [C: 032] toolschecker: fixed the labsdb1005 test. [puppet] - 10https://gerrit.wikimedia.org/r/238464 (owner: 10Andrew Bogott) [15:25:05] * hashar blames MediaWiki test suite on Zend [15:25:11] akosiaris: we have standup now, but let us know whne you are back. i want to start reinstalling these servers today. [15:25:22] need to talk to you about moving them outside of analytics vlan too [15:29:19] thcipriani: (it went through) [15:29:54] MatmaRex: yup, getting it on to tin now :) [15:31:26] !log thcipriani@tin Synchronized php-1.26wmf22/extensions/UploadWizard/resources/jquery/jquery.mwCoolCats.js: SWAT: Do not fail horribly when invalid categories are passed [[gerrit:238421]] (duration: 00m 12s) [15:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:51] ^ MatmaRex sync'd! check please. [15:33:18] thcipriani: checked, works :) [15:33:27] MatmaRex: awesome. Thank you! [15:33:35] kart_: last but not least [15:33:50] time to break testwiki then. [15:34:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237327 (https://phabricator.wikimedia.org/T112498) (owner: 10KartikMistry) [15:34:22] kart_: oh good :) [15:34:38] (03Merged) 10jenkins-bot: CX: Enable suggestion for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237327 (https://phabricator.wikimedia.org/T112498) (owner: 10KartikMistry) [15:34:55] 6operations: kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756 for analytics1044 and analytics1043 - https://phabricator.wikimedia.org/T107698#1641741 (10Ottomata) 5Open>3Resolved No, but there is a backlogged task to audit these. https://phabricator.wikimedia.org/T109834 I will close this one. [15:37:34] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable suggestion for testwiki (part 1) [[gerrit:237327]] (duration: 00m 12s) [15:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:09] !log thcipriani@tin Synchronized wmf-config: SWAT: CX: Enable suggestion for testwiki (part 2) [[gerrit:237327]] (duration: 00m 13s) [15:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:21] ^ kart_ okie doke should be live [15:39:04] (03CR) 10Qgil: "According to http://korma.wmflabs.org/browser/scr-backlog.html, this is our second oldest changeset without a review. Is there a chance to" [puppet] - 10https://gerrit.wikimedia.org/r/62955 (owner: 10Faidon Liambotis) [15:39:41] Testing. [15:42:29] thcipriani: any error in log? [15:42:51] thcipriani: I don't see things suppose to be there (ie Suggestions) on dashboard. [15:43:05] * thcipriani looking [15:43:48] thcipriani: oh. Wait. [15:44:05] thcipriani: the code is suppose to go live later with train :/ [15:44:17] that's fine. I will test later. [15:44:19] https://logstash.wikimedia.org/#dashboard/temp/AU_RrTqHifN8qp9j-q1T [15:44:41] oh, ok, that's fine then. SWAT complete! [15:46:08] thcipriani: nothing related to this patch, so we're good. [15:46:28] kart_: kk, thanks for taking a look. [15:47:05] (03PS1) 10BBlack: new wmflabs cert [puppet] - 10https://gerrit.wikimedia.org/r/238467 [15:47:21] (03CR) 10BBlack: [C: 032 V: 032] new wmflabs cert [puppet] - 10https://gerrit.wikimedia.org/r/238467 (owner: 10BBlack) [15:49:41] 6operations, 7Graphite, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1641804 (10Krenair) https://gerrit.wikimedia.org/r/#/c/237448/ and then https://gerrit.wikimedia.org/r/#/c/237761/ [15:49:54] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1641805 (10Jdforrester-WMF) [15:57:43] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1641827 (10Reedy) >>! In T88758#1613381, @fgiunchedi wrote: > afaik this isn't blocked on operations, see https://phabricator.... [16:00:04] YuviPanda _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150915T1600). Please do the needful. [16:00:04] Krenair ostriches: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:24] Actually I don't have one, it already got merged jouncebot. [16:00:29] Get it together, geez. [16:00:30] :p [16:00:42] hi [16:00:52] (03PS1) 10Rush: elasticsearch: break up hiera data [puppet] - 10https://gerrit.wikimedia.org/r/238470 [16:01:41] (03PS2) 10Rush: elasticsearch: break up hiera data [puppet] - 10https://gerrit.wikimedia.org/r/238470 [16:05:16] (03PS1) 10JanZerebecki: Also check submodules with rubocop [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) [16:05:34] (03CR) 10JanZerebecki: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [16:07:06] (03PS1) 10BBlack: depool codfw T112639 [dns] - 10https://gerrit.wikimedia.org/r/238472 [16:07:20] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1641871 (10mark) So, this request seems to serve T109715, but that ticket doesn't seem to be about a temporary test. So what exactly is the goal of this... [16:08:30] (03CR) 10BBlack: [C: 032] depool codfw T112639 [dns] - 10https://gerrit.wikimedia.org/r/238472 (owner: 10BBlack) [16:11:13] (03PS3) 10Rush: elasticsearch: break up hiera data [puppet] - 10https://gerrit.wikimedia.org/r/238470 [16:11:31] !log traffic DNS depooled out of codfw for now T112639 [16:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:01] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1641889 (10demon) >>! In T102566#1630396, @BBlack wrote: > So, now we're pending on merge of those 3 and a new sec release of... [16:12:22] _joe_: doing puppetswat? [16:12:54] (03CR) 1020after4: [C: 032] Add --environment flag to cli.Application [tools/scap] - 10https://gerrit.wikimedia.org/r/238211 (owner: 10Thcipriani) [16:13:14] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1641895 (10Papaul) apaul, As discussed over the phone, I have updated the RMA information (RMA number R395890-1) It has been resend to logistics for processing and the RMA is already for delivery tomo... [16:14:05] Hi Krenair [16:14:17] <_joe_> YuviPanda: I was waiting for you actually :) [16:14:18] (03PS4) 10Rush: elasticsearch: break up hiera data [puppet] - 10https://gerrit.wikimedia.org/r/238470 [16:14:40] (03Merged) 10jenkins-bot: Add --environment flag to cli.Application [tools/scap] - 10https://gerrit.wikimedia.org/r/238211 (owner: 10Thcipriani) [16:14:43] (03Merged) 10jenkins-bot: Allow full path to hosts file [tools/scap] - 10https://gerrit.wikimedia.org/r/238213 (owner: 10Thcipriani) [16:14:49] _joe_: can you just go ahead this time? I just woke up brain not fully worky... [16:14:59] <_joe_> ok [16:15:12] <_joe_> Krenair: it's only you, I'm looking at the patches right now [16:15:17] ok [16:15:58] _joe_: thanks [16:17:24] (03CR) 10Chad: [C: 032] Generate HTML coverage report [tools/scap] - 10https://gerrit.wikimedia.org/r/238428 (owner: 10Hashar) [16:17:51] (03CR) 10Rush: [C: 032] elasticsearch: break up hiera data [puppet] - 10https://gerrit.wikimedia.org/r/238470 (owner: 10Rush) [16:17:57] (03Merged) 10jenkins-bot: Generate HTML coverage report [tools/scap] - 10https://gerrit.wikimedia.org/r/238428 (owner: 10Hashar) [16:18:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Needs rework, see my comment." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [16:18:34] (03CR) 10Rush: [V: 032] elasticsearch: break up hiera data [puppet] - 10https://gerrit.wikimedia.org/r/238470 (owner: 10Rush) [16:19:06] (03PS2) 10JanZerebecki: Also check submodules with rubocop [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) [16:19:16] (03CR) 10Chad: [C: 032] Beginnings of some scap3 documentation [tools/scap] - 10https://gerrit.wikimedia.org/r/238391 (https://phabricator.wikimedia.org/T112554) (owner: 1020after4) [16:19:31] (03Merged) 10jenkins-bot: Beginnings of some scap3 documentation [tools/scap] - 10https://gerrit.wikimedia.org/r/238391 (https://phabricator.wikimedia.org/T112554) (owner: 1020after4) [16:19:52] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, but needs fixing of the preceding change." [puppet] - 10https://gerrit.wikimedia.org/r/236556 (owner: 10Alex Monk) [16:20:03] <_joe_> Krenair: see my comments on https://gerrit.wikimedia.org/r/236555 [16:20:26] yep, dealing with them now [16:21:13] (03CR) 10Giuseppe Lavagetto: [C: 031] beta apache config: fix instances of 'wikibooks' that were copy+pasted everywhere [puppet] - 10https://gerrit.wikimedia.org/r/236563 (owner: 10Alex Monk) [16:21:43] (03PS2) 10Jcrespo: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238456 (https://phabricator.wikimedia.org/T112478) [16:22:11] (03CR) 10Giuseppe Lavagetto: [C: 031] beta apache config: more consistency for wiktionary and wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/236566 (owner: 10Alex Monk) [16:22:48] (03PS3) 10Alex Monk: beta apache config: Move wikipedia and wikibooks out of main.conf into their own files [puppet] - 10https://gerrit.wikimedia.org/r/236555 [16:22:58] (03CR) 10Giuseppe Lavagetto: [C: 031] beta apache config: remove nonsensical rewrites [puppet] - 10https://gerrit.wikimedia.org/r/236567 (owner: 10Alex Monk) [16:23:38] (03PS4) 10Giuseppe Lavagetto: beta apache config: Move wikipedia and wikibooks out of main.conf into their own files [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [16:24:03] (03PS3) 10Jcrespo: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238456 (https://phabricator.wikimedia.org/T112478) [16:24:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] beta apache config: Move wikipedia and wikibooks out of main.conf into their own files [puppet] - 10https://gerrit.wikimedia.org/r/236555 (owner: 10Alex Monk) [16:24:22] (03PS4) 10Giuseppe Lavagetto: beta apache config: make wikipedia.conf more consistent with the other files [puppet] - 10https://gerrit.wikimedia.org/r/236556 (owner: 10Alex Monk) [16:24:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] beta apache config: make wikipedia.conf more consistent with the other files [puppet] - 10https://gerrit.wikimedia.org/r/236556 (owner: 10Alex Monk) [16:24:50] (03PS3) 10Giuseppe Lavagetto: beta apache config: fix instances of 'wikibooks' that were copy+pasted everywhere [puppet] - 10https://gerrit.wikimedia.org/r/236563 (owner: 10Alex Monk) [16:24:53] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1641947 (10yuvipanda) It is temporary becayse we don't know what the actual hardware requirements in terms of memory will be with a fully replicated ind... [16:25:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] beta apache config: fix instances of 'wikibooks' that were copy+pasted everywhere [puppet] - 10https://gerrit.wikimedia.org/r/236563 (owner: 10Alex Monk) [16:25:14] (03PS2) 10Giuseppe Lavagetto: beta apache config: more consistency for wiktionary and wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/236566 (owner: 10Alex Monk) [16:25:20] (03PS4) 10Jcrespo: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238456 (https://phabricator.wikimedia.org/T112478) [16:25:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] beta apache config: more consistency for wiktionary and wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/236566 (owner: 10Alex Monk) [16:25:39] 6operations, 10CirrusSearch, 6Discovery, 7Documentation: Decide on and document the implementation for multi data centre CirrusSearch - https://phabricator.wikimedia.org/T105708#1641949 (10Deskana) [16:25:47] (03PS3) 10Giuseppe Lavagetto: beta apache config: remove nonsensical rewrites [puppet] - 10https://gerrit.wikimedia.org/r/236567 (owner: 10Alex Monk) [16:25:59] (03CR) 10Jcrespo: [C: 032] Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238456 (https://phabricator.wikimedia.org/T112478) (owner: 10Jcrespo) [16:26:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] beta apache config: remove nonsensical rewrites [puppet] - 10https://gerrit.wikimedia.org/r/236567 (owner: 10Alex Monk) [16:26:07] (03PS1) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [16:26:14] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 (owner: 10Rush) [16:26:18] (03PS2) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [16:26:20] 6operations, 6Discovery, 5codfw-rollout: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1641954 (10Deskana) [16:26:40] 6operations, 6Discovery: Rollout CirrusSearch to codfw as a backup data centre - https://phabricator.wikimedia.org/T105711#1641956 (10Deskana) [16:27:11] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 for maintenance (duration: 00m 11s) [16:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:25] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1641963 (10JanZerebecki) I missed that submodules are not checked out in that job. [16:27:55] (03CR) 10JanZerebecki: [C: 04-1] "Doesn't help as submodules are not cloned in this job." [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [16:28:42] <_joe_> Krenair: done, you will have to wait a bit to see those on beta though [16:28:53] yep [16:28:54] thanks [16:29:46] !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSuggest.js: Touch file that is serving old version in prod (duration: 00m 12s) [16:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:16] !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/WikimediaEvents.php: touch file that is serving old version in prod (duration: 00m 12s) [16:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:33] 7Puppet, 10Continuous-Integration-Config: also clone submodules in operations/puppet jobs - https://phabricator.wikimedia.org/T112670#1641972 (10JanZerebecki) 3NEW [16:31:38] (03PS1) 10Rush: elasticsearch: codfw initial node stanza [puppet] - 10https://gerrit.wikimedia.org/r/238478 [16:32:08] !log Putting wmf22 versions of Echo and MobileFrontend on mw1017 for testing [16:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [16:32:31] <_joe_> Krenair: there was an issue with your patch in the last version [16:32:51] <_joe_> https://gerrit.wikimedia.org/r/#/c/236555 specifically, there is no wikibooks.conf added [16:32:53] (03CR) 10Rush: [C: 032] elasticsearch: codfw initial node stanza [puppet] - 10https://gerrit.wikimedia.org/r/238478 (owner: 10Rush) [16:32:54] <_joe_> I didn't notice [16:33:27] <_joe_> Krenair: so we need to add it again [16:33:32] oops [16:33:39] <_joe_> do you want me to do it? [16:33:41] (03PS1) 10Ottomata: Use raid1-lvm-ext4.cfg for analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/238479 (https://phabricator.wikimedia.org/T110090) [16:33:56] I don't seem to have a copy locally [16:34:02] yes please [16:34:38] (03PS2) 10Ottomata: Use raid1-lvm-ext4.cfg for analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/238479 (https://phabricator.wikimedia.org/T110090) [16:35:04] (03CR) 10Ottomata: [C: 032 V: 032] Use raid1-lvm-ext4.cfg for analytics1015 [puppet] - 10https://gerrit.wikimedia.org/r/238479 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [16:36:15] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1642002 (10EBernhardson) Yes, the initial question to answer here is if we can actually serve 2.5TB of indices out of a machine with 64GB of memory. In... [16:37:04] Hello - 503 again... [16:37:14] Hope it's ony temporary [16:38:05] ShakespeareFan00, where? [16:38:13] en.wikipedia.org [16:38:52] (03PS1) 10Giuseppe Lavagetto: beta apache config: Re-add wikibooks [puppet] - 10https://gerrit.wikimedia.org/r/238480 [16:39:06] (03PS1) 10Jcrespo: Revert "Depool db1055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238481 [16:39:16] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238481 (owner: 10Jcrespo) [16:39:18] <_joe_> Krenair: ^^ [16:39:31] !log reinstalling analytics1015 [16:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:45] (03CR) 10Giuseppe Lavagetto: [C: 032] beta apache config: Re-add wikibooks [puppet] - 10https://gerrit.wikimedia.org/r/238480 (owner: 10Giuseppe Lavagetto) [16:40:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Revert depool db1055 for maintenance (duration: 00m 11s) [16:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:55] ShakespeareFan00, that should fix it [16:41:19] Thanks [16:41:50] Anything more on the weekend incident? Or is that still under investigation? [16:42:39] _joe_, looks good, thanks [16:42:48] only saw 300 errors or so, althoug the depooling must be done [16:47:13] <_joe_> Krenair: the change should be in effect now [16:47:48] yep [16:48:02] I saw wikibooks stop working in beta and then come back up after you re-added the file [16:48:09] other things seem to work ok [16:48:29] 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1642021 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Rebuild is finished....error has cleared. [16:48:52] (03PS1) 10Hashar: nodepool: bump # of instances [puppet] - 10https://gerrit.wikimedia.org/r/238491 [16:50:15] 6operations, 10ops-eqiad, 6Labs, 3Labs-Sprint-114, 3ToolLabs-Goals-Q4: Make certain ports and cables between the labstores and shelves are numbered/named and labeled, and make sure that the diagram(s) reflect that. - https://phabricator.wikimedia.org/T112549#1642029 (10coren) @cmjohnson: what I need is a... [16:52:24] bblack, hi, i was already discussing it with akosiaris earlier - is it worth changing map services ports to match their defaults to keep things simpler and less confusing when debugging? https://gerrit.wikimedia.org/r/#/c/238399/ [16:52:38] its ok to bring it down shortly [16:54:01] (03PS1) 10Jcrespo: Depool es1003, es1004, es1007 and es1010 for decommision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238494 (https://phabricator.wikimedia.org/T105843) [16:54:38] (03CR) 10Jcrespo: [C: 032] Depool es1003, es1004, es1007 and es1010 for decommision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238494 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [16:55:04] (03CR) 10Hashar: [C: 04-1] "Wanna hold cause the image creation might be broken. Have to triple check it is still running properly." [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [16:55:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1003, es1004, es1007 and es1010 for decommision (duration: 00m 12s) [16:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:59] ori: https://www.npmjs.com/package/wikipedia-telnet [17:03:47] (03CR) 10Hashar: [V: 031] "I confirmed it is working by deleting instances with:" [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [17:04:25] has anything changed with our redis servers today? [17:04:37] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1642060 (10JanZerebecki) Updated https://wikitech.wikimedia.org/wiki/Puppet_coding#Roles to reflect this. [17:08:22] <_joe_> legoktm: no, why? [17:08:27] 6operations, 10hardware-requests, 7Database, 5Patch-For-Review: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1642069 (10jcrespo) This is almost completed: we just need to wait for the old server to finish processing dump queries, stop mysqls to confirm we can stop them and cl... [17:08:27] <_joe_> what's the issue you see? [17:08:31] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1642070 (10Eevans) [17:08:40] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1642074 (10jcrespo) [17:09:29] _joe_: CentralAuth's API token feature is broken https://phabricator.wikimedia.org/T112671, we're investigating in #mediawiki-core [17:09:56] <_joe_> so the session redises it would be [17:10:25] (03CR) 10Zfilipin: [C: 031] nodepool: bump # of instances [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [17:10:26] yep [17:10:43] but no one has reported any login or session failures, so it may not be the redises? [17:12:12] we switched over to redis from memcache last week, but people only started reporting that it was broken today [17:12:12] <_joe_> legoktm: I'd say not [17:12:12] (03CR) 10BBlack: "I assume this is because you want to not use 4000 when testing the software on laptops or whatever, and want to keep that consistent? Fro" [puppet] - 10https://gerrit.wikimedia.org/r/238399 (owner: 10Yurik) [17:12:17] ok [17:13:01] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1642099 (10Eevans) With the 3 test nodes in codfw up, we are now on step #6 above; The next step is to deploy RESTBase [[https://github.com/wikimedia/restb... [17:13:19] <_joe_> legoktm: I'm going off for now, pretty tired :) page me on the phone if needed [17:13:38] _joe_: o/ I don't think we'll need to :) [17:14:22] RECOVERY - Disk space on labstore1002 is OK: DISK OK [17:16:22] 6operations, 7Graphite, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1642117 (10Gilles) >>! In T93710#1641804, @Krenair wrote: > https://gerrit.wikimedia.org/r/#/c/237448/ and then https://gerrit.wikimedia.org/r/#/c/237761/ Thanks :) @GW... [17:24:36] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1642152 (10mark) Alright, so it sounds like this will likely end up in a future procurement request for hardware then, after this trial period is up? :) [17:25:59] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1642167 (10yuvipanda) Depends on result of this test - assuming we are actually able to get this to work with reasonable performance levels without havi... [17:26:57] 6operations, 10Wikimedia-Mailing-lists: Reset/resend xtools@lists.wikimedia.org admin password - https://phabricator.wikimedia.org/T112255#1642179 (10Dzahn) [17:27:36] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1642184 (10EBernhardson) there is also the possibility of using the old lsearchd cluster (but they are 1.5yrs out of warrenty), but i'm not super thrill... [17:31:55] 6operations, 10Wikimedia-Mailing-lists: Reset/resend xtools@lists.wikimedia.org admin password - https://phabricator.wikimedia.org/T112255#1642203 (10Dzahn) Done as requested and sent a new password to the list owner address. [17:32:20] 6operations, 10Wikimedia-Mailing-lists: Reset/resend xtools@lists.wikimedia.org admin password - https://phabricator.wikimedia.org/T112255#1642204 (10Dzahn) 5Open>3Resolved [17:32:54] 6operations, 10ops-eqiad, 6Labs, 3Labs-Sprint-114, 3ToolLabs-Goals-Q4: Make certain ports and cables between the labstores and shelves are numbered/named and labeled, and make sure that the diagram(s) reflect that. - https://phabricator.wikimedia.org/T112549#1642206 (10Cmjohnson) labstore1002 port 0 to l... [17:37:07] (03PS3) 10Dzahn: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) [17:44:58] (03CR) 10John F. Lewis: [C: 031] "SSH key looks fine. Account is associated with a mediawiki WMF account which is linked with a metawiki account created by "17:32, 24 Augus" [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn) [17:45:59] (03PS3) 10Dzahn: admin: create shell account for Joshua Minor [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) [17:48:25] (03CR) 10Dzahn: [C: 032] admin: create shell account for Joshua Minor [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn) [17:49:48] (03PS2) 10Muehlenhoff: Exclude DNS requests from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) [17:50:57] (03CR) 10Dzahn: "back on that old ticket it also says bblack wrote a Perl script for this. we should probably add that instead if we want the report again." [puppet] - 10https://gerrit.wikimedia.org/r/237865 (https://phabricator.wikimedia.org/T83158) (owner: 10Dzahn) [17:51:04] (03Abandoned) 10Dzahn: mailman: old maintenance script for list report [puppet] - 10https://gerrit.wikimedia.org/r/237865 (https://phabricator.wikimedia.org/T83158) (owner: 10Dzahn) [17:51:56] (03Abandoned) 10Dzahn: admin: optimized yuvipanda resource [puppet] - 10https://gerrit.wikimedia.org/r/237575 (owner: 10Dzahn) [17:52:33] (03CR) 10Dzahn: "@jcrespo time to re-evaluate?" [puppet] - 10https://gerrit.wikimedia.org/r/237513 (https://phabricator.wikimedia.org/T112135) (owner: 10Dzahn) [17:52:55] (03PS1) 10Rush: Adding search.svc.codfw.wmnet definition [dns] - 10https://gerrit.wikimedia.org/r/238504 [17:54:07] (03CR) 10Muehlenhoff: "I've updated this patch and I've added myself a TODO item to check the other rules." [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) (owner: 10Muehlenhoff) [17:55:16] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1642391 (10DannyH) [17:55:23] (03PS2) 10Dzahn: wikistats: crons for db backup (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/236238 [17:59:18] (03PS1) 10Rush: WIP elastic: define codfw lvs [puppet] - 10https://gerrit.wikimedia.org/r/238507 [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150915T1800). Please do the needful. [18:00:32] I can't reset my mediawiki.org password from the office IP? [18:01:39] ACKNOWLEDGEMENT - Check size of conntrack table on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:39] ACKNOWLEDGEMENT - DPKG on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:39] ACKNOWLEDGEMENT - Disk space on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:39] ACKNOWLEDGEMENT - HTTPS on mendelevium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn T111532 - new OTRS VM [18:01:39] ACKNOWLEDGEMENT - NTP on mendelevium is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn T111532 - new OTRS VM [18:01:39] ACKNOWLEDGEMENT - OTRS SMTP on mendelevium is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn T111532 - new OTRS VM [18:01:39] ACKNOWLEDGEMENT - RAID on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:40] ACKNOWLEDGEMENT - SSH on mendelevium is CRITICAL: Server answer: daniel_zahn T111532 - new OTRS VM [18:01:40] ACKNOWLEDGEMENT - configured eth on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:41] ACKNOWLEDGEMENT - dhclient process on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:41] ACKNOWLEDGEMENT - puppet last run on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:01:42] ACKNOWLEDGEMENT - salt-minion processes on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn T111532 - new OTRS VM [18:03:02] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1642417 (10yuvipanda) Not sure if we can fit 12 machines in b row (requirement to be in labs subnet) and pretty sure we don't want to :) [18:04:36] cscott: <3 <3 <3 <3 [18:05:42] (03CR) 10Jcrespo: [C: 031] admin: add jminor to research,stats,analytics-priv [puppet] - 10https://gerrit.wikimedia.org/r/238376 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn) [18:05:43] cscott: better ascii art: https://en.wikipedia.org/wiki/User:Dispenser [18:05:49] now we just need to find some nice machine in labs to host it [18:06:27] hahaha [18:06:48] we should totally host it on text-lb:23 [18:07:01] yes plz [18:07:14] man that would make a great apr 1st joke [18:07:16] headlines and everything [18:07:29] yes:) cooler than the star wars movie on telnet [18:08:40] (03PS2) 10Dzahn: admin: add jminor to research,stats,analytics-priv [puppet] - 10https://gerrit.wikimedia.org/r/238376 (https://phabricator.wikimedia.org/T111872) [18:08:48] (03PS1) 10Ottomata: Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) [18:09:33] (03PS2) 10Ottomata: Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) [18:09:50] (03CR) 10Jcrespo: [C: 031] Revert "phab: disable tools crons" [puppet] - 10https://gerrit.wikimedia.org/r/237513 (https://phabricator.wikimedia.org/T112135) (owner: 10Dzahn) [18:09:52] (03CR) 10Rush: [C: 032] Adding search.svc.codfw.wmnet definition [dns] - 10https://gerrit.wikimedia.org/r/238504 (owner: 10Rush) [18:09:59] (03CR) 10Dzahn: [C: 032] admin: add jminor to research,stats,analytics-priv [puppet] - 10https://gerrit.wikimedia.org/r/238376 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn) [18:10:54] (03PS2) 10Dduvall: Rename and simplify some git deploy functions [tools/scap] - 10https://gerrit.wikimedia.org/r/236241 (https://phabricator.wikimedia.org/T109514) [18:11:24] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [18:11:57] (03PS1) 10Krinkle: Reduce use of deprecated $wgStyleSheetPath. Use wgStylePath instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238509 [18:11:59] (03PS1) 10Krinkle: Derive from wgExtensionAssetsPath and wgStylePath from wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238510 [18:12:01] (03PS1) 10Krinkle: Remove hardcoded $wgCanonicalServer from $wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238511 (https://phabricator.wikimedia.org/T112646) [18:12:13] (03PS3) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [18:12:21] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 (owner: 10Rush) [18:12:22] 6operations, 10ops-codfw: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1642449 (10Papaul) 3NEW a:3Papaul [18:12:23] (03PS4) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [18:13:06] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:13:40] (03PS5) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [18:13:42] (03CR) 10Faidon Liambotis: [C: 04-1] "How are the responses going to be accepted in INPUT without connection tracking then?" [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) (owner: 10Muehlenhoff) [18:13:47] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 (owner: 10Rush) [18:13:53] (03PS6) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [18:14:52] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1642479 (10Eevans) [18:16:27] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1642485 (10Dzahn) merged. @jminor you should have access now. i already saw puppet create your new user on the bastion host, bast1001.wikimedia... [18:16:55] 6operations, 6Editing-Department, 6Parsing-Team, 6Services: [DRAFT] Services team goals October - December 2015 (Q2 2015/16) - https://phabricator.wikimedia.org/T111819#1642487 (10Jdforrester-WMF) [18:17:20] (03PS1) 10Andrew Bogott: Nova: remove_unused_base_images=True [puppet] - 10https://gerrit.wikimedia.org/r/238515 [18:17:22] (03PS1) 10Andrew Bogott: Explicitly disable services on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/238516 [18:17:47] (03PS2) 10Andrew Bogott: Explicitly disable services on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/238516 [18:17:50] (03PS2) 10Andrew Bogott: Nova: remove_unused_base_images=True [puppet] - 10https://gerrit.wikimedia.org/r/238515 [18:18:15] (03CR) 10Andrew Bogott: [C: 032] Explicitly disable services on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/238516 (owner: 10Andrew Bogott) [18:20:33] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1642495 (10Dzahn) The user has also been created on both stat1002 and stat1003 now. ---- Here's an example config snippet to get on stat1003, fo... [18:20:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1642497 (10Dzahn) 5Open>3Resolved [18:23:16] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [18:23:27] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1642500 (10dr0ptp4kt) @aklapper, I'm pinging on the thread. [18:27:02] (03CR) 10Jcrespo: "Check these comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [18:27:34] PROBLEM - nova-conductor process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [18:29:10] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1642509 (10awight) @ellery: Can you let us know what you think about the impact on statistics? [18:30:21] 6operations, 10ops-codfw: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1642510 (10RobH) The names of the new systems will be restbase2001-2006. We'll want to spread these out across multiple rows/racks (similar to eqiad's deployment.) As such, I'm suggest... [18:37:41] (03CR) 10Ottomata: Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [18:38:00] (03PS3) 10Ottomata: Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) [18:38:25] ottomata: https://phabricator.wikimedia.org/T100678 ? [18:38:49] ees failing?! geez. hm. [18:39:32] 6operations, 10Analytics-Cluster, 6Analytics-Kanban: Fix llama user id - https://phabricator.wikimedia.org/T100678#1642547 (10Ottomata) [18:39:42] paravoid: forgot about it. bumping it up to kanban so grace bugs me about it [18:39:55] haha [18:39:57] okay :) [18:45:09] Hi paravoid, wmflabs had an ssl issue earlier today. I was wondering, don't you guys get spam from your ssl provider about renewals? [18:45:15] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - /archive/banner_logs is not accessible: Stale NFS file handle [18:46:13] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 3 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1642576 (10Josve05a) [18:46:13] multichill: they spam the personal email address of whoever happened to request the certificate at the time [18:46:13] (03CR) 10Yurik: "yes, just want to keep everything consistent. It is not "must have" in prod. Downtime is ok if its under an hour. If its easy to do, lets " [puppet] - 10https://gerrit.wikimedia.org/r/238399 (owner: 10Yurik) [18:46:17] multichill: but we're fixing it [18:46:23] bblack, ^^ [18:46:24] (03PS4) 10Ottomata: Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) [18:46:39] (03PS1) 1020after4: 1.26wmf23 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238522 [18:46:44] paravoid: https://phabricator.wikimedia.org/T112645 ? [18:46:57] (03CR) 1020after4: [C: 032] 1.26wmf23 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238522 (owner: 1020after4) [18:47:04] (03Merged) 10jenkins-bot: 1.26wmf23 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238522 (owner: 1020after4) [18:47:10] (03PS5) 10Ottomata: Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) [18:47:13] I guess multiple people were banging their head on a desk :P [18:47:22] multichill: that, plus https://phabricator.wikimedia.org/T112521 [18:47:37] multichill: plus https://phabricator.wikimedia.org/T112542 [18:48:09] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1642592 (10awight) @Jgreen mentions that the new log pipeline will be updated in realtime, so we should reconside... [18:48:15] multichill: so essentially we're adding redundancy to the system; we're going to track it both from the technical side of things (icinga alerts for everything), plus track it in our "contract expiry" calendar [18:48:57] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:49:01] paravoid: We use the double approach. So we have notification from our CERT provider to a central email address (which produces a ticket) and every website has an SSL check that starts to complain n days before it expires [18:49:16] (03CR) 10Ottomata: [C: 032] Puppetize MySQL/MariaDB server on analytics1015 in prep for moving Hive and Oozie [puppet] - 10https://gerrit.wikimedia.org/r/238508 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [18:49:16] right [18:49:52] For service contracts we use the calendar approach. Works quite well. [18:50:15] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - /archive/banner_logs is not accessible: Stale NFS file handle [18:50:52] Jeff_Green: ^ [18:51:06] Tracking the "main" sites is quite easy, but having a 100% coverage of all the small sites that are just "temporary test sites" is quite had [18:51:08] *hard [18:52:54] (03PS1) 1020after4: Cleanup: delete 1.26wmf12 through 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238524 [18:53:22] paravoid: Do you guys run automatic penetration testing? You can have your pentest do the checks too as an extra measure. [18:53:23] (03CR) 1020after4: [C: 032] Cleanup: delete 1.26wmf12 through 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238524 (owner: 1020after4) [18:53:28] (03Merged) 10jenkins-bot: Cleanup: delete 1.26wmf12 through 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238524 (owner: 1020after4) [18:53:49] 6operations, 10ops-eqiad, 10Traffic, 10netops: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1642611 (10Cmjohnson) I ran the cabling today and ran 1 GigE connection to B8: ge-8/0/46 to lvs1012 eth3. lvs1007:   eth0 -> asw2-a5:8 (home row)   eth1 -> asw-c8:26 3931   eth2... [18:54:46] !log twentyafterfour@tin Started scap: sync 1.26wmf23 to testwiki [18:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:15] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - /archive/banner_logs is not accessible: Stale NFS file handle [18:56:27] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [18:56:44] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1642618 (10demon) Working on this. Failing on the usual TLS madness. [18:56:56] ACKNOWLEDGEMENT - check_disk on barium is CRITICAL: DISK CRITICAL - /archive/banner_logs is not accessible: Stale NFS file handle Jeff_Green working on it [18:57:12] :) [18:59:14] 6operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1642625 (10Dzahn) a:3Dzahn [18:59:47] (03CR) 10Chad: [C: 031] Rename and simplify some git deploy functions [tools/scap] - 10https://gerrit.wikimedia.org/r/236241 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [19:00:25] (03PS4) 10Dzahn: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) [19:00:35] (03CR) 10Chad: [C: 032] Add some coverage to scap.cdlib [tools/scap] - 10https://gerrit.wikimedia.org/r/238429 (owner: 10Hashar) [19:00:38] (03CR) 10Dzahn: [C: 032] reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [19:02:08] (03PS5) 10Dzahn: releases: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) [19:02:31] mutante: what's the list we announce to for outages impending? [19:03:31] chasemp: wikitech-l, ops and maybe something else depending on service? for example for mailman i added listadmins@list [19:03:38] kk [19:04:31] chasemp: if it's user facing, add the phabricator tag to a ticket and that should automagically make it show up in TechNews :) [19:04:47] #user-notice? [19:04:54] (03Merged) 10jenkins-bot: Add some coverage to scap.cdlib [tools/scap] - 10https://gerrit.wikimedia.org/r/238429 (owner: 10Hashar) [19:05:11] yes [19:05:36] Johan, the new Guillaume, looks at that [19:06:13] k tx -- search guys are going to do something that leaves teh server and client side disconnected on translation stuff for a period [19:06:26] advising them to notify these mediums [19:10:25] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [19:11:14] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1642679 (10Ottomata) @akosiaris we need to know: - Is aqs100x ok for a name - What VLAN should I put these in (and how?) [19:11:15] PROBLEM - DPKG on analytics1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:17:51] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1642698 (10RobH) a:5RobH>3None [19:18:06] (03PS3) 10Andrew Bogott: Nova: remove_unused_base_images=True [puppet] - 10https://gerrit.wikimedia.org/r/238515 [19:20:14] RECOVERY - check_disk on barium is OK: DISK OK - free space: / 25287 MB (47% inode=91%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 842214 MB (29% inode=99%): /boot 179 MB (68% inode=99%): /archive/udplogs 1062501 MB (26% inode=98%) [19:22:02] (03PS1) 10Dzahn: reprepro: switch default_distro to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) [19:25:18] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1642734 (10Dzahn) Thanks! So i added the new distro in the releases module. I did not change the default_distro in the reprepro module just yet. [19:27:12] 6operations, 10netops: Set up NTT transit @ eqdfw, eqord - https://phabricator.wikimedia.org/T111274#1642747 (10RobH) The latest cross-connection notice shows completion of the CH2 side of the NTT patching on 2015-09-16. There was a congestion issue on the initial LoA Z side panel, so it had to be updated for... [19:27:16] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [19:27:53] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1642749 (10yuvipanda) If someone is putting machines into vlans can I watch too? [19:30:18] (03PS7) 10Rush: elasticsearch: add role/codfw/elasticsearch/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/238476 [19:30:25] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [19:33:16] (03CR) 10Rush: [C: 032] "compiler says it looks good, no substantive change here just shuffling" [puppet] - 10https://gerrit.wikimedia.org/r/238476 (owner: 10Rush) [19:41:16] (03PS1) 10BBlack: dhcp for lvs1012 [puppet] - 10https://gerrit.wikimedia.org/r/238528 (https://phabricator.wikimedia.org/T104458) [19:41:34] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:44:53] 6operations, 6Performance-Team: Define SLAs for media - https://phabricator.wikimedia.org/T112692#1642816 (10Gilles) 3NEW [19:45:00] 6operations, 6Performance-Team: Define SLAs for media - https://phabricator.wikimedia.org/T112692#1642823 (10Gilles) [19:47:24] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:35] (03PS1) 10Ottomata: Puppetize MariaDB init.d and bin in role::analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/238529 (https://phabricator.wikimedia.org/T110090) [19:47:43] (03CR) 10jenkins-bot: [V: 04-1] Puppetize MariaDB init.d and bin in role::analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/238529 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [19:47:53] (03PS2) 10Ottomata: Puppetize MariaDB init.d and bin in role::analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/238529 (https://phabricator.wikimedia.org/T110090) [19:48:10] (03PS3) 10Ottomata: Puppetize MariaDB init.d and bin in role::analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/238529 (https://phabricator.wikimedia.org/T110090) [19:48:55] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [19:50:49] (03CR) 10Ottomata: [C: 032] Puppetize MariaDB init.d and bin in role::analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/238529 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [19:52:19] (03CR) 10Ori.livneh: [C: 032] Reduce use of deprecated $wgStyleSheetPath. Use wgStylePath instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238509 (owner: 10Krinkle) [19:52:46] (03Merged) 10jenkins-bot: Reduce use of deprecated $wgStyleSheetPath. Use wgStylePath instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238509 (owner: 10Krinkle) [19:54:06] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:54:20] (03CR) 10Ori.livneh: [C: 032] Derive from wgExtensionAssetsPath and wgStylePath from wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238510 (owner: 10Krinkle) [19:54:26] (03Merged) 10jenkins-bot: Derive from wgExtensionAssetsPath and wgStylePath from wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238510 (owner: 10Krinkle) [19:54:36] PROBLEM - puppet last run on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:08] (03CR) 10Ori.livneh: [C: 032] Remove hardcoded $wgCanonicalServer from $wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238511 (https://phabricator.wikimedia.org/T112646) (owner: 10Krinkle) [19:55:15] (03Merged) 10jenkins-bot: Remove hardcoded $wgCanonicalServer from $wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238511 (https://phabricator.wikimedia.org/T112646) (owner: 10Krinkle) [19:55:50] Krinkle: merging and syncing to mw1017 to test [19:56:03] ori: k, was just about to ask :) [19:56:33] Krinkle: k, it's live on mw1017 [19:57:20] btw, the 'outdated mwscript' on that host is fixed [19:57:30] mutante: sweet, thanks [19:57:32] scap scripts from puppet now on canary appservers [19:58:37] Krinkle: load.php urls on testwiki are still not using the mobile domain -- that's not desired, right? [19:58:54] I'm diffing responses from enwiki with the header [19:59:12] - https://en.wikipedia.org/w/load.php [19:59:16] + //en.wikipedia.org/w/load.php [19:59:17] for html [19:59:28] and - https://../static [19:59:32] + /static [20:00:04] checking mobile now [20:00:05] ah [20:00:11] * AaronSchulz wonders if there is any reason mwscript has to use zend [20:00:26] ori: It's worse. It used to use en.m.wikipedia.org at least in the html [20:00:39] it goes from https://en.m.wiki to //en.wiki [20:00:46] probably a faulty expandUrl somewhere [20:01:04] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [20:03:24] ori: It did fix addSource() in startup response.. so all dynamic load.php go over current domain as intended, nice. [20:03:35] now just need to keep the html as it was :/ [20:03:51] Krinkle: I gotta walk into another meeting. Either revert, or find the bug and block deployments until you're done [20:03:55] so this doesn't get synced by accident [20:04:00] Yeah [20:04:10] I'll revert for now and play on testwiki [20:05:06] !log restarted mysql (and oozie) on analytics1027 to start mysql binlogging [20:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:13] (03PS1) 10Krinkle: Revert "Remove hardcoded $wgCanonicalServer from $wgResourceBasePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238532 [20:06:19] (03CR) 10Krinkle: [C: 032] Revert "Remove hardcoded $wgCanonicalServer from $wgResourceBasePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238532 (owner: 10Krinkle) [20:06:26] (03Merged) 10jenkins-bot: Revert "Remove hardcoded $wgCanonicalServer from $wgResourceBasePath" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238532 (owner: 10Krinkle) [20:09:36] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:39] twentyafterfour: There are uncommitted changes in wikiversions.json. Known? [20:11:27] Krinkle: yes, I'm in the middle of syncing to testwiki, I don't commit the changes and push until after testing ...and scap is taking forever today [20:11:39] Okay :) [20:11:46] 6operations, 10Wikimedia-Mailing-lists: Reset/resend xtools@lists.wikimedia.org admin password - https://phabricator.wikimedia.org/T112255#1642871 (10Lixxx235) Thanks Dzahn, appreciate it. [20:12:02] _joe_: what MW related server are still running zend? I'm curious what other stuff might run into the problem in https://gerrit.wikimedia.org/r/#/c/238527/ [20:12:28] twentyafterfour: I didn't realise you were still syncing. I just did a sync-common on mw1017 [20:13:13] AaronSchulz: wikitech is on zend. Videoacalers as well. Krenair had a tracking ticket somewhere [20:13:21] Tin and terbium and crons there [20:13:28] I guess zend can just be upgraded [20:13:40] well, not even that, but just the ext [20:15:03] Krinkle: I wonder if that's why my sync seems to have frozen with 35 remaining hosts [20:15:04] snapshots are all still zend [20:15:10] tin and terbium are both zend [20:15:26] silver (wikitech) runs zend but it's 5.5 not 5.3 [20:16:01] eqiad videoscalers (tmh*) are zend 5.3 too [20:16:23] twentyafterfour, proxy failure? [20:17:22] Krenair: perhaps [20:17:39] AaronSchulz, basically see https://phabricator.wikimedia.org/T86081#1636920 [20:17:44] !log twentyafterfour@tin scap aborted: sync 1.26wmf23 to testwiki (duration: 82m 58s) [20:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:58] !log twentyafterfour@tin Started scap: sync 1.26wmf23 to testwiki, again [20:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:28] 6operations: Upgrade phpredis client on zend - https://phabricator.wikimedia.org/T112694#1642886 (10aaron) 3NEW [20:19:22] (03PS1) 10Ori.livneh: Increase Varnish's `shm_reclen` from 1024 to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/238536 (https://phabricator.wikimedia.org/T112002) [20:19:44] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [20:20:09] SSH_AUTH_SOCK=/run/keyholder/proxy.sock dsh -F 20 -M -g mediawiki-installation -r ssh -o -oUser=mwdeploy -- "php --version | head -n 1" | grep -v HipHop [20:20:15] on tin/mira [20:21:37] 6operations: Upgrade phpredis client on zend - https://phabricator.wikimedia.org/T112694#1642904 (10Krenair) [20:22:58] mw1010 is the proxy that's failing to scap [20:23:03] in total it's 9 of four hundred and something [20:24:09] twentyafterfour, can ping, no ssh though? [20:24:35] well, it's hung on /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -lmwdeploy mw1010.eqiad.wmnet /srv/deployment/scap/scap/bin/sync-common --no-update-l10n [20:24:55] PROBLEM - RAID on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:28] oh, I'm in, it's just very slow. [20:27:53] yeah something is going on ... and I can't log in there to see for myself [20:28:32] it's a jobrunner [20:29:13] memory is almost fully in use, lots of rsync processes... [20:30:12] are the ones that are failing the canary appservers? [20:30:16] PROBLEM - SSH on mw1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:48] re: 9 of four hundred [20:31:14] that's a separate discussion mutante [20:31:34] mw1010 is failing and it's a jobrunner. previous discussion was about zend (9 hosts run it) [20:32:55] ok! [20:32:59] looking at mw1010 [20:33:02] thanks [20:33:35] RECOVERY - SSH on mw1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [20:34:34] PROBLEM - DPKG on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:37] icinga-wm, you say that.. but .. [20:34:53] ori: https://github.com/cscott/wikipedia-telnet now has some new hotness yoinked from [[User:Dispenser]] [20:35:04] PROBLEM - salt-minion processes on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:04] PROBLEM - nutcracker port on mw1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:05] ori: I've applied https://gerrit.wikimedia.org/r/#/c/238537/ and re-applied https://gerrit.wikimedia.org/r/#/c/238511/ on mw1017. Mobile is now working as expected there I believe. [20:35:13] I see some "Could not update user with ID '5731961'; DB is read-only." [20:35:34] certainly something unusual going on in this host mutante - https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [20:36:22] looks like it's been going for about two hours [20:36:37] could not get shell via ssh, tried via mgmt [20:36:38] I'll schedule for the next SWAT since there's stuff going on [20:36:44] RECOVERY - nutcracker port on mw1010 is OK: TCP OK - 0.000 second response time on port 11212 [20:36:45] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:37:18] login timed out [20:37:30] it took me forever to log in but it did eventually work mutante [20:37:50] arr,yea, but in my case only to be disconnected again [20:39:58] (03PS1) 10Krinkle: Remove hardcoded $wgCanonicalServer from $wgResourceBasePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238543 (https://phabricator.wikimedia.org/T106966) [20:40:02] ..now [20:41:15] /var/log/mediawiki/jobrunner.log show a single HTTP 503 response but so does the same file on mw1009 which as far as we know is behaving itself [20:43:54] RECOVERY - RAID on mw1010 is OK: OK: no RAID installed [20:44:37] did you do something mutante? [20:44:44] RECOVERY - DPKG on mw1010 is OK: All packages OK [20:44:53] Uncaught exception: HHVM no longer supports the built-in webserver as [20:45:01] i got on and got to restart hhvm [20:45:32] twentyafterfour, are you still trying to sync? [20:47:08] !log mw1010 - extremely slow,finally got on and attempted to restart hhvm. load going down [20:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:13] rsync still running [20:49:16] the uncaught exception thing was just because i used the init script [20:49:25] start: Job is already running: hhvm [20:50:45] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:50:59] even MORE rsync now [20:52:18] but looks better [20:52:33] yes, it does [20:53:19] (03CR) 10Ori.livneh: "Cherry-picked on beta puppetmaster and applied on beta varnishes." [puppet] - 10https://gerrit.wikimedia.org/r/238536 (https://phabricator.wikimedia.org/T112002) (owner: 10Ori.livneh) [20:53:30] bblack: ^ [20:54:02] i can SSH to it relatively normal again, so i'll just let it do its thing [20:54:46] ah, and the rsync seems also done now [20:54:52] load further down. ok. [20:56:00] scap reports sync failed on mw1010 [20:56:20] can you do it again for just mw1010? [20:56:53] now mwdeploy doing python stuff [20:56:59] as opposed to rsync before [20:58:00] I can just run the whole thing again [20:58:26] does it take long? [20:58:37] since that was one of the proxies, 35 mediawikis never got sync'd [20:59:26] i saw the rsync running and then finish,but i guess we have to [21:00:13] didnt kill anything, just the hhvm restart [21:00:22] ori: I really think we have to raise another parameter alongside it, I just need to dig through notes and google search and remember which [21:00:45] one of the ones that controls overall shm workspace, which we've bumped to avoid segfault before and are probably barely-ok on as it is [21:01:58] bblack: right -- https://www.varnish-cache.org/trac/ticket/1055#comment:1 [21:03:55] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:56] bblack: it would be very nice to have (it would allow us to collect a few more timing metrics) but we can live without it, so if having to worry about this elevates your overall stress levels, it'd be OK to punt it for now. [21:05:25] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [21:05:47] !log twentyafterfour@tin Finished scap: sync 1.26wmf23 to testwiki, again (duration: 47m 49s) [21:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:00] !log twentyafterfour@tin Started scap: sync 1.26wmf23 to testwiki, once more because mw1010 overloaded [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:10] (03CR) 10GWicke: [C: 031] reprepro: switch default_distro to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [21:10:52] !log twentyafterfour@tin Finished scap: sync 1.26wmf23 to testwiki, once more because mw1010 overloaded (duration: 03m 52s) [21:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:29] (03CR) 10GWicke: "Actually, after grepping the puppet repo I'm not longer sure where the default is overridden for apt.wm.org. @Daniel, are you sure that th" [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [21:12:31] (03PS1) 1020after4: group0 wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238604 [21:12:42] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238604 (owner: 1020after4) [21:12:48] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238604 (owner: 1020after4) [21:13:28] (03CR) 10Dzahn: "No, thought pretty much the same and that's why i made it a separate change from adding the repo in the releases module and wanted to wait" [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [21:13:39] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf23 [21:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:35] train all finished, finally everything worked. 1.26wmf23 is now live on mediawiki.org and testwiki [21:15:08] Krenair: things are back to normal and synced ^ [21:15:52] !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/: touch files edited in I0cb6fe37e and re-sync to cluster (duration: 00m 13s) [21:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:54] (03PS1) 10Rush: elastic: specify site for rack regex [puppet] - 10https://gerrit.wikimedia.org/r/238606 [21:25:35] (03CR) 10Rush: [C: 032] elastic: specify site for rack regex [puppet] - 10https://gerrit.wikimedia.org/r/238606 (owner: 10Rush) [21:25:46] saving edits on mediawiki.org seems to be very slow at the moment [21:27:28] (03CR) 10Hashar: "The rubocop job is made to ignore submodules. For linting we can consider them as random third party libs that do not adhere to the same " [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [21:38:37] (03PS1) 10Rush: elastic: add row/rack designations for codfw [puppet] - 10https://gerrit.wikimedia.org/r/238612 [21:39:15] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:35] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [21:41:04] (03CR) 10Rush: [C: 032] elastic: add row/rack designations for codfw [puppet] - 10https://gerrit.wikimedia.org/r/238612 (owner: 10Rush) [21:41:17] 6operations, 10ops-codfw, 5Patch-For-Review: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1643169 (10chasemp) [21:52:00] twentyafterfour: legoktm asked you yesterday to pin Echo to wmf21 but that doesn't seem to have happened? [21:55:39] RoanKattouw: I guess I forgot? [21:56:39] OK [21:57:00] No worries [21:57:15] I'm a bit swamped right now but maybe legoktm can help when he's back later [21:57:44] Basically what we need is for Echo to be rolled back to the wmf21 version and for lego's patch at the top of MobileFrontend wmf22 to be cherry-picked to wmf23 [21:59:53] (03PS11) 10Dzahn: move mediawiki maintenance scripts to module [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) [22:00:34] (03PS1) 10Rush: elasticsearch: apply elasticsearch::server role to codfw [puppet] - 10https://gerrit.wikimedia.org/r/238616 [22:01:42] (03PS12) 10Dzahn: move mediawiki maintenance scripts to module [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) [22:05:17] (03PS2) 10Rush: WIP elasticsearch: apply elasticsearch::server role to codfw [puppet] - 10https://gerrit.wikimedia.org/r/238616 [22:08:51] 6operations, 10ops-codfw, 5Patch-For-Review: rack & initial setup of elastic2001-2024 - https://phabricator.wikimedia.org/T111080#1593759 (10chasemp) [22:08:52] 6operations, 6Discovery, 5codfw-rollout: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1643261 (10chasemp) [22:08:54] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1643256 (10chasemp) 5Open>3Resolved requested and answered see T111080 [22:11:17] hi mehrrit-wm [22:11:33] (03CR) 10Dzahn: "checked in compiler, here you can see the diff is just resource names: http://puppet-compiler.wmflabs.org/891/terbium.eqiad.wmnet/ misc:" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [22:11:34] (03CR) 10Dzahn: "checked in compiler, here you can see the diff is just resource names: http://puppet-compiler.wmflabs.org/891/terbium.eqiad.wmnet/ misc:" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [22:11:41] haha [22:11:47] wondeerrrful [22:11:55] (mehrrit-wm is grrrit-wm running in a docker container locally) [22:12:01] now to make that happen on k8s [22:12:11] bleh, git deploy sync takes ~5 min to sync 4 servers (( [22:12:20] ~lolrrit@ :) [22:13:11] apergos, is there a reason deploy takes so long now? [22:13:11] RoanKattouw: I can do that [22:13:27] twentyafterfour: OK, thanks [22:15:16] RoanKattouw: echo should be wmf21 not 22? [22:16:08] 6operations, 3Discovery-Maps-Sprint: Kartotherian git deploy service restart failed with perm error - https://phabricator.wikimedia.org/T112707#1643288 (10Yurik) 3NEW a:3akosiaris [22:16:43] looks like wmf22 and wmf21 are essentially the same [22:16:46] twentyafterfour: Yeah. If you run git log in php-1.26wmf22 you should see Kunal's commit that rolls it back to wmf21 [22:16:49] Yup exactly [22:16:51] where do we set $::mediawiki::users::web [22:18:26] (03PS1) 10Yuvipanda: k8s: Make docker service require flannel service [puppet] - 10https://gerrit.wikimedia.org/r/238620 [22:18:27] (03PS1) 10Yuvipanda: k8s: Make docker service require flannel service [puppet] - 10https://gerrit.wikimedia.org/r/238620 [22:18:29] woo [22:21:38] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1643318 (10GWicke) We have had this discussion a few times now. The basic issue of local logging being blocking (and thus running the risk of taking out a service when the disk fills up) has not been resolv... [22:22:29] RoanKattouw: that commit doesn't cherry-pick cleanly: there is a path conflict on resources/mobile.special.notifications.scripts/notifications.js [22:22:37] (the commit on MobileFrontend) [22:23:25] (03CR) 10Dzahn: [C: 032] "finally :) confirmed in compiler, checking on terbium, only applied there. killing a big thing from "misc"" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [22:26:17] Oh, great [22:26:25] Let me check [22:28:44] (03CR) 10Dzahn: "the crontabs of mwdeploy and www-data are identical before and after" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [22:29:44] ^ had that waiting since 2014, hope you like [22:30:08] instead of one huge maintenance.pp in misc it's now in the module and one file per job [22:30:52] way less manifests/misc/ [22:31:24] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/238625/1 [22:31:51] I think the conflict is ignorable [22:31:58] mutante: nice [22:32:00] Thanks [22:32:03] Yes, I just figured that out [22:32:08] The file was deleted between wmf22 and wmf23 [22:32:15] And changes to it can just be ignored [22:33:30] 6operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1643358 (10Dzahn) I merged this: https://gerrit.wikimedia.org/r/178873 compiler result was here: http://puppet-compiler.wmflabs.org/891/terbium.eqiad.wmnet/ also checked the run on terbium and compa... [22:34:50] RoanKattouw: oh crap, I forgot to ask twentyafterfour to do that, sorry :< [22:34:59] (03PS1) 10Dzahn: delete misc/maintenance.pp, now empty [puppet] - 10https://gerrit.wikimedia.org/r/238626 [22:35:32] Oh lol [22:35:44] twentyafterfour: Sorry for accusing you of forgetting, looks like you were never asked in the first place [22:36:11] lol you convinced me that I must have forgot ;) [22:36:21] wouldn't be too surprising [22:37:07] https://gerrit.wikimedia.org/r/#/c/238621/ [22:37:19] (03PS2) 10BBlack: Increase Varnish's `shm_reclen` from 1024 to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/238536 (https://phabricator.wikimedia.org/T112002) (owner: 10Ori.livneh) [22:40:36] 6operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1643372 (10Dzahn) all the maintenance jobs are here now: ``` /puppet/modules/mediawiki/manifests/maintenance$ ls cleanup_upload_stash.pp purge_checkuser.pp translationnotifications.pp updatetransl... [22:43:12] (03PS2) 10Dzahn: delete misc/maintenance.pp, now empty [puppet] - 10https://gerrit.wikimedia.org/r/238626 (https://phabricator.wikimedia.org/T88597) [22:44:05] (03PS3) 10Dzahn: delete misc/maintenance.pp, now empty [puppet] - 10https://gerrit.wikimedia.org/r/238626 (https://phabricator.wikimedia.org/T88597) [22:44:32] (03CR) 10Dzahn: [C: 032] delete misc/maintenance.pp, now empty [puppet] - 10https://gerrit.wikimedia.org/r/238626 (https://phabricator.wikimedia.org/T88597) (owner: 10Dzahn) [22:45:48] 6operations, 5Patch-For-Review: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1643392 (10Dzahn) 5Open>3Resolved [22:46:03] 6operations: move misc mw maintenance scripts into mw puppet module - https://phabricator.wikimedia.org/T88597#1015891 (10Dzahn) [22:46:17] RoanKattouw: want me to deploy that? [22:46:24] Yes please [22:46:27] Along with the Echo-rollback [22:46:30] (The two need each other) [22:46:31] ok :) [22:57:51] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1643453 (10GWicke) Nice work, @fgiunchedi! [22:57:59] (03PS1) 10Dduvall: Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) [22:58:22] (03CR) 10jenkins-bot: [V: 04-1] Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) (owner: 10Dduvall) [22:59:01] (03PS2) 10Dduvall: Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) [22:59:24] (03CR) 10jenkins-bot: [V: 04-1] Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) (owner: 10Dduvall) [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150915T2300). [23:00:04] Krinkle James_F MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:42] (03PS3) 10Dduvall: Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) [23:01:04] * MaxSem is here [23:01:22] hey [23:03:18] Krinkle, one of your patches is not merged on master... [23:03:19] (03PS1) 10Dzahn: misc/scripts: remove 'scheduledowntime' [puppet] - 10https://gerrit.wikimedia.org/r/238633 [23:03:30] Krenair: Yeah, I'm aware. [23:04:04] (03CR) 10Mobrovac: [C: 031] Execute distinct stages of deployment separately (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) (owner: 10Dduvall) [23:04:39] (03CR) 10Dzahn: [C: 032] misc/scripts: remove 'scheduledowntime' [puppet] - 10https://gerrit.wikimedia.org/r/238633 (owner: 10Dzahn) [23:05:31] (03CR) 10Dduvall: Execute distinct stages of deployment separately (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) (owner: 10Dduvall) [23:07:29] (merged https://gerrit.wikimedia.org/r/#/c/238543/ - what's up with grrrit-wm?) [23:08:09] it got a twin [23:08:46] /last mehrrit [23:08:47] (03CR) 10Mobrovac: [C: 031] Rename and simplify some git deploy functions [tools/scap] - 10https://gerrit.wikimedia.org/r/236241 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [23:09:05] mutante: I'm still working on it :) [23:09:07] Krenair: Beware that config patch must go after the backports in core are out [23:09:24] mutante: need to figure out a proper way to make it read ssh key without NFS [23:09:27] yeah, just realised that :/ [23:09:35] and then I'll switch grrrit-wm to run from kubernetes [23:10:11] YuviPanda: yep,it was more to let Krenair know [23:12:25] anyone deploying? I've got the Echo and MobileFrontend stuff staged and ready [23:12:29] I am [23:12:41] I hope you haven't touched the deployment branches, twentyafterfour [23:13:26] Krenair: just merged those patches on 1.26wmf23 [23:13:27] I will have some late CA changes for swat [23:13:38] twentyafterfour, why? [23:14:28] Krenair: because RoanKattouw asked ^^ [23:14:55] did I mess you up somehow? [23:16:35] !log krenair@tin Synchronized php-1.26wmf23/extensions/EventLogging/modules/ext.eventLogging.core.js: https://gerrit.wikimedia.org/r/#/c/238513/ (duration: 00m 12s) [23:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:48] Krinkle, ^ [23:17:00] thx, checking [23:17:23] if that's fine I'll run it on wmf22 [23:18:15] (03PS1) 10Dzahn: moved 'upgrade-helper' script over from puppet repo [software] - 10https://gerrit.wikimedia.org/r/238639 [23:18:24] Krenair: confirmed [23:18:37] great [23:19:06] (03PS1) 10Dzahn: files/misc: delete upgrade-helper script [puppet] - 10https://gerrit.wikimedia.org/r/238640 [23:19:09] What's this? [23:19:14] Project: mediawiki/extensions/WikidataPageBanner 4f9ee787bc6321d19f955be9811173b43db47cc4 [23:19:21] twentyafterfour, do you know anything about that? [23:20:23] (03CR) 10Dzahn: [C: 032] "just moved around for right now" [software] - 10https://gerrit.wikimedia.org/r/238639 (owner: 10Dzahn) [23:20:40] twentyafterfour: Krenair : It's this one https://gerrit.wikimedia.org/r/#/c/237302/ [23:20:44] It wasn't deployed? [23:21:04] (03CR) 10Dzahn: [C: 032] files/misc: delete upgrade-helper script [puppet] - 10https://gerrit.wikimedia.org/r/238640 (owner: 10Dzahn) [23:21:09] No [23:21:17] I'm going to merge it onto the branch on tin, twentyafterfour [23:21:29] Next time though, I will revert every single undeployed commit I can find on a deployment branch. [23:21:32] Krenair: yeah I was gonna deploy that also but my window ran out and swat started [23:21:41] (/me didn't realize it was so late) [23:21:45] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [23:21:59] Krenair: merge it yes please [23:22:55] !log krenair@tin Synchronized php-1.26wmf22/extensions/EventLogging/modules/ext.eventLogging.core.js: https://gerrit.wikimedia.org/r/#/c/238512/ (duration: 00m 12s) [23:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:01] Krinkle, ^ [23:23:27] (03CR) 10Dzahn: [V: 032] moved 'upgrade-helper' script over from puppet repo [software] - 10https://gerrit.wikimedia.org/r/238639 (owner: 10Dzahn) [23:23:32] I need to sync Echo and MobileFrontend (together) when you're all done [23:23:36] ok [23:23:37] Krenair: confirmed on enwiki [23:23:50] ok, moving on to your core patch and then afterwards we'll do the config change [23:23:56] wmf23 first [23:24:10] !log deployed kartotherian & tilerator [23:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:20] Well. [23:24:26] When jenkins finishes... [23:25:34] (03PS6) 10Dzahn: Revert "phab: disable tools crons" [puppet] - 10https://gerrit.wikimedia.org/r/237513 (https://phabricator.wikimedia.org/T112135) [23:26:46] (03CR) 10Dzahn: [C: 032] Revert "phab: disable tools crons" [puppet] - 10https://gerrit.wikimedia.org/r/237513 (https://phabricator.wikimedia.org/T112135) (owner: 10Dzahn) [23:29:33] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1643544 (10Dzahn) re-enabled the 2 Bugzilla related crons: Notice: /Stage[main]/Phabricator::Tools/Cron[bz_comment_update]/ensure: created Notice: /Stage[main]/P... [23:31:12] While I wait I should probably be getting the VE-MW submodule update ready... [23:36:47] ok [23:38:54] !log krenair@tin Synchronized php-1.26wmf23/includes/resourceloader/ResourceLoader.php: https://gerrit.wikimedia.org/r/#/c/238545/ (duration: 00m 11s) [23:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:49] looks fine [23:40:42] !log krenair@tin Synchronized php-1.26wmf22/includes/resourceloader/ResourceLoader.php: https://gerrit.wikimedia.org/r/#/c/238544/ (duration: 00m 11s) [23:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:55] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:42:07] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/238543/ (duration: 00m 12s) [23:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:33] !log krenair@tin Synchronized wmf-config/mobile.php: https://gerrit.wikimedia.org/r/#/c/238543/ (duration: 00m 14s) [23:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:47] Krinkle, ^ please check [23:42:56] checking now [23:43:07] already tested earlier on mw1017 but mayve missed something [23:43:45] (03Abandoned) 10Dzahn: add script to flush all iptables rules for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [23:43:57] (03PS1) 10Yuvipanda: k8s: Default ssl usergroup to be root than kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/238644 [23:43:59] (03PS1) 10Yuvipanda: tools: Add k8s bastion class [puppet] - 10https://gerrit.wikimedia.org/r/238645 [23:44:13] (03CR) 10Dzahn: "@Muehlenhoff should we still add something like this?" [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [23:44:20] Krenair: Looks fine, though I see I missed an edge case. I'll push a fix later. [23:44:27] ok [23:44:51] (03Restored) 10Dzahn: add script to flush all iptables rules for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [23:45:44] (03PS2) 10Yuvipanda: k8s: Make docker service require flannel service [puppet] - 10https://gerrit.wikimedia.org/r/238620 [23:45:52] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Make docker service require flannel service [puppet] - 10https://gerrit.wikimedia.org/r/238620 (owner: 10Yuvipanda) [23:46:01] (03PS2) 10Yuvipanda: k8s: Default ssl usergroup to be root than kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/238644 [23:46:09] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Default ssl usergroup to be root than kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/238644 (owner: 10Yuvipanda) [23:46:18] (03PS2) 10Yuvipanda: tools: Add k8s bastion class [puppet] - 10https://gerrit.wikimedia.org/r/238645 [23:46:27] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add k8s bastion class [puppet] - 10https://gerrit.wikimedia.org/r/238645 (owner: 10Yuvipanda) [23:48:06] !log krenair@tin Synchronized php-1.26wmf23/extensions/WikimediaEvents/modules/ext.wikimediaEvents.geoFeatures.js: https://gerrit.wikimedia.org/r/#/c/238618/ (duration: 00m 12s) [23:48:08] MaxSem, ^ please check [23:48:57] thx, but I probably need to wait a bit for caches to expire... [23:51:14] !log krenair@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/modules/ext.wikimediaEvents.geoFeatures.js: https://gerrit.wikimedia.org/r/#/c/238617/ (duration: 00m 12s) [23:54:40] (03PS2) 10Dzahn: add script to flush all iptables rules for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/228137 [23:55:59] (03Abandoned) 10Dzahn: Revert "contint: Don't include base firewall by default" [puppet] - 10https://gerrit.wikimedia.org/r/223234 (https://phabricator.wikimedia.org/T104939) (owner: 10Dzahn) [23:58:39] Krenair: are you done deploying? [23:58:49] no [23:59:15] still waiting for jenkins to merge the last patch [23:59:27] started 8 minutes ago :(