[00:01:32] (03PS2) 10Dzahn: gerrit: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) [00:01:49] (03CR) 10Dzahn: [C: 032] gerrit: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [00:04:01] (03PS3) 10Dzahn: deactivate webhostingwikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/243970 [00:04:26] (03CR) 10Dzahn: [C: 032] deactivate webhostingwikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/243970 (owner: 10Dzahn) [00:06:13] (03CR) 10Dzahn: "@JanZerebecki everything needs to be https-only :)" [dns] - 10https://gerrit.wikimedia.org/r/244103 (owner: 10Dzahn) [00:07:34] (03CR) 10Dzahn: "re: "not paying for it anymore" i don't know, that's a question for legal since we don't own the budget. i know in some cases it's not rea" [dns] - 10https://gerrit.wikimedia.org/r/244103 (owner: 10Dzahn) [00:10:30] (03CR) 10Dzahn: "could you do the manual rebase? was about to merge and looked good in compiler but something changed meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [00:11:38] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1737094 (10Dzahn) made this an access-request [00:24:41] (03PS2) 10Dzahn: wikitech: add SSL cert expiry monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244610 (https://phabricator.wikimedia.org/T114059) [00:28:43] (03CR) 10Dzahn: wikitech: add SSL cert expiry monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244610 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [00:30:57] (03CR) 10Dzahn: "eh.. why don't i see this created in icinga yet. it should" [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [00:31:04] (03CR) 10Dzahn: "eh.. why don't i see this created in icinga yet. it should" [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [00:31:27] krrrit-wm1: shhh, you are a dupe [00:32:39] YuviPanda, ^ [00:33:00] (03CR) 10Dzahn: "here it is:" [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [00:33:34] yeah I'm trying to make it pull latest docker image [00:33:37] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1737168 (10Dzahn) check forgerrit cert added: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ytterbium&service=HTTPS "SSL OK - Certificate gerrit... [00:33:37] it's not co-operating [00:34:48] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1737169 (10Dzahn) a:3Dzahn [00:35:57] 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1737183 (10Dzahn) p:5Triage>3Normal group added, but needs to be added to hosts. and there is the question on https://gerrit.wikimedia.org/r/#/c/246850/ [00:36:24] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1737186 (10Dzahn) p:5Triage>3Normal [00:37:13] legoktm: try now [00:37:43] YuviPanda: no notifications from https://gerrit.wikimedia.org/r/#/c/247034/7 at all [00:37:53] legoktm: wait [00:37:54] yeah [00:37:56] missed something [00:38:28] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1737189 (10Dzahn) @VBaranetsky Hi, any response from Doneva yet? [00:41:45] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [00:46:54] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [00:59:17] (03PS1) 10Yurik: Switch graphoid to the local restbase proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247494 [01:07:28] 6operations, 7Regression: [Regression] 404 Not Found: https://en.wikipedia.org/apple-touch-icon.png - https://phabricator.wikimedia.org/T115965#1737218 (10Krinkle) 3NEW [01:08:03] mutante|away: ^ [01:09:32] (03PS1) 10Ori.livneh: Tiny tweak to Grafana header text [puppet] - 10https://gerrit.wikimedia.org/r/247495 [01:09:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Tiny tweak to Grafana header text [puppet] - 10https://gerrit.wikimedia.org/r/247495 (owner: 10Ori.livneh) [01:12:24] (03PS1) 10Krinkle: grafana: Remove hardcoded text color from home page [puppet] - 10https://gerrit.wikimedia.org/r/247496 [01:12:50] (03CR) 10Ori.livneh: [C: 032 V: 032] "Well, OK :D" [puppet] - 10https://gerrit.wikimedia.org/r/247496 (owner: 10Krinkle) [01:15:21] Krinkle: {{done}} [01:18:39] Krinkle: do you know if we have a generic wikimedia e-mail account that has a gravatar set? [01:18:55] hello [01:18:56] I want the gravatar for the anonymous account on grafana to be the WMF logo, but root@wikimedia.org does not have a gravatar [01:19:07] ori: maybe noc@ ? [01:19:12] * ori tries [01:19:13] If not, we can fix it to have that gravatar. [01:19:38] nope [01:20:23] gerritadmin@wikimedia.org has it [01:21:58] Ha, OK [01:24:18] hello ori [01:24:22] hello Krinkle [01:24:51] hi cortex [01:28:00] i have one question [01:28:17] Please volunteer to host a mirror if you have access to sufficient storage and bandwidth. [01:28:27] how much storage and bandwidth? [01:28:48] https://dumps.wikimedia.org/ [01:29:54] https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps [01:29:55] or better [01:31:19] apergos ^ [01:31:19] because the location is only in USA or brazil [01:31:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [01:33:29] we need to find something in europe [01:33:31] also [01:33:58] for faster download [01:36:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [01:41:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [01:46:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [02:33:11] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1737270 (10ellery) Is this task being tracked on two tickets? Anyway, you can make the change as far as I'm conce... [02:38:57] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 08m 20s) [02:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:54] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-20 02:43:53+00:00 [02:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [02:48:15] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/Elastica/: Bring phase0 and phase1 inline with phase2 (duration: 00m 21s) [02:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:43] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch/: Bring phase0 and phase1 inline with phase2 (duration: 00m 18s) [02:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:56:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [03:01:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 2 below the confidence bounds [03:06:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 3 below the confidence bounds [03:09:22] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 16s) [03:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 3 below the confidence bounds [03:14:15] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-20 03:14:15+00:00 [03:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:16:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 3 below the confidence bounds [03:23:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [03:30:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [03:37:32] 6operations, 6Release-Engineering-Team: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1737288 (10faidon) Well as those puppet runs and logs prove, user trebuchet belongs in group wikidev only for about 1 minute every 30. I t... [04:11:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:16:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:19:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:19:34] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1737303 (10Dzahn) a:3Dzahn [04:40:15] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (10998 100000s) [06:14:02] (03PS1) 10KartikMistry: CX: Enable ContentTranslation suggestion in all Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247515 [06:53:27] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1737383 (10ArielGlenn) 5Resolved>3Open space there because it was reinstalled with a larger partition. sorry, I should have closed this. doing so now. [06:57:17] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1737388 (10ArielGlenn) 5Open>3Resolved [07:18:45] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [07:22:13] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [07:23:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 20 07:23:56 UTC 2015 (duration 23m 55s) [07:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:25:07] (03Abandoned) 10Alexandros Kosiaris: WIP: just testing something [puppet] - 10https://gerrit.wikimedia.org/r/237412 (owner: 10Alexandros Kosiaris) [07:26:32] (03CR) 10Muehlenhoff: "Yes, this patch has been superceded by c3bc5370dc92cea603c66198d416c9381f6f6a58" [puppet] - 10https://gerrit.wikimedia.org/r/247201 (owner: 10Muehlenhoff) [07:26:46] (03Abandoned) 10Muehlenhoff: iridium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247201 (owner: 10Muehlenhoff) [07:27:57] (03PS3) 10Muehlenhoff: Mark multatuli as spare [puppet] - 10https://gerrit.wikimedia.org/r/246831 [07:28:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Mark multatuli as spare [puppet] - 10https://gerrit.wikimedia.org/r/246831 (owner: 10Muehlenhoff) [07:29:14] (03CR) 10Nemo bis: "This one may use a Turkish speaker's opinion." [dns] - 10https://gerrit.wikimedia.org/r/244082 (owner: 10Dzahn) [07:32:58] (03CR) 10Alexandros Kosiaris: [C: 031] "This will change /etc/ldap/ldap.conf on:" [puppet] - 10https://gerrit.wikimedia.org/r/246242 (owner: 10Alexandros Kosiaris) [07:33:19] (03PS2) 10Muehlenhoff: Add salt grains for ocg [puppet] - 10https://gerrit.wikimedia.org/r/246955 [07:40:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for ocg [puppet] - 10https://gerrit.wikimedia.org/r/246955 (owner: 10Muehlenhoff) [07:41:26] (03CR) 10Alexandros Kosiaris: [C: 031] admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [07:43:41] akosiaris: by the way, i liked your homelist idea, but i thought naming the tag 'home' would be better, so i went with that [07:43:58] i converted all the homelist tag to home as well [07:44:19] (03Abandoned) 10Muehlenhoff: Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/247211 (owner: 10Muehlenhoff) [07:56:14] (03PS2) 10Muehlenhoff: Add salt grains for sca [puppet] - 10https://gerrit.wikimedia.org/r/246956 [07:56:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for sca [puppet] - 10https://gerrit.wikimedia.org/r/246956 (owner: 10Muehlenhoff) [08:05:17] (03PS2) 10Muehlenhoff: Add salt grains for scb [puppet] - 10https://gerrit.wikimedia.org/r/246957 [08:05:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for scb [puppet] - 10https://gerrit.wikimedia.org/r/246957 (owner: 10Muehlenhoff) [08:09:42] (03PS2) 10Muehlenhoff: Add salt grains for videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/246958 [08:12:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/246958 (owner: 10Muehlenhoff) [08:44:10] 6operations, 10Mathoid, 10RESTBase, 6Services: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1737538 (10mobrovac) [08:50:11] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1737542 (10jcrespo) @Dzahn I do not see anything in a critical state - I suppose you meant the "lag" between the servers. Did you or andrew did... [08:55:34] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [08:57:14] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [09:00:05] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:24] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 34.89 ms [09:00:45] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1737574 (10jcrespo) 3NEW a:3jcrespo [09:10:52] (03CR) 10Hashar: "Lame nitpick to have each role on its own line." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247192 (owner: 10Muehlenhoff) [09:12:19] 6operations: Old salt grains not removed if a role changes - https://phabricator.wikimedia.org/T115983#1737598 (10MoritzMuehlenhoff) 3NEW [09:13:34] (03CR) 10Muehlenhoff: gallium: Use the role keyword (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247192 (owner: 10Muehlenhoff) [09:15:25] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Puppet has 1 failures [09:15:40] hashar, who is the right person to ask questions about CI changes? [09:15:42] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1737605 (10hashar) Do we really care of having `status.wikimedia.org` to be served over TLS? I am not sure it is worth it (and the price of a host cert), so I would rather disab... [09:20:15] jynus: good morning (still) [09:20:22] morning [09:20:39] jynus: anyone with a +voice in #wikimedia-releng , but usually during European time that would be either me or zeljkof :D [09:20:50] what is going on? [09:21:00] doesn't need to be in European time [09:23:08] I just need some changes done that potentially will break CI for some people, will discuss in your channel later the best way to proceed [09:24:20] alternatively you can bring it up to the QA mailing list (fairly low traffic) [09:24:31] ah, great [09:24:44] or via a Phabricator task :-} [09:25:13] we have a checkin this afternoon at 2pm UTC / 4pm CET over hangout if you want to talk about it with Jan (WMDE), Zeljko and I [09:26:20] (03PS2) 10Muehlenhoff: gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 [09:26:50] well, I do not want to steal a lot of your time, but I will sent you an email before that time if you want to comment it [09:26:55] (03CR) 10jenkins-bot: [V: 04-1] gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 (owner: 10Muehlenhoff) [09:27:01] thank you, hashar [09:32:07] actually, I've realized there is another blocker for what I want to do, so it will have to wait [09:32:27] jynus: just fill a task against #continuous-integration-infrastructure , this way it doesn't get lost and others can look at it [09:32:35] jynus: I am terrible at managing direct emails [09:32:38] yes [09:32:47] the tasks actually are already filled [09:32:49] and even if there is a blocker, it is still worth filling :-} [09:32:51] oh [09:32:54] I will add you [09:33:19] moritzm: I like the 'new' role() semantic :-} [09:33:25] (03CR) 10Hashar: [C: 031] gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 (owner: 10Muehlenhoff) [09:34:27] hashar, T108255 [09:35:56] !task [09:35:57] !help task [09:35:57] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [09:36:16] !task is https://phabricator.wikimedia.org/$1 [09:36:17] Key was added [09:36:26] task T108255 [09:36:28] !task T108255 [09:36:28] https://phabricator.wikimedia.org/T108255 [09:37:11] instead of doing CI first, they proposed enabling the warnings on production first [09:37:37] today I learned that MySQL/MariaDB is sometime as annoying as PHP [09:39:29] on what? [09:40:13] on not producing errors? on not being standard? They are very similar on that [09:40:13] it has an optional strict mode ? :D [09:40:19] yep [09:40:30] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1737694 (10Chmarkine) >>! In T50501#1669896, @Chmarkine wrote: > [[ https://letsencrypt.org/ | Let's Encrypt ]] provides free tru... [09:40:35] which has created serveral security errors in the past [09:40:45] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:41:01] not anymore, 5.7 has it by default [09:41:16] so we better fix our code soon :-) [09:41:21] I am writing down some summary of mysql versions being used on CI / beta [09:41:54] are you ok with being more strict on CI than on production? [09:49:10] (03PS1) 10Muehlenhoff: argon: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/247522 [09:50:32] jynus: I am commenting on the task, but CI basically runs mysql-server 5.5.44 from Ubuntu .. [09:50:51] the version doesn't matter [09:51:18] if code can be conditional (which means code has to be changed) [09:53:56] jynus: basic summary of versions being use is at https://phabricator.wikimedia.org/T108255#1737707 [09:54:08] the thing if is that enabling strict on CI will impact all changes [09:55:02] (03CR) 10Hashar: "Damn puppet-lint, don't bother with multiple lines so. sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/247192 (owner: 10Muehlenhoff) [09:58:28] (03PS1) 10Muehlenhoff: Mark calcium as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247523 [10:01:29] (03PS3) 10Muehlenhoff: gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 [10:09:56] (03PS4) 10Muehlenhoff: gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 [10:15:55] (03PS1) 10Muehlenhoff: Assign salt grains for dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/247528 [10:41:52] (03PS1) 10Muehlenhoff: Add salt grains for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/247538 [10:41:54] (03PS1) 10Muehlenhoff: Assign salt grains for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/247539 [10:41:56] (03PS1) 10Muehlenhoff: Assign salt grains for lists [puppet] - 10https://gerrit.wikimedia.org/r/247540 [11:05:49] (03PS2) 10Muehlenhoff: Assign salt grains for dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/247528 [11:06:02] (03PS5) 10Muehlenhoff: gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 [11:07:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 (owner: 10Muehlenhoff) [11:12:09] 6operations, 7Database: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1737846 (10jcrespo) So, I have some questions here for the TLS experts: * Recommended cipher and key length (I suppose 2048), that we use for other production services (I assume `ssl_cipher=TLSv1.2`, which l... [11:13:56] 6operations, 7Database: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1737850 (10Reedy) [11:15:30] (03PS3) 10Muehlenhoff: Assign salt grains for dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/247528 [11:16:41] !log mathoid deploying 8e1a3327 [11:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/247528 (owner: 10Muehlenhoff) [11:22:58] (03PS1) 10Jcrespo: [WIP] Script to genereate openssh TLS keys for mysql replication [software] - 10https://gerrit.wikimedia.org/r/247542 (https://phabricator.wikimedia.org/T111654) [11:24:00] genereate <> generate, it is a new verb I invented [11:43:11] (03PS1) 10Muehlenhoff: Move to separate server group [puppet] - 10https://gerrit.wikimedia.org/r/247544 [11:43:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move to separate server group [puppet] - 10https://gerrit.wikimedia.org/r/247544 (owner: 10Muehlenhoff) [11:44:33] (03PS2) 10Muehlenhoff: Add salt grains for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/247538 [11:45:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/247538 (owner: 10Muehlenhoff) [11:49:24] (03PS1) 10Aude: Temporarily disable 'item-merge' right on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247546 [11:54:31] (03PS2) 10Muehlenhoff: Assign salt grains for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/247539 [11:54:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/247539 (owner: 10Muehlenhoff) [11:55:31] (03CR) 10JanZerebecki: [C: 031] Temporarily disable 'item-merge' right on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247546 (owner: 10Aude) [11:55:50] * aude needs to deploy quick config change for wikidata [11:56:01] as long as no one else is deploying now... [11:56:12] (03CR) 10Reedy: [C: 031] Temporarily disable 'item-merge' right on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247546 (owner: 10Aude) [11:56:36] aude: It's not set in InitialiseSettings or anything is it? [11:56:48] just in Wikidata default config-y stuff? [11:56:59] (03PS2) 10Muehlenhoff: Assign salt grains for lists [puppet] - 10https://gerrit.wikimedia.org/r/247540 [11:57:03] no, it's wikibase specific and only set if $wgUseWikibaseRepo = true [11:57:11] yeah, should be fine then :) [11:57:18] same as property-create :) [11:57:35] * aude hopes for a real fix this afternoon, though the code is somewhat complex [11:57:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for lists [puppet] - 10https://gerrit.wikimedia.org/r/247540 (owner: 10Muehlenhoff) [11:58:31] (03CR) 10Aude: [C: 032] "verified locally and works similar to property-create, so confident this is unproblematic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247546 (owner: 10Aude) [11:58:37] (03Merged) 10jenkins-bot: Temporarily disable 'item-merge' right on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247546 (owner: 10Aude) [12:01:21] !log aude@tin Synchronized wmf-config/Wikibase.php: Temporarily disallow item-merge until T115892 is resolved (duration: 00m 19s) [12:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:32] (03PS5) 10Alexandros Kosiaris: Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 [12:09:34] (03PS7) 10Alexandros Kosiaris: etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [12:09:36] (03PS5) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [12:12:25] (03PS6) 10Alexandros Kosiaris: Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 [12:12:26] (03PS8) 10Alexandros Kosiaris: etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 [12:12:29] (03PS6) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [12:13:07] (03CR) 10Alexandros Kosiaris: [C: 032] Specify etherpad.wikimedia.org logging [puppet] - 10https://gerrit.wikimedia.org/r/220086 (owner: 10Alexandros Kosiaris) [12:17:07] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: Log the incoming request's IP address [puppet] - 10https://gerrit.wikimedia.org/r/220087 (owner: 10Alexandros Kosiaris) [12:19:22] (03PS1) 10Muehlenhoff: Assign salt grains for ganeti [puppet] - 10https://gerrit.wikimedia.org/r/247549 [12:34:35] (03PS1) 10Muehlenhoff: Assign salt grains for bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/247550 [12:34:37] (03PS1) 10Muehlenhoff: Assign salt grains for otrs [puppet] - 10https://gerrit.wikimedia.org/r/247551 [12:34:39] (03PS1) 10Muehlenhoff: Assign salt grains for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/247552 [12:35:02] (03PS7) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [12:36:07] (03PS1) 10Muehlenhoff: mark graphite1002 as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247553 [12:38:24] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for ganeti [puppet] - 10https://gerrit.wikimedia.org/r/247549 (owner: 10Muehlenhoff) [12:42:42] 6operations: Category lag on commons - https://phabricator.wikimedia.org/T116001#1738013 (10Steinsplitter) 3NEW [12:43:29] 6operations: Category lag on commons - https://phabricator.wikimedia.org/T116001#1738021 (10Steinsplitter) [12:44:29] !log Update cxserver to 6452b68 [12:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:44:40] 6operations, 7Performance: Category lag on commons - https://phabricator.wikimedia.org/T116001#1738013 (10Steinsplitter) [12:45:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for bastion hosts [puppet] - 10https://gerrit.wikimedia.org/r/247550 (owner: 10Muehlenhoff) [12:46:04] (03PS8) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [13:05:04] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:13:49] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1738059 (10jcrespo) p:5Triage>3Low [13:18:44] (03PS1) 10KartikMistry: Apertium: Add apertium-isl and apertium-isl-eng packages [puppet] - 10https://gerrit.wikimedia.org/r/247562 (https://phabricator.wikimedia.org/T114988) [13:24:05] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1738083 (10jcrespo) According to https://gerrit.wikimedia.org/r/#/c/246689/1/wmf-config/InitialiseSettings.php, this should have been enabled on all production wikis. About to perform... [13:26:59] 6operations, 10Wikimedia-General-or-Unknown, 7Performance: Category lag on commons - https://phabricator.wikimedia.org/T116001#1738092 (10Nemo_bis) [13:30:47] akosiaris: https://gerrit.wikimedia.org/r/#/c/247562/ and then watch for any segfault. [13:30:54] was it in kernel logs? [13:31:32] kart_: yup [13:31:57] (03CR) 10Alexandros Kosiaris: [C: 032] Apertium: Add apertium-isl and apertium-isl-eng packages [puppet] - 10https://gerrit.wikimedia.org/r/247562 (https://phabricator.wikimedia.org/T114988) (owner: 10KartikMistry) [13:37:36] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1738113 (10jcrespo) There are 30 million rows on these tables (on enwiki, fewer on the others). This makes this a slightly more complex issue due to potential impact on the 5.5 masters... [13:48:30] !log backing up and renaming user_daily_contribs table from all wikis as a previous step for its deletion [13:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:43] 6operations, 10Wikimedia-Mailing-lists: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1738131 (10fgiunchedi) FWIW you should be able to obtain the same with a diamond collector and export data into graphite for graphing (and possibly alerting) [13:50:07] (03PS1) 10Ottomata: Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) [13:50:56] there is a channel for the sysadmin? [13:51:27] cortex_, you are on it [13:51:50] ok [13:52:53] apergos: have you seen my snapshot changes? [13:52:55] I put you as a reviewer [13:54:24] (03CR) 10Faidon Liambotis: [C: 04-1] "This commit is correct, however we should just seize the opportunity and convert both to require_package." [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [13:54:24] paravoid: I saw that they are there, I have only looked at the elimination of the common role [13:54:30] which is fine [13:54:33] hey apergos [13:54:37] cortex_: [13:54:45] have you read on wikimedia-tech? [13:54:52] I saw your comments about the mirror in the backscroll here [13:54:58] thanks [13:55:03] what do you think [13:55:05] no I haven't been paying attention to the conversations there, what's up? [13:55:17] apergos: can you review them on gerrit? [13:55:21] yes we do need a mirror with good bandwidth and capacity in europe [13:55:24] paravoid: yes [13:55:28] ok [13:55:34] apergos: i have one idea [13:55:45] for some reaons I thought seeing the email subjects coming through that they had already been, my bad [13:56:11] (03PS3) 10Faidon Liambotis: wikitech: add SSL cert expiry monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244610 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [13:56:12] apergos: but how much bandwidth and space? [13:56:15] cortex_: let's hear it [13:56:23] (03CR) 10Faidon Liambotis: [C: 032] wikitech: add SSL cert expiry monitoring [puppet] - 10https://gerrit.wikimedia.org/r/244610 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [13:56:40] apergos: my idea is to make another mirror in europe [13:56:48] well, if this is goin g to be the primary european mirror, it would be good if it could support I would say two monyths worth in europe [13:56:51] apergos: for speed purpose [13:56:56] which is two full runs plus two runs without full history [13:57:14] what's the speed issue? [13:57:28] downloading more faster from europe [13:57:36] i mean [13:58:00] (03PS2) 10Ottomata: Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) [13:58:03] because at the moment [13:58:14] you have one in brasil [13:58:24] and one in usa right? [13:58:42] why would it be faster to download from europe though? [13:58:43] I know they don't have a lot of bandwidth to throw at it from brazil but the usa mirror is pretty good [13:58:58] do you have any indications that there is congestion between european ISPs and our US network? [13:59:14] oh sorry [13:59:15] 3 in usa [13:59:20] so th [13:59:35] and 1 brazil [13:59:44] e one in the usa I am thinking of is the one that mirrors everything\ [14:00:22] (03PS4) 10Dzahn: admin: add bd808 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) [14:01:07] The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps. Please volunteer to host a mirror if you have access to sufficient storage and bandwidth. [14:01:14] (03CR) 10Dzahn: [C: 032] admin: add bd808 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) (owner: 10Dzahn) [14:01:22] anyways cortex_ if you have proposals for institutions or providers that would willing to put up mirrors, we are always looking [14:01:22] i read this [14:01:27] always. [14:01:34] (03PS2) 10Muehlenhoff: Enable ferm on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/243151 [14:01:39] can we have a problem statement please? [14:01:48] is the problem speed? availability of multiple copies? [14:01:54] (03PS5) 10Dzahn: admin: add bd808 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) [14:02:06] disaster recovery in case the wikimedia foundation dies as an organization? [14:02:17] mmm [14:02:19] maybe [14:02:20] :P [14:02:24] if it's speed, it's interesting from a network perspective as well [14:02:46] if there is evidence of network congestion between european ISP and the US network, this is something we care about irrespective of dumps [14:03:14] (03PS3) 10Ottomata: Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) [14:03:21] and you have users in asia too [14:03:44] that maybe wants to download a dump [14:03:59] and oceania [14:04:07] again, problem statement please [14:04:21] is there a problem and if so, what is it [14:04:33] paravoid: the speed always. [14:04:49] what does that mean? [14:04:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/243151 (owner: 10Muehlenhoff) [14:05:00] (03PS3) 10Muehlenhoff: Enable ferm on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/243151 [14:05:08] (03CR) 10Muehlenhoff: [V: 032] Enable ferm on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/243151 (owner: 10Muehlenhoff) [14:05:09] paravoid: have you limited the speed? [14:05:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1738184 (10Dzahn) on stat1002: Notice: /Stage[main]/Admin/Admin::Hashuser[bd808]/Admin::User[bd808]/User[bd808]/ensure: created .. [14:05:23] of each dumps? [14:05:36] that's for apergos, but I think yes [14:05:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1738187 (10Dzahn) 5Open>3Resolved [stat1002:~] $ id bd808 uid=3518(bd808) gid=500(wikidev) groups=500(wikidev),731(analytics-privatedata-users) [14:05:43] not because of network congestion, though [14:05:51] as we say on the download page for our website, we limit speed and number of simultaneous connections per IP [14:06:00] from us. [14:06:02] yes [14:06:14] why? [14:06:17] (03PS4) 10Ottomata: Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) [14:06:38] for bandwidth limited? [14:07:01] because we have had users that ate all available bandwidth at the expense of others [14:07:37] which bandwidth apergos? I/O or network? [14:07:47] network initially, actually [14:07:52] some greedy aws users [14:07:56] the server's bandwidth I suppose? [14:08:03] some users want to download at maxium speed apergos [14:08:10] it's 10G now isn't it? [14:08:20] (03PS5) 10Ottomata: Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) [14:08:21] yes they do. well, for speed they really should use a mirror. [14:08:22] (03PS2) 10Dzahn: argon: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/247522 (owner: 10Muehlenhoff) [14:08:24] *maximum [14:09:02] 500 GB is enough ? [14:09:08] for a mirror ? [14:09:20] why should they use a mirror for speed? [14:09:22] or you need better? [14:09:24] I don't get that [14:09:38] (03CR) 10Dzahn: [C: 032] argon: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/247522 (owner: 10Muehlenhoff) [14:09:53] because we have at least one mirror that has much better capacity bandwidth wise and disk wise than we do [14:10:12] bd808: your user has been created on stat1002 [14:10:20] and it's also not the host where the dumps are produced (I'd really like to not serve from that host someday) [14:10:24] i can offer my help [14:10:26] if you want [14:10:46] found one dedicated [14:11:12] bandwidth wise we have a 10G NIC connected to a high capacity uncongested network [14:11:14] for mirror [14:11:23] so I don't think this is an issue, is it? [14:11:34] if it is, I'm interested with my network eng hat on :) [14:11:37] (03PS2) 10Filippo Giunchedi: mark graphite1002 as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247553 (owner: 10Muehlenhoff) [14:11:44] Oh I believe it would be if the disks could keep up [14:11:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mark graphite1002 as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247553 (owner: 10Muehlenhoff) [14:11:48] last time we checked, there were some punctual congestion for some hours when there was a fresh dump, but it was *not related to network bandwidth*, afaik [14:11:57] (03PS6) 10Ottomata: Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) [14:12:06] (03CR) 10Ottomata: [C: 032 V: 032] Deploy VarnishReqstats diamond collector on remaining cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/247564 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [14:12:23] right so the problem is that disks can't keep up, right? [14:12:29] !log deployed varnishreqstats diamond collector to remaining varnish caches [14:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:41] which in turn partially is because the machine has tiny amounts of memory/pagecache [14:12:42] as soon as the en full history dumps come out there are lots of folks that want to pull them, they would happily download at top speed with as many simultaneous connections as they could [14:12:46] disk/threads/etc. [14:13:09] 10 G apergos [14:13:16] dedicated [14:13:35] :) [14:13:42] so it looks like there are two discussions happening here [14:14:08] one is a mirror offer, and the other is the disk i/o issue [14:14:09] 6operations, 10Beta-Cluster-Infrastructure: Beta Cluster no longer listens for HTTPS - https://phabricator.wikimedia.org/T70387#1738192 (10Krenair) I made a certificate for beta on deployment-puppetmaster and replaced the star.wmflabs.org cert with it there (also had a mess around with some other settings to g... [14:14:37] well yes, but they are sort of interrelated [14:15:13] not so much; we have the list of things we'd like for a mirror, we work out the rsync deals with the institution providing the mirror, and off it goes [14:15:36] we are getting a mirror offer because we offer a limited service that doesn't satisfy demand [14:15:44] the i/o issue though, if that could be resolved by adding memory that would be nice [14:15:52] I would encourage mirrors regardless [14:15:59] and you need to tell me how much bandwidth 100 TB is enough? [14:16:02] (03CR) 10Dzahn: "needs a manual rebase pls" [puppet] - 10https://gerrit.wikimedia.org/r/246964 (owner: 10Muehlenhoff) [14:16:12] we offer a crappy service and people are trying to help us offer a better service [14:16:32] it's commendable and very constructive to actually offer a mirror instead of just complaining [14:16:46] is there a but here? [14:16:52] but regardless, it's our job to also make our service better so that people don't feel the need to offer mirrors :) [14:17:28] I would like: our service to be better, people to want to offer mirrors so there are multiple copies in the world; to have so many downloaders that those mirrors are all well used [14:17:44] nothing against mirrors and you're right, this should be responded separately -- but I would like to see a task tracking the issues we're having that have forced us to place limits [14:18:00] sure [14:18:24] I hope you can weigh in on possible fixes too [14:18:26] the discussion above mentioned network issues as well for example which are interesting to me from a network capacity planning perspective [14:18:42] I think this proved to be a red herring but I'd like to be sure about that :) [14:18:45] yes, promise I will! [14:18:50] yay! [14:19:09] (03PS1) 10Ottomata: Remove unused role::cache::logging::eventlistener, add comments describing different varnish loggers [puppet] - 10https://gerrit.wikimedia.org/r/247566 [14:19:24] I'm not always just complaining :P [14:19:33] cortex_: in case it wasn't linkd yet, https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps has answers for the questions you asked a few dozens lines ago [14:19:34] cortex_: what I would want from you is ... are you the contact person for the institution that would be providing the mirror? [14:19:46] sometimes I can be constructive too, cf. my snapshot patches in gerrit ;-) [14:19:47] (03PS3) 10Alex Monk: Rename mediawiki::web::sites to mediawiki::web::prod_sites to make room for a new generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/244228 (https://phabricator.wikimedia.org/T86644) [14:20:13] (03PS2) 10Alex Monk: Begin to merge production and beta apache config, starting with nonexistent.conf [puppet] - 10https://gerrit.wikimedia.org/r/244237 (https://phabricator.wikimedia.org/T86644) [14:20:26] (03PS3) 10Alex Monk: Begin to merge production and beta apache config, starting with nonexistent.conf [puppet] - 10https://gerrit.wikimedia.org/r/244237 (https://phabricator.wikimedia.org/T86644) [14:20:41] paravoid: :-P and btw I am enough of a complainer for the two of us [14:20:43] yes apergos but i'm not an institution [14:20:46] ok [14:20:48] just a private [14:20:51] (03PS2) 10Dzahn: tor: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247228 (owner: 10Muehlenhoff) [14:20:52] well here is what I would like you to do [14:20:53] (03CR) 10Filippo Giunchedi: "sounds good to me, see also my comment on PS4 about sudo" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [14:21:26] what do you think apergos ? [14:21:32] is it possible? [14:21:39] go ahead and dscribe in the email what you could provide, and what your limitations would be (to the address on the web page) [14:22:04] and we can follow up from there. [14:22:10] ok [14:22:15] do mark the subject like as it says in the email [14:22:34] yes your 10gb would be a fine addition [14:22:43] rename table is ongoing, please alert if you see regressions on the logs/http requests failing [14:23:22] fwiw, I think as the foundation we have enough resources to be able to resource this service properly [14:24:15] (03PS2) 10Muehlenhoff: bromine: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246964 [14:24:39] (03CR) 10Muehlenhoff: "The patch has been manually rebased." [puppet] - 10https://gerrit.wikimedia.org/r/246964 (owner: 10Muehlenhoff) [14:25:23] (03CR) 10Ottomata: [C: 032] Remove unused role::cache::logging::eventlistener, add comments describing different varnish loggers [puppet] - 10https://gerrit.wikimedia.org/r/247566 (owner: 10Ottomata) [14:26:49] there has been a 33% very stable load increase on enwiki in the last 2 hours [14:27:35] on reads, since 12:25 UTC aprox [14:27:39] paravoid: resources in existing hardware/network, or budgetary resources that could be used to acquire hardware? [14:27:48] (03PS3) 10Dzahn: tor: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247228 (owner: 10Muehlenhoff) [14:27:57] both, broadly :) [14:28:04] not even talking about this year's budget [14:28:28] but it's not a huge problem to solve based on our size [14:28:39] (03PS4) 10Dzahn: tor: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247228 (owner: 10Muehlenhoff) [14:31:09] RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 16%, RTA = 2.07 ms [14:31:35] (03CR) 10Dzahn: [C: 032] tor: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247228 (owner: 10Muehlenhoff) [14:31:52] 6operations, 10ops-eqiad, 5Patch-For-Review: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1738270 (10Cmjohnson) Swapped the fiber and sfp+'s for Juniper 10G copper cable. [14:32:11] what would be a good way to try to correlate with http requests, I am lost with all the graphana dashboards [14:33:19] I think I am going "old fashion" [14:33:25] (03PS3) 10Dzahn: bromine: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246964 (owner: 10Muehlenhoff) [14:33:42] !log removing tele2(patchid 2953) from dmarc panel @eqiad [14:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:25] (03CR) 10Dzahn: [C: 032] bromine: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246964 (owner: 10Muehlenhoff) [14:35:18] 6operations, 10netops, 10procurement: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1738277 (10Cmjohnson) [14:35:20] 6operations, 10ops-eqiad, 10netops: remove tele2(patchid 2953) from dmarc panel - https://phabricator.wikimedia.org/T115921#1738275 (10Cmjohnson) 5Open>3Resolved Removed [14:39:02] 6operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011#1738291 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [14:39:13] 6operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011#1738299 (10MoritzMuehlenhoff) p:5Triage>3Normal [14:39:14] (03PS7) 10Rush: Phabricator: Fetch all references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:39:42] 6operations, 10netops, 10procurement: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1738306 (10faidon) Confirmed that the patch was successfully removed: ``` faidon@re1.cr1-eqiad> show interfaces descriptions | match tele2 xe-5/2/2 up down Transit: 6operations: reclaim tmh2* as spares or into mw* pool - https://phabricator.wikimedia.org/T115950#1738309 (10Cmjohnson) [14:40:05] 6operations, 10ops-eqiad: relabel tmh1001/mw1259 & tmh1002/mw1260 - https://phabricator.wikimedia.org/T115952#1738307 (10Cmjohnson) 5Open>3Resolved done [14:40:15] (03CR) 10Rush: [C: 032 V: 032] Phabricator: Fetch all references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:40:58] 6operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011#1738311 (10faidon) Yeah that's a good idea. If we do, should we consider userspace logging with ulogd instead of spamming dmesg? That way we could potentially collect it in the future as well and e.g. detect anomalies. [14:41:27] chasemp, twentyafterfour: https://phabricator.freedesktop.org/diffusion/GITPHAB/ is interesting [14:41:29] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:42:06] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1738312 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=silver&service=HTTPS SSL OK - Certificate wikitech.wikimedia.org valid until 2016... [14:42:56] (03PS2) 10EBernhardson: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247478 [14:42:58] (03PS2) 10Ottomata: Add --percent-lost flag to refinery data check [puppet] - 10https://gerrit.wikimedia.org/r/246412 (https://phabricator.wikimedia.org/T113255) (owner: 10Mforns) [14:43:02] (03CR) 10Ottomata: [C: 032] Add --percent-lost flag to refinery data check [puppet] - 10https://gerrit.wikimedia.org/r/246412 (https://phabricator.wikimedia.org/T113255) (owner: 10Mforns) [14:43:07] upstream did the basic work to get ref=>diff stuff but I haven't seen it in action [14:43:08] (03CR) 10Ottomata: [V: 032] Add --percent-lost flag to refinery data check [puppet] - 10https://gerrit.wikimedia.org/r/246412 (https://phabricator.wikimedia.org/T113255) (owner: 10Mforns) [14:44:18] (03PS1) 10Dzahn: icinga: move ssl cert install to role [puppet] - 10https://gerrit.wikimedia.org/r/247570 [14:45:20] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: puppet fail [14:45:47] (03PS1) 10Rush: phab: bad package name for dep [puppet] - 10https://gerrit.wikimedia.org/r/247571 [14:46:01] (03CR) 10Ottomata: "This is all that is necessary, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) (owner: 10Dzahn) [14:47:20] (03CR) 10Rush: [C: 032] phab: bad package name for dep [puppet] - 10https://gerrit.wikimedia.org/r/247571 (owner: 10Rush) [14:47:43] 6operations: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011#1738359 (10MoritzMuehlenhoff) Yes, my initial debug rules simply used syslog, but for a more complete fleet-wide solution I was thinking of ulogd(2). [14:47:48] (03CR) 10Dzahn: "ok, cool, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) (owner: 10Dzahn) [14:47:49] Question: do operations team have access to visits logs with indications on robots? (specifically Googlebot user agent?) [14:47:50] (03CR) 10Ottomata: [C: 031] stat1001: Fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247222 (owner: 10Muehlenhoff) [14:48:15] (03CR) 10DCausse: [C: 031] Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [14:48:23] (03CR) 10Ottomata: [C: 031] gadolinium: Use the role keyword (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247191 (owner: 10Muehlenhoff) [14:48:37] 6operations, 10ops-eqiad, 5Patch-For-Review: cp1059 has network issues - https://phabricator.wikimedia.org/T114870#1738367 (10BBlack) Looks fixed, will leave it alone for a day or so to see if icinga state stabilizes, then un-downtime it for a day or two and see if we get alerts, then repool. [14:48:41] (03CR) 10Ottomata: [C: 031] statistics::cruncher: Move standard and base::firewall includes into the role [puppet] - 10https://gerrit.wikimedia.org/r/247223 (owner: 10Muehlenhoff) [14:48:51] (03CR) 10Dzahn: "@Filippo let me move the cert install into the role first then and solve it that way .. -> https://gerrit.wikimedia.org/r/#/c/247570/" [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [14:50:19] (03PS2) 10Muehlenhoff: lithium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247210 [14:50:33] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1738380 (10BBlack) 5Open>3Resolved Seems ok now [14:50:40] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:00] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [14:52:19] (03PS1) 10Dzahn: planet: move to role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247576 [14:52:40] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [14:54:34] (03PS2) 10Dzahn: planet: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247576 [14:54:40] (03PS1) 10Dzahn: silver/wikitech: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247577 [14:56:08] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1738399 (10jcrespo) I have renamed the table on all wikis to delete_user_daily_contribs; I will leave it as is for some time, will delete it afterwards after checking that no code is r... [14:56:18] (03PS1) 10Dzahn: protactinium: mark as role spare [puppet] - 10https://gerrit.wikimedia.org/r/247578 [14:56:23] (03PS3) 10Filippo Giunchedi: lithium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247210 (owner: 10Muehlenhoff) [14:56:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lithium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247210 (owner: 10Muehlenhoff) [14:58:45] (03PS1) 10Dzahn: nitrogen/ipv6relay: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247579 [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151020T1500). Please do the needful. [15:00:05] mutante: no need to create new "role" patches, I made a complete rundown of patches [15:00:09] (03PS1) 10Dzahn: neon/icinga: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247580 [15:00:24] planet, silver, protactinium and nitrogen are all duplicates [15:00:24] moritzm: ooh..ok. /me stops [15:00:29] be back soon [15:00:38] I can add you as reviewer for the remaining ones [15:00:56] It seems that google and other search engines don't index pages containing apostrophe in title (https://phabricator.wikimedia.org/T112425) [15:01:14] moritzm: ok [15:01:50] can someone with access to visitors logs can try to shed light on this issue [15:01:59] who can SWAT? [15:02:09] I can SWAT. MatmaRex yurik kart_ ebernhardson ping for SWAT (bot is...not pinging) [15:02:21] thcipriani, pog [15:02:24] pogn [15:02:24] pong [15:02:27] hi. [15:02:29] )) [15:02:29] jouncebot: next [15:02:30] In 0 hour(s) and 57 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151020T1600) [15:02:36] hm. [15:02:39] joal: help [15:02:43] argh. [15:02:46] jouncebot: help [15:03:02] maybe someone confused with that ircnikc [15:03:21] why is the bloody bot sending notices. [15:03:25] anyway. [15:03:36] jouncebot: refresh [15:03:38] I refreshed my knowledge about deployments. [15:03:42] thcipriani: pong [15:03:48] * bd808 has been meaning to fix the notice thing [15:04:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 (owner: 10Bartosz Dziewoński) [15:04:29] (03Merged) 10jenkins-bot: Move ForeignUploadTargets config to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 (owner: 10Bartosz Dziewoński) [15:04:44] https://github.com/mattofak/jouncebot/commit/c6188c9b3c0f0abf7f15e947ab9db38bb0c5d7b3 [15:06:09] bd808: I fixed ircnick ;) [15:06:16] too late though. [15:07:36] (03PS1) 10ArielGlenn: monitor should use the new dblists location just like rest of dumps [puppet] - 10https://gerrit.wikimedia.org/r/247582 [15:07:36] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1738448 (10Cmjohnson) 3NEW a:3Cmjohnson [15:08:06] sweet, swat! :) [15:08:23] (03CR) 10ArielGlenn: [C: 032] monitor should use the new dblists location just like rest of dumps [puppet] - 10https://gerrit.wikimedia.org/r/247582 (owner: 10ArielGlenn) [15:10:34] MatmaRex: I'm going to sync out InitialiseSettings then CommonSettings, which is backwards of how it is normally done, but the change seems to need that. Does that seem right to you? [15:11:51] thcipriani: i've never deployed anything, so don't trust me. but yeah, it seems reasonable that this way would work correctly, and the other not [15:12:06] (03PS3) 10Jcrespo: Add pt-heartbeat start & execution script to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244651 (https://phabricator.wikimedia.org/T114752) [15:12:56] MatmaRex: heh, okie doke, with that caveat, I'll stick with the plan to roll out InitialiseSettings first :) [15:14:12] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Move ForeignUploadTargets config to production [[gerrit:246703]] part I (duration: 00m 18s) [15:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:40] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Move ForeignUploadTargets config to production [[gerrit:246703]] part 2 (duration: 00m 17s) [15:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:08] ^ MatmaRex check please [15:15:37] doing [15:17:22] thcipriani: all in order. thanks [15:17:33] MatmaRex: thanks for checking! [15:17:55] (03PS2) 10Milimetric: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) [15:18:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247494 (owner: 10Yurik) [15:18:39] yeppii [15:18:53] (03Merged) 10jenkins-bot: Switch graphoid to the local restbase proxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247494 (owner: 10Yurik) [15:19:17] (03PS2) 10Muehlenhoff: Assign salt grains for otrs [puppet] - 10https://gerrit.wikimedia.org/r/247551 [15:20:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for otrs [puppet] - 10https://gerrit.wikimedia.org/r/247551 (owner: 10Muehlenhoff) [15:20:39] (03PS2) 10Muehlenhoff: Assign salt grains for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/247552 [15:21:33] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Switch graphoid to the local restbase proxy [[gerrit:247494]] (duration: 00m 17s) [15:21:41] ^ yurik check please [15:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:22] thcipriani, works! [15:22:30] yurik: neat. Thanks! [15:23:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247515 (owner: 10KartikMistry) [15:23:49] (03Merged) 10jenkins-bot: CX: Enable ContentTranslation suggestion in all Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247515 (owner: 10KartikMistry) [15:23:57] Krenair: Reedy i want to deploy https://gerrit.wikimedia.org/r/#/c/246696/ today (WikimediaMessages) [15:24:13] do i still / should i still backport that to the deployment branches? [15:24:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/247552 (owner: 10Muehlenhoff) [15:24:35] localisation update should have run with this stuff from master afaik [15:25:57] bblack, hey. does *.*.beta.wmflabs.org not match deployment.wikimedia.beta.wmflabs.org ? [15:26:38] multiple levels of wildcards aren't supported in X.509 certificates [15:27:05] * aude suspects i might need to and then run scap, but vaguely remember backporting to WikimediaMessages can cause problem [15:27:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable ContentTranslation suggestion in all Wikipedia [[gerrit:247515]] (duration: 00m 17s) [15:27:38] ^ kart_ check please [15:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:50] (what was to Krenair) [15:28:04] Ugh. [15:28:06] 6operations, 6Analytics-Backlog: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1738546 (10Ottomata) Hm, no idea. Can we just do gid => 30001 for file_mover group in role::logging::systemusers in role/logging.pp? [15:28:24] I guess I should figure out how to generate certs with subject alternative names then [15:29:03] I guess? dunno what you're trying to do :) [15:29:05] thcipriani: sure [15:29:21] paravoid, make beta SSL work [15:29:27] with a self-signed cert [15:29:43] We have a task about buying real certs possibly though. [15:29:50] Indeed [15:30:32] 6operations, 6Analytics-Backlog: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1738560 (10Ottomata) Hm, actually, let's make the gid for the file_mover user use the name rather than the gid. file_mover group exists as gid 997 on erbium. [15:31:06] well then yeah, using the current url scheme you need to either do SANs, or SNI [15:31:07] (03PS1) 10Ottomata: Set file_mover primary gid by name rather than number [puppet] - 10https://gerrit.wikimedia.org/r/247584 (https://phabricator.wikimedia.org/T115943) [15:31:30] Krenair: in general, SSL cert wildcards cannot be multi-level [15:31:32] thcipriani: looks great. Thanks! [15:31:43] kart_: cool, thanks! [15:31:47] oh I'm late heh :) [15:32:32] ebernhardson: looking through this patch now. trying to figure out ordering for sync. [15:32:58] thcipriani: i'm not sure there is a great order for it, i keep pushing for us to find a way to do atomic deploys [15:33:05] but i'm getting off-point :) [15:33:20] (03CR) 10Ottomata: [C: 032] Set file_mover primary gid by name rather than number [puppet] - 10https://gerrit.wikimedia.org/r/247584 (https://phabricator.wikimedia.org/T115943) (owner: 10Ottomata) [15:33:47] thcipriani: can scap *not* touch InitialiseSettings.php when syncing? if so you could sync everything but [15:33:59] thcipriani: then finally sync InitialiseSettings.php which busts the cache [15:34:55] (03CR) 10BryanDavis: [C: 031] Log OOM rate and HHVM-non-OOM error rate in statds for graphing [puppet] - 10https://gerrit.wikimedia.org/r/246409 (owner: 10Chad) [15:34:57] heya Jeff_Green [15:34:58] fyi [15:35:00] https://gerrit.wikimedia.org/r/#/c/247584/ [15:35:11] something was wrong on erbium [15:35:12] not sure why [15:35:13] Notice: /Stage[main]/Role::Logging::Systemusers/User[file_mover]/ensure: created [15:35:14] Notice: /Stage[main]/Role::Logging::Systemusers/File[/var/lib/file_mover]/group: group changed '30001' to 'file_mover' [15:35:14] Notice: /Stage[main]/Role::Logging::Udp2log::Erbium/File[/a/log/fundraising]/group: group changed '30001' to 'file_mover' [15:35:14] Notice: /Stage[main]/Role::Logging::Udp2log::Erbium/File[/a/log/fundraising/logs]/group: group changed '30001' to 'file_mover' [15:35:43] i'm going to chgrp all the files in /a/log/fundraising to file_mover [15:35:47] as gid 30001 doesn't exist [15:35:49] dunno why this happened [15:36:30] RECOVERY - puppet last run on erbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:35] thcipriani: err, sorry no i'm wrong that doesn't help :S [15:37:39] thcipriani: if you sync CirrusSearch-* first it should throw some warnings about non-existent variables, but not cause any particular issues with user queries or updates [15:37:40] ebernhardson: How about: -production -common commonsettings then initialise ? it doesn't look like that'd break anything. [15:38:09] (03PS1) 10Alex Monk: Change star.wmflabs.org to beta certificate [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T70387) [15:38:13] thcipriani: yea i think that will work [15:38:22] okie doke, lets give it a shot. [15:38:49] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247478 (owner: 10EBernhardson) [15:39:13] (03Merged) 10jenkins-bot: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247478 (owner: 10EBernhardson) [15:39:29] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Puppet has 1 failures [15:39:40] (03PS2) 10Alex Monk: beta: Use new self-signed SSL certificate [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T70387) [15:39:46] 6operations, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Beta Cluster no longer listens for HTTPS - https://phabricator.wikimedia.org/T70387#1738608 (10Krenair) a:3Krenair Using a real trusted certificate is covered in T50501, T75919 and T97593. [15:39:59] (03PS1) 10Ottomata: Remove uid setting from file_mover user. enforce-users-groups-cleanup was removing this [puppet] - 10https://gerrit.wikimedia.org/r/247589 (https://phabricator.wikimedia.org/T115943) [15:40:00] jynus: based on job queue lag on Commons, it's more likely that DB load be caused by jobqueue items (wild guess) [15:41:23] yes, it is not http requests [15:43:56] (03CR) 10Ottomata: [C: 032] Remove uid setting from file_mover user. enforce-users-groups-cleanup was removing this [puppet] - 10https://gerrit.wikimedia.org/r/247589 (https://phabricator.wikimedia.org/T115943) (owner: 10Ottomata) [15:43:59] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1738629 (10fgiunchedi) taking another look at this, I'm going to block it with {T101141} about fixing inbound udp errors on graphite first sinc... [15:44:06] 6operations, 7Graphite, 5Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1330890 (10fgiunchedi) [15:44:11] 6operations, 7Graphite, 7HHVM, 7Monitoring: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#1738634 (10fgiunchedi) [15:44:56] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [[gerrit:247478]] Part I (duration: 00m 17s) [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:34] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [[gerrit:247478]] Part II (duration: 00m 17s) [15:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:00] akosiaris, aronud? [15:46:09] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [[gerrit:247478]] Part III (duration: 00m 17s) [15:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:37] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [[gerrit:247478]] Part IV (duration: 00m 17s) [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:04] ottomata: thanks for fixing that! so enforce-users-groups-cleanup ran recently and that's why there was no obvious change causing that [15:47:06] ^ ebernhardson well, not quite. check please. [15:47:07] ? [15:47:13] hm, mutante i think this may be a problem [15:47:13] my fix [15:47:22] it looks like fr_archive is NFS moutned from NetApp [15:47:24] but you fixed the puppet run [15:47:24] i see [15:47:27] owned by 30001 [15:47:35] so, uids need to match up for NFS, right? [15:47:47] eh, i suppose so [15:47:48] thcipriani: looks like the only warning was wmgCirrusSearchWriteClusters, which means write everywhere but the jobs will gracefully fail since the clusters dont have the right schemas yet [15:47:59] !log bump netdev_max_backlog to 10000 on graphite1001, T101141 [15:48:00] i'm not sure how that uid is set on that dir [15:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:07] don't know much about NetApp [15:48:09] thcipriani: looks like its still climbing? I've had to touch and re-sync InitialiseSettings.php before [15:48:14] i can't change it on erbium [15:48:16] i dont think i puppetized something that was on netapp before [15:48:35] no, which is why i didn't think it would be aproblem [15:48:38] i think its not in puppet maybe [15:48:42] i didn't see any other references to this [15:48:50] ebernhardson: fatalmonitor on fluorine is climbing, but logstash looks ok. [15:48:53] i'm pretty sure i just broke some FR things, buuuut, it may have already been broken? [15:48:57] i just wondering why it happened at the time it happened and was ok before [15:49:01] yeah, dunno [15:49:09] (03CR) 10Jcrespo: [C: 032] Add pt-heartbeat start & execution script to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244651 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [15:49:12] thcipriani: could you touch and re-sync? wmgCirrusSearchWriteClusters is 100% defined in InitialiseSettings.php [15:49:15] yup [15:49:19] paravoid: got any tips for us, Jeff_Green isn't here. need to change a uid on a netapp mounted dir [15:49:19] that was that enforce-users-groups-cleanup ran, but then why did it not run before [15:49:41] not sure who else to ask about netapp stuff [15:49:47] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [[gerrit:247478]] Part V (duration: 00m 18s) [15:49:49] ^ ebernhardson done [15:49:52] chown should work no? [15:50:03] looks to have stopped climbing [15:50:29] err, no there it goes :S [15:50:39] why no data here? https://gdash.wikimedia.org/dashboards/jobq/ :( [15:50:42] [root@erbium:/a/log/fundraising/logs] # chown file_mover:file_mover /a/log/fundraising/logs/fr_archive [15:50:42] chown: changing ownership of `/a/log/fundraising/logs/fr_archive': Operation not permitted [15:50:50] maybe if i umount and mount? [15:50:51] hm. [15:50:54] dunno. [15:51:05] i wonder if the climbing now is just old reports coming in, i think it is [15:51:12] well it's probably root-squashed [15:51:14] ebernhardson: there's no more index missing exceptions in logs [15:51:35] yea everything looks sane in terms of logs coming into logstash [15:51:58] is that something i can change on the netapp somehow? [15:52:07] thcipriani: looks good, thanks! [15:52:16] probably is, I don't remember... [15:52:18] ebernhardson: thanks for watching it—appreciated! [15:52:20] akosiaris might [15:53:04] 6operations, 7Database, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1738659 (10jcrespo) [15:53:21] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:36] trying this [15:53:37] https://wikitech.wikimedia.org/wiki/NetApp [15:53:38] no luck [15:54:04] oh maybe not netappname [15:54:05] duh [15:54:35] in [15:55:15] (03Abandoned) 10Dzahn: planet: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247576 (owner: 10Dzahn) [15:56:05] ebernhardson: the thing "climbing" in your statements is the jobqueue, right? [15:56:24] greg-g: no, in fatalmonitor [15:56:28] 92 ; Netapps [15:56:28] 93 nas1001-a 1H IN A 10.64.16.4 [15:56:28] 94 nas1001-b 1H IN A 10.64.16.5 [15:56:31] ottomata: ^ [15:56:49] ebernhardson: ah [15:56:52] greg-g: due to the order of syncing files out, we had these which would have inserted a few hundred jobs (all to cirrusSearchElasticaWrite): 309266 Undefined variable: wmgCirrusSearchWriteClusters in /srv/mediawiki/wmf-config/CirrusSearch-common.php on line 27 [15:57:07] ebernhardson: the jobqueue was/is increasing, was curious if it was related to you :) [15:57:08] mutante: aye, i'm in -a now [15:57:13] not sure what to do though, am googling stuff [15:57:22] ebernhardson: ahh, you got pinged in -tech, you're looking :) [15:57:23] greg-g: a thousand jobs, probably, but not 1.6M :) [15:57:33] ottomata: yea,. ehm, never logged in myself [15:57:46] ebernhardson: cool, carry on :) [15:58:15] akosiaris: help! :) [15:58:23] (03PS3) 10Aude: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups (test wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 (https://phabricator.wikimedia.org/T115653) [15:58:25] (03PS1) 10Aude: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247593 (https://phabricator.wikimedia.org/T115653) [15:59:28] (03PS2) 10Dzahn: planet: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247219 (owner: 10Muehlenhoff) [15:59:45] mutante: i'm into the mgmt interface [15:59:54] not sure if there is an usual shell interface [16:00:04] _joe_ andrewbogott: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151020T1600). Please do the needful. [16:00:05] SMalyshev: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:05] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151020T1600). [16:00:06] mgmt interface does not match anything i am googling [16:00:47] ottomata: looking at DNS i see actually 3 names per device: [16:01:03] so, any questions about my puppet SWAT patches? [16:01:07] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1738690 (10mmodell) [16:01:12] ottomata: nas1001-a.eqiad.wmnet , e0M.nas1001-a.eqiad.wmnet, nas1001-a.mgmt.eqiad.wmnet [16:01:22] so [16:01:24] I'm seeing [16:01:27] 2015-10-20 15:48:59 mw1010 enwiki redis ERROR: Redis exception on server "10.64.0.201" {"redis_server":"10.64.0.201"} [16:01:35] [Exception RedisException] (:) read error on connection [16:01:53] (03PS1) 10Aude: Bump cache epoch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247595 [16:01:59] <_joe_> SMalyshev: I am out for the week, I don't know why I was scheduled for puppetSWAT tbh, sorry [16:02:02] legoktm: a lot? [16:02:11] _joe_: me neither, but I can probably do it [16:02:18] _joe_: ok, np [16:02:32] aude: there's a bunch in redis.log [16:02:41] <_joe_> andrewbogott: thanks :) [16:02:46] andrewbogott: thanks! [16:02:49] legoktm@fluorine:/a/mw-log$ grep "Redis exception on server" -c redis.log [16:02:49] 411 [16:03:25] (03CR) 10Dzahn: [C: 032] planet: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247219 (owner: 10Muehlenhoff) [16:03:31] * aude is supposed to deploy, but if there is a problem then i should wait [16:03:40] the commons job queue is spiking [16:03:45] andrewbogott: Mine's pretty trivial too, I already have it testing in beta. [16:04:03] should wikidata and puppetswat deploy windows overlap like they do today? [16:04:05] (03PS2) 10Jcrespo: Enabling Async IO on newer kernels and, selectively, P_S [puppet] - 10https://gerrit.wikimedia.org/r/244713 [16:04:58] 6operations, 10netops, 10procurement: Zayo eqiad-codfw link implementation tracking - https://phabricator.wikimedia.org/T116028#1738703 (10RobH) 3NEW a:3RobH [16:05:32] SMalyshev: I’d like you to resolve your -1s (at least Daniel’s) on https://gerrit.wikimedia.org/r/#/c/240888/ before adding it to a swat window. [16:06:02] enwp is also increasing... [16:06:31] andrewbogott: I'm not sure what it is about, unfortunately. [16:07:22] (03PS2) 10Jcrespo: Add the posibility of enabling the performance_schema engine [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244710 [16:07:39] legoktm: perhaps you can grab a `monitor` log from the redis server for a few seconds and then grep that to figure out what it is? [16:07:56] SMalyshev: so ask him :) He’s ‘mutante’ on irc. [16:07:59] it will be spammy, but the info will be there somewhere [16:08:13] (03CR) 10Jcrespo: [C: 032] Add the posibility of enabling the performance_schema engine [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244710 (owner: 10Jcrespo) [16:08:20] (03PS4) 10Andrew Bogott: Log OOM rate and HHVM-non-OOM error rate in statds for graphing [puppet] - 10https://gerrit.wikimedia.org/r/246409 (owner: 10Chad) [16:09:00] mutante: could you comment on https://gerrit.wikimedia.org/r/#/c/240888/? [16:09:01] jobs should be removed (spikes of Received job sendData for unwritable cluster labsearch 1325s ) [16:09:46] 6operations, 10netops, 10procurement: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1738726 (10RobH) p:5High>3Low I've emailed our billing contact at eq to ensure that d/c of the tele2 cross-connect is all thats needed to cease billing. (It should be, but I've not processed a d/c via t... [16:09:58] (03PS4) 10Aude: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups (test wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 (https://phabricator.wikimedia.org/T115653) [16:10:21] (03PS3) 10Jcrespo: Enabling Async IO on newer kernels and, selectively, P_S [puppet] - 10https://gerrit.wikimedia.org/r/244713 [16:10:41] (03CR) 10Andrew Bogott: [C: 032] Log OOM rate and HHVM-non-OOM error rate in statds for graphing [puppet] - 10https://gerrit.wikimedia.org/r/246409 (owner: 10Chad) [16:10:46] 2015-10-20 16:10:31 mw1011 frwikisource JobQueueRedis INFO: Could not acknowledge refreshLinksPrioritized job 94d2a3cb95ca423883b3400f6b228ca1. [16:10:46] 2015-10-20 16:10:31 mw1014 frwikisource JobQueueRedis INFO: Could not acknowledge refreshLinksPrioritized job 5210af906aec49d6a153e8ec1eb19160. [16:10:47] 2015-10-20 16:10:31 mw1004 wikidatawiki JobQueueRedis INFO: Could not acknowledge refreshLinksPrioritized job f5aadc86deca4b2cad5329dbfc637a81. [16:10:48] etc. [16:10:54] so it's not able to mark jobs as done?? [16:12:07] andrewbogott: what about the second one? [16:12:11] (03PS4) 10Jcrespo: Enabling Async IO on newer kernels and, selectively, P_S [puppet] - 10https://gerrit.wikimedia.org/r/244713 [16:12:16] legoktm: these errors can be ignored I think unless we have a lot more. Last I checked I could see them in logs from July [16:13:11] SMalyshev: that one is weird enough that I don’t feel comfortable merging. I will review it though. [16:13:12] (03CR) 10Jcrespo: [C: 032] Enabling Async IO on newer kernels and, selectively, P_S [puppet] - 10https://gerrit.wikimedia.org/r/244713 (owner: 10Jcrespo) [16:13:18] hmm [16:13:25] (03CR) 10Dzahn: "i said -1 basically because of https://phabricator.wikimedia.org/T110070#1736940 and because there seems to be no consensus on ticket. but" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [16:13:27] andrewbogott: so who could merge it? [16:13:39] SMalyshev: i removed the -1 and abstain [16:13:52] dcausse: the frequency has definitely increased [16:14:31] JobQueueRedis.log-20151019.gz has 3961 [16:14:38] JobQueueRedis.log-20151020.gz has 2052440 [16:14:41] damn [16:14:46] SMalyshev: sorry, I don’t mean to be difficult — I’m assigned to swat today by mistake, my flight was delayed until 3AM last night and I’m also in another meeting right now. [16:14:46] so far today we're at 971771 [16:15:04] i dunno mutante not making progress [16:15:16] not sure what to do, i need to use uid 30001 [16:15:18] but puppet won't let me [16:15:23] chasemp: ideas? [16:15:33] andrewbogott: sure, np, I just wanted to know who I could talk to, since this one is sitting there for a long time already [16:15:33] (03CR) 10Nuria: "@BBlack Please let me know if this needs more changes. Talked to @ori about this and we agreed on pursuing these changes for the time com" [puppet] - 10https://gerrit.wikimedia.org/r/244626 (https://phabricator.wikimedia.org/T114370) (owner: 10Nuria) [16:15:59] ottomata: I've missed the meat of what your doing [16:16:00] andrewbogott: also looks like Daniel withdrew his objection to https://gerrit.wikimedia.org/r/#/c/240888/ [16:16:05] k [16:16:05] so [16:16:12] there is a legacy file_mover user [16:16:14] on erbium [16:16:23] that was manually set to have uid and gid 30001 [16:16:29] ottomata: what does it break currently if you nothing (since the puppet error is gone) [16:16:30] there is a NetApp mount [16:16:42] *you do nothing [16:16:43] mutante: nothing can access the fr_archive dir [16:16:49] which will cause fundraising banner things to break [16:16:52] 6operations, 10netops, 10procurement: Zayo eqiad-codfw link implementation tracking - https://phabricator.wikimedia.org/T116028#1738746 (10faidon) ``` faidon@re1.cr2-eqiad> show interfaces descriptions | match codfw xe-5/2/3 up down Core: << cr2-codfw:xe-5/0/1 HOLD FOR Zayo T116028 {#?} [10Gbps... [16:16:55] SMalyshev: it doesn’t look like anyone has expertise in that area besides you :) If you could get _joe_ or chasemp to +1 that would be enough for me. [16:17:25] so, chasemp yeah the NetApp mount somehow forces this to be owned by uid and gid 30001 [16:17:52] i don't know what mad ethis change [16:17:56] but mutante and i were fixing it in puppet [16:18:03] removing the file_mover uid => 30001 manual setting [16:18:13] chasemp: could you take a look on https://gerrit.wikimedia.org/r/#/c/243883/ ? [16:18:28] because the user/group puppet management was removing the user with uid 30001 [16:18:39] since it is a system user [16:18:42] i just saw icinga report the puppet breakage on erbium all of a sudden, then the error i pasted, but there was no change to logging.pp and nobody had logged in on that server [16:18:48] legoktm: i pulled a monitor log from redis, the jobs are almost entirely "Badtitle/EnqueueJob" [16:18:51] puppet change worked nicely- but I will wait half an hour to proclaim victory [16:19:00] chasemp: if I remove system => true, will it let me set uid and gid? [16:19:02] to 30001? [16:19:13] ebernhardson: those jobs just enqueue other jobs [16:19:21] and they should be very fast [16:19:32] it's intentionally "Badtitle" ? [16:19:34] SMalyshev: in a few? I'm currently inundated but offhand I'm not sure what protocol is for the -1 and overriding [16:19:38] * legoktm checks [16:19:39] but I also have no personal issues w/ it [16:19:46] ottomata: let me poke here for a second then respond [16:19:48] k [16:19:50] Title::makeTitle( NS_SPECIAL, 'Badtitle/' . __CLASS__ ), [16:19:51] yeah [16:19:54] chasemp: ok, when you have time [16:20:00] interesting, ok will look closer [16:20:14] can you tell what jobs it's queueing up? [16:20:14] chasemp: this is how it started: https://phabricator.wikimedia.org/T115943 [16:20:36] legoktm: yes, they look roughly like this: "a:8:{s:4:\"type\";s:7:\"enqueue\";s:9:\"namespace\";i:-1;s:5:\"title\";s:19:\"Badtitle/EnqueueJob\";s:6:\"params\";a:1:{s:10:\"jobsByWiki\";a:1:{s:11:\"commonswiki\";a:1:{i:0;a:4:{s:4:\"type\";s:12:\"refreshLinks\";s:6:\"params\";a:2:{s:15:\"isOpportunistic\";b:1;s:16:\"rootJobTimestamp\";s:14:\"20151020161801\";}s:4:\"opts\";a:1:{s:16:\"removeDuplicates\";b:1;}s:5:\"title\";a:2:{s:2:\" [16:20:43] also, I think it has something to do with the "Could not acknowledge refreshLinksPrioritized job" errors [16:20:54] 6operations, 7Graphite, 5Patch-For-Review: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1738764 (10fgiunchedi) 3NEW a:3fgiunchedi [16:20:55] ok, those are refreshLinks jobs [16:21:49] !log re-populated sites table on metawiki, mediawikiwiki and specieswiki with https protocol links [16:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:00] legoktm: this is internal to the jobqueue, not sure why it fails :/ [16:22:57] ottomata: the way I read the original ticket is that the error was puppet trying to create the user w/ that group but the group wasn't created first [16:23:04] legoktm: Now that springle is gone, who is good for getting db-related code review? [16:23:13] in a few places where we do a system => true we manually create the group above it in puppet [16:23:16] 6operations, 7Graphite: diamond should send statsd metrics in batches - https://phabricator.wikimedia.org/T116033#1738795 (10fgiunchedi) 3NEW a:3fgiunchedi [16:23:24] andrewbogott: My stats have started trickling in. Thank you! [16:23:25] (03PS1) 10Legoktm: Temporarily increase redis logging to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247603 [16:23:27] 6operations, 7Graphite: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1738807 (10fgiunchedi) [16:23:30] (03PS1) 10John F. Lewis: mailman: increase out queue to 300 check [puppet] - 10https://gerrit.wikimedia.org/r/247604 (https://phabricator.wikimedia.org/T114861) [16:23:34] (03PS1) 10Faidon Liambotis: Assign IPs for cr2-codfw<->cr2-eqiad link [dns] - 10https://gerrit.wikimedia.org/r/247605 (https://phabricator.wikimedia.org/T116028) [16:23:37] kaldari: on the ops side or MW? [16:23:40] i thought puppet auto required that stuff though [16:23:42] MW [16:23:48] will try reverting and adding require [16:23:59] (03CR) 10Legoktm: [C: 032] Temporarily increase redis logging to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247603 (owner: 10Legoktm) [16:24:00] mutante: https://gerrit.wikimedia.org/r/#/c/247604/ merge please :) [16:24:06] (03Merged) 10jenkins-bot: Temporarily increase redis logging to debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247603 (owner: 10Legoktm) [16:24:13] (03CR) 10Faidon Liambotis: [C: 032] Assign IPs for cr2-codfw<->cr2-eqiad link [dns] - 10https://gerrit.wikimedia.org/r/247605 (https://phabricator.wikimedia.org/T116028) (owner: 10Faidon Liambotis) [16:24:37] 6operations, 10netops, 10procurement: Zayo eqiad-codfw link implementation tracking - https://phabricator.wikimedia.org/T116028#1738825 (10faidon) [16:24:54] legoktm: the other job i'm seeing alot of is wikibase-addUsagesForPage [16:24:58] (inside enqueue job) [16:25:01] (03PS1) 10Ottomata: Revert previous changes to make sure file_mover has uid and gid 30001 [puppet] - 10https://gerrit.wikimedia.org/r/247606 (https://phabricator.wikimedia.org/T115943) [16:25:03] ottomata: sort of, in the case of a specified group that is not the same as the user name (so the gid) it won't autocreate just because it's referenced by the user stanza [16:25:10] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Temporarily increase redis logging to debug (duration: 00m 17s) [16:25:16] now the range of the id is another thing [16:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:20] ebernhardson: i can't imagine anything out of ordinary for wikibase jobs [16:25:28] but looking [16:25:39] (03PS2) 10Ottomata: Revert previous changes to make sure file_mover has uid and gid 30001 [puppet] - 10https://gerrit.wikimedia.org/r/247606 (https://phabricator.wikimedia.org/T115943) [16:25:41] (03CR) 10jenkins-bot: [V: 04-1] Revert previous changes to make sure file_mover has uid and gid 30001 [puppet] - 10https://gerrit.wikimedia.org/r/247606 (https://phabricator.wikimedia.org/T115943) (owner: 10Ottomata) [16:25:55] (03PS1) 10Legoktm: Revert "Temporarily increase redis logging to debug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247607 [16:26:01] (03CR) 10Dzahn: [C: 032] mailman: increase out queue to 300 check [puppet] - 10https://gerrit.wikimedia.org/r/247604 (https://phabricator.wikimedia.org/T114861) (owner: 10John F. Lewis) [16:26:03] (03CR) 10Legoktm: [C: 032] Revert "Temporarily increase redis logging to debug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247607 (owner: 10Legoktm) [16:26:10] (03Merged) 10jenkins-bot: Revert "Temporarily increase redis logging to debug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247607 (owner: 10Legoktm) [16:26:15] (03PS3) 10Ottomata: Revert previous changes to make sure file_mover has uid and gid 30001 [puppet] - 10https://gerrit.wikimedia.org/r/247606 (https://phabricator.wikimedia.org/T115943) [16:26:16] could this be related to higher mysql load (not related to higher http requests)? [16:26:27] JohnFLewis: :) "data driven approach" [16:26:28] over a 10s sample i see 746 wikibase-addUSagesForPage, 424 refreshLinks, and 2 flaggedrevs_CacheUpdate [16:26:30] it is like 50% higher than usual [16:26:54] uhh [16:26:55] but not spiky, consistently high [16:26:56] via the rather unscientific: cat redis.monitor | grep '"hSet"' | grep ":jobqueue:enqueue:h-data" | cut -d ' ' -f 7- | cut -d '"' -f 21 | sort | uniq -c | sort -rn [16:27:01] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Revert Temporarily increase redis logging to debug (duration: 00m 18s) [16:27:03] (03CR) 10Ottomata: [C: 032] Revert previous changes to make sure file_mover has uid and gid 30001 [puppet] - 10https://gerrit.wikimedia.org/r/247606 (https://phabricator.wikimedia.org/T115943) (owner: 10Ottomata) [16:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:08] if the job runners are working harder, that might increase mysql load [16:27:08] ugh [16:27:32] (03PS1) 10Milimetric: Alert about the status of pageview and projectview [puppet] - 10https://gerrit.wikimedia.org/r/247608 [16:27:38] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1738843 (10mmodell) icinga has paged me, and opsen, on multiple occasions when phabricator was down. I'm pretty sure that it's working. [16:27:39] sadly I do not have yet production machines with fine-grain monitoring to be helpful [16:27:40] (03PS2) 10Dzahn: neon: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247213 (owner: 10Muehlenhoff) [16:27:44] ottomata: I guess technically in that case you are conflicting system => true w/ the UID etc. I guess it probably only matters on creation in which case the most specific thing wins? and then we have to worry about our own cleanup logic [16:27:47] do we have graphs of number of jobs by type? [16:28:15] * aude looks on graphite [16:28:22] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1738853 (10hashar) 5Open>3Resolved a:3hashar Based on our experience we have good enough monitoring for either Gerrit or Phabricator. The critical bits are monitored via Icinga... [16:28:28] so assuming that works we could either change course and set a system user range uid/gid or we could let 30001 pass and put an allownce in enforce-users-groups [16:28:29] I guess [16:28:29] I see some large SELECTS from LinkHolderArray::replaceInternal [16:28:57] mutante: maybe data driven! [16:29:02] but those are not from job queues [16:29:16] * JohnFLewis unsilencing icinga check and closes ticket because bad luck if it occurs again [16:29:19] (03CR) 10Dzahn: [C: 04-1] "it says it's for neon but actually changes netmon" [puppet] - 10https://gerrit.wikimedia.org/r/247213 (owner: 10Muehlenhoff) [16:29:20] rats no chasemp: [16:29:21] Notice: /Stage[main]/Admin/Exec[enforce-users-groups-cleanup]/returns: /usr/local/sbin/enforce-users-groups removing user/id: file_mover/30001 [16:29:26] puppet is fine though [16:29:30] ok we can fix that [16:29:38] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1738858 (10JohnLewis) 5Open>3Resolved Above commit will resolve this. Unsilenced icinga check. [16:29:51] let's make the existing work as PoC and then we can make an issue to fix it in a better way? [16:30:04] ok [16:30:16] JohnFLewis: well, data-driven but not very long :p agreed though, yes please [16:30:39] took a second 10s sample from redis `monitor`, same distribution of things being added to the `enqueue` job type. 878 wikibase-addUsagesForPage, 454 refreshLinks, 2 flagedrevs_CacheUpdate [16:30:51] 6operations, 10Wikimedia-Mailing-lists: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1738867 (10Dzahn) [16:30:53] mutante: I'll keep the cron alive until let's say Friday and then we can kill it. as we'll have 5 days worth of data over an hourly period [16:31:37] hmm [16:32:56] uhh [16:32:57] 2015-10-20 16:28:17 terbium wikidatawiki JobQueueFederated INFO: Redis server error: protocol error, got '' as reply-type byte [16:33:08] is that someone doing something manually on terbium? [16:33:16] (03PS1) 10Rush: admin: allowance for file_mover UID [puppet] - 10https://gerrit.wikimedia.org/r/247609 [16:33:37] ottomata: that will stop puppet and cleanup from fighting^ [16:33:43] no, it's the dispatchChanges.php cronjob [16:33:58] ahh cool [16:33:59] ok [16:34:00] JohnFLewis: ok, sounds good [16:34:08] (03CR) 10Rush: [C: 032] admin: allowance for file_mover UID [puppet] - 10https://gerrit.wikimedia.org/r/247609 (owner: 10Rush) [16:34:24] urnning puppet [16:34:37] kaldari: is this about the Gadgets patch? [16:34:37] 6operations, 6Analytics-Backlog, 5Patch-For-Review: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1738888 (10chasemp) [16:35:19] 6operations, 10netops, 10procurement: implement new zayo wave connection for ulsfo-codfw - https://phabricator.wikimedia.org/T116036#1738889 (10RobH) 3NEW a:3RobH [16:36:05] IF this is related to the db issues (which we do not know yet), I can give you a starting time: 12:23 PM UTC [16:36:11] ottomata: arg syntax error give me a sec [16:36:40] andrewbogott: so can we move forward with https://gerrit.wikimedia.org/r/#/c/240888/? [16:36:42] ok, chasemp file_mover exists now, so things look ok there [16:36:47] i need food before next meeting [16:36:51] am around but downstairs [16:38:46] sure, fixing syntax and then let me know later if all seems well [16:38:55] (03PS1) 10Rush: admin: file_move exception remove ()'s and add ticket [puppet] - 10https://gerrit.wikimedia.org/r/247610 [16:38:59] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:00] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:11] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:20] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:29] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:30] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:33] ^this is probably me I'm fixing now [16:39:40] (03CR) 10Rush: [C: 032] admin: file_move exception remove ()'s and add ticket [puppet] - 10https://gerrit.wikimedia.org/r/247610 (owner: 10Rush) [16:39:40] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:41] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:41] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:50] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:50] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:51] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:51] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:58] (03PS2) 10Dzahn: krypton: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247203 (owner: 10Muehlenhoff) [16:39:59] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:59] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:40:01] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: Puppet has 1 failures [16:40:04] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:40:14] i stopped the bot for a moment [16:40:15] ^enforce-users-groups-cleanup failed [16:40:29] is this something you are working on? [16:40:40] should be fixed now, confirming [16:41:11] ok, no problem, just to check that [16:41:47] mutante: well we codified status quo there for file_mover but the why of the sudden appearance I do not understand [16:42:06] * aude looking at # of jobs (by type) on graphite [16:42:34] (03PS2) 10Dzahn: silver: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247221 (owner: 10Muehlenhoff) [16:43:39] don't see a spike or anything for addUsages (compared with the mast month or day(s)) [16:43:47] chasemp: that's the odd part, yes. it started out of nowhere [16:44:22] did enforce-users-groups-cleanup run but not run before? [16:44:55] aude, do you have a link? [16:45:15] jynus: https://phabricator.wikimedia.org/F2743935 [16:45:53] addusages would happen on edit (from wikidata, or i think wikipedias also) [16:46:16] SMalyshev: Is political rather than technical :( I’ll read through the phab task after this meeting and try to form an opinion. [16:46:41] i see a spike in refreshlinks but perhaps not so much out of ordinary [16:47:14] (03Abandoned) 10Dzahn: silver/wikitech: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247577 (owner: 10Dzahn) [16:47:27] probably normal [16:47:52] (03CR) 10Andrew Bogott: [C: 031] silver: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247221 (owner: 10Muehlenhoff) [16:47:54] legoktm: Yes [16:48:29] let me profile for a second one host on enwiki , that may help us a bit [16:48:38] ok [16:48:45] legoktm: mainly wondering how expensive it will be to run something like "select up_property, COUNT(*) from user_properties where up_property LIKE 'gadget-%' GROUP BY up_property;" against en.wiki [16:49:34] andrewbogott: :) i'll take that but waiting until the global puppet issue is cleared up [16:49:48] kaldari: i see a key on up_property [16:49:55] but i'm not expert at this [16:50:17] !log enabling query profiling on a sample of queries on db1072 [16:50:23] yeah, it does look like it's indexed well, but I'm paranoid of running anything against user_properties :) [16:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:31] heh [16:50:36] !log temp. disabled ircecho / neon puppet [16:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:51] Anyone around who could pick me up for the office at maybe 10:15? [16:51:22] kaldari: is this something you can try on analytics slave or something? [16:51:29] legoktm ^ [16:51:42] chasemp: numbers in icinga going down again now [16:51:46] hoo: I'm at home today :( [16:51:58] :( [16:52:02] * legoktm asks in -staff [16:52:46] one thing I can tell you- slow queries is not the issue, 50% more SELECTS is [16:52:59] hoo: Just tell the front-desk; they'll take you up to reception. [16:53:07] hoo: dangit, I will miss you this week, when do you leave? [16:53:13] (03CR) 10Dzahn: [C: 032] silver: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247221 (owner: 10Muehlenhoff) [16:53:17] no apreciable extra writes, but those are in general low [16:53:38] greg-g: On Saturday already [16:53:43] :( [16:54:01] I'm 50/50 on coming in on THursday, we'll see [16:54:09] legoktm, jynus: also I have no idea what the threshold is for marking a special page as isExpensive. [16:54:46] hoo: same here :) if you are there til Friday [16:54:50] who's the right person to get that sort of assessment from? [16:55:10] kaldari: I'll take a look in a bit, Aaron is also a good person to have review it. [16:55:29] James_F: nice [16:55:31] yeah, I added Aaron [16:55:40] thanks! [16:55:41] kaldari: if it takes longer than a few seconds to generate, it's probably expensive. But it can be turned into a cached page then [16:56:45] andrewbogott: silver change already done. no-op [16:56:53] great [16:57:45] (03PS3) 10Dzahn: krypton: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247203 (owner: 10Muehlenhoff) [16:58:29] this is what I see: https://phab.wmfusercontent.org/file/data/hm57f7cgxu2lshzbbot2/PHID-FILE-vx4h5h24ucoeqsfyimlx/gwnbatpibmdpl75h/Screenshot_from_2015-10-20_18%3A53%3A53.png [16:58:36] legoktm: I suspect it will definitely take more than a few seconds. Plus, I think it would be fine to mark this page as isExpensive since it doesn't really need to be regenerated often. [17:00:01] top queries are: Revision::fetchFromConds and LinkCache::addLinkObj [17:00:45] which are not precisely unexpected [17:01:09] (03CR) 10Dzahn: [C: 032] krypton: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247203 (owner: 10Muehlenhoff) [17:01:19] compared to normal load: https://wikitech.wikimedia.org/wiki/MariaDB/query_performance/coreproduction [17:04:28] jynus: so that's Revision::fetchRevision [17:05:07] (03PS1) 10Chad: Logstash: track apache2 syslog error rate in statsd [puppet] - 10https://gerrit.wikimedia.org/r/247613 (https://phabricator.wikimedia.org/T81030) [17:05:18] or other things [17:05:43] Revision::newFromTitle, newFromPageId, etc [17:06:08] 6operations, 10netops: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1739021 (10Cmjohnson) fe-0/0/5 up up Transit: (03PS2) 10Dzahn: tendril: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247226 (owner: 10Muehlenhoff) [17:06:27] there is also ResourceLoaderWikiModule::getTitleInfo, which is higher than normal (Gadget-switcher.js) [17:06:49] (03CR) 10Dzahn: [C: 032] tendril: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247226 (owner: 10Muehlenhoff) [17:07:17] but that is only a 5% of the queries [17:14:23] (03Abandoned) 10Dzahn: nitrogen/ipv6relay: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247579 (owner: 10Dzahn) [17:14:49] (03PS2) 10Dzahn: nitrogen: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247214 (owner: 10Muehlenhoff) [17:15:08] restarting icinga bot , it's all recovered [17:16:09] this is a summary: https://wikitech.wikimedia.org/wiki/MariaDB/query_performance/coreproduction-20151020 [17:16:17] (03CR) 10Dzahn: [C: 032] nitrogen: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247214 (owner: 10Muehlenhoff) [17:18:07] (03PS2) 10Dzahn: neon/icinga: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247580 [17:18:44] (03CR) 10Dzahn: [C: 032] neon/icinga: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247580 (owner: 10Dzahn) [17:19:55] (03CR) 10Dzahn: "icinga done in https://gerrit.wikimedia.org/r/#/c/247580/" [puppet] - 10https://gerrit.wikimedia.org/r/247213 (owner: 10Muehlenhoff) [17:20:12] (03PS3) 10Dzahn: netmon1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247213 (owner: 10Muehlenhoff) [17:20:29] (03PS4) 10Dzahn: netmon1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247213 (owner: 10Muehlenhoff) [17:25:13] Krenair: You online? [17:25:23] yes [17:25:27] you looking at mw.o rc? [17:25:50] Krenair: yes, but i need to go soon. some sort of bot/troll is crating a big mess https://www.mediawiki.org/wiki/Special:RecentChanges [17:26:10] and i don't know how abusefilter works with flow [17:26:27] I saw AbuseFilter block many of those flow spambot acounts, but not all. [17:26:33] you have to set up a separate abusefilter with the group=flow option I think [17:26:58] filter 41 blocked the bots [17:27:03] Nemo_bis: There is a human behinde this. He also adds comments. [17:27:17] (03CR) 10Dzahn: [C: 032] netmon1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247213 (owner: 10Muehlenhoff) [17:27:24] Deleting topics is pretty useless because of https://phabricator.wikimedia.org/T60725#1737375 [17:28:02] Steinsplitter: you think so, really? :/ I did not check in depth. [17:28:11] yes [17:28:25] he added a comment on a page with instructions how to create accounts. etc. [17:28:42] he also circumvited the abusefilter when he was not hidden. l [17:35:32] (03CR) 10Andrew Bogott: [C: 031] "MZ, applying this patch to Beta seems like an appropriate answer to your concerns. It will give the discovery folks an opportunity to dev" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [17:38:19] (03CR) 10Chad: [C: 031] "Releng is fine with letting this be tested out on beta as long as it's temporary :)" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [17:39:03] (03PS2) 10Dzahn: iron: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247202 (owner: 10Muehlenhoff) [17:39:13] (03PS5) 10Andrew Bogott: Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [17:40:20] (03CR) 10Dzahn: [C: 032] iron: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247202 (owner: 10Muehlenhoff) [17:41:16] (03PS1) 10Jcrespo: Enabling performance schema experimentally on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/247615 (https://phabricator.wikimedia.org/T99485) [17:41:38] (03PS6) 10Andrew Bogott: Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [17:42:01] (03PS2) 10Dzahn: carbon: Move to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246966 (owner: 10Muehlenhoff) [17:42:46] (03CR) 10Dzahn: [C: 032] carbon: Move to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246966 (owner: 10Muehlenhoff) [17:42:54] (03CR) 10Andrew Bogott: [C: 032] Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [17:43:22] (03PS7) 10Andrew Bogott: Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [17:43:28] andrewbogott: thanks! [17:44:34] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1739238 (10jcrespo) [17:44:39] (03PS2) 10Dzahn: californium: Move to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246965 (owner: 10Muehlenhoff) [17:47:57] (03CR) 10Dzahn: [C: 032] californium: Move to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246965 (owner: 10Muehlenhoff) [17:48:43] 6operations, 10netops, 10procurement: audit juniper hardware locations for support coverage - https://phabricator.wikimedia.org/T116051#1739262 (10RobH) 3NEW a:3RobH [17:49:34] (03CR) 10Dzahn: "do you want me to check all these are cisco? vs. racktables?" [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [17:49:35] 6operations, 10netops, 10procurement: audit juniper hardware locations for support coverage - https://phabricator.wikimedia.org/T116051#1739274 (10RobH) [17:50:54] (03PS2) 10Dzahn: icinga: move ssl cert install to role [puppet] - 10https://gerrit.wikimedia.org/r/247570 [17:52:04] (03PS3) 10Dzahn: icinga: move ssl cert install to role [puppet] - 10https://gerrit.wikimedia.org/r/247570 [17:52:20] andrewbogott: I've got a 3rd stat I'd like to track too, if you've got a second. [17:52:30] (logstash/statsd thing, like earlier) [17:52:44] (03CR) 10Dzahn: [C: 032] icinga: move ssl cert install to role [puppet] - 10https://gerrit.wikimedia.org/r/247570 (owner: 10Dzahn) [17:53:14] 6operations, 10netops, 10procurement: implement new zayo wave connection for ulsfo-codfw - https://phabricator.wikimedia.org/T116036#1739300 (10faidon) ``` faidon@cr2-ulsfo> show interfaces descriptions xe-1/3/0 Interface Admin Link Description xe-1/3/0 up down << cr1-codfw:xe-5/0/2 HOLD FOR... [17:53:30] robh: ^ [17:53:45] ostriches: sure, have a patch? [17:53:58] paravoid: cool, thanks =] [17:54:15] andrewbogott: https://gerrit.wikimedia.org/r/#/c/247613/ [17:56:36] (03PS2) 10Andrew Bogott: Logstash: track apache2 syslog error rate in statsd [puppet] - 10https://gerrit.wikimedia.org/r/247613 (https://phabricator.wikimedia.org/T81030) (owner: 10Chad) [17:56:48] only curious, do those match in fall through for multiple or match then pass for other guard conditions ostriches? [17:57:09] 6operations, 10netops, 10procurement: Zayo eqiad-codfw link implementation tracking - https://phabricator.wikimedia.org/T116028#1739335 (10RobH) Note that while I'll put in individual RT tickets for invoice tracking, their order has been pre-approved by @faidon via IRC conversation. (So there is no need to... [17:59:19] (03CR) 10Andrew Bogott: [C: 032] Logstash: track apache2 syslog error rate in statsd [puppet] - 10https://gerrit.wikimedia.org/r/247613 (https://phabricator.wikimedia.org/T81030) (owner: 10Chad) [17:59:26] chasemp: good question. I imagine match any. [18:00:09] (03PS3) 10Dzahn: icinga: add cert expiry check for icinga itself [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) [18:02:21] (03CR) 10Dzahn: [C: 032] "@Filippo moved to role" [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [18:02:26] (03PS4) 10Dzahn: icinga: add cert expiry check for icinga itself [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) [18:08:04] (03PS2) 10Dzahn: hooft: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247199 (owner: 10Muehlenhoff) [18:11:34] (03CR) 10Dzahn: [C: 032] hooft: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247199 (owner: 10Muehlenhoff) [18:13:25] (03CR) 10MarcoAurelio: "Thank you Alex." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247253 (https://phabricator.wikimedia.org/T115841) (owner: 10MarcoAurelio) [18:17:04] (03CR) 10Papaul: "yes that will be great if you can check that all those are Cisco. Thanks." [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [18:19:18] (03PS2) 10Dzahn: install2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247200 (owner: 10Muehlenhoff) [18:21:03] (03CR) 10Dzahn: [C: 032] install2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247200 (owner: 10Muehlenhoff) [18:21:51] akosiaris, any updates on https://gerrit.wikimedia.org/r/#/c/244436/ ? [18:22:10] (03PS5) 10Yurik: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [18:24:59] Is there a ticket for the (ongoing?) redis/ job queue troubles? [18:25:24] yes [18:25:33] !log re-enabling icinga notifications for icinga (neon) itself that were disabled for some reason though all OK [18:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:13] (03PS3) 10John F. Lewis: mw_rc_irc: rename module to standard naming [puppet] - 10https://gerrit.wikimedia.org/r/244699 [18:27:22] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint: Deploy TileratorUI service - https://phabricator.wikimedia.org/T116062#1739490 (10Yurik) 3NEW a:3akosiaris [18:27:36] (03PS6) 10Yurik: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [18:28:47] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: Deploy TileratorUI service - https://phabricator.wikimedia.org/T116062#1739505 (10Yurik) [18:29:13] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint, 5Patch-For-Review: Deploy TileratorUI service - https://phabricator.wikimedia.org/T116062#1739490 (10Yurik) [18:29:18] hoo, https://phabricator.wikimedia.org/T116001 [18:29:49] thanks [18:34:04] 6operations: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#1739522 (10RobH) 3NEW a:3RobH [18:34:53] (03PS1) 10Cmjohnson: Adding dns entries for labvirt1010 -11 [dns] - 10https://gerrit.wikimedia.org/r/247628 [18:37:48] 6operations: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#1739581 (10RobH) [18:38:21] 6operations, 5Patch-For-Review: provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#1739592 (10RobH) [18:38:23] 6operations: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#1739522 (10RobH) [18:38:57] 6operations: alternatives to racktables ? - https://phabricator.wikimedia.org/T84001#1739598 (10RobH) [18:38:58] 6operations: Migrate racktables to servermon - https://phabricator.wikimedia.org/T88424#1739597 (10RobH) [18:39:00] 6operations: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#1739522 (10RobH) [18:39:19] 6operations: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#1739522 (10RobH) [18:39:37] yea thats not annoying ;] [18:40:33] robh: all you're doing is automating our hatred to your tasks :) [18:41:05] yea but my manger loves my OCD adherence to making tasks [18:41:11] so im cool with no one else liking it ;] [18:41:48] everyone should instead focus their hatred on a bot that echos a busy phab tag [18:43:05] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1739612 (10RobH) [18:43:08] 6operations, 6Phabricator: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1739611 (10RobH) [18:44:37] (03CR) 10Rush: [C: 04-1] "I really think we should consider whether this is sane behavior. What are the reasons we can't ask clients to make valid requests?" [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [18:47:29] (03CR) 10Smalyshev: "We can't ask clients because we have no idea who makes these SPARQL tools and how to contact them, and they probably would not modify thei" [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [18:52:10] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1739669 (10awight) Confirmed that the campaign is intact. All the pipeline does is store URLs in a file, the ban... [18:52:55] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1739673 (10awight) Also, for the record we are now talking about beaconImpressions files, not bannerImpressions.... [18:53:17] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for labvirt1010 -11 [dns] - 10https://gerrit.wikimedia.org/r/247628 (owner: 10Cmjohnson) [19:15:25] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: gdash reports for php/apache errors - https://phabricator.wikimedia.org/T81030#1739822 (10demon) 5Open>3Resolved a:3demon Apache syslog error rate, MW debug log error rates, HHVM error rates and OOMs all tracked via [[ https://grafana-admin.wikimedi... [19:17:15] papaul: are all the cisco servers shut down? [19:17:32] no [19:17:38] there are stay up [19:17:47] doing the wipe [19:17:55] but you dont need mgmt to do that? [19:18:00] no [19:18:03] i don't [19:18:05] alright,ok [19:18:08] reviewing now [19:18:14] thanks [19:18:17] (03CR) 10BBlack: [C: 04-1] "See ticket" [puppet] - 10https://gerrit.wikimedia.org/r/243883 (https://phabricator.wikimedia.org/T112151) (owner: 10Smalyshev) [19:18:21] i already checked like half, they are all matching so far [19:18:56] mutante:cool [19:19:30] mutante: I have in total 18 of those [19:24:22] (03CR) 10Dzahn: [C: 04-1] "checked all against racktables data, matches except that WMF 5700 and 5701 are removed in one file but not the other. all others look good" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [19:24:52] papaul: all good except the comment above ^ [19:26:34] mutante Thanks [19:28:50] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1739909 (10Dzahn) and let's also have meta monitoring. icinga itself should have a working cert :) added: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type... [19:30:12] ori: How does one go about setting a dashboard to being featured? [19:30:18] 6operations, 6Release-Engineering-Team: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1739925 (10thcipriani) So, currently, it doesn't matter if the `trebuchet` user is in the `wikidev` group, this has only been the case sin... [19:30:21] ostriches: add the 'home' tag [19:31:06] Sweetness, thx! [19:32:03] ori: I made graphs so we know when people push buggy code! :P [19:33:03] (03PS1) 10Ori.livneh: grafana: use "featured" tag to feature dashboards, rather than "home" [puppet] - 10https://gerrit.wikimedia.org/r/247642 [19:33:20] ostriches: ^ [19:33:34] i could probably mass-rename `home` to `featured` in sqlite [19:33:42] makes sense [19:34:04] (03PS2) 10Ori.livneh: grafana: use "featured" tag to feature dashboards, rather than "home" [puppet] - 10https://gerrit.wikimedia.org/r/247642 [19:34:11] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: use "featured" tag to feature dashboards, rather than "home" [puppet] - 10https://gerrit.wikimedia.org/r/247642 (owner: 10Ori.livneh) [19:39:53] (03CR) 10Dzahn: [C: 031] "the changes are only the resource names:" [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [19:41:24] ostriches: {{done}} (both the config change and the mass tag-rename) [19:41:44] okie dokie [19:42:10] (03PS4) 10Dzahn: mw_rc_irc: rename module to standard naming [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [19:43:49] (03PS1) 10EBernhardson: Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247643 [19:43:56] (03CR) 10Dzahn: "eh, except the site.pp part changed now" [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [19:43:58] (03CR) 10EBernhardson: [C: 032] Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247643 (owner: 10EBernhardson) [19:44:34] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247643 (owner: 10EBernhardson) [19:45:21] !log ebernhardson@tin Synchronized wmf-config/: (no message) (duration: 00m 18s) [19:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:48] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 18s) [19:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:01] !log previous syncs was of I46200d4edb3: Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad""" [19:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:28] (03PS5) 10Dzahn: mw_rc_irc: rename module to standard naming [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [19:49:54] (03Abandoned) 10Ori.livneh: Rename php_ini() to ini() [puppet] - 10https://gerrit.wikimedia.org/r/245496 (owner: 10Ori.livneh) [19:50:53] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1024/argon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [19:53:37] JohnFLewis: puppet doesnt like module renamings in the first attempt, but after 3 runs it's all good. fairly normal :) [19:54:12] just cause resource names change and dependencies but then it's all good and done . thanks [19:56:11] aude: what was the wikidata change? [19:56:11] hmm. there might be an issue left, looking [19:56:18] the wikidata jobs look huge [19:57:15] legoktm: do you know? [19:57:48] I'm not sure what the change was [19:59:54] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Puppet has 1 failures [20:00:45] mutante: when you fix it, you have to say "the errors 'argon'" ;) [20:01:30] JohnFLewis: :haha [20:01:59] errorbegone, why is it looking for the old resource still [20:02:33] (03PS1) 10Muehlenhoff: Allow access to the Yarn manager UI from the entire internal network [puppet] - 10https://gerrit.wikimedia.org/r/247644 [20:03:37] (03CR) 10Ottomata: [C: 031] Allow access to the Yarn manager UI from the entire internal network [puppet] - 10https://gerrit.wikimedia.org/r/247644 (owner: 10Muehlenhoff) [20:04:21] !log uninstalling hadoop packages on analytics1017 [20:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:34] JohnFLewis: it's in a cycle indeed :p sigh [20:04:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Allow access to the Yarn manager UI from the entire internal network [puppet] - 10https://gerrit.wikimedia.org/r/247644 (owner: 10Muehlenhoff) [20:04:40] mutante: https://gerrit.wikimedia.org/r/#/c/247305/ hehe :) [20:05:05] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:06:28] ostriches: ok, really soon:) still checking the last one on argon..hmm [20:06:39] it goes back and forth each puppet run.,.grrr [20:07:40] tries puppet stored config clean for this node on both masters [20:09:42] !log errors argon' on argon [20:09:43] JohnFLewis: seems to be ok now, 3 runs in a row without warnings or errors [20:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:02] (03PS3) 10Chad: contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [20:10:05] haha 'argon'' [20:10:10] maybe the stored config clean, maybe just waiting [20:10:18] you told me to :p [20:10:18] (03CR) 10Chad: [C: 031] contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [20:11:56] gwicke: so you don't use dsh to restart parsoid anymore, right [20:12:26] can we remove that group file called parsoid? [20:13:14] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:13:39] uhm [20:14:31] cassandra is running [20:14:34] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [20:15:07] !log starting restbase on restbase1001 [20:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:42] otto and gwicke, are you working on something there [20:16:50] well, it was stopped and now it's started again [20:16:56] see services, I made a booboo, gwicke is fixing i think [20:17:00] per "service restbase status" [20:17:06] apparently deploying AQS is tied to deploying restbase [20:17:07] interesting.... [20:17:11] all i did was start it.ok [20:17:26] mutante: I stopped it again [20:17:32] ottomata deployed old code [20:17:39] ok [20:17:42] steps back [20:20:19] (03CR) 10Dzahn: "are we sure it's not used by parsoid anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [20:22:55] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.007 second response time [20:23:16] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [20:23:40] !log reverted restbase deploy on restbase1001 to a4c55e40 [20:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:04] 6operations, 6Release-Engineering-Team: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1740032 (10faidon) >>! In T115760#1739925, @thcipriani wrote: > It seems like the Right Thing™ would be to make `wikidev` the primary grou... [20:25:30] (03CR) 10Subramanya Sastry: [C: 04-1] "We use this for parsoid till we can confirm that git deploy service restart works reliably." [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [20:27:30] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: gdash reports for php/apache errors - https://phabricator.wikimedia.org/T81030#1740044 (10faidon) >>! In T81030#1739822, @demon wrote: > Apache syslog error rate, MW debug log error rates, HHVM error rates and OOMs all tracked via [[ https://grafana.wikim... [20:28:16] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1740045 (10Dzahn) progress tracking on etherpad now: https://etherpad.wikimedia.org/p/T114059 [20:28:59] ostriches: ^ .. so still used [20:29:10] Hrm [20:29:22] the open ticket is what keeps them from using salt [20:29:25] if i got it right [20:29:53] tl;dr trebuchet doesn't always do what you ask it to do [20:30:03] so we have dsh hacks [20:30:48] ok.hmm. we were hoping to kill dsh a little bit more [20:31:15] easy! just fix salt/trebuchet :) [20:31:28] or wait for scap3 [20:31:53] greg-g: Are deployments still blocked? [20:32:06] ok:) will there be a new flying pig logo in scap3? [20:32:37] I think it has a smiling pig logo (or did at one point) [20:32:51] hoo: no [20:33:15] greg-g: In that case, I'll do what aude wanted to do earlier [20:33:30] bd808: yea, that's the one i meant :) [20:33:53] She's ok with that [20:36:43] greg-g: Update the schedule... will start at 2pm [20:36:47] * Updated [20:36:56] mutante: thcipriani, ostriches and the releng folks are making scap3. Conceptually it's trebuchet without salt (plus some other fun stuff). [20:37:59] bd808: alright, *nod* [20:39:16] also, fwiw, new flying-pig logo as yet undecided. I think bd808 took issue with the one I threw out: http://tyler.zone/scap.gif called it more flash than substance if I recall :P [20:39:21] (03Abandoned) 10Dzahn: dsh: delete most remaining group files [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [20:39:58] thcipriani: aaw, that's pretty cool :) [20:40:03] I was just sad about the loss of wings and speedlines ;) [20:40:15] i always like the .gifs too [20:40:55] I'm just going to throw this out there: RelEng team members make the best gifs [20:41:03] also, they're useful [20:41:46] thcipriani has more l33t ascii skills than I do. [20:43:58] I present https://gist.github.com/thcipriani/7221243 as proof [20:46:20] oh good. I'll make sure to add a line to my résumé: "* L337 ansi-art/gif skillz" [20:46:46] ori pointed it out as a plus while we were interviewing you ;) [20:47:41] so, ok, the jobqueue is growing and we don't know why [20:47:42] ? [20:47:50] zangief is awesome [20:48:05] can someone tell me why https://gdash.wikimedia.org/dashboards/jobq/ no longer loads? [20:48:18] it would be super helpful to see what has been going on with the jobqueue (I would hope), but it's empty [20:48:54] greg-g: i made https://phabricator.wikimedia.org/F2743935 (graphs some of the jobs, by type and count) [20:49:05] it would be nice to have more of that and more jobs [20:49:09] https://phabricator.wikimedia.org/T62105 [20:49:16] to view graphs of [20:49:31] gwicke: "better" == "what we had before before shit broke?" [20:49:50] that task doesn't help me answer why these graphs show "no data" [20:50:04] and we're experiencing a jobqueue .... clusterfuck? right now so, we need something [20:50:30] the urls looks jacked up. I'm going to look at the definition file [20:50:35] so far, I have worked around this with mwscript on tin [20:50:40] mwscript showJobs.php --wiki=commonswiki --group [20:50:51] (03PS1) 10Dzahn: dumps: move ssl cert install to role [puppet] - 10https://gerrit.wikimedia.org/r/247700 [20:51:14] gwicke: right, but that doesn't give us historical data to see where something might have gone wrong [20:51:31] greg-g: you can use wikiapiary-ish [20:51:50] this is what I needed: http://graphite.wikimedia.org/render/?from=-2d&width=1054&height=487&_salt=1445369534.106&target=MediaWiki.jobqueue.abandons.deleteLinks.count&target=MediaWiki.jobqueue.inserts.enqueue.count&hideLegend=false [20:51:54] thanks chasemp [20:52:06] greg-g: looks like some of the metric names have changed and the dashabard hasn't been updated [20:52:12] gotcha [20:52:44] aude: that spike is aroud the same time as https://tools.wmflabs.org/sal/log/AVCFIArV1oXzWjit5_jE [20:52:55] that spike being that spike in the graphite graph I linked [20:54:05] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:25] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:57] 1 day https://phabricator.wikimedia.org/F2746432 [20:56:01] (03CR) 10Dzahn: "i'll move it to role after https://gerrit.wikimedia.org/r/#/c/247700/" [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [20:56:18] 7 days https://phabricator.wikimedia.org/F2746445 [20:56:42] 1 month https://phabricator.wikimedia.org/F2746452 [20:56:49] so... what you're showing is that it isn't that crazy? :) [20:57:08] chasemp: ^ [20:57:13] from what i can tell [20:57:15] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.016 second response time [20:57:24] nothing so out of the ordinary just now, but do see spikes [20:57:31] https://phabricator.wikimedia.org/file/upload/ (3 months) [20:58:00] almost :) [20:58:01] count and ack rate of queues seems mostly stable [20:58:30] but enqueue is bumped, assuming as the normal work is more or less sane and whatever causes things to be deferred is choking on jobs and offloading them [20:59:04] my graphs are just count (pop) [20:59:12] i'm not sure what else to look at [21:00:19] inserts http://graphite.wikimedia.org/render/?from=-7d&width=1054&height=487&_salt=1445369534.106&target=MediaWiki.jobqueue.abandons.deleteLinks.count&target=MediaWiki.jobqueue.inserts.*.count [21:00:54] well, that is weird [21:01:18] but not unheard of: http://graphite.wikimedia.org/render/?from=-28d&width=1054&height=487&_salt=1445369534.106&target=MediaWiki.jobqueue.abandons.deleteLinks.count&target=MediaWiki.jobqueue.inserts.*.count [21:01:22] (28 days) [21:02:03] chasemp: interesting [21:02:37] wikibase ones would happen on edit (afaik) on wikipedias, like refresh links [21:02:45] so expected to see similar patterns [21:02:45] (03PS1) 10BryanDavis: gdash: Update jobq dashboard for new metric names [puppet] - 10https://gerrit.wikimedia.org/r/247720 [21:02:55] one thing that tempers this is the last jobqueue death on fillup was basically memory exhaustion that lead to AOS disk crapout etc [21:02:59] and we are not near that seemingly [21:03:04] or other such actions [21:04:29] (03CR) 10Ori.livneh: "Thanks for doing this. Any chance you could be persuaded to make this a Grafana dashboard instead, given that it's an open task to migrate" [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [21:05:15] (03CR) 10BryanDavis: "yeah I can make it in grafana. I messed some stuff up here anyhow (1day rollups)." [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [21:05:36] (03CR) 10Greg Grossmeier: "Can we just merge this now as gdash is currently broken/showing 'no data' for these graphs?" [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [21:06:53] (03CR) 10Ori.livneh: "I'd prefer seeing it ported to Grafana once and for all but not militant about it. Whatever bd808 wants to do is fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [21:11:23] !log ori@tin Synchronized php-1.27.0-wmf.2/includes/page/WikiPage.php: I5d0440588d: Make triggerOpportunisticLinksUpdate() directly use RefreshLinks (T116001) (duration: 00m 18s) [21:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:56] !log ori@tin Synchronized php-1.27.0-wmf.3/includes/page/WikiPage.php: I5d0440588d: Make triggerOpportunisticLinksUpdate() directly use RefreshLinks (T116001) (duration: 00m 17s) [21:11:58] (03PS1) 10Thcipriani: Remove trebuchet user from wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/247721 (https://phabricator.wikimedia.org/T115760) [21:11:59] etherpad server seems to be down [21:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:05] so, I have to run to a workers comp drs appt, looks like ori is now on the jobqueue thing, so please do respond to him or chasemp to get this resolved before moving on with other deploys [21:12:28] back now [21:12:42] (03CR) 10BryanDavis: "https://grafana-admin.wikimedia.org/dashboard/db/job-queue" [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [21:12:49] kaldari: icinga alerted on it (and recovered it) a little bit ago [21:13:53] (03CR) 10Ori.livneh: "Nice! I think we can remove this dashboard now" [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [21:19:25] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: Connection timed out [21:20:46] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [21:27:45] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.004 second response time [21:34:40] * ori pronounces deployments unblocked (cc greg-g, hoo) [21:34:47] I'm going to occupy tin for the next bit, then [21:34:50] thanks ori [21:35:00] Will do some tests with mw1017 and if that's ok scap stuff to the real world [21:35:05] +1 [21:35:15] PROBLEM - check google safe browsing for wikimedia.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikimedia.org/ - 688 bytes in 0.047 second response time [21:35:20] * aude notes the deployment calendar is inacurate [21:35:25] PROBLEM - check google safe browsing for wiktionary.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wiktionary.org/ - 690 bytes in 0.107 second response time [21:35:34] regarding the train [21:35:48] You could try fixing it yourself, it's a wiki in the end [21:36:01] but twentyafterfour probably wants to be informed [21:36:04] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=mediawiki.org/ - 688 bytes in 0.079 second response time [21:36:06] PROBLEM - check google safe browsing for wikisource.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikisource.org/ - 690 bytes in 0.084 second response time [21:36:06] PROBLEM - check google safe browsing for wikibooks.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikibooks.org/ - 688 bytes in 0.061 second response time [21:36:21] does that google safe browsing monitor need to be changed to https? [21:36:22] ? [21:36:23] hoo: true, but also greg-g [21:36:35] twentyafterfour: group1 already has wmf3 [21:36:41] so nothing to do tomorrow? [21:37:45] RECOVERY - check google safe browsing for mediawiki.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3782 bytes in 0.037 second response time [21:37:46] hmm, trying to remember where things got frozen... I don't think I ever finished group1 entirely [21:38:02] * aude is/was a bit confused [21:38:23] but tin + wikiversions + special:version agrees that wmf3 is on group1 [21:38:36] Krenair: probably [21:38:41] and makes sense with new bugs appearing on wikidata recentlyish [21:38:44] hmm, I guess somewhere in the chaos it got sync'd ? weird [21:38:56] it's ok... nothing terrible [21:39:02] accidential deployments, yeah [21:39:12] SAL log says nothing though [21:39:21] Reminds me of the accidental schema change... [21:39:21] helping feed confusion [21:39:25] PROBLEM - check google safe browsing for wikiquotes.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikiquotes.org/ - 690 bytes in 0.039 second response time [21:39:33] I never scapped the change [21:39:39] so I don't know how it got out [21:39:45] twentyafterfour: guess it went otu [21:39:49] maybe yesterday? [21:39:53] twentyafterfour: l10n update scaps [21:40:16] the one bug we fixed for wikidata was reported yesterday [21:40:20] The document has moved [21:40:20] here. [21:40:26] * aude thinks users might have noticed before, but not sure [21:40:58] (03PS2) 10BryanDavis: gdash: Remove jobq dashboard [puppet] - 10https://gerrit.wikimedia.org/r/247720 [21:41:10] I really wanted to sync it anyway but I was told absolutely no more deployments so I left everything in half-finished state like I was told to do :-/ [21:41:23] and then l10n update came along :D [21:41:25] :( [21:41:30] yeah I guess that must be it [21:41:33] in the end, it think it's ok [21:41:39] at least this time :) [21:41:41] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: gdash reports for php/apache errors - https://phabricator.wikimedia.org/T81030#1740219 (10demon) Done :) https://grafana.wikimedia.org/dashboard/db/production-logging [21:42:25] PROBLEM - check google safe browsing for wikipedia.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikipedia.org/ - 688 bytes in 0.052 second response time [21:42:27] yeah, I'll remind people about l10nupdate if this comes up again ;) [21:42:30] (03CR) 10Hashar: [C: 031] contint: stop gerrit replication to gallium [puppet] - 10https://gerrit.wikimedia.org/r/244498 (https://phabricator.wikimedia.org/T86661) (owner: 10Hashar) [21:42:33] thanks aude and hoo. [21:43:15] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikinews.org/ - 686 bytes in 0.036 second response time [21:43:38] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1740227 (10intracer) >>! In T106517#1722718, @Tgr wrote: > See [[ https://github.com/wikimedia/mediawiki/blob/a2d6ecc4539e60501803155990ec365... [21:43:44] PROBLEM - check google safe browsing for wikiversity.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=wikiversity.org/ - 692 bytes in 0.031 second response time [21:44:34] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=mediawiki.org/ - 688 bytes in 0.031 second response time [21:46:39] interesting... https://www.google.com/transparencyreport/safebrowsing/diagnostic/?#url=AS14907 (WIKIMEDIA) shows safe browsing status overview for our entire AS. That might be a lot easier to monitor than the individual domains [21:47:23] "For example, the following website on this network has been dangerous over the last 90 days: wikimedia.org." useful [21:47:29] except that it's a dynamic page and doesn't respond to a normal non-hashed url [21:48:04] "Some pages on this website install malware on visitors' computers." [21:48:21] it recognises Automattic's AS for wikimedia.org as well [21:50:20] twentyafterfour, you can get https://www.google.com/safebrowsing/diagnostic?output=jsonp&site=wikiquote.org instead [21:50:59] (03PS2) 10Dzahn: dumps: move ssl cert install to role [puppet] - 10https://gerrit.wikimedia.org/r/247700 [21:53:52] (03Abandoned) 10Addshore: wgRCWatchCategoryMembership false for commons & wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/235467 (https://phabricator.wikimedia.org/T109707) (owner: 10Addshore) [21:53:55] Krenair: that doesn't search by AS number though [21:54:58] (03PS2) 10Dzahn: dumps: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) [21:55:12] fine, https://www.google.com/safebrowsing/diagnostic?output=jsonp&site=AS%3A14907 [21:56:54] twentyafterfour: hoo just for the record, i see "15:42 logmsgbot: an-omie@tin Started scap: SWAT: Add a change tag to cross-wiki uploads" [21:56:57] from yesterday [21:57:11] That might have also been it [21:57:13] that would match the timestamps on special:version (afaik) and the bug reports [21:57:14] whatever came first [21:57:19] doesn't matter now [21:57:36] i don't think localization update does scap, but it rebuilds localisation cache and some stuff [21:57:58] I thought we killed the separate command for that and changed it to use plain scape [21:58:00] * scap [21:58:07] or maybe that only happened in my mind [21:58:25] i am not sure (would be good to know for sure) [21:58:37] (03PS3) 10Dzahn: dumps: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) [21:59:12] it doesn't use scap [21:59:14] (03CR) 10Dzahn: [C: 032] dumps: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [21:59:22] modules/scap/files/l10nupdate-1 in the puppet repo [22:03:09] i think it's still not the full scap [22:03:17] for l10update [22:03:22] It's not, no [22:03:27] It's just part of scap [22:03:31] but it's not scap [22:03:33] yep [22:03:36] I must have misremembered that [22:03:47] anyway.... [22:04:51] * aude would have been the next person to run scap if it wasn't done yesterday [22:05:07] or *could [22:07:49] 6operations, 6Parsing-Team: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1740321 (10ssastry) 3NEW [22:07:54] (03PS3) 10Dzahn: dumps: move ssl cert and config to role [puppet] - 10https://gerrit.wikimedia.org/r/247700 [22:08:02] legoktm: hmm, wikibugs is dead [22:08:15] that sucks. [22:10:23] 6operations, 6Parsing-Team: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1740331 (10ssastry) [22:12:44] (03PS1) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247744 (https://phabricator.wikimedia.org/T114059) [22:13:21] (03PS2) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247744 (https://phabricator.wikimedia.org/T114059) [22:14:05] (03PS3) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247744 (https://phabricator.wikimedia.org/T114059) [22:14:59] (03CR) 10Dzahn: [C: 032] "one from the list on https://etherpad.wikimedia.org/p/T114059" [puppet] - 10https://gerrit.wikimedia.org/r/247744 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [22:15:00] Running sync-common on mw1017 to update Wikibase to wmf3b [22:15:08] will also activate merging there [22:15:13] (but only for test) [22:15:28] aude: ^ [22:17:04] done [22:19:21] https://test.wikidata.org/w/index.php?title=Q1778&action=history merging works [22:19:29] (03PS1) 10Dzahn: RT: replace HTTPS monitoring with new method [puppet] - 10https://gerrit.wikimedia.org/r/247745 [22:19:46] (03PS3) 10Ori.livneh: gdash: Remove jobq dashboard [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [22:19:54] (03CR) 10Ori.livneh: [C: 032 V: 032] gdash: Remove jobq dashboard [puppet] - 10https://gerrit.wikimedia.org/r/247720 (owner: 10BryanDavis) [22:20:09] basic editing looks good as well [22:20:22] (03PS2) 10Dzahn: RT: replace HTTPS monitoring with new method [puppet] - 10https://gerrit.wikimedia.org/r/247745 [22:20:46] (03CR) 10Dzahn: [C: 032] RT: replace HTTPS monitoring with new method [puppet] - 10https://gerrit.wikimedia.org/r/247745 (owner: 10Dzahn) [22:21:33] hoo: great [22:22:14] Will update WikimediaMessages and the scap everything together [22:22:20] after that, we can re-enable merging [22:22:25] ok [22:22:28] and then continue with sitelinks [22:22:28] (03PS1) 10Ori.livneh: gdash: remove ve and frontend dashboards; unmaintained [puppet] - 10https://gerrit.wikimedia.org/r/247746 [22:22:34] also cache epoch [22:22:41] oh, right, yes [22:22:46] will do that post-scap [22:22:53] k [22:23:04] (03PS2) 10Ori.livneh: gdash: remove ve and frontend dashboards; unmaintained [puppet] - 10https://gerrit.wikimedia.org/r/247746 [22:23:13] (03CR) 10Ori.livneh: [C: 032 V: 032] gdash: remove ve and frontend dashboards; unmaintained [puppet] - 10https://gerrit.wikimedia.org/r/247746 (owner: 10Ori.livneh) [22:23:46] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [22:24:49] (03PS1) 10Dzahn: tendril: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247747 [22:25:26] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [22:26:00] (03PS2) 10Dzahn: tendril: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247747 [22:26:08] (03CR) 10Dzahn: [C: 032] tendril: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247747 (owner: 10Dzahn) [22:26:57] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1740383 (10BBlack) Definitely not varnish! [22:27:30] ori: merging yours, k? [22:28:21] done [22:28:30] mutante: eep thanks [22:31:20] !log hoo@tin Started scap: Update Wikibase to wmf3b and add messages for sitelinks to MediaWiki, Meta-Wiki and Wikispecies [22:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:31:41] I also reverted my live hack to allow merging on testwikidata before pushing that [22:33:07] (03PS1) 10Dzahn: librenms: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247749 [22:33:24] (03PS2) 10Dzahn: librenms: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247749 [22:37:22] (03PS1) 10Hoo man: Revert "Temporarily disable 'item-merge' right on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247751 [22:39:56] (03CR) 10Dzahn: [C: 032] librenms: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247749 (owner: 10Dzahn) [22:49:20] (03PS1) 10Dzahn: tendril: rename monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/247752 [22:49:24] (03CR) 10jenkins-bot: [V: 04-1] tendril: rename monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/247752 (owner: 10Dzahn) [22:49:32] (03PS2) 10Dzahn: tendril: rename monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/247752 [22:50:02] (03CR) 10Dzahn: [C: 032] tendril: rename monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/247752 (owner: 10Dzahn) [22:50:35] (03CR) 10Dzahn: "fixes "cannot redeclare" puppet sadness" [puppet] - 10https://gerrit.wikimedia.org/r/247752 (owner: 10Dzahn) [22:52:12] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [22:53:31] ACKNOWLEDGEMENT - puppet last run on neon is CRITICAL: CRITICAL: puppet fail daniel_zahn https://gerrit.wikimedia.org/r/#/c/247752/ [22:55:19] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:55:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [22:56:50] 6operations: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1740512 (10Dzahn) 3NEW [22:57:42] CUSTOM - Host google is UP: PING OK - Packet loss = 0%, RTA = 9.61 ms [22:57:58] PROBLEM - HTTPS on planet1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [22:59:12] ACKNOWLEDGEMENT - HTTPS on planet1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn cause this is a special case [22:59:48] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:48] what's up with the "Safe Browsing" stuff.. meh Google? [22:59:59] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151020T2300). [23:00:05] Krenair: you pasted something earlier [23:00:10] mutante, I think they decided to move the page or something [23:00:23] 6operations: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1740543 (10Dzahn) 16:00 < icinga-wm> CUSTOM - Host google is UP: PING OK - Packet loss = 0%, RTA = 9.61 ms 16:02 < Krenair> mutante, I think they decided to move the page or something [23:00:25] it was HTTP 302 Found, so... [23:00:29] Krenair: ok ^ [23:00:37] ack, see the CUSTOM icinga bot message [23:01:11] if you read further up the chat log I messed around with it a little bit [23:01:17] (03PS1) 10MaxSem: Update safe browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/247754 (https://phabricator.wikimedia.org/T116099) [23:01:18] MatmaRex, you tested this in beta right? [23:01:21] mutante, ^ [23:01:24] hi [23:01:28] yes [23:01:36] MaxSem: aah :) [23:01:37] ACKNOWLEDGEMENT - check google safe browsing for mediawiki.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string This site is not currently... not found on http://www.google.com:80/safebrowsing/diagnostic?site=mediawiki.org/ - 688 bytes in 0.070 second response time daniel_zahn https://phabricator.wikimedia.org/T116099 [23:02:17] Krenair: i was just going to say that you can see that it works at http://commons.wikimedia.beta.wmflabs.org/w/index.php?title=Special:RecentChanges&tagfilter=cross-wiki-upload [23:02:22] but beta is fucked again [23:02:28] that's a new notice [23:02:41] [528e9f31] /w/index.php?title=Special:RecentChanges&tagfilter=cross-wiki-upload MWException from line 141 of /srv/mediawiki/php-master/includes/FormOptions.php: Invalid option hidecategorization [23:02:46] Jamesofur, what notice? [23:02:48] (03PS2) 10Dzahn: Update safe browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/247754 (https://phabricator.wikimedia.org/T116099) (owner: 10MaxSem) [23:02:56] Krenair: I don't remember the google safe browsing notice [23:03:01] it's been around for ages [23:03:05] Max uploaded the fix :) merging [23:03:08] just rarely complains. today it broke [23:03:15] see the diff there [23:03:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:03:37] (03CR) 10Dzahn: [C: 032] Update safe browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/247754 (https://phabricator.wikimedia.org/T116099) (owner: 10MaxSem) [23:03:50] MatmaRex, any idea what's wrong there? [23:04:02] Oh, RecentChanges completely broken in beta. I see. [23:04:17] Oh, I might know who's responsible for this. [23:04:25] Krenair: looks like somebody merged the categorization changes in RC commit again :P [23:04:41] indeed [23:04:43] "somebody" :P [23:05:32] MatmaRex: can take a look [23:05:38] think i see the problem [23:05:48] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1740598 (10Dzahn) dumps: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dataset1001&service=HTTPS OTRS: https://icinga.wikimedia.org/cgi-bin/icinga... [23:06:37] https://gerrit.wikimedia.org/r/#/c/239065/ [23:06:38] legoktm [23:06:43] and addshore [23:07:59] yeah this looks suspect: https://gerrit.wikimedia.org/r/#/c/239065/15/includes/specials/SpecialRecentchanges.php [23:08:10] (03PS1) 10Dzahn: Revert "planet: add ssl cert expiry check" [puppet] - 10https://gerrit.wikimedia.org/r/247755 [23:08:45] (03CR) 10Dzahn: [C: 032] "this one is a special case that doesn't work like the others because it has its own cert on misc-web" [puppet] - 10https://gerrit.wikimedia.org/r/247755 (owner: 10Dzahn) [23:08:52] have a fix [23:09:19] (03PS2) 10Dzahn: Revert "planet: add ssl cert expiry check" [puppet] - 10https://gerrit.wikimedia.org/r/247755 [23:09:29] hi [23:09:40] loking [23:09:44] looking even [23:10:06] https://gerrit.wikimedia.org/r/#/c/247756/ [23:10:08] legoktm: ^ [23:10:19] verified manually with the setting both ways [23:10:22] +2'd, thanks [23:10:33] sure [23:10:33] I only tested with it on >.< [23:11:41] does that really fix the issue...? [23:11:48] it does not look convincing [23:12:57] $opts[$key] is the same as $opts->offsetGet( $key ) ? [23:13:50] Oh, ArrayAccess. Okay. [23:16:18] !log mw1232: restarted hhvm [23:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:24] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 65225 bytes in 0.327 second response time [23:17:05] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.079 second response time [23:17:48] 6operations, 7Icinga: ganeti: PROCS CRITICAL: 2 processes ... - https://phabricator.wikimedia.org/T116111#1740647 (10Dzahn) 3NEW [23:18:28] 6operations, 7Icinga: ganeti: PROCS CRITICAL: 2 processes ... - https://phabricator.wikimedia.org/T116111#1740658 (10Dzahn) [23:18:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [23:20:05] !log hoo@tin Finished scap: Update Wikibase to wmf3b and add messages for sitelinks to MediaWiki, Meta-Wiki and Wikispecies (duration: 48m 44s) [23:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:12] aude: ^ [23:20:42] (03PS1) 10Hoo man: Bump the cache epoch for (test)wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247757 [23:21:12] (03CR) 10Hoo man: [C: 032] Bump the cache epoch for (test)wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247757 (owner: 10Hoo man) [23:21:18] (03Merged) 10jenkins-bot: Bump the cache epoch for (test)wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247757 (owner: 10Hoo man) [23:21:38] (03CR) 10Dzahn: [C: 031] labsdb100[1-3]: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247208 (owner: 10Muehlenhoff) [23:21:58] (03CR) 10Dzahn: [C: 031] pc100*: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247218 (owner: 10Muehlenhoff) [23:22:19] (03CR) 10Dzahn: [C: 031] Use role keyword for dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/247230 (owner: 10Muehlenhoff) [23:22:21] hoo: \o/ [23:22:22] MatmaRex, legoktm, aude: ok, beta RC is working again, thanks [23:22:35] !log hoo@tin Synchronized wmf-config/: Bump the cache epoch for (test)wikidata (duration: 00m 18s) [23:22:35] yay [23:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:52] (03CR) 10Dzahn: [C: 031] holmium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [23:23:00] yes, that looks good [23:23:09] (03CR) 10Dzahn: [C: 031] hafnium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247195 (owner: 10Muehlenhoff) [23:23:13] (03CR) 10Hoo man: [C: 032] Revert "Temporarily disable 'item-merge' right on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247751 (owner: 10Hoo man) [23:23:15] (03CR) 10jenkins-bot: [V: 04-1] Revert "Temporarily disable 'item-merge' right on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247751 (owner: 10Hoo man) [23:23:29] merge conflict with myself [23:23:30] awesome [23:23:40] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:24:40] (03PS2) 10Hoo man: Revert "Temporarily disable 'item-merge' right on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247751 [23:24:42] (03CR) 10Dzahn: [C: 031] graphite2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247194 (owner: 10Muehlenhoff) [23:24:55] (03CR) 10Hoo man: [C: 032] Revert "Temporarily disable 'item-merge' right on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247751 (owner: 10Hoo man) [23:25:02] (03Merged) 10jenkins-bot: Revert "Temporarily disable 'item-merge' right on Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247751 (owner: 10Hoo man) [23:25:07] !log aaron@tin Synchronized php-1.27.0-wmf.3/includes/deferred: 2a1e1d7dd88a62aba9 (duration: 00m 17s) [23:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:23] (03PS1) 10Nuria: Enabling mapjoins in hive by default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 [23:25:36] (03CR) 10Dzahn: [C: 031] Move the authdns servers to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247235 (owner: 10Muehlenhoff) [23:26:10] (03CR) 10Dzahn: [C: 031] Use the role keyword for puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/247232 (owner: 10Muehlenhoff) [23:26:39] !log hoo@tin Synchronized wmf-config/: Revert "Temporarily disable "item-merge" right on Wikidata" (duration: 00m 18s) [23:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:45] (03CR) 10Dzahn: [C: 031] terbium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247227 (owner: 10Muehlenhoff) [23:27:06] (03CR) 10Dzahn: [C: 031] potassium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247220 (owner: 10Muehlenhoff) [23:27:58] (03CR) 10Dzahn: [C: 031] graphite1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247193 (owner: 10Muehlenhoff) [23:28:55] (03CR) 10Dzahn: [C: 04-1] "hmm, does this work at all even before this change? per wikitech docs, when using the role keyword, they all have to be on _the same line_" [puppet] - 10https://gerrit.wikimedia.org/r/246979 (owner: 10Muehlenhoff) [23:30:45] (03CR) 10Dzahn: "eh, i guess it's just about not repeating the role keyword. https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Role-based_lookup ok" [puppet] - 10https://gerrit.wikimedia.org/r/246979 (owner: 10Muehlenhoff) [23:31:59] (03CR) 10Dzahn: [C: 031] "+ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/246971 (owner: 10Muehlenhoff) [23:32:26] (03CR) 10Dzahn: [C: 031] db1069: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246969 (owner: 10Muehlenhoff) [23:33:04] (03CR) 10Dzahn: [C: 031] db1047: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246968 (owner: 10Muehlenhoff) [23:33:32] (03CR) 10Dzahn: [C: 031] conf*: Convert to fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246967 (owner: 10Muehlenhoff) [23:34:04] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1740677 (10Cmjohnson) Racked, cabled and ILO setup. DNS completed ge-3/0/15 labvirt1010 wmf4713 10.65.3.236 ge-3/0/16 labvirt1011 wmf4714 10.65.3.237 [23:35:11] aude: I don't see any problems on wikidata right now [23:35:22] so will go on with sitelinks to test [23:35:30] well, enable linking the wikis *from* test [23:36:36] MatmaRex, sync'd [23:36:42] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/WikimediaEventsHooks.php: https://gerrit.wikimedia.org/r/#/c/247650/ (duration: 00m 17s) [23:36:48] ok [23:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:51] Krenair: thanks [23:36:57] please test [23:37:09] i can't really [23:37:17] MaxSem: it doesn't find the new string either :p [23:37:21] Krenair: unless you want me to upload a file from mw.org to prod commons [23:37:43] i'll do that when i have a useful file to upload. [23:37:51] ok [23:39:33] (03PS2) 10Dzahn: Restrict access to deployment redis to internal plus silver [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff) [23:40:05] Krenair: Do you need tin for a longer period or can I continue? [23:40:09] hahahahaha check_http_url_for_string!www.google.com!/safebrowsing/diagnostic?site=wikiquotes.org/!\'Safe Browsing has not recently seen malicious content\' [23:40:17] (03CR) 10Dzahn: "confirmation from bd808 would be great" [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff) [23:40:32] you can go [23:40:34] im done [23:40:38] Ok :) [23:41:35] MaxSem: how about "Not dangerous" as the string :) [23:41:51] https://www.google.com/transparencyreport/safebrowsing/diagnostic/index.html#url=mediawiki.org [23:42:01] mutante, it's barfing about a HTTPS redirect, apparently [23:42:07] the /transparencyreport/ part is missing? [23:42:28] as evidenced by a 302 status code [23:43:08] yea, and that redirects me to the above [23:43:21] with that extra part in the URL [23:43:26] (03PS1) 10MaxSem: Switch safe browsing checks to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/247760 (https://phabricator.wikimedia.org/T116099) [23:43:29] (03CR) 10jenkins-bot: [V: 04-1] Switch safe browsing checks to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/247760 (https://phabricator.wikimedia.org/T116099) (owner: 10MaxSem) [23:43:30] aude: Did you forget about wmgWikibaseSiteGroup? [23:43:39] In your patch [23:43:43] where? [23:43:53] https://gerrit.wikimedia.org/r/246782 [23:43:54] maybe [23:44:10] eh, wikimedia.org : Some pages on this website install malware on visitors' computers. [23:44:12] Do you want to amend it really quick [23:44:20] but it's "not dangerous" [23:44:26] mutante: ssshhh [23:44:37] that's a client setting [23:44:37] :D [23:44:49] i'm not enabling the client yet in this patch [23:45:04] (03PS1) 10MaxSem: Fix Wikiquote check to check the real domain [puppet] - 10https://gerrit.wikimedia.org/r/247761 [23:45:07] (03CR) 10jenkins-bot: [V: 04-1] Fix Wikiquote check to check the real domain [puppet] - 10https://gerrit.wikimedia.org/r/247761 (owner: 10MaxSem) [23:45:11] me tried these settings locally and they were fine [23:45:14] because "AUTOMATTIC" is acounted s one of our ASes [23:45:16] because blog [23:45:16] :p [23:45:19] * aude tried* [23:45:42] aude: Oh right [23:45:48] the other site groups are in the DB already [23:45:50] allright [23:45:54] (03PS2) 10MaxSem: Switch safe browsing checks to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/247760 (https://phabricator.wikimedia.org/T116099) [23:45:57] Fine, then [23:46:08] (03PS2) 10MaxSem: Fix Wikiquote check to check the real domain [puppet] - 10https://gerrit.wikimedia.org/r/247761 [23:46:12] (03CR) 10Hoo man: [C: 032] Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups (test wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 (https://phabricator.wikimedia.org/T115653) (owner: 10Aude) [23:46:40] (03Merged) 10jenkins-bot: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups (test wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246782 (https://phabricator.wikimedia.org/T115653) (owner: 10Aude) [23:47:10] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [23:47:16] mutante, ^^ two commits flying your way [23:47:25] make sure to set wmgWikibaseEnableData = false for the new client wikis [23:47:29] hoo: ^ [23:47:46] aude: Good point [23:47:52] !log hoo@tin Synchronized wmf-config/: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups (testwikidata) (duration: 00m 18s) [23:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:19] MaxSem: isn't that still going to be a redirect though? [23:48:30] try it in browser [23:48:52] hoo: test.wikidata looks good :) [23:49:11] aude: Edit conflict :D [23:49:25] adding "mediawiki" is a little odd [23:49:30] I wanted to link CatScan, as meta doesn't have a page about kittens [23:49:41] the "wiki" gets stripped so it's "media" [23:49:45] think that' sthe gadget [23:50:31] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [23:50:43] (03CR) 10Dzahn: [C: 04-1] "

302 Moved

" [puppet] - 10https://gerrit.wikimedia.org/r/247760 (https://phabricator.wikimedia.org/T116099) (owner: 10MaxSem) [23:51:17] aude: Clearly https://species.wikimedia.org/wiki/Felis_silvestris_catus would have been the better match :'D [23:51:17] (03CR) 10Dzahn: [C: 032] Fix Wikiquote check to check the real domain [puppet] - 10https://gerrit.wikimedia.org/r/247761 (owner: 10MaxSem) [23:51:26] Yeah, that's the gadget [23:51:42] heh [23:52:08] mutante, finita la comedia: location:https://www.google.com/transparencyreport/safebrowsing/diagnostic/index.html#url=wikibooks.org/ [23:52:31] Ok, let's enable sitelinks for real [23:52:45] (03PS2) 10Hoo man: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247593 (https://phabricator.wikimedia.org/T115653) (owner: 10Aude) [23:53:23] (03CR) 10Hoo man: [C: 032] Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247593 (https://phabricator.wikimedia.org/T115653) (owner: 10Aude) [23:53:29] (03Merged) 10jenkins-bot: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247593 (https://phabricator.wikimedia.org/T115653) (owner: 10Aude) [23:54:18] MaxSem: indeed, but that's not what is in the change, i can just amend [23:54:18] !log hoo@tin Synchronized wmf-config/: Add MediaWiki, Meta-Wiki and Wikispecies to Wikibase special site groups (duration: 00m 18s) [23:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:47] (03CR) 10Hoo man: [C: 04-1] "No longer relevant" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247595 (owner: 10Aude) [23:55:08] mutante, you can't easily amend it to execure JS to do the #hash stuff [23:55:42] did we bump cache epoch yet? [23:55:51] aude: I did after the scap [23:56:10] Sorry, only saw your page now [23:56:22] nevermind, i'm viewing a semi-protected item [23:56:30] * aude logs in [23:56:43] was wondering where the edit links went [23:57:35] (03CR) 10Dzahn: [C: 032] Switch safe browsing checks to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/247760 (https://phabricator.wikimedia.org/T116099) (owner: 10MaxSem) [23:57:37] (03Abandoned) 10Aude: Bump cache epoch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247595 (owner: 10Aude) [23:57:42] so the actual thingie is https://www.google.com/safebrowsing/diagnostic?output=jsonp&site=wikimedia.org [23:58:03] * hoo waits for the sitesmodule stuff to be update [23:58:04] d [23:58:18] * aude too [23:58:42] Already valid for hte api: https://www.wikidata.org/w/api.php?action=help&modules=wbsetsitelink [23:59:35] Here we go: https://www.wikidata.org/w/index.php?title=Q4115189&type=revision&diff=260514930&oldid=260353290 [23:59:36] :)