[00:04:28] (03PS6) 10Dzahn: tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) [00:04:30] (03CR) 10Dzahn: [C: 032] tor::relay: add configurable thirdparty APT source [puppet] - 10https://gerrit.wikimedia.org/r/456056 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [00:07:58] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702 (10bd808) [00:16:15] (03PS2) 10Dzahn: smokeping: replace radon with dnsauth1001 as a target [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) [00:16:54] (03PS3) 10Dzahn: smokeping: replace radon with authdns1001 as a target [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) [00:20:55] (03PS4) 10Dzahn: smokeping: replace radon with deploy1001 as a target in C4 [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) [00:22:08] (03PS5) 10Dzahn: smokeping: replace radon with deploy1001 as a target in C4 [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) [00:22:10] (03CR) 10Dzahn: [C: 032] "since the point is testing connectivity of each rack, radon should be replaced with something else in the same rack, not what replaced the" [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) (owner: 10Dzahn) [00:23:31] (03CR) 10Dzahn: "ah, no.. that isn't in wikimedia.org of course.." [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) (owner: 10Dzahn) [00:27:20] (03PS6) 10Dzahn: smokeping: replace radon with cobalt as a target in C4 [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) [00:28:14] (03PS7) 10Dzahn: smokeping: replace radon with cobalt as a target in C4 [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) [00:29:20] (03CR) 10Dzahn: [C: 032] "ok, finally. cobalt has a public IP and is also in C4" [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) (owner: 10Dzahn) [00:31:18] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission radon - https://phabricator.wikimedia.org/T202040 (10Dzahn) @cmjohnson You should be unblocked now [00:33:30] (03CR) 10Dzahn: [C: 032] "applied on netmon1002, netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/456320 (https://phabricator.wikimedia.org/T202040) (owner: 10Dzahn) [00:44:28] !log netmon1002 - restarted smokeping, removed radon as target (unblock decome of former dns server), added cobalt instead as a target also in C4 [00:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:57] (03PS1) 10Dzahn: rsync::server: add parameter to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/456522 [00:56:18] (03CR) 10Dzahn: [C: 031] "this could make this potentially a little nicer: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/456522/" [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [00:57:43] (03CR) 10Dzahn: [C: 031] profile::archiva: allow rsync to bind to IPv6 interfaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [01:02:36] (03CR) 10Smalyshev: Create wikidata ntriples dump from ttl dump (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447922 (https://phabricator.wikimedia.org/T144103) (owner: 10Smalyshev) [01:11:08] (03CR) 10Alex Monk: Validate challenges before pushing them to the ACME directory (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [01:17:10] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [70.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [01:41:20] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [01:56:47] (03PS6) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [01:57:54] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [01:58:50] (03CR) 10Mathew.onipe: "> Patch Set 5:" (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [02:12:14] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: en.planet hasn't updated since July 25 - https://phabricator.wikimedia.org/T203055 (10Legoktm) Confirmed, I now have 70+ posts in my feed reader :) Filed {T203208} as follow-up for adding monitoring. [02:22:28] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sde1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [02:26:57] (03PS7) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [02:28:06] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [02:32:38] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 161058504 [02:33:39] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 917504 [03:27:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 863.82 seconds [03:31:04] 10Operations, 10Traffic, 10User-Urbanecm: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Bawolff) Are people going to be directed to this page via a CentralNotice banner? If so we already know the language and could just add uselang to the url in the banner... [03:50:28] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 238.95 seconds [03:50:40] PROBLEM - High load average on labstore1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [70.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:06:49] 10Operations, 10Traffic, 10User-Urbanecm: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10BBlack) >>! In T203179#4547170, @Legoktm wrote: > As far as I can tell, ULS outputs no header when relying upon Accept-Language, it sounds like you're saying that it sh... [04:14:29] RECOVERY - High load average on labstore1004 is OK: OK: Less than 50.00% above the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:45:18] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 53.64 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:52:39] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 82.58 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:06:08] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [05:06:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [05:25:58] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 54.64 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:31:58] (03PS7) 10Elukey: profile::archiva: allow rsync to bind to IPv6 interfaces [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) [05:32:18] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.85 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:33:55] !log restart pdfrender on scb1003 [05:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:29] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [05:35:09] (03CR) 10Elukey: [C: 032] "Going to merge this to finish the task and then I'll pick up Daniel's code review to make it better :)" [puppet] - 10https://gerrit.wikimedia.org/r/456156 (https://phabricator.wikimedia.org/T192639) (owner: 10Elukey) [05:52:28] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:52:30] PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:52:59] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:53:09] PROBLEM - Hadoop NodeManager on analytics1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:54:01] ehm this is me sorry --^ [05:54:38] downtimed [05:54:57] !log resumed the Hadoop workers reboots for kernel upgrades [05:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:18] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 55.35 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:12:18] RECOVERY - Hadoop NodeManager on analytics1048 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:12:38] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:13:18] RECOVERY - Hadoop NodeManager on analytics1047 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:21:19] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 72.13 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:29:18] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:30:49] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/root/.screenrc] [06:48:29] PROBLEM - High lag on wdqs2003 is CRITICAL: 3601 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:50:39] PROBLEM - High lag on wdqs2003 is CRITICAL: 3653 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [06:56:09] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:59:39] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:29] PROBLEM - High lag on wdqs2003 is CRITICAL: 3663 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:07:31] 10Operations, 10Analytics: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) [07:20:48] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 48.04 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:27:19] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 81.46 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:41:39] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 40.89 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:47:18] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 84.09 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:53:27] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 [07:54:20] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 (owner: 10Arturo Borrero Gonzalez) [07:57:09] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 55.99 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:59:13] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 [08:00:00] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 (owner: 10Arturo Borrero Gonzalez) [08:00:57] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 [08:01:36] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 (owner: 10Arturo Borrero Gonzalez) [08:02:40] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: create monitoring user [puppet] - 10https://gerrit.wikimedia.org/r/456569 (https://phabricator.wikimedia.org/T203177) [08:04:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler seems happy: https://puppet-compiler.wmflabs.org/compiler02/12308/" [puppet] - 10https://gerrit.wikimedia.org/r/456569 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [08:09:18] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 85.06 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:14:31] (03CR) 10Gehel: "Thanks for the rework! I already have a few comments :) I'll add more once those are resolved. Don't worry, it is perfectly expected to do" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [08:14:35] (03CR) 10Gehel: [C: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [08:15:18] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: monitoring user is administrator [puppet] - 10https://gerrit.wikimedia.org/r/456572 (https://phabricator.wikimedia.org/T203177) [08:15:59] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: rabbitmq: monitoring user is administrator [puppet] - 10https://gerrit.wikimedia.org/r/456572 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [08:16:57] !log repair sde1 on ms-be2041 - T199198 [08:16:59] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 51.95 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:03] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:21:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10aaron) perf-roots seems appropriate. If anything extra is needed, that can always be discuss... [08:22:48] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [08:24:48] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 102.9 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:27:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10MoritzMuehlenhoff) @aaron: To clarify/confirm: You don't need cluster-wide root access anymo... [08:29:43] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: monitoring user is administrator [puppet] - 10https://gerrit.wikimedia.org/r/456572 (https://phabricator.wikimedia.org/T203177) [08:30:51] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wiki Indaba Steering Committee - https://phabricator.wikimedia.org/T203222 (10Vikoula5) [08:33:29] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 24.8 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:35:22] (03CR) 10Gehel: [C: 04-1] "No emergency on this, but you could start to address the errors reported by jenkins (https://integration.wikimedia.org/ci/job/tox-docker/3" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [08:36:33] (03PS2) 10Gehel: wdqs: redirect stderr from cron jobs to log file [puppet] - 10https://gerrit.wikimedia.org/r/456345 [08:36:44] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10MoritzMuehlenhoff) I worked on a backport of the driver 4.9 and I got to the point where the driver loaded along with the firmware, but there were runtime issues which caused connection fail... [08:37:54] (03CR) 10Gehel: [C: 032] wdqs: redirect stderr from cron jobs to log file [puppet] - 10https://gerrit.wikimedia.org/r/456345 (owner: 10Gehel) [08:39:14] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "The compiler verifies this:" [puppet] - 10https://gerrit.wikimedia.org/r/456572 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [08:39:21] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: rabbitmq: monitoring user is administrator [puppet] - 10https://gerrit.wikimedia.org/r/456572 (https://phabricator.wikimedia.org/T203177) [08:42:18] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.3 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:43:01] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wiki Indaba Steering Committee - https://phabricator.wikimedia.org/T203222 (10Aklapper) Hi @Vikoula5. Please see https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list for required information. Thanks! [08:44:22] (03PS1) 10Gilles: Preserve EXIF ImageDescription instead of XMP Description [puppet] - 10https://gerrit.wikimedia.org/r/456575 (https://phabricator.wikimedia.org/T20871) [08:44:59] (03PS1) 10Gehel: logstash: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456576 (https://phabricator.wikimedia.org/T198351) [08:45:37] godog: data move for the second half of the logstash cluster ^ [08:45:44] if you have a minute to review (trivial) [08:46:09] gehel: sure, taking a look [08:46:32] (03CR) 10Filippo Giunchedi: [C: 031] logstash: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456576 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [08:47:08] godog: thansk! [08:52:49] (03CR) 10Gehel: [C: 032] logstash: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456576 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [08:53:13] 10Operations, 10Dumps-Generation: Reboots of dumps/snapshot hosts for L1TF/microcode updates - https://phabricator.wikimedia.org/T202623 (10ArielGlenn) 05Open>03Resolved p:05Triage>03Normal [08:57:04] !log elasticsearch data directory migration on all logstash nodes [08:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:20] godog: soo much easier to restart elasticsearch when there is no data on the node :) [08:57:38] (03PS1) 10Gilles: Use jessie-backports version of haproxy [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) [09:01:14] (03PS1) 10Muehlenhoff: Also remove obsolete Hiera host file for silver [puppet] - 10https://gerrit.wikimedia.org/r/456579 (https://phabricator.wikimedia.org/T191357) [09:02:44] (03CR) 10Muehlenhoff: [C: 032] Also remove obsolete Hiera host file for silver [puppet] - 10https://gerrit.wikimedia.org/r/456579 (https://phabricator.wikimedia.org/T191357) (owner: 10Muehlenhoff) [09:18:48] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 39.84 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:25:28] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 81.61 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:32:03] (03PS1) 10Muehlenhoff: Switch cloudvirt1023 temporarily to stretch for kernel tests [puppet] - 10https://gerrit.wikimedia.org/r/456580 [09:35:46] (03CR) 10Filippo Giunchedi: [C: 031] "Adding Jaime/Manuel since they also use haproxy, no concerns using the jessie-backports version from me." [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:36:28] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 53.93 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:37:43] (03CR) 10Jcrespo: [C: 031] "We don't have any jessie proxy- I would strongly recommend to avoid jessie for haproxy unless necessary (but this doesn't affect me)." [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:39:49] (03CR) 10Jcrespo: [C: 031] "Unrelated and out of scope, but the create /run/haproxy bellow should be done by tempfile or systemd on >= buster, not on puppet." [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:43:02] I've silenced the traffic ulsfo alert for three hours btw [09:43:47] (03CR) 10Muehlenhoff: [C: 032] Switch cloudvirt1023 temporarily to stretch for kernel tests [puppet] - 10https://gerrit.wikimedia.org/r/456580 (owner: 10Muehlenhoff) [09:43:57] (03PS2) 10Filippo Giunchedi: Use jessie-backports version of haproxy [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:45:08] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.95 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:46:08] !log installing libx11 security updates on trusty [09:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:34] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:46:37] (03CR) 10Filippo Giunchedi: [C: 032] Use jessie-backports version of haproxy [puppet] - 10https://gerrit.wikimedia.org/r/456578 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [09:51:12] (03PS4) 10Fdans: Add druid snapshot removal cron job [puppet] - 10https://gerrit.wikimedia.org/r/455605 (https://phabricator.wikimedia.org/T197889) [10:00:16] (03PS1) 10Volans: confctl: add set_and_verify() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456584 (https://phabricator.wikimedia.org/T199079) [10:00:18] (03PS1) 10Volans: mediawiki: refactor to use confctl set_and_verify [software/spicerack] - 10https://gerrit.wikimedia.org/r/456585 (https://phabricator.wikimedia.org/T199079) [10:00:20] (03PS1) 10Volans: dnsdisc: add a pool() and depool() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) [10:02:12] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wiki Indaba Steering Committee - https://phabricator.wikimedia.org/T203222 (10Vikoula5) [10:03:47] (03CR) 10Elukey: Add druid snapshot removal cron job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/455605 (https://phabricator.wikimedia.org/T197889) (owner: 10Fdans) [10:04:35] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wiki Indaba Steering Committee - https://phabricator.wikimedia.org/T203222 (10Vikoula5) hi @Aklapper. I edit my task. I hope it's okay now [10:05:22] (03PS5) 10Fdans: Add druid snapshot removal cron job [puppet] - 10https://gerrit.wikimedia.org/r/455605 (https://phabricator.wikimedia.org/T197889) [10:06:31] (03PS6) 10Elukey: Add druid snapshot removal cron job [puppet] - 10https://gerrit.wikimedia.org/r/455605 (https://phabricator.wikimedia.org/T197889) (owner: 10Fdans) [10:08:08] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12311/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/455605 (https://phabricator.wikimedia.org/T197889) (owner: 10Fdans) [10:25:00] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) [10:27:24] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) [10:28:01] (03CR) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbooks (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:30:55] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 6 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) [10:33:36] 10Operations, 10cloud-services-team, 10netops: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10MoritzMuehlenhoff) @ayounsi : I can still reproduce this with an installation of cloudvirt1023, I can see in syslog that atftpd is serving lpxelinux.0 to 10... [10:40:35] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 7 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) [10:46:32] (03CR) 10Jcrespo: [C: 031] "Nothing to review here" [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:49:40] !log installing libgd2 security updates on trusty [10:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:32] 10Operations, 10Services: Create nodejs 10 packages - https://phabricator.wikimedia.org/T203239 (10MoritzMuehlenhoff) [11:24:10] (03CR) 10Elukey: [C: 031] mediawiki::web::prod_sites: make includes explicit in more wikis [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:25:49] 10Operations, 10Services: Create nodejs 10 packages - https://phabricator.wikimedia.org/T203239 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [11:26:02] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-29: Release and deploy wikidiff2 v1.7.3 - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) a:03MoritzMuehlenhoff [11:34:29] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2018-08-29: Release and deploy wikidiff2 v1.7.3 - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) Since the deployment of the v1.8.0 is a bit more time consuming and should happen with a bit more precaution due to th... [11:35:18] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 54.98, 34.00, 22.21 [11:50:55] (03PS1) 10Gilles: Increase per-original thumbnail throttle for prerender [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) [11:53:05] (03CR) 10Elukey: [C: 031] "Looks good to me, even if I wasn't able to get a good pcc diff (https://puppet-compiler.wmflabs.org/compiler02/12312/mw1269.eqiad.wmnet/)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:54:16] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/456584 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:57:53] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:00:09] (03CR) 10Gehel: [C: 031] "LGTM (very minor comment inline, feel free to ignore, or merge directly after correction)." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456585 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:03:16] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 2 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456503 (https://phabricator.wikimedia.org/T199079) [12:03:46] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 7 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) [12:04:38] PROBLEM - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100% [12:05:16] (03CR) 10Volans: mediawiki: refactor to use confctl set_and_verify (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456585 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:05:24] (03CR) 10Gehel: [C: 04-1] dnsdisc: add a pool() and depool() methods (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:08:08] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 10.27, 12.94, 23.67 [12:09:07] (03CR) 10Volans: "replies inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:11:26] (03PS1) 10Jcrespo: mariadb-backups: Provide backup file metadata information [puppet] - 10https://gerrit.wikimedia.org/r/456608 (https://phabricator.wikimedia.org/T198987) [12:15:05] (03CR) 10Elukey: [C: 031] "Looks good to me, I checked all the includes replacements and they are correct as far as I can see. One nit: I saw that you respected each" [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [12:16:01] (03CR) 10Jcrespo: "Example output so far: https://phabricator.wikimedia.org/T198987#4548356" [puppet] - 10https://gerrit.wikimedia.org/r/456608 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [12:23:38] (03CR) 10Elukey: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [12:30:48] (03PS1) 10Jcrespo: mariadb-backups: Calculate total backup size [puppet] - 10https://gerrit.wikimedia.org/r/456613 (https://phabricator.wikimedia.org/T198987) [12:34:01] (03PS2) 10Volans: dnsdisc: add a pool() and depool() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) [12:34:15] (03CR) 10Volans: dnsdisc: add a pool() and depool() methods (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:40:58] (03CR) 10Gehel: sre.switchdc.mediawiki: add Phase 5 cookbooks (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:44:13] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10elukey) p:05Triage>03Normal [12:47:08] (03CR) 10Elukey: [C: 031] mediawiki::web::prod_sites: expand the includes in sites in main.conf (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [12:51:04] (03CR) 10Gehel: [C: 031] dnsdisc: add a pool() and depool() methods (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:52:20] (03CR) 10Gehel: [C: 031] mediawiki: refactor to use confctl set_and_verify (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456585 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:52:47] (03CR) 10Volans: [C: 032] confctl: add set_and_verify() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456584 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:53:52] (03Merged) 10jenkins-bot: confctl: add set_and_verify() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456584 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:54:10] (03CR) 10Volans: [C: 032] mediawiki: refactor to use confctl set_and_verify [software/spicerack] - 10https://gerrit.wikimedia.org/r/456585 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:55:11] (03Merged) 10jenkins-bot: mediawiki: refactor to use confctl set_and_verify [software/spicerack] - 10https://gerrit.wikimedia.org/r/456585 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:56:01] (03CR) 10Volans: dnsdisc: add a pool() and depool() methods (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:57:29] (03CR) 10Gehel: "lgtm, trivial enough" [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [12:58:22] (03PS3) 10Volans: dnsdisc: add a pool() and depool() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) [13:00:37] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T196507 (10Bstorm) Aaaand, it's throwing an error again. Says no battery. The battery doesn't show when I ask the controller for status. for instance: ``` Smart Array P840 in Slot 1 Controller Status: OK Cach... [13:04:57] (03PS1) 10Volans: mediawiki: add check_cronjobs_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) [13:14:47] (03CR) 10Elukey: "I left some comments but I am not sure if they are wrong or right, pcc is still not really clear when comparing the vhost diffs, so I'd wa" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452323 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [13:20:55] (03CR) 10Gehel: [C: 031] mediawiki: add check_cronjobs_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:29:53] (03PS1) 10Thcipriani: Beta: remove npm from deployment master [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) [13:38:26] (03CR) 10Gehel: [C: 031] dnsdisc: add a pool() and depool() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:46:40] (03CR) 10Volans: [C: 032] dnsdisc: add a pool() and depool() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:47:44] (03Merged) 10jenkins-bot: dnsdisc: add a pool() and depool() methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/456586 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:51:11] (03PS19) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [13:51:48] (03CR) 10Volans: mediawiki: add check_cronjobs_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:51:52] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:52:00] (03PS2) 10Volans: mediawiki: add check_cronjobs_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) [13:52:17] (03PS1) 10Volans: mysql: add get_dbs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) [13:52:57] (03CR) 10Alex Monk: "Great. Need to remember to drop the "Try to fix npm package on deployment-deploy01" cherry-pick (currently 47c986fdd3a74808e725fdd7aad404c" [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [13:53:14] (03CR) 10Alex Monk: "and actually, probably absent the stuff it added" [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [13:57:12] (03PS20) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [13:57:27] 10Operations, 10Maps-Sprint, 10Traffic, 10Maps (Tilerator), and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) https://github.com/kartotherian/tilerator/pull/43 [13:57:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:57:57] (03CR) 10Gehel: [C: 031] mediawiki: add check_cronjobs_disabled() method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:00:14] (03CR) 10Gehel: "minor comments inline" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:01:57] (03PS21) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:02:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:05:33] (03PS22) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:06:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:06:22] dammit jenkins [14:08:59] (03CR) 10Muehlenhoff: "You can abandon this patch: microcode is now enabled for Wikimedia, but the patch which was eventually used is different, it's only being " [puppet] - 10https://gerrit.wikimedia.org/r/312714 (owner: 10Matanya) [14:09:53] (03PS23) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:10:12] (03PS1) 10Jgreen: create frdeploy/README, clean up comments and typos [software] - 10https://gerrit.wikimedia.org/r/456637 [14:11:30] operations-puppet-tests-docker SUCCESS in 17s [14:11:31] yay [14:14:09] (03PS2) 10Jgreen: create frdeploy/README, clean up comments and typos [software] - 10https://gerrit.wikimedia.org/r/456637 [14:16:09] Krenair: nice :D [14:16:21] PROBLEM - High lag on wdqs2003 is CRITICAL: 3606 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:16:54] I may have cheated tiny little bit with a lint:ignore [14:17:03] (03CR) 10Jgreen: [C: 032] create frdeploy/README, clean up comments and typos [software] - 10https://gerrit.wikimedia.org/r/456637 (owner: 10Jgreen) [14:17:21] I believe it to be reasonable but I guess we'll see what the reviewers think later [14:17:44] vgutierrez, so I think we also said we'd make a service user for it as part of the package? [14:17:51] (03Merged) 10jenkins-bot: create frdeploy/README, clean up comments and typos [software] - 10https://gerrit.wikimedia.org/r/456637 (owner: 10Jgreen) [14:18:05] Krenair: indeed [14:18:12] gotta figure that bit out next [14:18:25] oh and push my certcentral.git changes for packaging [14:19:04] yep [14:19:26] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 8 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) [14:19:51] (03CR) 10Volans: [C: 032] mediawiki: add check_cronjobs_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:20:51] (03Merged) 10jenkins-bot: mediawiki: add check_cronjobs_disabled() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456620 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:22:18] Krenair: hmm I see you're referring to /usr/local in the puppet code, as certcentral is being deployed through a package it should live in /usr/lib.. and CLI entry points in /usr/bin [14:22:23] (03PS1) 10Jgreen: frdeploy/README typos [software] - 10https://gerrit.wikimedia.org/r/456641 [14:23:02] vgutierrez, hm yep that's gonna break :/ [14:23:05] well spotted [14:23:15] haven't done a CLI entry point yet [14:24:32] (03CR) 10Jgreen: [C: 032] frdeploy/README typos [software] - 10https://gerrit.wikimedia.org/r/456641 (owner: 10Jgreen) [14:24:51] (03PS24) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:25:04] (03CR) 10Volans: mysql: add get_dbs() method (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:26:09] Krenair: we should create two entry points, one to spawn CertCentral.run() and one to do commmon actions like create an ACME Account, revoke a certificate... [14:26:35] yep right now I just have a comment in the hieradata explaining how to create that by hand with the library :D [14:26:51] 10Operations, 10Data-Services, 10Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083 (10Bstorm) [14:26:52] (03CR) 10Gehel: [C: 031] mysql: add get_dbs() method (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:27:10] (03PS9) 10Gehel: extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [14:27:28] volans: finally your turn to criticise my code ^ [14:27:31] 10Operations, 10ops-eqdfw: unrack/decom cr1-eqdfw - https://phabricator.wikimedia.org/T202700 (10ayounsi) [14:27:35] \o/ [14:27:50] (03PS10) 10Gehel: extract reporting from BaseEventHandler [software/cumin] - 10https://gerrit.wikimedia.org/r/451080 [14:28:12] (03PS2) 10Volans: mysql: add get_dbs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) [14:29:46] (03CR) 10Volans: [C: 032] mysql: add get_dbs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:30:51] (03Merged) 10jenkins-bot: mysql: add get_dbs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/456630 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:31:05] (03CR) 10Volans: sre.switchdc.mediawiki: add Phase 4 cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:32:57] (03PS1) 10Vgutierrez: Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 [14:34:24] (03CR) 10jerkins-bot: [V: 04-1] Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (owner: 10Vgutierrez) [14:35:37] (03CR) 10Filippo Giunchedi: "+1 on the rationale, I'll merge this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles) [14:37:15] (03PS3) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) [14:37:26] (03CR) 10Volans: "replies inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:38:32] ACKNOWLEDGEMENT - High lag on wdqs1005 is CRITICAL: 1.316e+04 ge 3600 Gehel catching up after data reimport https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:38:33] ACKNOWLEDGEMENT - High lag on wdqs1010 is CRITICAL: 1.146e+04 ge 3600 Gehel catching up after data reimport https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:44:35] (03PS10) 10Vgutierrez: Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) [14:44:37] (03PS3) 10Vgutierrez: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) [14:44:39] (03PS2) 10Vgutierrez: Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) [14:45:14] (03CR) 10Vgutierrez: Validate challenges before pushing them to the ACME directory (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:45:42] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [14:45:56] (03CR) 10jerkins-bot: [V: 04-1] ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:45:58] (03CR) 10jerkins-bot: [V: 04-1] Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:46:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [14:46:27] (03CR) 10jerkins-bot: [V: 04-1] Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:46:33] (03PS3) 10Vgutierrez: Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) [14:47:34] (03PS1) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [14:48:01] (03CR) 10jerkins-bot: [V: 04-1] Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:48:16] (03PS11) 10Vgutierrez: Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) [14:48:19] (03PS4) 10Vgutierrez: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) [14:48:21] (03PS4) 10Vgutierrez: Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) [14:48:28] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [14:49:38] (03CR) 10jerkins-bot: [V: 04-1] ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:49:40] (03CR) 10jerkins-bot: [V: 04-1] Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:51:45] (03CR) 10Vgutierrez: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:52:03] (03CR) 10Vgutierrez: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:57:51] (03CR) 10Alex Monk: Validate challenges before pushing them to the ACME directory (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:58:11] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [14:58:12] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [14:58:37] 10Operations, 10Services: Create nodejs 10 packages - https://phabricator.wikimedia.org/T203239 (10MoritzMuehlenhoff) One notable change which is to be expected from moving to 10: Some node modules ship binary blobs in their modules and the official node packages are build against OpenSSL 1.0.2. nodejs 10 onl... [15:02:10] 10Operations, 10Maps-Sprint, 10Traffic, 10Maps (Tilerator), and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) This is ready to deploy to beta as soon as the deploy config template and puppet changes land. [15:03:06] (03CR) 10Alex Monk: [C: 032] Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:07:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [15:07:38] vgutierrez, so what's the normal mechanism for having a package create system users? [15:07:55] add in some script that just does adduser etc.? [15:08:24] best to add it in postinst, let me find an example [15:09:01] e.g. https://github.com/wikimedia/operations-debs-prometheus-rabbitmq-exporter/blob/master/debian/postinst [15:10:50] thanks moritzm :) [15:11:01] RECOVERY - Memory correctable errors -EDAC- on scb1002 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [15:11:33] (03CR) 10Gehel: "minor comment inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:14:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10Imarlier) @MoritzMuehlenhoff Based on Aaron's comment, you are correct in your understanding... [15:14:52] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [15:20:26] (03CR) 10Gehel: sre.switchdc.mediawiki: add Phase 5 cookbooks (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:23:17] (03PS2) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:23:25] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:28:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Balazs Pocze - https://phabricator.wikimedia.org/T202521 (10MoritzMuehlenhoff) Balazs has been added to pwstore. [15:28:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Balazs Pocze - https://phabricator.wikimedia.org/T202521 (10MoritzMuehlenhoff) [15:30:19] (03CR) 10Bstorm: "Anomie, if I can get a review on this one, it would be awesome. I think using this approach would improve performance on the main views, " [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [15:31:50] 10Operations, 10Cloud-Services, 10Patch-For-Review: nfs-manage failover script needs to be tested with real load and fixed - https://phabricator.wikimedia.org/T169570 (10Bstorm) a:03Bstorm Just to put an assignee on this one, since I'm thinking about it a lot. [15:32:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) Please @Cmjohnson! :) [15:33:53] (03PS12) 10Alex Monk: Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:33:55] (03PS5) 10Alex Monk: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:33:57] (03PS5) 10Alex Monk: Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:34:00] (03PS3) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:34:42] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:36:01] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [15:36:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [15:37:21] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 8 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) [15:37:33] (03CR) 10Volans: "done" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:38:32] (03PS4) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:38:32] PROBLEM - Filesystem available is greater than filesystem size on ms-be2043 is CRITICAL: cluster=swift device=/dev/sdk1 fstype=xfs instance=ms-be2043:9100 job=node mountpoint=/srv/swift-storage/sdk1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [15:39:24] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:45:23] (03PS5) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:46:11] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:46:15] 10Operations, 10Mail: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10faidon) p:05Triage>03Normal [15:48:06] (03CR) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbooks (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:55:45] (03PS6) 10Alex Monk: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 [15:56:54] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [15:57:26] (03PS3) 10Faidon Liambotis: Kill wiki-mail.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/143762 [15:58:36] Is it OK if I do an emergency deploy for an UBN – T203213? [15:58:36] T203213: Visual Editor works as LTR in all RTL Wikimedia wikis - https://phabricator.wikimedia.org/T203213 [16:04:05] (Taking silence as assent; I have the conch.) [16:08:14] (03PS1) 10BBlack: cache_text: inject Vary:AL for fixcopyrightwiki [puppet] - 10https://gerrit.wikimedia.org/r/456656 (https://phabricator.wikimedia.org/T203179) [16:10:04] 10Operations, 10netops: cr2-eqdfw (MX204) vhclient log noise - https://phabricator.wikimedia.org/T203261 (10faidon) p:05Triage>03Normal [16:12:18] 10Operations, 10Mail, 10User-herron: Outdated TLS config for MXes - https://phabricator.wikimedia.org/T203260 (10herron) [16:27:19] (03CR) 10Anomie: [C: 031] "Sorry, I haven't had a chance to catch up on the backlog of review requests since my vacation for the first half of August, since I came b" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [16:30:16] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: Hot-deploy of I38eda4aac48 to fix T203213 (duration: 00m 54s) [16:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:23] T203213: Visual Editor works as LTR in all RTL Wikimedia wikis - https://phabricator.wikimedia.org/T203213 [16:32:03] I surrender the conch. [16:52:47] (03PS1) 10Dzahn: admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) [16:55:22] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1186 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:56:55] (03PS2) 10Dzahn: admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) [17:05:41] (03CR) 10Cwhite: [C: 031] admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [17:05:58] (03PS2) 10Dzahn: noc: Add Cache-Control with short max-age for noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/456206 (https://phabricator.wikimedia.org/T202734) (owner: 10Krinkle) [17:06:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:10:32] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:10:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [17:15:02] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [17:17:53] marlier: ssh to .. for example mc1019.eqiad.wmnet , does it work for you? that would confirm we can close T202657 [17:17:53] T202657: request to add imarlier to perf-roots - https://phabricator.wikimedia.org/T202657 [17:19:14] it should, i see you have a home and the group membership [17:22:16] I'm going to need to do another (entirely unrelated) emergency deploy, I'm afraid. [17:26:26] mutante: it works! [17:26:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add imarlier to perf-roots - https://phabricator.wikimedia.org/T202657 (10Imarlier) Thumbs up! [17:27:07] marlier: cool :) thx [17:27:29] Thank you! (And Ariel!) [17:28:04] you're welcome, closing the ticket [17:28:30] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10Dzahn) [17:28:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add imarlier to perf-roots - https://phabricator.wikimedia.org/T202657 (10Dzahn) 05Open>03Resolved 13:17 < mutante> marlier: ssh to .. for example mc1019.eqiad.wmnet , does it work for you? that would confirm we can close T202657 13:26 <... [17:35:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10sguebo_WMF) Hi @ArielGlenn, the access works just fine. Thanks! [17:37:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10Dzahn) 05Open>03Resolved a:03Dzahn Thanks for confirming. [17:37:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10Dzahn) a:05Dzahn>03ArielGlenn [17:39:48] !log repooled wdqs1005 [17:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:23] 10Operations, 10SRE-Access-Requests: Please add everyone on the performance team to perf-roots - https://phabricator.wikimedia.org/T202648 (10Dzahn) [17:53:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please add aaron to perf-team - https://phabricator.wikimedia.org/T202650 (10Dzahn) 05Open>03Resolved This has been answered on T202910#4544447 ff This patch will remove him from global root: https://gerrit.wikimedia.org/r/456663 We can close th... [17:55:29] @seen phedenskog [17:55:29] mutante: Last time I saw phedenskog they were joining the channel, they are still in the channel #wikimedia-overflow at 8/24/2018 2:30:12 AM (7d15h25m16s ago) [17:56:02] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 16 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [17:57:45] !log depooled wtp2020 because icinga reported memory errors [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:18] (03CR) 10Paladox: [C: 031] "Tested locally and works for me. Just need to get ib3 packaged as a deb then we can install with apt." [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [18:10:08] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.19/includes/Title.php: Hot-deploy of I05eea553c58 to let users edit [[Copyright]] again (duration: 00m 50s) [18:10:13] OK, deployed. [18:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:20] Conch up for grabs again. [18:13:46] 10Operations: wtp2020 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T203265 (10Dzahn) [18:14:22] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 16 ge 4 daniel_zahn https://phabricator.wikimedia.org/T203265 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [18:14:29] James_F: thanks for announcing this here [18:14:56] Always. [18:15:25] Ideally we'd have a "real" bot-managed conch for prod deploys rather than rely on human eyeballs. [18:16:52] yes, a jouncebot command could do it, true [18:17:43] It's a perennially popular HackerNews topic, maybe we could steal one rather than build it (yet) again. [18:17:48] might be worth a ticket to add the feature [18:17:55] * James_F nods. [18:18:30] well, the current ones are already using a framework to make bots afaict [18:18:47] but i also have this idea to puppetize eggdrop, the oldest IRC bot around, rock stable :PP [18:18:52] and then we write TCL scripts :p [18:19:01] Ha. [18:19:12] i have an actual gerrit change but sitting there forever as WIP [18:19:44] https://github.com/eggtcl/eggtcl [18:20:57] Filed as T203267. [18:20:58] T203267: Consider having an IRC bot to manage the deployment "conch" for out-of-band (and scheduled?) deploys - https://phabricator.wikimedia.org/T203267 [18:20:59] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/320698/3/modules/eggdrop/templates/eggdrop.conf.erb [18:21:11] 10Operations, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10BBlack) ^ I'm going to merge this up shortly. It's pretty un-dangerous to other traffic and it ensures the ULS Accept-Language stuff won't have an... [18:21:19] ok, cool [18:22:21] 10Operations, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10CCicalese_WMF) That's great news! Thank you! [18:37:30] (03CR) 10BBlack: [C: 032] cache_text: inject Vary:AL for fixcopyrightwiki [puppet] - 10https://gerrit.wikimedia.org/r/456656 (https://phabricator.wikimedia.org/T203179) (owner: 10BBlack) [18:44:37] (03CR) 10RobH: [C: 031] admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [18:59:25] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.5-1 - https://phabricator.wikimedia.org/T203271 (10thcipriani) p:05Triage>03Normal [18:59:43] 10Operations: cp3038, cp3039 - power supply redundancy failure - https://phabricator.wikimedia.org/T203272 (10Dzahn) [19:00:16] 10Operations, 10ops-esams: cp3038, cp3039 - power supply redundancy failure - https://phabricator.wikimedia.org/T203272 (10Dzahn) [19:01:02] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3039 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] daniel_zahn rhttps://phabricator.wikimedia.org/T203272 [19:01:13] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3038 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] daniel_zahn rhttps://phabricator.wikimedia.org/T203272 [19:05:33] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be2043 is CRITICAL: cluster=swift device=/dev/sdk1 fstype=xfs instance=ms-be2043:9100 job=node mountpoint=/srv/swift-storage/sdk1 site=codfw daniel_zahn https://phabricator.wikimedia.org/T199198 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [19:06:49] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10Dzahn) It broke today on ms-be2043 I ACKed https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be2043&service=Filesystem+available+is+grea... [19:09:46] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.5-1 - https://phabricator.wikimedia.org/T203271 (10thcipriani) a:05thcipriani>03None [19:12:17] !log ms-be2043 - following instructions at https://wikitech.wikimedia.org/wiki/Graphite#Repair_xfs_misreporting_free_space to repair xfs misreporting free space (T199198), fixing docs, icinga-downtime doesn't want fqdn but short name [19:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:23] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [19:18:26] (03CR) 10Alex Monk: "well the idea of doing it this way is to avoid having to get ops to sort us out a deb" [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [19:19:53] (03CR) 10Paladox: [C: 031] "> well the idea of doing it this way is to avoid having to get ops to" [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [19:21:38] James_F, you mean the scap lock is not enough? :p [19:22:04] Indeed. [19:22:49] (03CR) 10Gehel: [C: 031] sre.switchdc.mediawiki: add Phase 8 cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:23:46] ACKNOWLEDGEMENT - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T203244 [19:26:23] (03CR) 10Gehel: [C: 031] sre.switchdc.mediawiki: add Phase 5 cookbooks (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:26:32] ACKNOWLEDGEMENT - Device not healthy -SMART- on cloudelastic1002 is CRITICAL: cluster=misc device=sdb instance=cloudelastic1002:9100 job=node site=eqiad daniel_zahn https://phabricator.wikimedia.org/T194186 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudelastic1002&var-datasource=eqiad%2520prometheus%252Fops [19:27:43] 10Operations, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Dzahn) icinga reports that on cloudelastic1002 device sdb is not healthy per SMART cluster=misc device=sdb instance=cloudelastic1002:9100 job=node sit... [19:28:15] (03PS8) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [19:28:47] (03CR) 10Alex Monk: "no it doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [19:28:49] cookbooks? spicerack? [19:28:51] PROBLEM - Host scb2005 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:19] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [19:30:01] 10Operations, 10ops-codfw, 10netops: Switch port configuration for backup2001 - https://phabricator.wikimedia.org/T196782 (10Dzahn) icinga is reporting that on backup2010 there is "enp59s0f1 reporting no carrier." since about 11h 9m https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=backup... [19:30:21] RECOVERY - Host scb2005 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [19:31:09] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Dzahn) icinga is reporting that on backup2001 there is "enp59s0f1 reporting no carrier." since about 11h 9m https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=backup2... [19:31:42] ACKNOWLEDGEMENT - configured eth on backup2001 is CRITICAL: enp59s0f1 reporting no carrier. daniel_zahn https://phabricator.wikimedia.org/T196477 [19:32:04] (03CR) 10Legoktm: "It would be nice to mention the license in the README and have an explicit COPYING file." [cookbooks] - 10https://gerrit.wikimedia.org/r/454559 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:32:14] (03CR) 10Mathew.onipe: "> Patch Set 7:" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [19:39:01] RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [19:39:10] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Dzahn) Icinga is reporting that the new proton endpoints are not healthy, since about 2d 7h, on both proton1001 and proton2001. The reason gi... [19:39:42] ACKNOWLEDGEMENT - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICA [19:39:42] ar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T186748#4549439 [19:39:52] ACKNOWLEDGEMENT - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICA [19:39:52] ar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200) daniel_zahn https://phabricator.wikimedia.org/T186748#4549439 [19:41:55] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10Dzahn) < icinga-wm> RECOVERY - Filesystem available is greater than filesystem size on ms-be2043 is OK: All metrics within thresholds. but command still runn... [19:43:52] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) @Dzahn no need to worry - this is not used in prod yet. @pmiazga could you please silence the icinga for the time being? [19:46:29] !log right when it was fixed on ms-be2043 it also broke on ms-be2040. following the same instructions to fix xfs in a root screen (T199198) [19:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:34] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [19:46:41] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdi1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdi1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [19:47:22] ACKNOWLEDGEMENT - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdi1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdi1 site=codfw daniel_zahn https://phabricator.wikimedia.org/T199198 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [19:50:06] (03CR) 10Gehel: "And a few more comments! This is close to being ready. Once those comments are implemented, we can start on writing tests." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [19:50:16] (03CR) 10Paladox: [C: 031] "> no it doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/455277 (https://phabricator.wikimedia.org/T48254) (owner: 10Alex Monk) [19:55:58] (03CR) 10Volans: "> Patch Set 4:" [cookbooks] - 10https://gerrit.wikimedia.org/r/454559 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:56:43] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Dzahn) >>! In T186748#4549452, @Pchelolo wrote: > @Dzahn no need to worry - this is not used in prod yet. Yep, thanks for confirming. > @pm... [19:56:54] (03CR) 10Hashar: [C: 031] "Note npm (and maybe nodejs as well) would need to be purged manually." [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [20:00:11] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10aaron) >>! In T196378#4254123, @jcrespo wrote: > > The main blocker right now is to decide on a tunneling technology, as most seem to... [20:02:23] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) @Dzahn thank you for taking care of it! > Except now we have to remember to remove that again once this goes production. Now we... [20:02:54] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Pchelolo) And, @pmiazga would be great to find out why is it failing the checks [20:10:44] (03PS2) 10Andrew Bogott: openstack eqiad1: Run dns_floating_ip_updater [puppet] - 10https://gerrit.wikimedia.org/r/445310 (https://phabricator.wikimedia.org/T199374) (owner: 10Alex Monk) [20:10:49] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Papaul) @Dzahn you can disable those alerts @MoritzMuehlenhoff is running some test on that server. [20:14:12] (03CR) 10Andrew Bogott: [C: 032] openstack eqiad1: Run dns_floating_ip_updater [puppet] - 10https://gerrit.wikimedia.org/r/445310 (https://phabricator.wikimedia.org/T199374) (owner: 10Alex Monk) [20:14:48] (03PS1) 10Rush: admin: script to rush home directory [puppet] - 10https://gerrit.wikimedia.org/r/456690 [20:15:27] (03CR) 10jerkins-bot: [V: 04-1] admin: script to rush home directory [puppet] - 10https://gerrit.wikimedia.org/r/456690 (owner: 10Rush) [20:17:01] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:18:21] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:19:45] (03PS1) 10Bstorm: block_sync: Small improvement to the drbd backup script [puppet] - 10https://gerrit.wikimedia.org/r/456740 (https://phabricator.wikimedia.org/T171394) [20:22:02] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 320 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:22:09] (03CR) 10Gehel: [C: 04-1] Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:23:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:29:38] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704 (10Petar.petkovic) [20:29:58] (03PS9) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [20:30:50] (03CR) 10Mathew.onipe: "> Patch Set 8: Code-Review-1" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:31:13] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:32:06] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Dzahn) Yep, thanks Papaul. I realized after making the comment here. Done. [20:34:33] 10Operations, 10Analytics, 10ContentTranslation, 10SRE-Access-Requests: Add amire80 to analytics-privatedata-users group - https://phabricator.wikimedia.org/T122524 (10Petar.petkovic) [20:35:01] (03PS3) 10Dzahn: admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) [20:37:56] (03PS10) 10Andrew Bogott: Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) [20:38:46] (03CR) 10Andrew Bogott: [C: 032] Delegate 185.15.56.0/24 to labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/445303 (https://phabricator.wikimedia.org/T199374) (owner: 10Andrew Bogott) [20:41:01] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:46:02] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:46:51] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [21:05:42] * Krinkle staging on deployment/mwdebug1002 [21:11:31] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) I think (I didn't verify it yet) that it fails because of introduced restbase checks: @Pchelolo @Dzahn I assume this error happens... [21:11:55] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/includes/jobqueue/jobs/: Id2852d73d00 (1/2) (duration: 00m 52s) [21:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:54] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/includes/deferred/: Id2852d73d00 (2/2) (duration: 00m 55s) [21:14:57] * Krinkle releases deploy handle [21:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:45] (03CR) 10Gehel: "We're getting close! Inline comments are really minor." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [21:43:21] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:47:51] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:01:16] (03CR) 10Aaron Schulz: [C: 031] admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [22:02:32] (03CR) 10Dzahn: [C: 032] admins: remove aaron from ops [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [22:07:37] (03CR) 10Dzahn: "confirmed working as expected. key got removed cleanly for example on planet1001 but access on mwdebug1001 unchanged due to perf-roots, on" [puppet] - 10https://gerrit.wikimedia.org/r/456663 (https://phabricator.wikimedia.org/T202910) (owner: 10Dzahn) [22:08:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10Dzahn) 05Open>03Resolved a:03Dzahn This should conclude the ticket. Please reopen if a... [22:23:14] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Dzahn) >>! In T186748#4549704, @pmiazga wrote: > @Dzahn could you provide me the full URL the service checker is requesting for those two cal... [22:41:06] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wiki Indaba Steering Committee - https://phabricator.wikimedia.org/T203222 (10Dzahn) You have successfully created the mailing list wiscom and notification has been sent to the list owner vikoula5@yahoo.fr. You can now: [[ https://lists.wikimedia.org/m... [22:42:01] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Wiki Indaba Steering Committee - https://phabricator.wikimedia.org/T203222 (10Dzahn) 05Open>03Resolved a:03Dzahn Note there is now the special email address wiscom-owner@lists.wikimedia.org to reach all the admins at once. Also see the "list run... [22:44:00] 10Operations, 10SRE-Access-Requests: Access to restbase servers (including sudo) for Imarlier - https://phabricator.wikimedia.org/T202563 (10Dzahn) a:05VColeman>03ArielGlenn [22:51:51] (03PS1) 10Dzahn: admins: add ccicalese to analytics-privatedata-admins [puppet] - 10https://gerrit.wikimedia.org/r/456763 (https://phabricator.wikimedia.org/T203182) [22:53:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10Dzahn) analytics-privatedata-users sounds right per https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive... [22:55:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10Dzahn) a:03gabriel-wmde [22:56:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10Dzahn) a:03Thargrovewmf [22:57:26] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10Dzahn) a:05Joe>03jijiki [22:57:54] * Krinkle staging on deployment/mwdebug1002 [22:58:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Balazs Pocze - https://phabricator.wikimedia.org/T202521 (10Dzahn) a:05Marostegui>03Banyek @Banyek Is this resolved from your point of view? [22:58:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10Dzahn) a:03Kalliope [22:59:12] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10Dzahn) a:03Peter [23:00:34] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/WikidataPageBanner/: I9dc9a4c1fb62c4 - T199855 (duration: 00m 51s) [23:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:40] T199855: WikidataPageBanner breaks Vector page subtitle - https://phabricator.wikimedia.org/T199855 [23:00:51] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Watching / External): Add contint-roots to releases{1,2}001 - https://phabricator.wikimedia.org/T201470 (10Dzahn) a:03ArielGlenn [23:01:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10Dzahn) a:03Dzahn [23:02:46] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists, 10User-herron: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Dzahn) What's the blocker on this one? Running disable_list is probably good enough. The archives would still exist. h... [23:07:35] (03CR) 10Alex Monk: [C: 031] "we should talk about the PS9 and PS11 comments - haven't merged yet mainly because otherwise I will forget about it" [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [23:10:14] (03PS1) 10Dzahn: backup::host: update comment on ::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/456769 [23:11:37] (03CR) 10Dzahn: [C: 032] "just a comment cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/456769 (owner: 10Dzahn) [23:12:50] (03CR) 10Dzahn: "you can now safely abandon this. done in" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [23:13:39] (03CR) 10Dzahn: "ping paladox. bump" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [23:13:58] (03CR) 10Dzahn: [C: 04-1] "ping paladox, bump" [puppet] - 10https://gerrit.wikimedia.org/r/434605 (owner: 10Paladox) [23:20:57] (03PS2) 10Dzahn: Beta: remove npm from deployment master [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [23:21:46] (03CR) 10Dzahn: [C: 032] Beta: remove npm from deployment master [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [23:22:31] (03CR) 10Dzahn: [C: 032] "as hashar said, please manually purge them. (alternative would have been to set them to absent here, but i think it's fine this way)" [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [23:24:20] (03PS3) 10Dzahn: mediawiki: Remove unneeded file decleration on wikidata maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/454543 (owner: 10Ladsgroup) [23:24:44] (03CR) 10Dzahn: [C: 032] "just a log file on maintenance that was already absented" [puppet] - 10https://gerrit.wikimedia.org/r/454543 (owner: 10Ladsgroup) [23:26:04] (03CR) 10Dzahn: [C: 032] "yep. cannot open `/var/log/wikidata/rebuildTermSqlIndex.log' (No such file or directory)" [puppet] - 10https://gerrit.wikimedia.org/r/454543 (owner: 10Ladsgroup) [23:28:17] (03PS5) 10Dzahn: quarry::database: Use mariadb module instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [23:29:02] (03CR) 10jerkins-bot: [V: 04-1] quarry::database: Use mariadb module instead of mysql module [puppet] - 10https://gerrit.wikimedia.org/r/454481 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [23:32:52] 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10Dzahn) [23:33:16] ACKNOWLEDGEMENT - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 76 hours old. daniel_zahn https://phabricator.wikimedia.org/T203290 [23:35:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Balazs Pocze - https://phabricator.wikimedia.org/T202521 (10Banyek) Yes, it is resolved [23:35:51] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/includes/Html.php: I67ceb34eabf2f - T200506 (duration: 00m 50s) [23:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:02] T200506: Previewing a non-style-only gadget that you already have enabled causes a syntax error - https://phabricator.wikimedia.org/T200506 [23:36:17] (03CR) 10Alex Monk: "tidied up" [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [23:37:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Balazs Pocze - https://phabricator.wikimedia.org/T202521 (10Dzahn) 05Open>03Resolved :) [23:37:47] (03CR) 10Dzahn: [C: 032] "awesome, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/456625 (https://phabricator.wikimedia.org/T192561) (owner: 10Thcipriani) [23:39:01] 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10Dzahn) 19:34 < nacc> that's probably a question for the canonical channel(s) mutante 19:35 < nacc> mutante: #canonical-sysadmin, i think? 19:35 -!- Irssi: Join to #canonical-sysadmin was synced in 2 secs 19:35 < mutante> we... [23:40:11] 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10Dzahn) ` For help, please use RT ..| Although we idle here, please mail requests to rt@ubuntu.com ` [23:45:34] (03CR) 10Krinkle: [C: 031] Preserve EXIF ImageDescription instead of XMP Description [puppet] - 10https://gerrit.wikimedia.org/r/456575 (https://phabricator.wikimedia.org/T20871) (owner: 10Gilles) [23:46:01] (03CR) 10Krinkle: [C: 031] Increase per-original thumbnail throttle for prerender [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles) [23:47:55] (03CR) 10Krinkle: [C: 031] Increase per-original thumbnail throttle for prerender (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles)