[00:43:45] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 0 hours old. [02:04:40] (03PS1) 10Legoktm: planet: Add to en [puppet] - 10https://gerrit.wikimedia.org/r/457225 [02:29:24] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:35:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:36:52] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.19) (duration: 14m 26s) [02:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:24] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [02:42:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:47:07] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Sep 3 02:47:07 UTC 2018 (duration 10m 15s) [02:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 836.99 seconds [03:48:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.10 seconds [05:10:35] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [05:11:45] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 3.650 second response time [05:13:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457287 [05:16:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457287 (owner: 10Marostegui) [05:18:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457287 (owner: 10Marostegui) [05:19:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 00m 52s) [05:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:17] !log Deploy alter table on db1121 with replication, this will generate lag on labs:s4 [05:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457287 (owner: 10Marostegui) [05:40:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [05:45:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [05:57:34] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install analytics-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10elukey) [06:28:41] (03PS1) 10Volans: sre.switchdc.mediawiki: add Phase 0 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) [06:29:05] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:30:44] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mtail/varnishxcache.mtail] [06:31:45] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:32:05] mmmh checking [06:32:47] 500s from the puppetmasters seems [06:41:12] !log uploaded wikidiff 1.7.3 to apt.wikimedia.org [06:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:33] (03CR) 10Marostegui: [C: 031] sre.switchdc.mediawiki: add Phase 4 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [06:47:14] https://lists.wikimedia.org/pipermail/wikimedia-l/2018-September/thread.html is 404. I wonder if something us wrong or if really nobody has emailed wikimedia-l in the last 3 days [06:49:57] bawolff: no emails in my folder [06:50:23] Ah...i guess its really just all quiet then [06:50:42] "Ending: Fri Aug 31 05:54:19 UTC 2018" [06:50:57] the last email I have is 08/30/2018 10:54 PM PDT [06:52:25] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:34] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.655 second response time [06:56:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:56:05] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] !log upgrading mwdebug* to wikidiff 1.7.3 [06:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:14] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:59:44] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:34] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:07:04] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.742 second response time [07:11:25] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:11:39] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10MW-1.32-release-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), and 4 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Arrbe... [07:16:05] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:16:54] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.931 second response time [07:17:48] 10Operations, 10Performance-Team (Radar), 10User-Elukey: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 (10elukey) Stretch is now packaging 1.4.33, meanwhile the last version tested in this task was 1.4.28. Release notes between the two: * https://github.com/m... [07:20:15] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:20:19] mobrovac: FYI pdfrendere seems to be flaky again ^^^ [07:20:58] !log upgrading mw1238-mw1258 to wikidiff 1.7.3 [07:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:30] <_joe_> volans: I would just depool it on scb1004 and let it around for mobrovac to take a look [07:26:28] (03PS1) 10Marostegui: filtered_tables: Add constraint_id column [puppet] - 10https://gerrit.wikimedia.org/r/457363 (https://phabricator.wikimedia.org/T189101) [07:34:29] !log Deploy schema change on s3.testwikidatawiki on codfw master - T189101 [07:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:34] T189101: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101 [07:37:04] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.148 second response time [07:38:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): add performance team members to webserver_misc_static servers to maintain sitemaps - https://phabricator.wikimedia.org/T202910 (10ArielGlenn) 05Open>03Resolved I got no new emails from the account checker so it looks g... [07:39:43] !log Deploy schema change on s8.wikidatawiki on codfw master - T189101 [07:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:49] T189101: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101 [07:40:24] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:09] !log upgrading mw1221-mw1235 to wikidiff 1.7.3 [07:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:34] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.737 second response time [07:44:29] (03CR) 10Gehel: [C: 04-1] "Minor comments inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [07:45:19] (03CR) 10Filippo Giunchedi: "+Petr" [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles) [07:45:54] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:45:58] (03PS2) 10Filippo Giunchedi: Preserve EXIF ImageDescription instead of XMP Description [puppet] - 10https://gerrit.wikimedia.org/r/456575 (https://phabricator.wikimedia.org/T20871) (owner: 10Gilles) [07:47:08] (03CR) 10Filippo Giunchedi: [C: 032] Preserve EXIF ImageDescription instead of XMP Description [puppet] - 10https://gerrit.wikimedia.org/r/456575 (https://phabricator.wikimedia.org/T20871) (owner: 10Gilles) [07:49:32] !log depooled pdfrender on scb1004 to let the service owners debug it [07:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:00] (03CR) 10Filippo Giunchedi: [C: 032] Increase per-original thumbnail throttle for prerender [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles) [07:50:08] (03PS2) 10Filippo Giunchedi: Increase per-original thumbnail throttle for prerender [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles) [07:50:10] (03CR) 10Gilles: Increase per-original thumbnail throttle for prerender (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456604 (https://phabricator.wikimedia.org/T203135) (owner: 10Gilles) [07:52:45] 10Operations, 10SRE-Access-Requests, 10wikidiff2, 10Patch-For-Review, 10User-Addshore: Give thiemowmde permission to upload wikidiff2 releases (releasers-wikidiff2) - https://phabricator.wikimedia.org/T202476 (10ArielGlenn) Hey @thiemowmde please let us know that uploads work for you, and we'll close thi... [07:53:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10ArielGlenn) Hey @Thargrovewmf please let us know that access to the logs and to mwmaint servers works, and... [07:55:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10ArielGlenn) Hey @Kalliope please let us know that access to the logs and the mwmaint servers work... [07:55:46] !log roll restart thumbor to apply latest config changes - T203135 T20871 [07:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:53] T20871: Include at least some EXIF metadata in resized pictures - https://phabricator.wikimedia.org/T20871 [07:55:53] T203135: ThumbnailRender job fails with 429 errors - https://phabricator.wikimedia.org/T203135 [07:56:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10ArielGlenn) Hey @Peter, as soon as you verify that you have access to, say, the app servers, we can close this ticket. Thanks! [07:56:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10elukey) Adding also a reference to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User... [07:56:23] !log restarted pdfrender on scb1003 [07:56:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10elukey) Adding also a reference to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsib... [07:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:14] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [07:58:29] (03PS1) 10Filippo Giunchedi: varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457365 (https://phabricator.wikimedia.org/T201858) [07:58:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Effie Mouzeli - https://phabricator.wikimedia.org/T201816 (10ArielGlenn) @jijiki The only thing left on this list is the pwstore, so get us that GPG key whenever you're ready. Then you'll be all done! [07:59:11] (03CR) 10Elukey: "Nit: group name in commit message not correct :)" [puppet] - 10https://gerrit.wikimedia.org/r/456763 (https://phabricator.wikimedia.org/T203182) (owner: 10Dzahn) [08:00:24] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [08:01:07] !log restarted pdfrender on scb1004 and repooled it [08:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:03] godog: FYI ms-be1041 has the xfs issue, in case it was missed in the weekend [08:04:23] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 0 cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:04:47] (03CR) 10Vgutierrez: [C: 04-1] Packaging stuff and readme (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:07:45] !log depooling mw1293-mw1296,mw1299,mw1318 for reboot to address L1TF once pending transcoding jobs have been completed [08:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:24] (03PS1) 10Volans: mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) [08:11:34] volans: thanks! yeah I'll take of that shortly [08:14:45] (03CR) 10Gehel: sre.switchdc.mediawiki: add Phase 0 cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:15:33] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 1 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456502 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:16:53] (03CR) 10Marostegui: [C: 031] "The update-tendril.py looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:18:13] !log rebooting mw2163-mw2189 for kernel security updates [08:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:32] (03CR) 10Volans: "replies inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:21:22] !log Deploy schema change on s7 codfw master (this will generate lag on s7 codfw) [08:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:57] 10Operations: syncing Ubuntu mirror fail - https://phabricator.wikimedia.org/T203290 (10ArielGlenn) p:05Triage>03High [08:23:08] 10Operations, 10ops-esams: cp3038, cp3039 - power supply redundancy failure - https://phabricator.wikimedia.org/T203272 (10ArielGlenn) p:05Triage>03Normal [08:23:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457369 [08:23:51] 10Operations, 10WMF-JobQueue, 10Patch-For-Review: Dismantle most of the old jobqueue infrastructure - https://phabricator.wikimedia.org/T197003 (10ArielGlenn) p:05Triage>03Normal [08:25:10] (03PS7) 10Vgutierrez: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:25:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457369 (owner: 10Marostegui) [08:25:35] (03CR) 10Vgutierrez: "I've took the liberty of fixing the imports stuff on PS7 myself." [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:25:47] 10Operations: wtp2020 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T203265 (10ArielGlenn) p:05Triage>03Normal [08:26:08] 10Operations, 10ops-eqiad, 10Traffic: cp1080 - kernel / bnxt_en failures - https://phabricator.wikimedia.org/T203194 (10ArielGlenn) p:05Triage>03Normal [08:26:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457369 (owner: 10Marostegui) [08:26:43] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:26:54] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 2 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456503 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:27:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 (duration: 00m 52s) [08:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:58] (03PS1) 10Filippo Giunchedi: varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457370 (https://phabricator.wikimedia.org/T201858) [08:29:04] (03CR) 10Vgutierrez: Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:32:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457369 (owner: 10Marostegui) [08:32:40] !log fix xfs on ms-be1041 sde - T199198 [08:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:46] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:33:30] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) @Dzahn thanks a lot! Also I realized I've pasted the fix instructions on the wrong page (graphite vs swift). The right location is https://wikitech... [08:34:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see comment inline, I think setting the target dc databases to read-only is redundant and unnecessary." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:36:18] (03CR) 10Giuseppe Lavagetto: "The change looks good to me; I was wondering if it wouldn't make sense to merge this one-liner into phase 3's task: "set readonly and wait" [cookbooks] - 10https://gerrit.wikimedia.org/r/456511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:37:56] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 3 others: Phabricator outbound email seems to have a SPOF of mx1001 - https://phabricator.wikimedia.org/T196916 (10ArielGlenn) Does this need more review/commentary before moving forward? [08:41:55] RECOVERY - Filesystem available is greater than filesystem size on ms-be1041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [08:42:16] (03CR) 10Vgutierrez: [C: 04-1] Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:46:26] 10Operations, 10ops-codfw, 10DBA: db2042 RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) 05Open>03stalled p:05Triage>03Normal I am marking this as Stalled and if no one objects I think we should proceed with T202051#4541285 leaving the RAID controller with WB enforced. [08:47:16] (03PS2) 10Gehel: elasticsearch: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456137 (https://phabricator.wikimedia.org/T198351) [08:51:36] (03PS8) 10Vgutierrez: Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:51:38] (03PS1) 10Vgutierrez: Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) [08:53:06] (03CR) 10jerkins-bot: [V: 04-1] Packaging stuff and readme [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:53:12] (03CR) 10jerkins-bot: [V: 04-1] Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [08:53:39] (03CR) 10Legoktm: [C: 04-1] "I only reviewed the packaging stuff, HTH :)" (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [08:57:15] thx legoktm <3 [08:58:14] 10Operations, 10DNS, 10Mail, 10User-herron: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065 (10ArielGlenn) At this point might it be useful to poke the greenhouse support folks again? [08:58:53] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) With the two users approach (certcentral / www-data) we just stop nginx from writing in /etc/certcentral. We should also co... [08:59:24] :)) [09:00:25] !log starting rolling restart of elasticsearch / cirrus / codfw for various updates and data directory migration - T198351 [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:30] T198351: Refactor puppet to support multiple elasticsearch instances on same node - https://phabricator.wikimedia.org/T198351 [09:03:59] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 (10ArielGlenn) [09:04:49] (03CR) 10Gehel: [C: 032] elasticsearch: move elasticsearch data directory [puppet] - 10https://gerrit.wikimedia.org/r/456137 (https://phabricator.wikimedia.org/T198351) (owner: 10Gehel) [09:08:42] !log depooling mw1334-mw1338 for reboot to address kernel security updates once pending transcoding jobs have been completed [09:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:36] 10Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821 (10ArielGlenn) [09:12:34] !log upgrading mw1339-mw1348 to wikidiff 1.7.3 [09:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:32] (03CR) 10Gilles: [C: 031] varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457370 (https://phabricator.wikimedia.org/T201858) (owner: 10Filippo Giunchedi) [09:14:00] (03Abandoned) 10Filippo Giunchedi: varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457365 (https://phabricator.wikimedia.org/T201858) (owner: 10Filippo Giunchedi) [09:14:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please verify if "rendering" is still needed in the switch-traffic checks - I think it doesn't and should be removed." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:19:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] sre.switchdc.mediawiki: add Phase 5 cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:20:04] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 6 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:22:37] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: add Phase 7 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:24:08] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457390 [09:24:31] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457390 [09:33:16] !log rebooting mw2135-mw2147 for kernel security updates [09:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:57] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457390 (owner: 10Marostegui) [09:36:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457390 (owner: 10Marostegui) [09:36:37] (03PS2) 10Marostegui: filtered_tables: Add constraint_id column [puppet] - 10https://gerrit.wikimedia.org/r/457363 (https://phabricator.wikimedia.org/T189101) [09:37:25] (03CR) 10Marostegui: [C: 032] filtered_tables: Add constraint_id column [puppet] - 10https://gerrit.wikimedia.org/r/457363 (https://phabricator.wikimedia.org/T189101) (owner: 10Marostegui) [09:38:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 01m 55s) [09:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:02] moritzm: for those mw servers you are rebooting, will you pull the last config before getting them back in production? I just pushed the above change and they failed of course [09:41:07] yeah, I run "scap pull" before repooling for servers in eqiad [09:41:38] at least when there's logged activity via SAL while they were off [09:44:31] moritzm: great! thanks :) [09:45:51] (03PS3) 10Marostegui: filtered_tables: Remove unused columns [puppet] - 10https://gerrit.wikimedia.org/r/450934 (https://phabricator.wikimedia.org/T51191) [09:48:35] PROBLEM - High CPU load on API appserver on mw2147 is CRITICAL: Return code of 255 is out of bounds [09:49:44] PROBLEM - Host mw2146 is DOWN: PING CRITICAL - Packet loss = 100% [09:50:04] RECOVERY - Host mw2146 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [09:50:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457390 (owner: 10Marostegui) [09:50:37] ^ silenced [09:51:55] RECOVERY - High CPU load on API appserver on mw2147 is OK: OK - load average: 4.16, 1.37, 0.48 [09:53:47] (03CR) 10Giuseppe Lavagetto: [C: 031] "overall LGTM; I think we might get to the point where we don't need to restart parsoid anymore before the switchover, though." [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:00:05] !log depool codfw mathoid for kubernetes upgrade [10:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:52] !log installing tomcat8 security updates [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:07] (03CR) 10Volans: "Reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:18:35] (03PS1) 10Gehel: maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/457408 (https://phabricator.wikimedia.org/T198622) [10:18:39] (03PS1) 10Gehel: maps: change partitioning scheme for new SSDs in maps2004 [puppet] - 10https://gerrit.wikimedia.org/r/457409 (https://phabricator.wikimedia.org/T195285) [10:21:33] (03CR) 10Zfilipin: "The commit can not be deployed as-is during SWAT because it changes files in two root folders (static and wmf-config). Please split the co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [10:21:40] (03CR) 10Zfilipin: "The commit can not be deployed as-is during SWAT because it changes files in two root folders (static and wmf-config). Please split the co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [10:26:15] (03CR) 10Volans: "> Patch Set 3:" [cookbooks] - 10https://gerrit.wikimedia.org/r/456511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:27:55] 10Operations, 10Wikimedia-Logstash, 10Goal, 10Patch-For-Review: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) As for producers **already in logstash** here's what we have for the last 24 hours: |ty... [10:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180903T1030). [10:30:38] !log installing ruby2.1 security updates [10:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:39] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457414 (https://phabricator.wikimedia.org/T128546) [10:34:14] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.5-1 - https://phabricator.wikimedia.org/T203271 (10fgiunchedi) a:03fgiunchedi [10:35:28] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457414 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10mark) Yes, this can be merged once Nuria approves. [10:36:46] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457414 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:59] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457414 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:06] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:457414|Bumping portals to master (T128546)]] (duration: 00m 51s) [10:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:12] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:39:56] (03CR) 10Odder: "Why not? I have had countless patches adding logos merged that way." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [10:39:56] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:457414|Bumping portals to master (T128546)]] (duration: 00m 49s) [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:31] (03PS3) 10Volans: sre.switchdc.mediawiki: add Phase 3 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) [10:52:07] (03CR) 10Zfilipin: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [10:57:55] !log installing libx11 security updates on jessie [10:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:57] (03PS4) 10Volans: sre.switchdc.mediawiki: add Phase 4 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) [10:59:10] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) [10:59:22] (03PS3) 10Volans: sre.switchdc.mediawiki: add Phase 6 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) [10:59:36] (03PS3) 10Volans: sre.switchdc.mediawiki: add Phase 7 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180903T1100). [11:00:05] odder: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:33] I can SWAT today [11:00:56] odder: around for SWAT? [11:02:40] (03PS2) 10Odder: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) [11:05:16] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 0 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:05:35] (03PS1) 10Odder: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457418 (https://phabricator.wikimedia.org/T203343) [11:05:54] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 0 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:07:04] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 1 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456502 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:08:40] odder: around for SWAT? [11:11:18] (03PS2) 10Odder: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) [11:11:22] zeljkof: Yes, just need one more minute. [11:11:44] odder: let me know when you're ready [11:14:13] (03PS2) 10Volans: sre.switchdc.mediawiki: add Phase 1 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456502 (https://phabricator.wikimedia.org/T199079) [11:15:05] (03CR) 10Vgutierrez: Validate challenges before pushing them to the ACME directory (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [11:15:33] (03PS13) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [11:15:39] !log installing libcgroup security updates [11:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:52] (03CR) 10Zfilipin: [C: 031] Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:16:13] (03PS1) 10Odder: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) [11:16:38] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [11:16:45] (03CR) 10jerkins-bot: [V: 04-1] Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:18:18] (03CR) 10Zfilipin: [C: 031] Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:18:36] zeljkof: Ready now, thanks for waiting up for me [11:18:45] odder: no problem [11:19:06] odder: please add all commits you want deployed to the calendar [11:19:13] I can see only two at the moment [11:19:21] Oh, sure. [11:19:24] I'll start with the ones in the calendar [11:19:27] In the meantime, https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/457419/ failed a test [11:19:59] (03CR) 10Zfilipin: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:20:15] hm, looks like a CI problem, re-running the tests [11:20:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:21:26] !log repool codfw mathoid. Kubernetes cluster upgraded to 1.10.6 [11:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:35] !log depool eqiad mathoid for kubernetes upgrade [11:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:54] (03Merged) 10jenkins-bot: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:23:41] !log zfilipin@deploy1001 Synchronized static/images/project-logos: SWAT: [[gerrit:457081|Add high-density logos for the Russian Wikisource (T203343)]] (duration: 00m 50s) [11:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:46] T203343: Create HiDPI logos for Russian Wikisource - https://phabricator.wikimedia.org/T203343 [11:23:59] odder: 457081 is deployed, please check [11:25:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:26:01] (03PS3) 10Volans: sre.switchdc.mediawiki: add Phase 2 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456503 (https://phabricator.wikimedia.org/T199079) [11:26:33] (03Merged) 10jenkins-bot: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:26:57] zeljkof: Looks okay [11:27:06] (03Abandoned) 10Volans: sre.switchdc.mediawiki: add Phase 4 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456511 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:27:47] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 2 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456503 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:27:52] !log zfilipin@deploy1001 Synchronized static/images/project-logos: SWAT: [[gerrit:457084|Add high-density logos for Commons (T203342)]] (duration: 00m 49s) [11:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:57] T203342: Create HiDPI logos for Wikimedia Commons - https://phabricator.wikimedia.org/T203342 [11:28:25] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 2 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456503 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:28:50] (03CR) 10Zfilipin: "Purged: T203342#4553145" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:29:16] (03PS6) 10Muehlenhoff: Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (https://phabricator.wikimedia.org/T127825) [11:29:40] odder: 457084 is deployed, please check [11:30:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10ArielGlenn) I have sent mail to Nuria, given that phab notifications aren't always the first thing people see when check... [11:30:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457418 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:31:07] zeljkof: Hm, still seeing the low-density logo on Commons [11:31:26] odder: I've just uploaded the images, I'm merging changes to use them now [11:31:42] it's a two step process now [11:31:56] (03Merged) 10jenkins-bot: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457418 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:32:20] (03PS4) 10Volans: sre.switchdc.mediawiki: add Phase 3 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) [11:33:45] (03PS2) 10Zfilipin: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:33:59] (03CR) 10Volans: "done" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:34:05] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:457418|Add high-density logos for the Russian Wikisource (T203343)]] (duration: 00m 50s) [11:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:11] T203343: Create HiDPI logos for Russian Wikisource - https://phabricator.wikimedia.org/T203343 [11:34:13] odder: 457418 is deployed, please check [11:34:48] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:34:52] (03CR) 10jenkins-bot: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457081 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:34:54] (03CR) 10jenkins-bot: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457084 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:34:56] (03CR) 10jenkins-bot: Add high-density logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457418 (https://phabricator.wikimedia.org/T203343) (owner: 10Odder) [11:36:12] (03Merged) 10jenkins-bot: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:36:36] 10Operations, 10Traffic: prometheus-varnish-exporter@frontend.service: Unit entered failed state - invalid character 'C' - https://phabricator.wikimedia.org/T203191 (10ema) p:05Triage>03Normal [11:37:42] (03CR) 10Gehel: Elasticsearch module is coming up. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [11:38:03] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:457419|Add high-density logos for Commons (T203342)]] (duration: 00m 49s) [11:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:09] T203342: Create HiDPI logos for Wikimedia Commons - https://phabricator.wikimedia.org/T203342 [11:38:30] odder: 457419 is deployed, that is the last one, please check and thanks for deploying with #releng! :) [11:38:41] !log EU SWAT finished [11:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:13] (03CR) 10Volans: "Some replies inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/456588 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:39:45] (03PS3) 10Volans: sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) [11:40:44] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:41:01] (03CR) 10Muehlenhoff: [C: 032] Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [11:41:25] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 5 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456589 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:41:36] zeljkof: Can confirm Commons logo looks OK, the Wikisource one is ugly. [11:41:38] (03PS4) 10Volans: sre.switchdc.mediawiki: add Phase 6 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) [11:42:50] odder: :D additional deploys needed for wikisource? something went wrong? or is the logo just ugly? [11:43:33] zeljkof: At least for me, the logo gets cut off. But I just checked and the 2x Commons logo is even biger, so I think they must be doing something with their local CSS? [11:43:38] Will show you a screenshot in a moment [11:43:44] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 6 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:44:21] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 6 cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/456592 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:44:48] (03CR) 10Volans: "> Patch Set 2: Code-Review+1" [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:45:26] (03PS4) 10Volans: sre.switchdc.mediawiki: add Phase 7 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) [11:46:36] zeljkof: http://twkozlowski.com/files/ruwikisource.png [11:46:54] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: add Phase 7 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:47:02] Commons, on the other hand, looks perfect... [11:47:33] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: add Phase 7 cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/456639 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:48:23] 10Operations, 10Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825 (10MoritzMuehlenhoff) 05Open>03Resolved Microcode is now enabled on all baremetal servers with an Intel CPU and we haven't seen any issues so far. Closing the task. [11:49:39] odder: hm, ruwikisource look ok to me https://usercontent.irccloud-cdn.com/file/hmL4BC1w/ruwikisource.png [11:49:41] (03CR) 10jenkins-bot: Add high-density logos for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457419 (https://phabricator.wikimedia.org/T203342) (owner: 10Odder) [11:50:06] But that's the low-density logo, yes? [11:50:47] odder: good point, didn't check, how do I force hi-res logos? [11:51:46] yes, that's https://ru.wikisource.org/static/images/project-logos/ruwikisource.png [11:51:54] Unless you have a Retina display or some similar, I have no idea :) [11:52:45] I don't think so :/ [11:52:56] well, add screenshot to the task [11:53:04] somebody will know what to do [11:53:39] I just had a look and it seems all Wikisource projects have the same problem [11:57:06] https://vi.wikisource.org/w/index.php?title=MediaWiki:Common.css&diff=53051&oldid=53050 [11:57:10] Hmmm... [11:59:10] (03PS1) 10Ladsgroup: ores: add PoolCounter nodes settings [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) [12:03:59] !log repool eqiad mathoid for kubernetes upgrade [12:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:07] _joe_: 1.10.6 :-) [12:05:08] zeljkof: KMN, I see where the problem is now. [12:05:31] zeljkof: The other logos have empty space/margins *inside the actual files*. [12:06:04] So I'll update the Russian Wikisource one in the next SWAT window [12:08:35] (03CR) 10Ladsgroup: [C: 04-1] "It fails, trying again. https://puppet-compiler.wmflabs.org/compiler02/12321/ores1001.eqiad.wmnet/change.ores1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [12:09:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:09:24] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:11:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:13:40] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) That makes sense, so we're preferring one of these: * Set the group of the files to be www-data, chmod the files 640. * Put ww... [12:17:52] (03CR) 10Alex Monk: Packaging stuff and readme (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [12:18:24] (03CR) 10Alex Monk: [C: 032] Rename certcentral_api to just api [software/certcentral] - 10https://gerrit.wikimedia.org/r/457378 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [12:19:43] olasd: sounds good [12:20:47] (03CR) 10Alex Monk: Validate challenges before pushing them to the ACME directory (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [12:32:34] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:35:18] <_joe_> akosiaris: neat! [12:37:04] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:37:52] (03PS1) 10Elukey: profile::analytics::refinery::job: add systemd timer template [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [12:42:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:45:58] (03PS2) 10Elukey: profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [12:48:16] (03PS3) 10Elukey: profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [12:49:00] (03CR) 10jerkins-bot: [V: 04-1] profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:52:48] (03PS4) 10Elukey: profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [12:54:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:58:48] (03PS5) 10Elukey: profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [13:00:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:02:22] zeljkof: o/ [13:02:41] I took a look to the mw exceptions --^ [13:02:44] elukey: \o [13:03:03] and from the graph there seems to be an increase from ~11:40 UTC onward [13:03:17] elukey: uh oh, I've caused them? [13:03:43] not sure! But the timing is suspicious, so I wanted to double check with you [13:04:05] I've just deployed a few trivial config changes during swat, updating logos [13:04:29] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180903T1100 [13:05:04] ah okok logos [13:05:14] but the last deployment is at 11:38 :/ [13:05:40] yes, it's unlikely I've caused it, but I have been wrong before :D [13:06:04] PROBLEM - Check systemd state on cloudservices1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:06:10] PROBLEM - designate-pool-manager process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [13:06:17] (03PS2) 10Ladsgroup: ores: add PoolCounter nodes settings [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) [13:06:26] uh oh [13:06:34] arturo: ^ known ? [13:06:39] arturo, andrewbogott [13:06:46] oh, late :) [13:07:22] ACKNOWLEDGEMENT - Check systemd state on cloudservices1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott This is me troubleshooting [13:07:27] ACKNOWLEDGEMENT - designate-pool-manager process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager andrew bogott This is me troubleshooting [13:07:46] sorry for the pages, that's me fussing with things [13:08:06] ok [13:08:25] zeljkof: do you have access to https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors? Timing is really suspicious [13:08:29] ack [13:08:33] most of them are for commons, and then enwiki [13:08:35] afaics [13:08:51] (can also anybody with a bit of time check mw errors ?) [13:09:13] elukey: I do, looking... [13:09:31] elukey: well, there was a commons related patch [13:09:32] super [13:10:39] (03CR) 10Ladsgroup: [C: 04-1] "Still not working https://puppet-compiler.wmflabs.org/compiler02/12325/ores1001.eqiad.wmnet/change.ores1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [13:10:57] I've just received my page now [13:11:07] I may have a lot of delay [13:11:25] elukey: there were two patches with just new/updated logos, those are highly unlikely to cause any trouble [13:11:57] elukey: there were another two patches that use the new logos, those are also unlikely to cause trouble, but I could revert those and see if it helps? [13:12:30] I'm looking at the logs, but there are so many errors, is there a specific one you are referring to? [13:12:32] (03CR) 10MSantos: "LGTM, just like the plan. Are we going to switch masters after that? What I mean is, maps2001 will stop being the master or this is just t" [puppet] - 10https://gerrit.wikimedia.org/r/457408 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [13:13:38] zeljkof: I checked the most recurrent one and it seems related to db perf exceptions, but I am a bit ignorant about it [13:13:54] I agree that the new logos shouldn't cause this but.. [13:14:11] (03CR) 10Gehel: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/457408 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [13:14:30] elukey: all I can do is revert what I did :/ [13:15:04] zeljkof: if it is not a big trouble it would be great to check if anything changes or not [13:16:43] elukey: sure, reverting, just to check if there's another deployment on at the moment [13:16:44] (03PS6) 10Elukey: profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [13:18:52] (03PS3) 10Ladsgroup: ores: add PoolCounter nodes settings [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) [13:22:27] (03PS7) 10Elukey: profile::analytics::refinery::job::data_check: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) [13:22:39] (03CR) 10Vgutierrez: Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [13:23:23] zeljkof: wait a sec, they are more stable now [13:23:36] not sure what happened but the revert is probably not needed [13:23:57] I didn't realize that they have been stable for 20 mins now [13:24:13] zeljkof: There is a swat going on? [13:24:35] marostegui: we were wondering if rollback or not, due to mw failures [13:24:45] they aligned with the last deployment, but now they are gone [13:24:49] marostegui: no [13:24:54] elukey: ok, great [13:25:02] Thanks, I will wait till you guys decide what to do to push my changes :) [13:25:05] sorry for the ping, thanks a lot for the help! [13:25:06] I was thinking about revering a couple of logo updates [13:25:22] elukey: all good, I don't need to do anything? [13:25:45] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/12329/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/457431 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:26:03] zeljkof: exactly :) [13:26:06] just ignore me! [13:26:23] elukey: will do ;) [13:26:47] * marostegui will do too [13:26:49] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia Community User Group Albania mailing list request - https://phabricator.wikimedia.org/T201670 (10Sidorela) Hi @herron one of the IP's is: 79.106.255.42 I also tried to subscribe to other mailing-lists but still happens the same error. [13:27:16] (03CR) 10Alex Monk: Packaging stuff and readme (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/456646 (owner: 10Alex Monk) [13:27:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 [13:28:09] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 [13:29:37] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 (owner: 10Marostegui) [13:30:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 (owner: 10Marostegui) [13:30:37] uh? [13:30:57] (03CR) 10Marostegui: [C: 032] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 (owner: 10Marostegui) [13:31:44] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:34] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 (owner: 10Marostegui) [13:33:23] (03CR) 10Ladsgroup: "I'm done fighting this thing, leave it to the expert" [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [13:33:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 (duration: 00m 56s) [13:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457447 [13:36:14] (03PS1) 10Elukey: profile::analytics::systemd_timer: fix require [puppet] - 10https://gerrit.wikimedia.org/r/457448 (https://phabricator.wikimedia.org/T172532) [13:36:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457447 (owner: 10Marostegui) [13:37:44] (03CR) 10Elukey: [C: 032] profile::analytics::systemd_timer: fix require [puppet] - 10https://gerrit.wikimedia.org/r/457448 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:38:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457447 (owner: 10Marostegui) [13:39:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 (duration: 00m 49s) [13:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457443 (owner: 10Marostegui) [13:41:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457447 (owner: 10Marostegui) [13:41:55] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:51:36] (03CR) 10Marostegui: sre.switchdc.mediawiki: add Phase 3 cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/456510 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:54:10] (03PS3) 10Gehel: Add variables for map tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/456463 (https://phabricator.wikimedia.org/T109776) (owner: 10Mholloway) [13:54:59] (03CR) 10Gehel: [C: 032] Add variables for map tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/456463 (https://phabricator.wikimedia.org/T109776) (owner: 10Mholloway) [13:56:50] (03PS1) 10Odder: Update logos for the Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457455 (https://phabricator.wikimedia.org/T203343) [14:02:17] 10Operations, 10Traffic: http-01 challenge checking on *all* working backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) p:05Triage>03Low [14:02:27] 10Operations, 10Traffic: certcentral: http-01 challenge checking on *all* working backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [14:02:58] hej zeljkof [14:03:37] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/457455/ fixes those Russian Wikisource logos [14:03:49] Any chance this could be deployed without me present? I'm travelling all day tomorrow [14:03:52] 10Operations, 10Traffic: certcentral: http-01 challenge checking on *all* working backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [14:04:09] 10Operations, 10Traffic: certcentral: http-01 challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [14:05:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: add PoolCounter nodes settings (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [14:07:45] 10Operations, 10Traffic: certcentral: http-01 challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) btw, checking all backend servers it means that cercentral must have network access to port :80 (or whatever the port is) on every backend server t... [14:11:20] (03PS1) 10Elukey: systemd::timer: use RandomizedDelaySec only when splay is defined [puppet] - 10https://gerrit.wikimedia.org/r/457458 (https://phabricator.wikimedia.org/T172532) [14:13:01] (03PS1) 10Ladsgroup: Add centralauth.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) [14:13:45] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:14:35] (03CR) 10jerkins-bot: [V: 04-1] Add centralauth.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [14:15:22] (03PS1) 10Andrew Bogott: designate: open up the mdns port firewall for all pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/457460 [14:18:06] odder: sure, I'll deploy the patch [14:18:17] odder: just please add it to the calendar [14:18:34] (03CR) 10Ema: [C: 031] varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457370 (https://phabricator.wikimedia.org/T201858) (owner: 10Filippo Giunchedi) [14:18:34] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:19:26] (03CR) 10Elukey: [C: 032] "No op :)" [puppet] - 10https://gerrit.wikimedia.org/r/457458 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:21:04] (03CR) 10Filippo Giunchedi: [C: 032] varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457370 (https://phabricator.wikimedia.org/T201858) (owner: 10Filippo Giunchedi) [14:21:12] (03PS2) 10Filippo Giunchedi: varnish: switch swift_thumbs to active/active [puppet] - 10https://gerrit.wikimedia.org/r/457370 (https://phabricator.wikimedia.org/T201858) [14:22:05] (03PS1) 10Urbanecm: Two throttle rules for SMEX editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457464 (https://phabricator.wikimedia.org/T203392) [14:22:27] elukey: merged your change too btw [14:22:28] (03CR) 10Ladsgroup: ores: add PoolCounter nodes settings (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [14:22:36] godog: thanks! [14:22:38] (03PS4) 10Ladsgroup: ores: add PoolCounter nodes settings [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) [14:22:54] (03PS2) 10Andrew Bogott: designate: open up the mdns port firewall for all pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/457460 [14:22:58] (03CR) 10MarcoAurelio: "Computed list 'centralauth.dblist' must end its name with '-computed.dblist'." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [14:23:07] (03PS1) 10Elukey: profile::analytics::systemd_timer: explicitly set splay to undef [puppet] - 10https://gerrit.wikimedia.org/r/457466 (https://phabricator.wikimedia.org/T172532) [14:23:25] !log switch swift thumbnails active/active - T201858 T199073 [14:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:32] T199073: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 [14:23:32] T201858: Push thumbnails to both data centers - https://phabricator.wikimedia.org/T201858 [14:24:04] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:26:27] !log lvs1016: re-enable puppet [14:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:36] (03PS3) 10Andrew Bogott: designate: open up the mdns port firewall for all pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/457460 [14:27:25] (03CR) 10Andrew Bogott: [C: 032] designate: open up the mdns port firewall for all pdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/457460 (owner: 10Andrew Bogott) [14:27:46] (03CR) 10Elukey: [C: 032] profile::analytics::systemd_timer: explicitly set splay to undef [puppet] - 10https://gerrit.wikimedia.org/r/457466 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:27:53] (03PS2) 10Elukey: profile::analytics::systemd_timer: explicitly set splay to undef [puppet] - 10https://gerrit.wikimedia.org/r/457466 (https://phabricator.wikimedia.org/T172532) [14:27:57] 10Operations, 10Traffic, 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Urbanecm) [14:28:52] (03CR) 10Ladsgroup: "Works fine now: https://puppet-compiler.wmflabs.org/compiler03/12336/ores1001.eqiad.wmnet/ We can deploy this now as the ores change is no" [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [14:29:05] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:29:44] RECOVERY - Check systemd state on cloudservices1003 is OK: OK - running: The system is fully operational [14:29:51] RECOVERY - designate-pool-manager process on cloudservices1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [14:30:27] (03PS1) 10Urbanecm: Add .bollywoodhungama.in to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457469 (https://phabricator.wikimedia.org/T203363) [14:32:15] (03PS1) 10Urbanecm: add Radlines.org to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457474 (https://phabricator.wikimedia.org/T203219) [14:38:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:38:23] (03CR) 10Elukey: [C: 032] "I thought that simply setting 'undef' to any system::timer's splay would have done the trick, but since splay is set as Integer with 0 as " [puppet] - 10https://gerrit.wikimedia.org/r/457458 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:42:14] (03PS8) 10Gehel: Add health check for categories endpoint without lag check [puppet] - 10https://gerrit.wikimedia.org/r/456187 (owner: 10Smalyshev) [14:48:14] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:45] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:46] !log running basic mw warmup script on codfw as a test [14:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] (03PS2) 10Ladsgroup: Add centralauth.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) [14:54:45] (03CR) 10Reedy: [C: 04-1] "Might aswell expose this on noc too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [14:55:45] (03CR) 10jerkins-bot: [V: 04-1] Add centralauth.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457459 (https://phabricator.wikimedia.org/T201009) (owner: 10Ladsgroup) [14:56:58] (03PS1) 10Elukey: systemd::timer: add a variable to ensure compatibility with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/457478 (https://phabricator.wikimedia.org/T172532) [14:57:17] (03PS1) 10ArielGlenn: correctly find revsion history content files for monthly dump stats [puppet] - 10https://gerrit.wikimedia.org/r/457479 (https://phabricator.wikimedia.org/T203381) [14:57:57] (03CR) 10jerkins-bot: [V: 04-1] systemd::timer: add a variable to ensure compatibility with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/457478 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:58:28] (03CR) 10ArielGlenn: [C: 032] correctly find revsion history content files for monthly dump stats [puppet] - 10https://gerrit.wikimedia.org/r/457479 (https://phabricator.wikimedia.org/T203381) (owner: 10ArielGlenn) [15:03:14] (03PS5) 10Alexandros Kosiaris: ores: add PoolCounter nodes settings [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [15:03:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: add PoolCounter nodes settings [puppet] - 10https://gerrit.wikimedia.org/r/457422 (https://phabricator.wikimedia.org/T160692) (owner: 10Ladsgroup) [15:08:43] (03CR) 10Alex Monk: [C: 032] Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:10:09] (03Merged) 10jenkins-bot: Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:10:11] (03Merged) 10jenkins-bot: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:10:13] (03CR) 10jerkins-bot: [V: 04-1] Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:11:27] (03CR) 10jenkins-bot: Validate challenges before pushing them to the ACME directory [software/certcentral] - 10https://gerrit.wikimedia.org/r/455159 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:11:39] (03CR) 10jenkins-bot: ACMERequests: Remove orders/challenges after a non-recoverable error [software/certcentral] - 10https://gerrit.wikimedia.org/r/456110 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:13:21] (03CR) 10Vgutierrez: "recheck" [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:13:30] !log depooling mw1300-mw1309 for reboot to address kernel security updates once pending transcoding jobs have been completed [15:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:33] (03CR) 10jenkins-bot: Provide logging [software/certcentral] - 10https://gerrit.wikimedia.org/r/456644 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:16:48] 10Operations, 10Traffic: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [15:17:48] 10Operations, 10Traffic: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) After some discussion in -traffic with @vgutierrez and @bblack I've expanded the scope of this ticket to include dns-01 [15:18:45] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:19:04] !log rebooting mw2190-mw2219 for kernel security updates [15:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:14] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.453 second response time [15:20:50] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.14 [keeping static files] (duration: 02m 13s) [15:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:04] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10fdans) p:05Normal>03Triage [15:24:10] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10fdans) p:05Triage>03Normal [15:24:42] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.15 [keeping static files] (duration: 03m 44s) [15:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:23] (03PS1) 10Vgutierrez: README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) [15:26:49] (03CR) 10jerkins-bot: [V: 04-1] README: provide configuration file examples [software/certcentral] - 10https://gerrit.wikimedia.org/r/457485 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [15:27:00] !log reedy@deploy1001 Pruned MediaWiki: 1.32.0-wmf.16 [keeping static files] (duration: 01m 48s) [15:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:22] (03CR) 10Elukey: "This one seems working https://puppet-compiler.wmflabs.org/compiler02/12338/" [puppet] - 10https://gerrit.wikimedia.org/r/457478 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [15:29:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457487 [15:30:23] (03PS1) 10Andrew Bogott: designate: add main_and_eqiad1_pool_config.yml [puppet] - 10https://gerrit.wikimedia.org/r/457489 [15:30:25] <_joe_> jynus: I have some treats for you [15:30:40] oh [15:30:46] (03PS1) 10Giuseppe Lavagetto: conftool: add class for writing to state to file [puppet] - 10https://gerrit.wikimedia.org/r/457490 [15:30:46] you intrige me [15:30:48] (03PS1) 10Giuseppe Lavagetto: realm.pp: drop mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/457491 [15:30:50] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::maintenance: depend on mediawiki config, not hiera [puppet] - 10https://gerrit.wikimedia.org/r/457492 [15:31:07] (03CR) 10Andrew Bogott: [C: 032] designate: add main_and_eqiad1_pool_config.yml [puppet] - 10https://gerrit.wikimedia.org/r/457489 (owner: 10Andrew Bogott) [15:31:33] <_joe_> I basically adopted the same approach as one of your patches, but avoided using the "shell out to confctl from each puppet run" approach I wrote before and I hated :P [15:31:44] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.121 second response time [15:31:51] interesting [15:31:58] I am about to look at how it works [15:31:59] <_joe_> there is still some details to iron out ofc [15:32:27] ah, so the same but you "cache" it for a while? [15:32:28] <_joe_> the way it works is: confd drops a yaml file with the variables we want in /etc/conftool-state/mediawiki.yaml on the puppetmasters [15:32:45] <_joe_> we have 1 watcher that doesn't even need to be cached [15:32:46] <_joe_> confd [15:32:53] and when you refresh that, on "deploy"? [15:32:57] ah, cool [15:33:12] so technically it is always in sync [15:33:14] <_joe_> whenever etcd changes => the file on the puppetmaster changes [15:33:28] <_joe_> although it could fail to sync ofc [15:33:35] we have to make sure it is not abused [15:33:41] <_joe_> yes [15:33:46] as in things that should be calling it directly [15:33:53] <_joe_> well the mediawiki::state function is conveniently buried [15:33:54] because are important [15:33:57] he he [15:33:57] (03CR) 10jerkins-bot: [V: 04-1] conftool: add class for writing to state to file [puppet] - 10https://gerrit.wikimedia.org/r/457490 (owner: 10Giuseppe Lavagetto) [15:34:11] <_joe_> as you can see, I still have things to fix [15:34:21] it looks cool [15:34:32] <_joe_> it's hacky and duct-tapey [15:34:41] So we can maybe keep the variable more or less as we have it now? [15:34:45] <_joe_> but better than having another thing integrate directly into etcd, tbh [15:34:57] <_joe_> jynus: yes, see the second patch [15:34:59] (for non important stuff like monitoring) [15:35:13] non important == can be handled by puppet [15:35:14] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:25] let me see, I was still on the first one [15:35:29] <_joe_> https://gerrit.wikimedia.org/r/457491 <== here i use the mediawiki::state function [15:35:40] <_joe_> to fetch mw_primary from file [15:36:12] wasn't there another usage? [15:36:17] in discussion [15:36:23] <_joe_> yes, cronjobs [15:36:26] "configuring the maintenace server" [15:36:28] that one [15:36:28] <_joe_> but they were moved to use hiera [15:36:41] <_joe_> see the last patch of the series [15:36:44] <_joe_> :P [15:36:47] ok :-) [15:36:56] <_joe_> https://gerrit.wikimedia.org/r/457492 [15:37:17] <_joe_> so, I'd really appreciate if you would take a look and give your feedback [15:37:25] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.925 second response time [15:37:41] <_joe_> I think this is the least-ugly thing we can do without re-thinking radically how we do icinga config [15:37:58] I am ok with this [15:38:05] less code changes on "client code" [15:38:08] <_joe_> I thought of building some intelligence into naggen, but that's as ugly as this [15:38:29] how would react the puppet compiler to this? [15:38:43] will it work? [15:39:03] or at least it can be faked [15:39:05] <_joe_> uhm [15:39:14] <_joe_> I was sure I submitted changes to it as well [15:39:37] <_joe_> https://gerrit.wikimedia.org/r/c/operations/puppet/+/457490/1/modules/puppet_compiler/manifests/init.pp [15:39:43] sorry, maybe you did, but I have a delay on reading [15:39:44] <_joe_> just forgot to also add the file :D [15:40:45] I like this [15:40:55] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:16] I would document it well to clarify what should or shouldn't be used for [15:41:49] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457487 (owner: 10Marostegui) [15:41:51] and we will not need a lots of tests- the existing script for mw monitoring should be quite reliable even on failure [15:42:01] I meant master monitoring [15:42:31] I had stalled also this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450228/ [15:42:36] which was read only monitoring [15:43:09] and after that it should work mostly unchanged [15:43:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457487 (owner: 10Marostegui) [15:43:36] _joe_: give me some time to give it a proper review [15:44:01] and I will comment if I see something I don't like or strange [15:44:12] _joe_: thanks a lot for working on that [15:45:17] I may think about solving other dynamic sources of truth that are not mission-critical in the same way [15:45:39] e.g. sections and roles and prometheus configuration [15:51:02] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457487 (owner: 10Marostegui) [15:52:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 (duration: 00m 50s) [15:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457497 [15:55:31] 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172 (10MoritzMuehlenhoff) @Joe , @elukey : Any objections? Otherwise I'll turn this into a decom ticket. [15:56:02] (03PS1) 10Jcrespo: mysql-prometheus-exporter: Fix deleted x1 instance from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/457499 [15:56:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457497 (owner: 10Marostegui) [15:57:19] !log rebooting mw2200-mw2239 for kernel security updates [15:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457497 (owner: 10Marostegui) [15:59:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 00m 48s) [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I was planning to do a test of switchover the deployment server without going through the switching the entire DC thing. Can we stall this" [puppet] - 10https://gerrit.wikimedia.org/r/457492 (owner: 10Giuseppe Lavagetto) [16:07:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/457497 (owner: 10Marostegui) [16:07:26] ACKNOWLEDGEMENT - HP RAID on elastic2012 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T203404 [16:07:31] 10Operations, 10ops-codfw: Degraded RAID on elastic2012 - https://phabricator.wikimedia.org/T203404 (10ops-monitoring-bot) [16:08:22] (03CR) 10Muehlenhoff: "I think we can ignore this, systemd handles it gracefully and jessie will grow out over time. Maybe add a comment to systemd::timer that s" [puppet] - 10https://gerrit.wikimedia.org/r/457478 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:08:38] !log Deploy schema change on db1086 [16:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:19] (03PS4) 10Andrew Bogott: VPS puppet ENC: change max prefix size to 255 [puppet] - 10https://gerrit.wikimedia.org/r/456174 (https://phabricator.wikimedia.org/T203104) [16:10:13] (03CR) 10Andrew Bogott: [C: 032] VPS puppet ENC: change max prefix size to 255 [puppet] - 10https://gerrit.wikimedia.org/r/456174 (https://phabricator.wikimedia.org/T203104) (owner: 10Andrew Bogott) [16:10:48] (03CR) 10Marostegui: [C: 031] mysql-prometheus-exporter: Fix deleted x1 instance from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/457499 (owner: 10Jcrespo) [16:11:18] gehel: if you're wandering what's wrong on T203404 it's the battery count [16:11:19] T203404: Degraded RAID on elastic2012 - https://phabricator.wikimedia.org/T203404 [16:14:41] (03PS1) 10Ema: ATS: fix people.w.o never-cache rule [puppet] - 10https://gerrit.wikimedia.org/r/457500 (https://phabricator.wikimedia.org/T199720) [16:15:11] (03PS2) 10Ema: ATS: fix people.w.o never-cache rule [puppet] - 10https://gerrit.wikimedia.org/r/457500 (https://phabricator.wikimedia.org/T199720) [16:16:56] volans: thanks! That one is supposed to be replaced soon, so let's just forget about it atm [16:18:07] 10Operations, 10ops-codfw: Degraded RAID on elastic2012 - https://phabricator.wikimedia.org/T203404 (10Gehel) 05Open>03Resolved a:03Gehel elastic2012 is scheduled to be replaced soon (see T198169), so let's not do anything at the moment and not waste our DC ops time. [16:19:24] gehel: ack, only thing is that most likely the write policy has changed accordingly and it might have slower performances [16:19:28] bare that in mind [16:19:48] volans: I would not have thought about that, thanks! [16:21:03] gehel: you can force the write back anyways, as long as you're ok to accept data loss in case of a host failure [16:21:39] volans: any docs as to how to do that (since you obviously seem to be the expert :) [16:21:50] (03PS2) 10Jcrespo: mysql-prometheus-exporter: Fix deleted x1 instance from dbstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/457499 [16:22:34] not from memory, as it's an HP controller [16:22:59] volans: ok, I'll have a look [16:23:03] mmh it has the HPE SSD Smart Path, so maybe not affected [16:23:27] (03Abandoned) 10Elukey: systemd::timer: add a variable to ensure compatibility with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/457478 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:24:53] gehel: if you want to have fun, this is last time I had to look into smart path: https://phabricator.wikimedia.org/T178177#3818600 [16:25:27] (03CR) 10Ema: [C: 032] ATS: fix people.w.o never-cache rule [puppet] - 10https://gerrit.wikimedia.org/r/457500 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [16:28:13] (03PS1) 10Elukey: systemd::timer: add a comment about RandomizedDelaySec [puppet] - 10https://gerrit.wikimedia.org/r/457505 (https://phabricator.wikimedia.org/T172532) [16:29:59] gehel: so I'd check on a similar host with: sudo hpssacli controller slot=0 show detail [16:30:08] and see the diffs [16:30:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good, two typos" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/457505 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:30:39] for things like [16:30:39] No-Battery Write Cache: Disabled [16:30:40] SSD Caching RAID5 WriteBack Enabled: True [16:31:21] (03PS2) 10Elukey: systemd::timer: add a comment about RandomizedDelaySec [puppet] - 10https://gerrit.wikimedia.org/r/457505 (https://phabricator.wikimedia.org/T172532) [16:31:45] (03CR) 10Elukey: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/457505 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:32:44] !log powercycling mw1300, stuck in reboot [16:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:59] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/456187 (owner: 10Smalyshev) [16:35:53] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, 10Release-Engineering-Team (Watching / External): Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) Wikitech is current only m5- however, on switchover to codfw, it will point to db2037. However, m5-master w... [16:39:29] (03PS1) 10Elukey: profile::analytics::refinery::job::data_check: add more timers [puppet] - 10https://gerrit.wikimedia.org/r/457508 (https://phabricator.wikimedia.org/T172532) [16:40:35] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_check: add more timers [puppet] - 10https://gerrit.wikimedia.org/r/457508 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:41:11] !log depooling mw1310-mw1311 for reboot to address kernel security updates once pending transcoding jobs have been completed [16:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:55] 10Operations: in Commons, some PDFs are failing to render thumbnails. - https://phabricator.wikimedia.org/T203402 (10Paladox) [16:45:47] (03PS1) 10Elukey: profile::analytics::systemd_timer: fix systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/457511 (https://phabricator.wikimedia.org/T172532) [16:48:16] (03CR) 10Elukey: [C: 032] profile::analytics::systemd_timer: fix systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/457511 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:48:34] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.354 second response time [16:51:55] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10Nuria) @CCicalese_WMF The data is also on the mysql data store (to which you already have access) so you can verify ther... [16:59:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to EventLogging in Hive (analytics-privatedata-users) for Cicalese - https://phabricator.wikimedia.org/T203182 (10CCicalese_WMF) @Nuria, right, but as we discussed, I'm looking for the geocoded data. Before this goes live, I want to v... [17:00:04] gehel: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180903T1700). [17:01:17] jouncebot: o/ [17:02:33] !log gehel@deploy1001 Started deploy [wdqs/wdqs@68d9cab]: new version of wdqs GUI (wdqs1009 only) [17:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:07] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@68d9cab]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 33s) [17:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:21] !log gehel@deploy1001 Started deploy [wdqs/wdqs@68d9cab]: new version of wdqs GUI [17:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:45] (03PS1) 10Gehel: maps: configure maps-test for new tile invalidation variables [puppet] - 10https://gerrit.wikimedia.org/r/457512 (https://phabricator.wikimedia.org/T109776) [17:10:57] (03CR) 10Gehel: [C: 032] maps: configure maps-test for new tile invalidation variables [puppet] - 10https://gerrit.wikimedia.org/r/457512 (https://phabricator.wikimedia.org/T109776) (owner: 10Gehel) [17:14:47] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@68d9cab]: new version of wdqs GUI (duration: 10m 26s) [17:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:24] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:27:35] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:27:35] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:13] 10Operations, 10Thumbor: in Commons, some PDFs are failing to render thumbnails. - https://phabricator.wikimedia.org/T203402 (10Aklapper) @Jan.Kamenicek: Pardon! I obviously hadn't read the initial description closely enough. [17:36:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 27 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [17:41:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [17:46:32] legoktm: It sounds from https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/456646/ like the instructions need some changes [17:50:35] (03CR) 10Krinkle: sre.switchdc.mediawiki: add Phase 0 cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:01:13] (03PS1) 10Volans: sre.switchdc.mediawiki: improve readability [cookbooks] - 10https://gerrit.wikimedia.org/r/457519 (https://phabricator.wikimedia.org/T199079) [18:01:46] (03CR) 10Volans: sre.switchdc.mediawiki: add Phase 0 cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/457328 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:02:18] (03CR) 10Volans: "Addressing comment in:" [cookbooks] - 10https://gerrit.wikimedia.org/r/457519 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [18:04:52] (03PS1) 10Volans: Add licence and copyright note [cookbooks] - 10https://gerrit.wikimedia.org/r/457521 (https://phabricator.wikimedia.org/T199079) [18:06:24] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.209 second response time [18:09:45] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:04] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.560 second response time [18:26:24] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:18] (03PS1) 10Muehlenhoff: Remove now obsolete Hiera setting profile::base::enable_microcode [puppet] - 10https://gerrit.wikimedia.org/r/457532 [18:57:34] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.754 second response time [18:57:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: request to add phendeskog to perf-roots - https://phabricator.wikimedia.org/T202658 (10Peter) Ah sorry totally missed this, will get back ASAP. [19:01:04] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:41] 10Operations, 10Thumbor: in Commons, some PDFs are failing to render thumbnails. - https://phabricator.wikimedia.org/T203402 (10Jan.Kamenicek) One more thing: I thought it might be just a problem of displaying the thumbnails and otherwise the files could work well, so I tried to use them at Wikisource. However... [19:19:12] (03CR) 10ArielGlenn: "While I Like the idea of moving these values out, I would suggest they not go into the xml dumps config, because they really are unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/456439 (owner: 10Smalyshev) [19:39:58] bah I forgot it was a US holiday [19:41:20] (03PS8) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) [19:51:19] what's now? [19:51:54] Labour Day? [19:52:03] oh, sorry, "Labor" :P [19:53:21] Ja [19:53:45] Arbeitstag [19:55:25] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.321 second response time [19:56:33] * Hauskatze updating npm dev deps, is *boring* [19:56:56] but I was tired of seeing floods of "npm WARN deprecated" [19:58:54] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:44] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.594 second response time [20:19:51] (03PS14) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) [20:20:50] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:20:55] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:37] Krenair: seems like it [20:30:15] legoktm, I should probably also include a systemd service file in my package right? [20:30:34] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [20:32:17] I think so [20:34:19] Krenair: see https://manpages.debian.org/stretch/debhelper/dh_systemd_enable.1.en.html and https://manpages.debian.org/stretch/debhelper/dh_systemd_start.1.en.html [20:36:34] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.890 second response time [20:37:25] and also an example config [20:39:54] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:53] legoktm, right now I'm more concerned about how to get the file there [20:41:27] er, what do you mean? [20:41:56] well I have to somehow get the systemd service file into the package [20:42:09] and have it extract to the correct location on the installing machine [20:43:43] Krenair: just create debian/certcentral.service, and then it should just work [20:43:57] cool [20:44:23] you might need to add --with systemd to the dh invocation, I forget [20:49:45] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark] [20:52:39] Krenair: to confirm, you're targetting stretch right? [20:52:47] legoktm, yeah [20:53:06] these will be fresh machines/VMs specifically for this service [20:53:13] so they'll be stretch [20:54:59] I'm reading the wikitect docs [20:55:04] and the first thing it says is [20:55:07] "Faidon says not to use stdeb" [20:55:49] I saw that too [21:01:12] with debhelper compat level 10 and when using dh it's automatic: https://wiki.debian.org/Teams/pkg-systemd/Packaging [21:01:22] compat level 10 is fine is only stretch is targeted [21:01:30] jessie has only 9 [21:03:57] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) @mobrovac I think as a first step we should: * Standardise the name of the header (for services that can/do set a head... [21:16:03] (03CR) 10Smalyshev: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/456439 (owner: 10Smalyshev) [21:16:22] * Krinkle staging on mwdebug1002/deployment [21:16:34] (03CR) 10Gehel: Elasticsearch module is coming up. (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T199079) (owner: 10Mathew.onipe) [21:17:05] (03CR) 10Krinkle: [C: 031] sre.switchdc.mediawiki: improve readability [cookbooks] - 10https://gerrit.wikimedia.org/r/457519 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:19:37] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.19/extensions/WikimediaEvents/: I920127efb3c4 (duration: 00m 51s) [21:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:55] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.472 second response time [21:25:24] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:36:24] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.593 second response time [21:39:45] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:43:35] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:54] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.384 second response time [21:46:05] (03CR) 10Krinkle: [C: 031] mediawiki: improve stop_cronjobs() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/457367 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:49:14] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:55:01] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds [21:55:37] uh, thumbor [21:55:37] got paged, looking [21:55:58] godog: I'm here if needed [21:56:01] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.004 second response time [21:56:37] yup, sadly already recovered [21:57:14] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.483 second response time [21:57:36] yep [21:57:56] that'd be nginx I believe waiting too long to reply "sometimes" [21:58:13] which would be fixed by T187765 [21:58:14] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [21:58:26] apologies for the page though [21:58:41] np for me [21:58:43] other than that I can't see anything wrong [21:58:47] ack [22:00:34] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:24] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.629 second response time [22:14:04] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:21:44] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.572 second response time [22:24:25] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.321 second response time [22:25:04] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:54] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:16] legoktm, so if I want to include a config example file... how do I tell it where to put that? [22:49:11] uh, debian/certcentral.files [22:49:15] lemme find an example [22:49:39] actually it's .install [22:49:41] Krenair: see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/debian/+/master/debian/mediawiki.install [22:49:50] ty [22:52:14] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.745 second response time [22:55:35] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:56:35] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.629 second response time [22:59:24] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.183 second response time [23:00:05] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:51] (03PS1) 10Volans: Add redis_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) [23:04:54] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:40] (03CR) 10Volans: "Tomorrow I'll send the patch for Puppet (config file) and the related cookbook. It should be decided if it could go into Phase4 or needs a" [software/spicerack] - 10https://gerrit.wikimedia.org/r/457711 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [23:17:34] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:40:24] legoktm, is it supposed to say "E: certcentral changes: bad-distribution-in-changes-file stretch-wikimedia" ? [23:40:29] it continues anyway but I see that in the log [23:41:51] yes-ish [23:42:16] there's a way to teach lintian that stretch-wikimedia is real I think, I've never done that before [23:42:21] 10Operations, 10Thumbor: in Commons, some PDFs are failing to render thumbnails. - https://phabricator.wikimedia.org/T203402 (10Ronhjones) Maybe a time / space thing? The 550 uncompressed extracted TIFFs took a while to extract and take up a total of 4.17GB of disk space. [23:42:43] alright well I'll leave it be I think [23:53:44] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.810 second response time [23:57:04] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds