[00:13:32] !log volker-e@deploy1001 Started deploy [design/style-guide@efc240b]: Deploy design/style-guide: [00:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:39] !log volker-e@deploy1001 Finished deploy [design/style-guide@efc240b]: Deploy design/style-guide: (duration: 00m 07s) [00:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:02] 10Operations, 10Performance-Team, 10Traffic, 10observability: Ensure graphs used by Performance account for Varnish-to-ATS migration - https://phabricator.wikimedia.org/T233474 (10Krinkle) a:03ema It looks like the Apache Backend-Timing graphs dried up. (03PS1) 10Bmansurov: Recommendation API: upgrade node to version 10 [puppet] - 10https://gerrit.wikimedia.org/r/560454 (https://phabricator.wikimedia.org/T241230) [02:14:23] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1067.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:45:25] !log volker-e@deploy1001 Started deploy [design/style-guide@8b2eda6]: Deploy design/style-guide: [02:45:33] !log volker-e@deploy1001 Finished deploy [design/style-guide@8b2eda6]: Deploy design/style-guide: (duration: 00m 07s) [02:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:45] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:10:49] PROBLEM - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2019-12-20 05:04:29 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:20:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560416 (owner: 10Volans) [06:32:42] (03CR) 10Ammarpad: [C: 03+1] "LGTM also." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [07:21:36] 10Operations, 10ops-codfw, 10DBA: codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10jcrespo) [07:30:07] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Restore access for bmansurov - https://phabricator.wikimedia.org/T241089 (10jcrespo) a:03jcrespo I need to ask internally the reasons of removal (expiration, inactivity, other). Knowing the original access request would expedite handling this ticket. [07:36:35] (03CR) 10ArielGlenn: "I'm not totally excited about some of the formatting but I can live with it, as far as the changes to the dumps and snapshot modules in th" [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [07:42:48] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10jcrespo) a:03Nuria This needs @nuria approval (in addition of @leila) as service owner. I haven't checked the other information giv... [07:46:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Restore access for bmansurov - https://phabricator.wikimedia.org/T241089 (10MoritzMuehlenhoff) >>! In T241089#5762004, @jcrespo wrote: > I need to ask internally the reasons of removal (expiration, inactivity, other). Knowing the original access reques... [07:51:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Restore access for bmansurov - https://phabricator.wikimedia.org/T241089 (10jcrespo) I found the original onboarding ticket: T113069. Sorry for the delay on handling this, these are bad dates. Will proceed as per procedure after the 3 business day delay. [08:14:49] RECOVERY - Maps - OSM synchronization lag - codfw on icinga1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 2.969e+04 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [08:23:44] (03PS1) 10Muehlenhoff: Add a define to install a package from a repository component (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) [08:23:45] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 4.955e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [08:24:26] (03CR) 10jerkins-bot: [V: 04-1] Add a define to install a package from a repository component (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/560458 (https://phabricator.wikimedia.org/T240324) (owner: 10Muehlenhoff) [08:28:46] (03PS1) 10Mathew.onipe: maps: Enable osm replication after state file update. [puppet] - 10https://gerrit.wikimedia.org/r/560459 (https://phabricator.wikimedia.org/T239728) [08:37:43] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10jcrespo) 05Open→03Resolved ` megacli -PDRbld -ShowProg -PhysDrv [32:9] -aALL Device(Encl-32 Slot-9) is not in rebuild process Exit Code: 0x00 ` `MegaRAID OK: opt... [08:39:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.023e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [08:52:19] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 324 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [08:57:24] ACKNOWLEDGEMENT - snapshot of s7 in eqiad on db1115 is CRITICAL: snapshot for s7 at eqiad taken more than 4 days ago: Most recent backup 2019-12-20 05:04:29 Jcrespo retrying now, should be fixed soon https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:59:24] 10Operations, 10SRE-Access-Requests: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) 05Open→03Resolved Because no feedback has been given for a while, this is considered as resolved. Please reopen if yo... [09:02:01] 10Operations, 10Puppet, 10Patch-For-Review: puppet-merge can't accept an explicit SHA1 for an --ops merge - https://phabricator.wikimedia.org/T241277 (10jcrespo) @CDanis Is this something you plan to work on? Otherwise, who do you need help with? I am trying to triage the importance of this ticket. [09:03:48] 10Operations, 10serviceops, 10Patch-For-Review: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10jcrespo) I believe this is a know issue tracked on other ticket (parse... [09:05:14] 10Operations, 10serviceops, 10Patch-For-Review: PHP Fatal error: Allowed memory size of 524288000 bytes exhausted (tried to allocate 20480 bytes) in /var/www/php-monitoring/lib.php on line 35 - https://phabricator.wikimedia.org/T240824 (10jcrespo) T230076 I believe. [09:11:47] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10jcrespo) CC @MoritzMuehlenhoff (Re: get operations to use a newer versio... [09:16:09] (03PS1) 10Jcrespo: Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/560514 [09:16:20] (03PS1) 10Jcrespo: Revert "varnish: temporarily allow more response headers" [puppet] - 10https://gerrit.wikimedia.org/r/560515 [09:18:29] (03PS2) 10Jcrespo: Revert "varnish: temporarily allow more response headers" [puppet] - 10https://gerrit.wikimedia.org/r/560515 [09:18:48] (03PS2) 10Jcrespo: Revert "Increase nginx limits on http resp hdr block size" [puppet] - 10https://gerrit.wikimedia.org/r/560514 [09:19:53] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10jcrespo) p:05High→03Normal > Yes, I think, once those... [09:22:07] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10MoritzMuehlenhoff) Providing Imagemagick 7 is non-trivial given that Deb... [09:22:51] 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jcrespo) What is the right followup after a month? "I don't know" is an ok answer, I just want to clarify the... [09:24:48] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10jcrespo) @Bawolff Is the answer clarifying enough? Aiming for Bullseye (... [09:27:02] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10jcrespo) 05Open→03Stalled p:05High→03Normal This seems to be installed due to concerns raised at T211139#4798560, to be revisited later. [09:34:47] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Restore access for bmansurov - https://phabricator.wikimedia.org/T241089 (10jcrespo) p:05Triage→03High [09:35:33] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10jcrespo) p:05Triage→03High [09:51:48] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10TheDJ) So what bawolff quotes: > identify-im6.q16: cache resources exh... [10:00:36] 10Operations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10jcrespo) There has been multiple of mx1001 issues lately (even if that is unreliable, it is worth noting). My suggestion would be, at least initially, to detect the sam... [10:05:04] 10Operations, 10Mail: CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2 - https://phabricator.wikimedia.org/T240906 (10jcrespo) p:05Normal→03High I am going to mark this as high, as we have now daily alerts, assuming those are real. [10:06:55] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10TheDJ) This is likely the same issue as {T124662} [10:08:49] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10TheDJ) Interestingly enough tiffinfo can be used in place of identify..... [10:09:45] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 27992816 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:15:07] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 897664 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:29:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/560373 (https://phabricator.wikimedia.org/T241348) (owner: 10Andrew Bogott) [10:30:53] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10TheDJ) It was [[ https://github.com/wikimedia/operations-mediawiki-confi... [10:35:55] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.64e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:43:59] (03PS1) 10Andrew Bogott: nova firstboot script: disable 'growpart' in cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) [10:48:06] (03CR) 10Arturo Borrero Gonzalez: nova firstboot script: disable 'growpart' in cloud-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) (owner: 10Andrew Bogott) [10:49:58] (03CR) 10Andrew Bogott: nova firstboot script: disable 'growpart' in cloud-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) (owner: 10Andrew Bogott) [10:51:50] (03CR) 10Andrew Bogott: nova firstboot script: disable 'growpart' in cloud-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) (owner: 10Andrew Bogott) [10:52:32] (03PS2) 10Andrew Bogott: nova firstboot script: disable 'growpart' in cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) [10:55:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:06:41] (03PS1) 10Andrew Bogott: nova firstboot: move the serial tty logic out of the base image [puppet] - 10https://gerrit.wikimedia.org/r/560517 (https://phabricator.wikimedia.org/T181375) [11:30:32] (03PS3) 10Andrew Bogott: nova firstboot script: disable 'growpart' in cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) [11:30:34] (03PS2) 10Andrew Bogott: nova firstboot: move the serial tty logic out of the base image [puppet] - 10https://gerrit.wikimedia.org/r/560517 (https://phabricator.wikimedia.org/T181375) [11:30:36] (03PS1) 10Andrew Bogott: openstack nova: rename firstboot.sh to userdata.txt [puppet] - 10https://gerrit.wikimedia.org/r/560520 [11:36:28] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Reedy) >>! In T240455#5762118, @TheDJ wrote: > It was [[ https://github.... [11:36:32] (03PS1) 10Reedy: Revert "Remove $wgUseImageResize as same as default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 [11:36:50] (03CR) 10jerkins-bot: [V: 04-1] Revert "Remove $wgUseImageResize as same as default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 (owner: 10Reedy) [11:37:17] (03PS2) 10Reedy: Revert "Remove $wgTiffUseTiffinfo because it doesn't exist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 [11:37:21] (03PS3) 10Reedy: Revert "Remove $wgTiffUseTiffinfo because it doesn't exist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 [11:37:42] (03PS4) 10Reedy: Revert "Remove $wgTiffUseTiffinfo because it doesn't exist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 (https://phabricator.wikimedia.org/T240455) [11:40:59] (03CR) 10Reedy: [C: 03+2] Revert "Remove $wgTiffUseTiffinfo because it doesn't exist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 (https://phabricator.wikimedia.org/T240455) (owner: 10Reedy) [11:42:03] (03Merged) 10jenkins-bot: Revert "Remove $wgTiffUseTiffinfo because it doesn't exist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560521 (https://phabricator.wikimedia.org/T240455) (owner: 10Reedy) [11:43:45] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: use TiffInfo again T240455 (duration: 01m 07s) [11:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:53] T240455: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 [11:50:49] RECOVERY - snapshot of s7 in eqiad on db1115 is OK: snapshot for s7 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-12-24 10:11:43 from db1116.eqiad.wmnet:3317 (894 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [12:07:50] (03CR) 10Volans: [C: 03+2] images: fix authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560416 (owner: 10Volans) [12:10:30] (03Merged) 10jenkins-bot: images: fix authentication [software/debmonitor] - 10https://gerrit.wikimedia.org/r/560416 (owner: 10Volans) [12:17:12] (03PS1) 10Volans: Release v0.2.2 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/560525 [12:18:49] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.2.2 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/560525 (owner: 10Volans) [12:20:20] !log volans@deploy1001 Started deploy [debmonitor/deploy@39ad186]: Release v0.2.2 - T241206 [12:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:27] T241206: Report image metadata to debmonitor - https://phabricator.wikimedia.org/T241206 [12:21:00] !log volans@deploy1001 Finished deploy [debmonitor/deploy@39ad186]: Release v0.2.2 - T241206 (duration: 00m 40s) [12:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:49] 10Operations, 10SRE-tools, 10docker-pkg, 10serviceops, 10Patch-For-Review: Report image metadata to debmonitor - https://phabricator.wikimedia.org/T241206 (10Volans) The issue for the `DELETE` has been fixed, I've successfully deleted the image `docker-registry.wikimedia.org/python3-build-stretch:0.0.2`... [12:54:00] (03PS1) 10Elukey: airflow: fix hdfs fuse mountpoint check [puppet] - 10https://gerrit.wikimedia.org/r/560527 [12:55:59] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:57:00] (03CR) 10Elukey: [C: 03+2] airflow: fix hdfs fuse mountpoint check [puppet] - 10https://gerrit.wikimedia.org/r/560527 (owner: 10Elukey) [13:38:49] PROBLEM - Disk space on wdqs1006 is CRITICAL: DISK CRITICAL - free space: /srv 53349 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1006&var-datasource=eqiad+prometheus/ops [13:46:20] ^gehel: something that could be deleted there? [13:54:04] jynus: looking. A data reload is probably the only solution [13:54:19] I was creating a ticket [13:54:23] At least the only short term solution [13:54:26] I was about to tunefs -m0 [13:54:31] to get some gigabytes [13:54:46] ok with that? [13:56:04] (03PS3) 10Andrew Bogott: Add initial config for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560372 (https://phabricator.wikimedia.org/T241347) [13:56:08] (03PS3) 10Andrew Bogott: Openstack Designate: add manifests for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560373 (https://phabricator.wikimedia.org/T241348) [13:56:10] (03PS2) 10Andrew Bogott: keystone/pike: remove obsolete filter from paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/560375 (https://phabricator.wikimedia.org/T241347) [13:56:12] (03PS2) 10Andrew Bogott: nova/pike: update policy.json for new Pike policy changes [puppet] - 10https://gerrit.wikimedia.org/r/560376 (https://phabricator.wikimedia.org/T241347) [13:56:21] (03CR) 10Andrew Bogott: [C: 03+2] nova firstboot script: disable 'growpart' in cloud-config [puppet] - 10https://gerrit.wikimedia.org/r/560516 (https://phabricator.wikimedia.org/T241322) (owner: 10Andrew Bogott) [13:56:45] (03PS3) 10Andrew Bogott: nova firstboot: move the serial tty logic out of the base image [puppet] - 10https://gerrit.wikimedia.org/r/560517 (https://phabricator.wikimedia.org/T181375) [13:57:07] (03PS2) 10Andrew Bogott: openstack nova: rename firstboot.sh to userdata.txt [puppet] - 10https://gerrit.wikimedia.org/r/560520 [13:58:27] (03CR) 10Andrew Bogott: [C: 03+2] nova firstboot: move the serial tty logic out of the base image [puppet] - 10https://gerrit.wikimedia.org/r/560517 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [13:58:28] !log tune2fs -m 0 /dev/mapper/wdqs1006--vg-data T241418 [13:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:35] T241418: wdqs1006 /srv low on disk space - https://phabricator.wikimedia.org/T241418 [13:58:49] jynus: can you hold up on the tune2fs? [13:58:56] or is it already done' [13:59:08] sorry, I did it already [13:59:18] ok, no problem [13:59:19] I can undo it [13:59:47] I guessed it was ok if you planned on rebuild it [13:59:49] the journal is exploding, not the first time we have that, but we don't really understand how that happens [14:00:06] I'm just going to copy the journal from another system, no do a full rebuild [14:00:09] RECOVERY - Disk space on wdqs1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1006&var-datasource=eqiad+prometheus/ops [14:00:41] (03CR) 10Andrew Bogott: [C: 03+2] openstack nova: rename firstboot.sh to userdata.txt [puppet] - 10https://gerrit.wikimedia.org/r/560520 (owner: 10Andrew Bogott) [14:00:51] (03PS4) 10Andrew Bogott: Add initial config for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560372 (https://phabricator.wikimedia.org/T241347) [14:00:53] (03PS4) 10Andrew Bogott: Openstack Designate: add manifests for Openstack Pike [puppet] - 10https://gerrit.wikimedia.org/r/560373 (https://phabricator.wikimedia.org/T241348) [14:00:55] (03PS3) 10Andrew Bogott: keystone/pike: remove obsolete filter from paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/560375 (https://phabricator.wikimedia.org/T241347) [14:00:57] (03PS3) 10Andrew Bogott: nova/pike: update policy.json for new Pike policy changes [puppet] - 10https://gerrit.wikimedia.org/r/560376 (https://phabricator.wikimedia.org/T241347) [14:02:06] (03PS2) 10Gehel: maps: Enable osm replication after state file update. [puppet] - 10https://gerrit.wikimedia.org/r/560459 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [14:03:05] gehel: to be fair, I don't see much disadvantages on 0% reserved blocks with a 95% filesystemd utilization on a non-root partition [14:04:24] it gives us some head space if we have a more subtle disk space issue in the future, but agreed, not much of a change [14:05:47] "disk space issue" like today :-P [14:06:15] well, this one, I know what to do to fix it in the short term [14:06:40] again, can be undone, I thought you were out and I was gaining a few hours [14:06:47] I might not know what to do with the next one :) [14:06:57] but more disks! [14:06:59] *buy [14:07:17] though honestly, I don't know what could be worth than having blazegraph unable to recover free space :/ [14:07:41] let me know if I can help [14:08:04] jynus: thanks! but things are undercontrol (well as much as they can be) [14:11:04] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:27] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Some queries causes wdqs-blazegraph on wdqs1006 to crash and restart - https://phabricator.wikimedia.org/T213191 (10jcrespo) Issue started the 22 Dec at around 2:16 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wdqs1006&var-dat... [14:13:43] !log data reload from wdqs1008 to wdqs1006 - T241418 [14:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:50] T241418: wdqs1006 /srv low on disk space - https://phabricator.wikimedia.org/T241418 [14:28:57] PROBLEM - nova-compute proc minimum on cloudvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:30:46] RECOVERY - nova-compute proc minimum on cloudvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:38:34] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [14:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:52] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:26] (03PS1) 10Arturo Borrero Gonzalez: openstack: don't show puppet diff in files which may contain passwords [puppet] - 10https://gerrit.wikimedia.org/r/560530 [14:48:17] (03CR) 10Andrew Bogott: [C: 03+1] openstack: don't show puppet diff in files which may contain passwords [puppet] - 10https://gerrit.wikimedia.org/r/560530 (owner: 10Arturo Borrero Gonzalez) [14:49:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: don't show puppet diff in files which may contain passwords [puppet] - 10https://gerrit.wikimedia.org/r/560530 (owner: 10Arturo Borrero Gonzalez) [15:13:16] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [15:13:21] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:53] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:01] PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 4756 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:41:14] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1008 is CRITICAL: 4559 ge 3600 Gehel recovery after data reload https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:53:25] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:20:51] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [16:21:01] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [16:32:59] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:35:27] (03PS1) 10Jcrespo: swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 [16:38:19] (03CR) 10Jcrespo: [C: 04-1] "Needs checking and fix, and reviewing of the other links, but uploading as a reminder to check later." [puppet] - 10https://gerrit.wikimedia.org/r/560538 (owner: 10Jcrespo) [16:39:04] (03PS2) 10Jcrespo: swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 [16:39:36] (03PS3) 10Jcrespo: swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 [16:40:09] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:40:43] RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1113 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:51:23] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [16:51:35] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [17:01:47] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:30:51] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [17:31:01] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [17:32:13] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:57:13] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:09:45] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:43:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:28:21] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10MarcoAurelio) @jcrespo @Pchelolo Could this be related to T241294 somewhat? One job is from... [21:33:12] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Reedy) Can someone try the large tiff(s) again? :) [21:37:06] reedy, my connection is prohibitively slow for that, or I would try it. but 350 mb upload? nope [21:37:28] it's already taking enough time just to download it :-P [21:37:33] heh, if I was at home I'd do it [21:37:46] In a B&B... but not sure on connection speed yet [21:37:51] mm [21:37:55] might be better than mine :-P [21:38:03] It is Sweden, and they do like their Fibre [21:38:15] I wish they would hurry up with the dang fiber rollout (=vdsl for us) [21:38:48] in theory fiber to the home is supposed to be possible in this neighborhood once the rollout is complete, but no one has said they would be offering it I guess [21:38:49] meh [21:39:01] only vdsl, but that will still be way better than what I have [21:39:14] anyways, why not do a speed test where you are? :-P [21:39:32] it's STILL downloading [21:40:17] done at last :-/ [21:41:48] 45-50 meg down... [21:41:49] 11 ish up [21:42:21] "Anna Norrie, rollporträtt - SMV - NN054.tif (326M) is too large for Google to scan for viruses" [21:42:26] Pfft. as if google doesn't have the resources [21:43:43] ~5 minutes to upload [21:43:45] Might aswell try [21:43:53] 5-10, it's fluctuating [21:47:56] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Bawolff) >Maybe it's just that Debian has adjusted the memory limit poli... [21:50:59] you hve better upload speed than me indeed [21:51:04] welp, go go go [21:58:00] http://commons.wikimedia.org/wiki/File:Anna_Norrie,_rollportr%C3%A4tt_-_SMV_-_NN054.tif [21:59:51] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Reedy) 05Open→03Resolved a:03Reedy It works! @Alicia_Fagerving_WM... [22:06:19] 10Operations, 10Commons, 10MediaWiki-extensions-PagedTiffHandler, 10Multimedia, and 2 others: Large TIFF files do not pass file verification (related to version of image magick installed) - https://phabricator.wikimedia.org/T240455 (10Reedy) >>! In T240455#5762109, @TheDJ wrote: > This is possibly related... [22:22:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:24:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:40:30] (03CR) 10BryanDavis: "I would personally be a lot happier with line length of 79. Black's formatting is tolerable otherwise. Black can be configured with a pypr" [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [22:41:25] bd808: 100 or bust! (I truly truly hate 79 as the cutoff0 [22:41:26] ) [22:41:43] and with that, good night from Greece where it is already the 25th, happy holidays, etc! [22:41:50] I truly hate anything >79 ;) [22:42:03] fight! fight! fight! :-P :-D [22:42:05] but not today [22:42:08] maybe next year :-D [22:42:20] * apergos pulls the covers up [22:45:17] I can fit an 80 char terminal and a 1024x800 browser side by side on my laptop screen. If I make the terminal 100 chars then my browser has to be squished to a size that modern websites decides to treat as a tablet. Vim soft wraps, so >80 char lines are visible, but ugly. [22:46:30] Also (and this won't win an argument with a.pergos) I have used 79 chars for editing for something more than 25 years and wider files look really really weird [23:14:28] 10Operations, 10Gerrit, 10serviceops, 10Patch-For-Review: Convert Gerrit to use H2 as the database - https://phabricator.wikimedia.org/T211139 (10Paladox) >>! In T211139#4798560, @Dzahn wrote: > On one hand i would love this because it would make the gerrit codfw slave work which is blocked to lack of misc... [23:58:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [23:59:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets